0. data analysis in R
1. dplyr
2. data.table
3. dtplyr
4. speed test
0. data analysis in R
- R에서 데이터 분석을 할때 가장 많이 사용하는 것은 아마 dplyr일 것임 (아무튼 필자는 그럼)
- 데이터 규모가 커지게 되면 느려지는데 data.table은 확실히 dplyr 보다 빨라서 big data에서는 data.table 사용
- dplyr과 data.table을 혼용해서 필요에 따라(?) 사용중
- dtplyr이라는 패키지가 있다는 소식을 들음 - dplyr vs data.table vs dtplyr 에 대해 알아보자
1. dplyr
- data analysis의 기본: documentation https://cran.r-project.org/web/packages/dplyr/dplyr.pdf
- %>%을 이용한 직관적(?)인 문법이 특징, 그냥 익숙함,,
- Data Wrangling Cheat Sheet
- Data Rransformation with dplyr :: Cheat Sheet
2. data.table
- 대충 메모리 효율적이라 빠름, big data에서 더욱 더,,
- fread, fwrite가 좀 더 범용?적임, dt[i,j,by] 문법이 간단하지만 좀 헷갈림 (필자)
- documentation: https://cran.r-project.org/web/packages/data.table/data.table.pdf
- github: https://github.com/Rdatatable/data.table
- Data Transformation with data.table::Cheat Sheet
- data.table tutorial: https://www.listendata.com/2016/10/r-data-table.html
3. dtplyr
- dplyr 문법으로 data.table을 사용할 수 있는 패키지
- frontend: dplyr, backend: data.table로 간단히? 이해하면 될 듯
- 당연히 data.table 보다는 느리겠고 생각 되지만 이유에 대해..
1. Each dplyr verb must do some work to convert dplyr syntax to data.table syntax. This takes time proportional to the complexity of the input code, not the input data, so should be a negligible overhead for large datasets. Initial benchmarks suggest that the overhead should be under 1ms per dplyr call.
2. To match dplyr semantics, mutate() does not modify in place by default. This means that most expressions involving mutate() must make a copy that would not be necessary if you were using data.table directly. (You can opt out of this behaviour in lazy_dt() with immutable = FALSE).
- 초기 시작 data table: lazt_dt()
dtplyr_dt <- laze_dt(dataset)
- 마지막에 as_tibble() 문법 추가해서 효율적으로 사용
mtcars2 %>%
filter(wt < 5) %>%
mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
group_by(cyl) %>%
summarise(l100k = mean(l100k)) %>%
as_tibble()
- ref: https://dtplyr.tidyverse.org/
4. speed test
- example 1. dplyr vs data.table vs dtplyr vs dt_dtplyr(not dataframe, datatable)
- example 2.
results = microbenchmark(
`data.table` = df_dt[origin == 'JFK' & carrier == 'AA'] ,
`dplyr` = df_tb %>% filter(origin == 'JFK' & carrier == 'AA'),
`dtplyr` = df_lz %>% filter(origin == 'JFK' & carrier == 'AA') %>% as_tibble(),
times = 100
)
autoplot(results) +
aes(fill = expr) +
theme_bw() +
labs(title = "Filter")
results2 = microbenchmark(
`data.table` = df_dt[, .(mean_delay = mean(dep_delay, na.rm = TRUE)),
by = c('year', 'month', 'day', 'carrier', 'origin')][mean_delay >= 10],
`dplyr` = df_tb %>%
group_by(year, month, day, carrier, origin) %>%
summarize(mean_delay = mean(dep_delay, na.rm = TRUE)) %>%
ungroup() %>%
filter(mean_delay >= 10),
`dtplyr` = df_lz %>%
group_by(year, month, day, carrier, origin) %>%
summarize(mean_delay = mean(dep_delay, na.rm = TRUE)) %>%
ungroup() %>%
filter(mean_delay >= 10) %>%
as_tibble(),
times = 100
)
autoplot(results2) +
aes(fill = expr) +
theme_bw() +
labs(title = "group_by, mean, filter")