library(tidyfast)
library(data.table)
library(magrittr)
tidyfast
Author(s): Tyson S. Barrett, Mark Fairbanks, Ivan Leung, Indrajeet Patil
Maintainer: Tyson S. Barrett (t.barrett88@gmail.com)
The goal of tidyfast
is to provide fast and efficient alternatives to some tidyr
(and a few dplyr
) functions using data.table
under the hood. Each have the prefix of dt_
to allow for autocomplete in IDEs such as RStudio. These should compliment some of the current functionality in dtplyr
(but notably does not use the lazy_dt()
framework of dtplyr
). This package imports data.table
and cpp11
(no other dependencies). These are, in essence, translations from a more tidyverse grammar to data.table
. Most functions herein are in places where, in my opinion, the data.table
syntax is not obvious or clear. As such, these functions can translate a simple function call into the fast, efficient, and concise syntax of data.table.
Relationship with data.table
tidyfast
was designed to be an extension to and translation of data.table
. As such, there are three main ways tidyfast
is related to data.table
.
- This package is built directly on
data.table
using direct calls to[.data.table
and other functions under the hood. - It only relies on two packages,
cpp11
anddata.table
both stable packages that are unlikely to have breaking changes often. This follows thedata.table
principle of few dependencies. - It was designed to also show how others can use
data.table
within their own package to create functions that flexibly calldata.table
in complex ways.
Overview
As shown on the tidyfast
GitHub page, tidyfast
has several functions that have the prefix dt_
. A few notable functions from the package are shown below.
dt_fill
Filling NAs is a useful function but tidyr::fill()
, especially when done by many, many groups can become too slow. dt_fill()
is useful for this and can be used a few different ways.
= 1:10
x <- data.table(
dt_with_nas x = x,
y = shift(x, 2L),
z = shift(x, -2L),
a = sample(c(rep(NA, 10), x), 10),
id = sample(1:3, 10, replace = TRUE)
)
# Original
dt_with_nas
x y z a id
<int> <int> <int> <int> <int>
1: 1 NA 3 4 1
2: 2 NA 4 NA 1
3: 3 1 5 NA 2
4: 4 2 6 2 2
5: 5 3 7 3 3
6: 6 4 8 1 3
7: 7 5 9 9 3
8: 8 6 10 NA 1
9: 9 7 NA 10 2
10: 10 8 NA 6 1
# All defaults
dt_fill(dt_with_nas, y, z, a, immutable = FALSE)
x y z a id
<int> <int> <int> <int> <int>
1: 1 NA 3 4 1
2: 2 NA 4 4 1
3: 3 1 5 4 2
4: 4 2 6 2 2
5: 5 3 7 3 3
6: 6 4 8 1 3
7: 7 5 9 9 3
8: 8 6 10 9 1
9: 9 7 10 10 2
10: 10 8 10 6 1
# by id variable called `grp`
dt_fill(dt_with_nas,
y, z, a, id = list(id))
x y z a id
<int> <int> <int> <int> <int>
1: 1 NA 3 4 1
2: 2 NA 4 4 1
3: 3 1 5 4 2
4: 4 2 6 2 2
5: 5 3 7 3 3
6: 6 4 8 1 3
7: 7 5 9 9 3
8: 8 6 10 9 1
9: 9 7 10 10 2
10: 10 8 10 6 1
# both down and then up filling by group
dt_fill(dt_with_nas,
y, z, a, id = list(id),
.direction = "downup")
x y z a id
<int> <int> <int> <int> <int>
1: 1 6 3 4 1
2: 2 6 4 4 1
3: 3 1 5 4 2
4: 4 2 6 2 2
5: 5 3 7 3 3
6: 6 4 8 1 3
7: 7 5 9 9 3
8: 8 6 10 9 1
9: 9 7 10 10 2
10: 10 8 10 6 1
dt_nest
Nesting data can be useful for a number of reasons, including running multiple statistical models in a structured way, storing non-standard data types (e.g., graphics), easing the cognitive burden of joining data sets, storing information that is only useful as a group (e.g., boundaries of polygons), among others. The dt_nest()
function takes a data.table
and ID variables and nests the remaining columns into a list column of data.table
s as shown below.
<- data.table(
dt x = rnorm(1e5),
y = runif(1e5),
grp = sample(1L:5L, 1e5, replace = TRUE),
nested1 = lapply(1:10, sample, 10, replace = TRUE),
nested2 = lapply(c("thing1", "thing2"), sample, 10, replace = TRUE),
id = 1:1e5
)
<- dt_nest(dt, grp)
nested nested
Key: <grp>
grp data
<int> <list>
1: 1 <data.table[20074x5]>
2: 2 <data.table[19792x5]>
3: 3 <data.table[20113x5]>
4: 4 <data.table[19991x5]>
5: 5 <data.table[20030x5]>
dt_pivot_longer and dt_pivot_wider
The last example for this brief post is pivoting. In my opinion, the pivot syntax is easy to remember and use and as such, is nice to have that syntax with the performance of melt()
and dcast()
. The syntax, although it doesn’t have the full functionality of tidyr
’s pivot functions, can do most things you need to do with reshaping data.
<- tidyr::billboard
billboard
<- billboard %>%
longer dt_pivot_longer(
cols = c(-artist, -track, -date.entered),
names_to = "week",
values_to = "rank"
)
Warning in melt.data.table(data = dt_, id.vars = id_vars, measure.vars = cols,
: 'measure.vars' [wk1, wk2, wk3, wk4, ...] are not all of the same type. By
order of hierarchy, the molten data value column will be of type 'double'. All
measure variables not of type 'double' will be coerced too. Check DETAILS in
?melt.data.table for more on coercion.
longer
artist track date.entered week rank
<char> <char> <Date> <char> <num>
1: 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
2: 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91
3: 3 Doors Down Kryptonite 2000-04-08 wk1 81
4: 3 Doors Down Loser 2000-10-21 wk1 76
5: 504 Boyz Wobble Wobble 2000-04-15 wk1 57
---
24088: Yankee Grey Another Nine Minutes 2000-04-29 wk76 NA
24089: Yearwood, Trisha Real Live Woman 2000-04-01 wk76 NA
24090: Ying Yang Twins Whistle While You Tw... 2000-03-18 wk76 NA
24091: Zombie Nation Kernkraft 400 2000-09-02 wk76 NA
24092: matchbox twenty Bent 2000-04-29 wk76 NA
Can also take that long data set and turn it wide again.
<- longer %>%
wider dt_pivot_wider(
names_from = week,
values_from = rank
) wider[, .(artist, track, wk1, wk2)]
Key: <artist, track>
artist track wk1 wk2
<char> <char> <num> <num>
1: 2 Pac Baby Don't Cry (Keep... 87 82
2: 2Ge+her The Hardest Part Of ... 91 87
3: 3 Doors Down Kryptonite 81 70
4: 3 Doors Down Loser 76 76
5: 504 Boyz Wobble Wobble 57 34
---
313: Yankee Grey Another Nine Minutes 86 83
314: Yearwood, Trisha Real Live Woman 85 83
315: Ying Yang Twins Whistle While You Tw... 95 94
316: Zombie Nation Kernkraft 400 99 99
317: matchbox twenty Bent 60 37