Seal of Approval: tidyfast

seal of approval
bridge package
Author

Tyson S. Barrett

Published

August 1, 2024

tidyfast

tidyfast hex sticker

Author(s): Tyson S. Barrett, Mark Fairbanks, Ivan Leung, Indrajeet Patil

Maintainer: Tyson S. Barrett ()

The goal of tidyfast is to provide fast and efficient alternatives to some tidyr (and a few dplyr) functions using data.table under the hood. Each have the prefix of dt_ to allow for autocomplete in IDEs such as RStudio. These should compliment some of the current functionality in dtplyr (but notably does not use the lazy_dt() framework of dtplyr). This package imports data.table and cpp11 (no other dependencies). These are, in essence, translations from a more tidyverse grammar to data.table. Most functions herein are in places where, in my opinion, the data.table syntax is not obvious or clear. As such, these functions can translate a simple function call into the fast, efficient, and concise syntax of data.table.

Relationship with data.table

tidyfast was designed to be an extension to and translation of data.table. As such, there are three main ways tidyfast is related to data.table.

  1. This package is built directly on data.table using direct calls to [.data.table and other functions under the hood.
  2. It only relies on two packages, cpp11 and data.table both stable packages that are unlikely to have breaking changes often. This follows the data.table principle of few dependencies.
  3. It was designed to also show how others can use data.table within their own package to create functions that flexibly call data.table in complex ways.

Overview

As shown on the tidyfast GitHub page, tidyfast has several functions that have the prefix dt_. A few notable functions from the package are shown below.

library(tidyfast)
library(data.table)
library(magrittr)

dt_fill

Filling NAs is a useful function but tidyr::fill(), especially when done by many, many groups can become too slow. dt_fill() is useful for this and can be used a few different ways.

x = 1:10
dt_with_nas <- data.table(
  x = x,
  y = shift(x, 2L),
  z = shift(x, -2L),
  a = sample(c(rep(NA, 10), x), 10),
  id = sample(1:3, 10, replace = TRUE)
)

# Original
dt_with_nas
        x     y     z     a    id
    <int> <int> <int> <int> <int>
 1:     1    NA     3     4     1
 2:     2    NA     4    NA     1
 3:     3     1     5    NA     2
 4:     4     2     6     2     2
 5:     5     3     7     3     3
 6:     6     4     8     1     3
 7:     7     5     9     9     3
 8:     8     6    10    NA     1
 9:     9     7    NA    10     2
10:    10     8    NA     6     1
# All defaults
dt_fill(dt_with_nas, y, z, a, immutable = FALSE)
        x     y     z     a    id
    <int> <int> <int> <int> <int>
 1:     1    NA     3     4     1
 2:     2    NA     4     4     1
 3:     3     1     5     4     2
 4:     4     2     6     2     2
 5:     5     3     7     3     3
 6:     6     4     8     1     3
 7:     7     5     9     9     3
 8:     8     6    10     9     1
 9:     9     7    10    10     2
10:    10     8    10     6     1
# by id variable called `grp`
dt_fill(dt_with_nas, 
        y, z, a, 
        id = list(id))
        x     y     z     a    id
    <int> <int> <int> <int> <int>
 1:     1    NA     3     4     1
 2:     2    NA     4     4     1
 3:     3     1     5     4     2
 4:     4     2     6     2     2
 5:     5     3     7     3     3
 6:     6     4     8     1     3
 7:     7     5     9     9     3
 8:     8     6    10     9     1
 9:     9     7    10    10     2
10:    10     8    10     6     1
# both down and then up filling by group
dt_fill(dt_with_nas, 
        y, z, a, 
        id = list(id), 
        .direction = "downup")
        x     y     z     a    id
    <int> <int> <int> <int> <int>
 1:     1     6     3     4     1
 2:     2     6     4     4     1
 3:     3     1     5     4     2
 4:     4     2     6     2     2
 5:     5     3     7     3     3
 6:     6     4     8     1     3
 7:     7     5     9     9     3
 8:     8     6    10     9     1
 9:     9     7    10    10     2
10:    10     8    10     6     1

dt_nest

Nesting data can be useful for a number of reasons, including running multiple statistical models in a structured way, storing non-standard data types (e.g., graphics), easing the cognitive burden of joining data sets, storing information that is only useful as a group (e.g., boundaries of polygons), among others. The dt_nest() function takes a data.table and ID variables and nests the remaining columns into a list column of data.tables as shown below.

dt <- data.table(
   x = rnorm(1e5),
   y = runif(1e5),
   grp = sample(1L:5L, 1e5, replace = TRUE),
   nested1 = lapply(1:10, sample, 10, replace = TRUE),
   nested2 = lapply(c("thing1", "thing2"), sample, 10, replace = TRUE),
   id = 1:1e5
)

nested <- dt_nest(dt, grp)
nested
Key: <grp>
     grp                  data
   <int>                <list>
1:     1 <data.table[20074x5]>
2:     2 <data.table[19792x5]>
3:     3 <data.table[20113x5]>
4:     4 <data.table[19991x5]>
5:     5 <data.table[20030x5]>

dt_pivot_longer and dt_pivot_wider

The last example for this brief post is pivoting. In my opinion, the pivot syntax is easy to remember and use and as such, is nice to have that syntax with the performance of melt() and dcast(). The syntax, although it doesn’t have the full functionality of tidyr’s pivot functions, can do most things you need to do with reshaping data.

billboard <- tidyr::billboard 

longer <- billboard %>%
  dt_pivot_longer(
     cols = c(-artist, -track, -date.entered),
     names_to = "week",
     values_to = "rank"
  )
Warning in melt.data.table(data = dt_, id.vars = id_vars, measure.vars = cols,
: 'measure.vars' [wk1, wk2, wk3, wk4, ...] are not all of the same type. By
order of hierarchy, the molten data value column will be of type 'double'. All
measure variables not of type 'double' will be coerced too. Check DETAILS in
?melt.data.table for more on coercion.
longer
                 artist                   track date.entered   week  rank
                 <char>                  <char>       <Date> <char> <num>
    1:            2 Pac Baby Don't Cry (Keep...   2000-02-26    wk1    87
    2:          2Ge+her The Hardest Part Of ...   2000-09-02    wk1    91
    3:     3 Doors Down              Kryptonite   2000-04-08    wk1    81
    4:     3 Doors Down                   Loser   2000-10-21    wk1    76
    5:         504 Boyz           Wobble Wobble   2000-04-15    wk1    57
   ---                                                                   
24088:      Yankee Grey    Another Nine Minutes   2000-04-29   wk76    NA
24089: Yearwood, Trisha         Real Live Woman   2000-04-01   wk76    NA
24090:  Ying Yang Twins Whistle While You Tw...   2000-03-18   wk76    NA
24091:    Zombie Nation           Kernkraft 400   2000-09-02   wk76    NA
24092:  matchbox twenty                    Bent   2000-04-29   wk76    NA

Can also take that long data set and turn it wide again.

wider <- longer %>% 
  dt_pivot_wider(
    names_from = week,
    values_from = rank
  )
wider[, .(artist, track, wk1, wk2)]
Key: <artist, track>
               artist                   track   wk1   wk2
               <char>                  <char> <num> <num>
  1:            2 Pac Baby Don't Cry (Keep...    87    82
  2:          2Ge+her The Hardest Part Of ...    91    87
  3:     3 Doors Down              Kryptonite    81    70
  4:     3 Doors Down                   Loser    76    76
  5:         504 Boyz           Wobble Wobble    57    34
 ---                                                     
313:      Yankee Grey    Another Nine Minutes    86    83
314: Yearwood, Trisha         Real Live Woman    85    83
315:  Ying Yang Twins Whistle While You Tw...    95    94
316:    Zombie Nation           Kernkraft 400    99    99
317:  matchbox twenty                    Bent    60    37

Seal of Approval: collapse

seal of approval
partner package
No matching items