Seal of Approval: tidyfast

`tidyfast`

Author(s): Tyson S. Barrett, Mark Fairbanks, Ivan Leung, Indrajeet Patil

Maintainer: Tyson S. Barrett (t.barrett88@gmail.com)

The goal of tidyfast is to provide fast and efficient alternatives to some tidyr (and a few dplyr) functions using data.table under the hood. Each have the prefix of dt_ to allow for autocomplete in IDEs such as RStudio. These should compliment some of the current functionality in dtplyr (but notably does not use the lazy_dt() framework of dtplyr). This package imports data.table and cpp11 (no other dependencies). These are, in essence, translations from a more tidyverse grammar to data.table. Most functions herein are in places where, in my opinion, the data.table syntax is not obvious or clear. As such, these functions can translate a simple function call into the fast, efficient, and concise syntax of data.table.

Relationship with `data.table`

tidyfast was designed to be an extension to and translation of data.table. As such, there are three main ways tidyfast is related to data.table.

This package is built directly on data.table using direct calls to [.data.table and other functions under the hood.
It only relies on two packages, cpp11 and data.table both stable packages that are unlikely to have breaking changes often. This follows the data.table principle of few dependencies.
It was designed to also show how others can use data.table within their own package to create functions that flexibly call data.table in complex ways.

Overview

As shown on the tidyfast GitHub page, tidyfast has several functions that have the prefix dt_. A few notable functions from the package are shown below.

library(tidyfast)
library(data.table)
library(magrittr)

dt_fill

Filling NAs is a useful function but tidyr::fill(), especially when done by many, many groups can become too slow. dt_fill() is useful for this and can be used a few different ways.

x = 1:10
dt_with_nas <- data.table(
  x = x,
  y = shift(x, 2L),
  z = shift(x, -2L),
  a = sample(c(rep(NA, 10), x), 10),
  id = sample(1:3, 10, replace = TRUE)
)

# Original
dt_with_nas

        x     y     z     a    id
    <int> <int> <int> <int> <int>
 1:     1    NA     3     4     1
 2:     2    NA     4    NA     1
 3:     3     1     5    NA     2
 4:     4     2     6     2     2
 5:     5     3     7     3     3
 6:     6     4     8     1     3
 7:     7     5     9     9     3
 8:     8     6    10    NA     1
 9:     9     7    NA    10     2
10:    10     8    NA     6     1

# All defaults
dt_fill(dt_with_nas, y, z, a, immutable = FALSE)

        x     y     z     a    id
    <int> <int> <int> <int> <int>
 1:     1    NA     3     4     1
 2:     2    NA     4     4     1
 3:     3     1     5     4     2
 4:     4     2     6     2     2
 5:     5     3     7     3     3
 6:     6     4     8     1     3
 7:     7     5     9     9     3
 8:     8     6    10     9     1
 9:     9     7    10    10     2
10:    10     8    10     6     1

# by id variable called `grp`
dt_fill(dt_with_nas, 
        y, z, a, 
        id = list(id))

        x     y     z     a    id
    <int> <int> <int> <int> <int>
 1:     1    NA     3     4     1
 2:     2    NA     4     4     1
 3:     3     1     5     4     2
 4:     4     2     6     2     2
 5:     5     3     7     3     3
 6:     6     4     8     1     3
 7:     7     5     9     9     3
 8:     8     6    10     9     1
 9:     9     7    10    10     2
10:    10     8    10     6     1

# both down and then up filling by group
dt_fill(dt_with_nas, 
        y, z, a, 
        id = list(id), 
        .direction = "downup")

        x     y     z     a    id
    <int> <int> <int> <int> <int>
 1:     1     6     3     4     1
 2:     2     6     4     4     1
 3:     3     1     5     4     2
 4:     4     2     6     2     2
 5:     5     3     7     3     3
 6:     6     4     8     1     3
 7:     7     5     9     9     3
 8:     8     6    10     9     1
 9:     9     7    10    10     2
10:    10     8    10     6     1

dt_nest

Nesting data can be useful for a number of reasons, including running multiple statistical models in a structured way, storing non-standard data types (e.g., graphics), easing the cognitive burden of joining data sets, storing information that is only useful as a group (e.g., boundaries of polygons), among others. The dt_nest() function takes a data.table and ID variables and nests the remaining columns into a list column of data.tables as shown below.

dt <- data.table(
   x = rnorm(1e5),
   y = runif(1e5),
   grp = sample(1L:5L, 1e5, replace = TRUE),
   nested1 = lapply(1:10, sample, 10, replace = TRUE),
   nested2 = lapply(c("thing1", "thing2"), sample, 10, replace = TRUE),
   id = 1:1e5
)

nested <- dt_nest(dt, grp)
nested

Key: <grp>
     grp                  data
   <int>                <list>
1:     1 <data.table[20074x5]>
2:     2 <data.table[19792x5]>
3:     3 <data.table[20113x5]>
4:     4 <data.table[19991x5]>
5:     5 <data.table[20030x5]>

dt_pivot_longer and dt_pivot_wider

The last example for this brief post is pivoting. In my opinion, the pivot syntax is easy to remember and use and as such, is nice to have that syntax with the performance of melt() and dcast(). The syntax, although it doesn’t have the full functionality of tidyr’s pivot functions, can do most things you need to do with reshaping data.

Categories

Relationship with data.table

Overview

dt_fill

dt_nest

dt_pivot_longer and dt_pivot_wider

Relationship with `data.table`