Piping data.tables

Like a devoted plumber, modern R loves pipes. The magrittr pipe has a long history and it’s fair share of detractors, but with the implementation of the native pipe operator released in May 2021 it’s clear that chaining operations is now part of R vernacular.

So it’s no surprise that people often wonder how can you use pipes with data.table, as one participant of the recent data.table tutorial during LatinR 2023. The surprising answer is that data.table has supported pipelines since its inception in 2006. Furthermore, you can easily use either the magrittr or native pipes.

Vector graphic of two persons dressed in orange safety gear maintaining a thick pipe. — Image by storyset on Freepik

The data.table “pipe”

Instead of passing data to functions, data.table syntax is all about operating inside the [ operator¹ .

DT[rows, columns, by]

Where DT is a data.table object, the rows argument is used for filtering and joining operations, the columns argument can summarise and mutate, and the by argument defines the groups to which to apply these operations.

So, to get only Chinstrap penguins from the penguins dataset, instead of using base::subset() or dplyr::filter() you would do

penguins_chinstrap <- penguins[species == "Chinstrap"]

Or, to get the mean mean flipper length of these penguins for each island and sex, you could summarise the data like this:

penguins_chinstrap[, .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)]

But because the output of the first operation is a data.table, you can add another [ operator after the first to chain both operations:

penguins[species == "Chinstrap"][, .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)]

I usually call this the ][ pipe.

You might have noticed that for just two operations, this line of code is already too long, so for even moderately long chains it’s usually advisable to put each operation in its own line. There’s some controversy on how to break the ][ pipe into lines and indent it. One option is to add a new line after the second [, which has the advantage of actually writing the ][ pipe explicitly.

penguins[species == "Chinstrap" ][
  , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)]

A second options is to add the new line before the end of the first operation like so:

penguins[species == "Chinstrap" 
       ][ , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)]

Personally, I don’t like this syntax very much. No matter how you slice it, you always get what feels to me as incomplete lines. Also, RStudio doesn’t correctly indent the second syntax automatically.

Alternatively, the ][ pipe can go in its own line like so:

penguins[species == "Chinstrap" 
][ 
  , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)
]

This is indented correctly by RStudio and has the advantage of making easy to comment out each individual step:

penguins[species == "Chinstrap" 
][ 
  # , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)
]

data.table and magrittr

Until the introduction of the native pipe, I used to write long data.table pipelines using magrittr. To do this, I took advantage of the . placeholder which, within a magrittr pipe, refers to the result of the previous step.

library(magrittr)

penguins[species == "Chinstrap"] %>%
  .[ , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)]

I really like this syntax as it’s very clean. Each line of code is a complete operation without dangling parts and it’s easy to comment out single steps.

The only downside is that the dot here has two meanings: as the placeholder for the previous result in .[, and as an alias for list in .(mean_flipper_length = mean(flipper_length_mm)). It’s not a huge issue, though, since I tend to read .[ as a single entity, but it can trip up some people.

Using the native pipe

The native pipe at first didn’t have a placeholder and it didn’t chaining to [, so this so the above syntax wasn’t directly applicable. But you could cheat by creating an alias for [ and use that alias as a regular function. So this works:

DT <- `[`

penguins[species == "Chinstrap"] |> 
  DT( , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island))

This worked so well that data.table officially added the DT() function (currently only in the development version), so if you’re using the latest development version you don’t even need the first line².

This syntax is fine but I don’t like that I need ro write one more character and the closing character being a ) can get confusing because it adds to the closing ) that you usually have in the by argument.

From R 4.3.0 onwards, the native pipe supports a _ placeholder to the right-hand side fo the pipe. So now the magrittr syntax can be directly translated to

penguins[species == "Chinstrap"] |> 
  _[ , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)]

I like this syntax even more than the original magrittr one because it solves the double meaning problem and operations get hugged by a pair of brackets.

The four pipes of data.table

So, there you are, 4 different ways you can pipe your data.tables.

Use the ][ pipe if you want your code to have minimal dependencies and work in older versions of R. Use the %>% pipe if you want your code to work in older versions of R and don’t mind the extra dependency. Use any version of the |> pipe if you want minimal dependencies and don’t mind depending on R >= 4.3.0.

Categories

The data.table “pipe”

data.table and magrittr

Using the native pipe

The four pipes of data.table

Footnotes