<- penguins[species == "Chinstrap"] penguins_chinstrap
Like a devoted plumber, modern R loves pipes. The magrittr pipe has a long history and it’s fair share of detractors, but with the implementation of the native pipe operator released in May 2021 it’s clear that chaining operations is now part of R vernacular.
So it’s no surprise that people often wonder how can you use pipes with data.table, as one participant of the recent data.table tutorial during LatinR 2023. The surprising answer is that data.table has supported pipelines since its inception in 2006. Furthermore, you can easily use either the magrittr or native pipes.
The data.table “pipe”
Instead of passing data to functions, data.table syntax is all about operating inside the [
operator1 .
DT[rows, columns, by]
Where DT
is a data.table object, the rows
argument is used for filtering and joining operations, the columns
argument can summarise and mutate, and the by
argument defines the groups to which to apply these operations.
So, to get only Chinstrap penguins from the penguins dataset, instead of using base::subset()
or dplyr::filter()
you would do
Or, to get the mean mean flipper length of these penguins for each island and sex, you could summarise the data like this:
mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)] penguins_chinstrap[, .(
But because the output of the first operation is a data.table, you can add another [
operator after the first to chain both operations:
== "Chinstrap"][, .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)] penguins[species
I usually call this the ][
pipe.
You might have noticed that for just two operations, this line of code is already too long, so for even moderately long chains it’s usually advisable to put each operation in its own line. There’s some controversy on how to break the ][
pipe into lines and indent it. One option is to add a new line after the second [
, which has the advantage of actually writing the ][
pipe explicitly.
== "Chinstrap" ][
penguins[species mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)] , .(
A second options is to add the new line before the end of the first operation like so:
== "Chinstrap"
penguins[species mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)] ][ , .(
Personally, I don’t like this syntax very much. No matter how you slice it, you always get what feels to me as incomplete lines. Also, RStudio doesn’t correctly indent the second syntax automatically.
Alternatively, the ][
pipe can go in its own line like so:
== "Chinstrap"
penguins[species
][ mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)
, .( ]
This is indented correctly by RStudio and has the advantage of making easy to comment out each individual step:
== "Chinstrap"
penguins[species
][ # , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)
]
data.table and magrittr
Until the introduction of the native pipe, I used to write long data.table pipelines using magrittr. To do this, I took advantage of the .
placeholder which, within a magrittr pipe, refers to the result of the previous step.
library(magrittr)
== "Chinstrap"] %>%
penguins[species mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)] .[ , .(
I really like this syntax as it’s very clean. Each line of code is a complete operation without dangling parts and it’s easy to comment out single steps.
The only downside is that the dot here has two meanings: as the placeholder for the previous result in .[
, and as an alias for list in .(mean_flipper_length = mean(flipper_length_mm))
. It’s not a huge issue, though, since I tend to read .[
as a single entity, but it can trip up some people.
Using the native pipe
The native pipe at first didn’t have a placeholder and it didn’t chaining to [
, so this so the above syntax wasn’t directly applicable. But you could cheat by creating an alias for [
and use that alias as a regular function. So this works:
<- `[`
DT
== "Chinstrap"] |>
penguins[species DT( , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island))
This worked so well that data.table officially added the DT()
function (currently only in the development version), so if you’re using the latest development version you don’t even need the first line2.
This syntax is fine but I don’t like that I need ro write one more character and the closing character being a )
can get confusing because it adds to the closing )
that you usually have in the by
argument.
From R 4.3.0 onwards, the native pipe supports a _
placeholder to the right-hand side fo the pipe. So now the magrittr syntax can be directly translated to
== "Chinstrap"] |>
penguins[species mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)] _[ , .(
I like this syntax even more than the original magrittr one because it solves the double meaning problem and operations get hugged by a pair of brackets.
The four pipes of data.table
So, there you are, 4 different ways you can pipe your data.tables.
Use the ][
pipe if you want your code to have minimal dependencies and work in older versions of R. Use the %>%
pipe if you want your code to work in older versions of R and don’t mind the extra dependency. Use any version of the |>
pipe if you want minimal dependencies and don’t mind depending on R >= 4.3.0.
== "Chinstrap" ][
penguins[species mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)]
, .(
== "Chinstrap"] %>%
penguins[species mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)]
.[ , .(
== "Chinstrap"] |>
penguins[species DT( , .(mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island))
== "Chinstrap"] |>
penguins[species mean_flipper_length = mean(flipper_length_mm)), by = .(sex, island)] _[ , .(
Image by storyset on Freepik