Column assignment and reference semantics in data.table

The goal of this blog post is to explain some similarities and differences between the base R data.frame object type, and the data.table object type. We will focus on accessing and assigning values, and discuss two major differences:

Syntax means the structure of the code that is used: the characters and symbols that execute tasks. The data.table package uses a syntax where most operations can be done within the square brackets: DT[i, j, by].
Semantics refers to the internal structure of an object or variable. We say that a data.table object has reference semantics, meaning we can modify a data.table from within a function, and see those modifications after the function is done executing. In other words, two different R variables can point to, and modify, the same data.table.

Difference in syntax

To break down the similarities and differences in syntax, consider the data below,

See source code

library(data.table)
library(knitr)

syntax <- function(type, name, columns, code){
  mcall <- match.call()
  dt.args <- lapply(as.list(mcall[-1]), paste)
  do.call(data.table, dt.args)
}

syntax.tab <- rbind(
  syntax(frame, literal, one, "df$col_name <- value"),
  syntax(table, literal, one, "DT[, col_name := value]"),
  syntax(frame, variable, multiple, 'df[, col_names_list] <- values'),
  syntax(table, variable, multiple, 'DT[, (col_names_list) := values]'))

syntax.tab |> kable()

type	name	columns	code
frame	literal	one	df$col_name <- value
table	literal	one	DT[, col_name := value]
frame	variable	multiple	df[, col_names_list] <- values
table	variable	multiple	DT[, (col_names_list) := values]

The table above defines the different syntax required to do column assignment in data tables (DT) and frames (df).

type indicates object type: frame or table.
name indicates whether the column(s) to assign are literally written in the code (col_name), or if the names are stored in a variable (col_names_list).
columns indicates whether only one or multiple (one or more) columns can be assigned using the syntax.
code is the exact syntax of the R code used for the assignment.

Note that there are other ways to do column assignment. For example,

DF[["col_name"]] <- value can also be used for single column assignment in a data frame.
set(DT, j=col_name_list, value=values) is a more efficient version of column assignment for data tables, that is recommended for use in loops, as it avoids the overhead of the [.data.table method.

Below is a reshaped version of the table above, to facilitate easier comparison between frame and table versions:

See source code

options(width=100)
data.table::dcast(syntax.tab, name + columns ~ type, value.var="code")  |> kable()

name	columns	frame	table
literal	one	df$col_name <- value	DT[, col_name := value]
variable	multiple	df[, col_names_list] <- values	DT[, (col_names_list) := values]

The table above shows the equivalent code for assignment of columns using either a data.frame or data.table. In fact, the code in the frame column above can also be used for assignment of a data.table, but it may be less efficient than the data table square brackets, as we will discuss in the next section.

One reason why data.table uses a custom assignment syntax is for consistency: the same syntax can be used, with square brackets and :=, for one or multiple column assignment. (Note the use parentheses around col_names_list in the second row of the table column above, to indicate that the left side of := is a variable storing column names or numbers, instead of a direct unquoted column name.)

Another reason why data.table uses a custom assignment syntax is for efficiency, as we see in the next section.

Base “copy on write” versus `data.table` reference semantics

R has “copy on write” semantics, meaning that in base R if a variable is modified inside a function, a copy is made of the whole variable. For example, consider the code below

dt_outside <- data.table(x=1:3)

base_assign <- function(dt_inside, variable, value){
  dt_inside[1,variable] <- value # makes a copy of input variable!
}

base_assign(dt_outside, "x", 0)

dt_outside

       x
   <int>
1:     1
2:     2
3:     3

In the code above, we pass dt_outside to the base_assign function, which makes a copy called dt_inside before it is modified, so that the data in dt_outside is unchanged after the function is done. Compare with the code below,

dt_assign <- function(dt_inside, variable, value){
  dt_inside[1, (variable) := value] # directly modifies input variable
}

dt_assign(dt_outside, "x", 0)

dt_outside

       x
   <int>
1:     0
2:     2
3:     3

The output above shows that by using the square brackets and := assignment, we can modify data.table objects in functions without copying them. Here, the variables dt_inside and dt_outside point to the same underlying data.

Efficiency of reference semantics

Reference semantics mean that data.table assignment is potentially much more efficient than base R, in terms of time and memory usage. To demonstrate, we use the following benchmark. Assume we have a table with $N$ rows, but we just want to modify one row. This should be a constant time/space operation (independent of $N$), but because of the base R copy on write semantics, it will be a linear time/space operation, $O(N)$.

See source code

atime_result <- atime::atime(
  N = 10^seq(1, 7, by = 0.5),
  setup = {
    dt <- data.table(x = 1:N)
  },
  dt_assign = dt_assign(dt, "x", 0),
  base_assign = base_assign(dt, "x", 0))

plot(atime_result)

We can see from the plot above that for base_assign, both time and space increase with $N$, because the entire table is copied; whereas dt_assign is constant time/space, because only one row is modified with no copy necessary.

Note

The code in this section used a data.table object in both function calls to illustrate the constant time/space assignment which is possible, but the visualized result also applies to other data structures.

As an exercise, add two more expressions to the atime benchmark: base_assign with a data.frame object and tibble object. You should see linear time/space for both.

Conclusions

In this post we have explored the syntax and semantics for assignment using base R and data.table square brackets with :=, and we have seen how the reference semantics of data.table can be very beneficial for computational efficiency.

Use of non-API entry points in `data.table`

developer

guest post

performance

In the late 1970’s, people at Bell Laboratories designed the S programming language in order to facilitate interactive exploratory data analysis (Chambers 2016). Instead of…

Jan 13, 2025

Ivan Krylov

Advent of Code with `data.table`: Week One

tutorials

community

Happy December, R friends!

Dec 7, 2024

Kelly Bodwin

Comparing `data.table` reshape to `duckdb` and `polars`

tips

tutorials

developer

benchmarks

One element of the NSF POSE grant for data.table is to create benchmarks which can inform users about when data.table could be more performant than similar software. Two…

Oct 17, 2024

Toby Dylan Hocking

Visualizing performance regression of `data.table` with `atime`

performance

testing

developer

Since August 2023, I have been working on performance testing, which could be useful for expanding the open-source ecosystem around data.table package in R. This could…

Oct 10, 2024

Doris Afriyie Amoakohene

Seal of Approval: mlr3

seal of approval

application package

Author(s): Michel Lang, Bernd Bischl, Jakob Richter, Patrick Schratz, Martin Binder, Florian Pfisterer, Raphael Sonabend, Marc Becker, Sebastian Fischer

Oct 1, 2024

Maximilian Mücke

Seal of Approval: collapse

seal of approval

partner package

Author(s): Sebastian Krantz

Sep 21, 2024

Sebastian Krantz

Newly awarded translation projects

announcements

grant

translation

We are pleased to fund a French translation project, led by Philippe Grosjean, who is also the leader of the base R French translation. Co-authors include Christian Wia…

Aug 20, 2024

Toby Dylan Hocking

Seal of Approval: dtplyr

seal of approval

bridge package

Author(s): Hadley Wickham, Maximilian Girlich, Mark Fairbanks, Ryan Dickerson, Posit Software PBC

Aug 1, 2024

Kelly Bodwin

Seal of Approval: nc

seal of approval

extension package

Maintainer: Toby Dylan Hocking (toby.hocking@r-project.org)

Aug 1, 2024

Toby Dylan Hocking

Seal of Approval: tidyfast

seal of approval

bridge package

Author(s): Tyson S. Barrett, Mark Fairbanks, Ivan Leung, Indrajeet Patil

Aug 1, 2024

Tyson S. Barrett

Announcement: The ‘Seal of Approval’

announcements

grant

community

seal of approval

The Community Team, alongside a group of regular data.table contributors, is very pleased to announce a new Seal of Approval program!

Jul 31, 2024

Kelly Bodwin

Announcement: Paola Corrales, data.table Ambassador

announcements

grant

ambassadors

travel

Paola is a professor teaching Data Science at Guillermo Brown University in Argentina, a developer of R packages and teaching materials, and a leader of the LatinR…

Jun 12, 2024

Community Team

Two Roads Diverged

opinion

Two roads diverged in a wood and I, I took the one less traveled by, and that has made all the difference.

Jun 4, 2024

Kelly Bodwin

Testing infrastructure for data.table

grant

testing

developer

One major element of the NSF POSE grant for data.table is to create more documentation and testing infrastructure, in order to help expand the data.table ecosystem. This…

Mar 10, 2024

Toby Hocking

Community interviews about data.table

community

grant

One stipulation of NSF POSE funded projects like this one was to conduct several interviews under NSF’s I-CORPS program (Winter 2024 Cohort), to gather information as to how …

Mar 6, 2024

Anirban Chetia

Results of the 2023 survey

community

guest post

governance

Thanks to everyone who helped create, shared, or filled out the first data.table survey! The survey was officially open between October 17 and December 1 and it received 391 …

Feb 25, 2024

Aljaž Sluga

The Benefits of `data.table` Syntax

tips

tutorials

documentation

Among the many reasons to use data.table in your code (which includes the more common answers of speed, memory efficiency, etc.) is the syntax. The syntax is

Feb 5, 2024

Tyson Barrett

New governance, release with new features

governance

releases

I am proud to report that today, the first major new data.table features in several years have been released to CRAN!

Jan 30, 2024

Toby Dylan Hocking

Piping data.tables

tips

tutorials

documentation

guest post

Like a devoted plumber, modern R loves pipes. The magrittr pipe has a long history and it’s fair share of detractors, but with the implementation of the native pipe operator …

Jan 28, 2024

Elio Campitelli

Announcement: Jan Gorecki, data.table Ambassador

announcements

grant

ambassadors

Jan is a natural choice for an Ambassador, due to his many years of fantastic contribution to the data.table package. You can find his great work in open-source development…

Jan 14, 2024

Kelly Bodwin

Summary of LatinR conference

conferences

Last month, I (Toby) went to the LatinR conference in Montevideo, Uruguay. I had two goals: to teach about data.table in a tutorial, and to find people to work on…

Nov 19, 2023

Toby Dylan Hocking

Announcement: The data.table Ambassadors Travel Grant

announcements

grant

funding opportunity

We on the community team are very excited to announce another major funding opportunity!

Nov 1, 2023

Kelly Bodwin

Announcement: data.table translation projects

announcements

grant

funding opportunity

In 2023-2025, National Science Foundation (NSF) has provided funds to support the project “Expanding the data.table ecosystem for efficient big data manipulation in R.” One…

Oct 17, 2023

Welcome to the data.table ecosystem project!

An NSF-POSE funded venture.

announcements

grant

Hi! My name is Toby Dylan Hocking, and I have been using R since 2003, which means 20 years, can you believe it?

Oct 15, 2023

Toby Hocking

Categories

Difference in syntax

Base “copy on write” versus data.table reference semantics

Efficiency of reference semantics

Conclusions

Base “copy on write” versus `data.table` reference semantics