Column assignment and reference semantics in data.table
tips
tutorials
developer
Author
Toby Hocking
Published
February 18, 2024
Categories
All (236)
ambassadors (2)
announcements (7)
application package (1)
benchmarks (1)
bridge package (2)
community (4)
conferences (1)
developer (4)
documentation (2)
extension package (1)
funding opportunity (2)
governance (2)
grant (9)
guest post (3)
opinion (1)
partner package (1)
performance (2)
releases (1)
seal of approval (6)
testing (2)
tips (3)
translation (1)
travel (1)
tutorials (4)
The goal of this blog post is to explain some similarities and differences between the base R data.frame object type, and the data.table object type. We will focus on accessing and assigning values, and discuss two major differences:
Syntax means the structure of the code that is used: the characters and symbols that execute tasks. The data.table package uses a syntax where most operations can be done within the square brackets: DT[i, j, by].
Semantics refers to the internal structure of an object or variable. We say that a data.table object has reference semantics, meaning we can modify a data.table from within a function, and see those modifications after the function is done executing. In other words, two different R variables can point to, and modify, the same data.table.
Difference in syntax
To break down the similarities and differences in syntax, consider the data below,
The table above defines the different syntax required to do column assignment in data tables (DT) and frames (df).
type indicates object type: frame or table.
name indicates whether the column(s) to assign are literally written in the code (col_name), or if the names are stored in a variable (col_names_list).
columns indicates whether only one or multiple (one or more) columns can be assigned using the syntax.
code is the exact syntax of the R code used for the assignment.
Note that there are other ways to do column assignment. For example,
DF[["col_name"]] <- value can also be used for single column assignment in a data frame.
set(DT, j=col_name_list, value=values) is a more efficient version of column assignment for data tables, that is recommended for use in loops, as it avoids the overhead of the [.data.table method.
Below is a reshaped version of the table above, to facilitate easier comparison between frame and table versions:
See source code
options(width=100)data.table::dcast(syntax.tab, name + columns ~ type, value.var="code") |>kable()
name
columns
frame
table
literal
one
df$col_name <- value
DT[, col_name := value]
variable
multiple
df[, col_names_list] <- values
DT[, (col_names_list) := values]
The table above shows the equivalent code for assignment of columns using either a data.frame or data.table. In fact, the code in the frame column above can also be used for assignment of a data.table, but it may be less efficient than the data table square brackets, as we will discuss in the next section.
One reason why data.table uses a custom assignment syntax is for consistency: the same syntax can be used, with square brackets and :=, for one or multiple column assignment. (Note the use parentheses around col_names_list in the second row of the table column above, to indicate that the left side of := is a variable storing column names or numbers, instead of a direct unquoted column name.)
Another reason why data.table uses a custom assignment syntax is for efficiency, as we see in the next section.
Base “copy on write” versus data.table reference semantics
R has “copy on write” semantics, meaning that in base R if a variable is modified inside a function, a copy is made of the whole variable. For example, consider the code below
dt_outside <-data.table(x=1:3)base_assign <-function(dt_inside, variable, value){ dt_inside[1,variable] <- value # makes a copy of input variable!}base_assign(dt_outside, "x", 0)dt_outside
x
<int>
1: 1
2: 2
3: 3
In the code above, we pass dt_outside to the base_assign function, which makes a copy called dt_inside before it is modified, so that the data in dt_outside is unchanged after the function is done. Compare with the code below,
The output above shows that by using the square brackets and := assignment, we can modify data.table objects in functions without copying them. Here, the variables dt_inside and dt_outside point to the same underlying data.
Efficiency of reference semantics
Reference semantics mean that data.table assignment is potentially much more efficient than base R, in terms of time and memory usage. To demonstrate, we use the following benchmark. Assume we have a table with \(N\) rows, but we just want to modify one row. This should be a constant time/space operation (independent of \(N\)), but because of the base R copy on write semantics, it will be a linear time/space operation, \(O(N)\).
We can see from the plot above that for base_assign, both time and space increase with \(N\), because the entire table is copied; whereas dt_assign is constant time/space, because only one row is modified with no copy necessary.
Note
The code in this section used a data.table object in both function calls to illustrate the constant time/space assignment which is possible, but the visualized result also applies to other data structures.
As an exercise, add two more expressions to the atime benchmark: base_assign with a data.frame object and tibble object. You should see linear time/space for both.
Conclusions
In this post we have explored the syntax and semantics for assignment using base R and data.table square brackets with :=, and we have seen how the reference semantics of data.table can be very beneficial for computational efficiency.