Column assignment and reference semantics in {data.table}

tips
tutorials
developer
Author

Toby Hocking

Published

February 18, 2024

The goal of this blog post is to explain some similarities and differences between the base R data.frame object type, and the data.table object type. We will focus on accessing and assigning values, and discuss two major differences:

Difference in syntax

To break down the similarities and differences in syntax, consider the data below,

See source code
library(data.table)
library(knitr)

syntax <- function(type, name, columns, code){
  mcall <- match.call()
  dt.args <- lapply(as.list(mcall[-1]), paste)
  do.call(data.table, dt.args)
}

syntax.tab <- rbind(
  syntax(frame, literal, one, "df$col_name <- value"),
  syntax(table, literal, one, "DT[, col_name := value]"),
  syntax(frame, variable, multiple, 'df[, col_names_list] <- values'),
  syntax(table, variable, multiple, 'DT[, (col_names_list) := values]'))

syntax.tab |> kable()
type name columns code
frame literal one df$col_name <- value
table literal one DT[, col_name := value]
frame variable multiple df[, col_names_list] <- values
table variable multiple DT[, (col_names_list) := values]

The table above defines the different syntax required to do column assignment in data tables (DT) and frames (df).

  • type indicates object type: frame or table.

  • name indicates whether the column(s) to assign are literally written in the code (col_name), or if the names are stored in a variable (col_names_list).

  • columns indicates whether only one or multiple (one or more) columns can be assigned using the syntax.

  • code is the exact syntax of the R code used for the assignment.

Note that there are other ways to do column assignment. For example,

  • DF[["col_name"]] <- value can also be used for single column assignment in a data frame.

  • set(DT, j=col_name_list, value=values) is a more efficient version of column assignment for data tables, that is recommended for use in loops, as it avoids the overhead of the [.data.table method.

Below is a reshaped version of the table above, to facilitate easier comparison between frame and table versions:

See source code
options(width=100)
data.table::dcast(syntax.tab, name + columns ~ type, value.var="code")  |> kable()
name columns frame table
literal one df$col_name <- value DT[, col_name := value]
variable multiple df[, col_names_list] <- values DT[, (col_names_list) := values]

The table above shows the equivalent code for assignment of columns using either a data.frame or data.table. In fact, the code in the frame column above can also be used for assignment of a data.table, but it may be less efficient than the data table square brackets, as we will discuss in the next section.

One reason why data.table uses a custom assignment syntax is for consistency: the same syntax can be used, with square brackets and :=, for one or multiple column assignment. (Note the use parentheses around col_names_list in the second row of the table column above, to indicate that the left side of := is a variable storing column names or numbers, instead of a direct unquoted column name.)

Another reason why data.table uses a custom assignment syntax is for efficiency, as we see in the next section.

Base “copy on write” versus data.table reference semantics

R has “copy on write” semantics, meaning that in base R if a variable is modified inside a function, a copy is made of the whole variable. For example, consider the code below

dt_outside <- data.table(x=1:3)

base_assign <- function(dt_inside, variable, value){
  dt_inside[1,variable] <- value # makes a copy of input variable!
}

base_assign(dt_outside, "x", 0)

dt_outside
       x
   <int>
1:     1
2:     2
3:     3

In the code above, we pass dt_outside to the base_assign function, which makes a copy called dt_inside before it is modified, so that the data in dt_outside is unchanged after the function is done. Compare with the code below,

dt_assign <- function(dt_inside, variable, value){
  dt_inside[1, (variable) := value] # directly modifies input variable
}

dt_assign(dt_outside, "x", 0)

dt_outside
       x
   <int>
1:     0
2:     2
3:     3

The output above shows that by using the square brackets and := assignment, we can modify data.table objects in functions without copying them. Here, the variables dt_inside and dt_outside point to the same underlying data.

Efficiency of reference semantics

Reference semantics mean that data.table assignment is potentially much more efficient than base R, in terms of time and memory usage. To demonstrate, we use the following benchmark. Assume we have a table with \(N\) rows, but we just want to modify one row. This should be a constant time/space operation (independent of \(N\)), but because of the base R copy on write semantics, it will be a linear time/space operation, \(O(N)\).

See source code
atime_result <- atime::atime(
  N = 10^seq(1, 7, by = 0.5),
  setup = {
    dt <- data.table(x = 1:N)
  },
  dt_assign = dt_assign(dt, "x", 0),
  base_assign = base_assign(dt, "x", 0))

plot(atime_result)
Loading required namespace: ggplot2
Loading required namespace: directlabels

We can see from the plot above that for base_assign, both time and space increase with \(N\), because the entire table is copied; whereas dt_assign is constant time/space, because only one row is modified with no copy necessary.

Note

The code in this section used a data.table object in both function calls to illustrate the constant time/space assignment which is possible, but the visualized result also applies to other data structures.

As an exercise, add two more expressions to the atime benchmark: base_assign with a data.frame object and tibble object. You should see linear time/space for both.

Conclusions

In this post we have explored the syntax and semantics for assignment using base R and data.table square brackets with :=, and we have seen how the reference semantics of data.table can be very beneficial for computational efficiency.

No matching items