Visualizing performance regression of data.table with atime

Since August 2023, I have been working on performance testing, which could be useful for expanding the open-source ecosystem around data.table package in R. This could increase confidence in code contributions by ensuring the sustained efficiency of the data.table package.

In data.table, the term “performance regression” refers to a change to the data.table source code, or to the core R build, that causes an increase in either time metrics and memory metrics.

It is important that we prevent significant performance regression from reaching the current release of the data.table package. Slowness or big memory usage can be frustrating; and in fact, are the issues data.table is most used to solve. Any performance regression that makes it into a version release will degrade user experience.

In this blog post, I will demonstrate the use of benchmarking techniques to verify whether reported issues on data.table have been successfully resolved.

Overview

Understanding performance in `data.table`

data.table is an extension of R’s data.frame, designed to handle large datasets efficiently. It provides a syntax that is both concise and expressive, allowing users to perform complex data manipulations with ease. Its efficiency is particularly evident when dealing with tasks like filtering, grouping, aggregating, and joining data.

The development team behind data.table is committed to continuously improving its performance. Over the years, several major version changes have been introduced, aiming to enhance speed and efficiency. These changes include algorithmic optimizations, memory management improvements, and enhancements to parallel processing capabilities. Upgrading to the latest version ensures that users can leverage the most recent performance enhancements.

Why do we run performance tests on GitHub commits?

Running performance tests on GitHub commits helps maintain a high-performance standard for the package, detect and fix performance regressions, optimize code, validate performance improvements, ensure consistent performance over time and to encourage confidence in code contributions from new people.

It is an essential practice to deliver a performant and reliable package to end-users.

Benchmarking for performance evaluation

To evaluate data.table performance, it is essential to employ benchmarking methodologies. The approach I used utilizes the atime_versions function from the atime package, which measures the actual execution time of specific operations. This function allows for accurate comparisons between different versions of the data.table package, by benchmarking against time and memory usage and giving a graphical visualization of the results.

Details of the performance tests

The primary function atime_versions has six main arguments:

pkg.path: This argument specifies the location on your system where you have stored a git clone of the data.table package.
pkg.edit.fun: The default behavior of pkg.edit.fun is designed to work with Rcpp packages and involves replacing instances of “PKG” with “PKG.SHA” in the package code. Any occurrences of the string “PKG” within the package code will be replaced with “PKG.SHA”, where “SHA” represents the commit SHA/ids associated with the version being installed.
N: This argument determines the number of iterations for the benchmarking process. It is a sequence of numbers that define different data sizes to test the performance of the operation.
setup: This section contains the setup code for generating the dataset used in the benchmarking process, the setup is determined by the value of N.
expr: This section contains the expression that represents the operation being benchmarked. It uses the data.table::[.data.table`` syntax to perform the operation on the dataset.

In the given syntax data.table::`[.data.table`, the first part data.table:: installs and loads different versions of the data.table package based on the specified commit ids. Hence, data.table:: will be translated to data.table.SHA1:: for some version hash SHA1. Following that, the expression specified within `[.data.table `` is executed on each installed version. This process is repeated for all the specified commit IDs in the code.

For example:

data.table.ec1259af1bf13fc0c96a1d3f9e84d55d8106a9a4:::`[.data.table`(DT, , .(v3=mean(v3, na.rm=TRUE)), by=id3, verbose=TRUE)

In this example, the expression [.data.table is executed on the DT dataset using the specified commit ID (ec1259af1bf13fc0c96a1d3f9e84d55d8106a9a4) of the data.table package. The expression calculates the mean of the v3 column (ignoring missing values) grouped by id3, and the verbose=TRUE argument enables verbose output during the operation. This process is typically repeated for all commit IDs in your code to compare the performance of different versions of the data.table package.

... : This specifies the different versions of the data.table packages that will be tested. It includes three versions: “Before,” “Regression,” and “Fixed.” Each version is associated with a specific commit id.

Test procedure

We run the full performance regression with atime:

Before the change causing performance regression is made (Before)
When the change causing performance regression is first submitted (Regression)
After the Pull Request (PR) which fixes the performance regression (Fixed)

Overall workflow

When a fixing Pull Request is submitted, our procedure automatically takes the following steps:

Pass the hashes for different branches (Before, Regression, Fix) to atime_versions; along with various parameters for the test (number of simulations, code expression to run, etc.).
Use the atime_versions function to measure time and memory usage across different versions.
Generate a plot to showcase the test results, using the atime package built in plotting functions.
Display the plot and test results as a comment on the submitted Pull Request.

Here is an example of how to perform the atime test. More documentation of the atime package can be found here.

Example

The first example we will show is an issue reported on performing group computations, specifically when running R’s C eval: link to GitHub Issue that reported regression. This regression was caused by the inclusion of the certain code within the #if block. This PR discusses the specific C code in q7 and q8 in the “db-benchmark” which causes the regression.

This PR fixed the regression problem.

The details of the code problems and solutions are not required for the example; we link them only to share a map of the regression-and-fix process.

To produce performance test results, we first load package dependencies, as well as the current GitHub snapshot of data.table in development:

library(atime)
library(ggplot2)
library(data.table)

tdir <- tempfile()
dir.create(tdir)
git2r::clone("https://github.com/Rdatatable/data.table", tdir )

Next, we establish our performance test. Here, we will create a data.table object and then compute the range by group. We vary the size of the object by varying values of N across tests.

d <- data.table(
      id3 = sample(c(seq.int(N*0.9), sample(N*0.9, N*0.1, TRUE))),
      v1 = sample(5L, N, TRUE),
      v2 = sample(5L, N, TRUE)
      )

data.table:::`[.data.table`(d, , (max(v1)-min(v2)), by = id3)

This setup and expression is then passed to atime_versions, along with a bit of package management information, and hashes (a.k.a. “Commit ID” or “SHA”) for the commits before, during, and after the performance regression.

atime.list.4200 <- atime::atime_versions(
  pkg.path = tdir,
  pkg.edit.fun = pkg.edit.fun,
  N = 10^seq(1,20),
  setup = { 
    set.seed(108)
    d <- data.table(
      id3 = sample(c(seq.int(N*0.9), sample(N*0.9, N*0.1, TRUE))),
      v1 = sample(5L, N, TRUE),
      v2 = sample(5L, N, TRUE))
  },
  expr = data.table:::`[.data.table`(d, , (max(v1)-min(v2)), by = id3),
  "Before" = "793f8545c363d222de18ac892bc7abb80154e724", # commit hash in PR prior to regression
  "Regression" = "c152ced0e5799acee1589910c69c1a2c6586b95d", # commit hash in PR causing regression
  "Fixed" = "f750448a2efcd258b3aba57136ee6a95ce56b302" # commit hash in PR that fixes the regression
)

Note

The function pkg.edit.fun that is passed to atime_versions above is a custom function written to manage the packages and paths on the server running this test.

You can see the code below if you wish.

Code

pkg.edit.fun=function(old.Package, new.Package, sha, new.pkg.path){
      pkg_find_replace <- function(glob, FIND, REPLACE){
        atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
      }
      Package_regex <- gsub(".", "_?", old.Package, fixed=TRUE)
      Package_ <- gsub(".", "_", old.Package, fixed=TRUE)
      new.Package_ <- paste0(Package_, "_", sha)
      pkg_find_replace(
        "DESCRIPTION", 
        paste0("Package:\\s+", old.Package),
        paste("Package:", new.Package))
      pkg_find_replace(
        file.path("src","Makevars.*in"),
        Package_regex,
        new.Package_)
      pkg_find_replace(
        file.path("R", "onLoad.R"),
        Package_regex,
        new.Package_)
      pkg_find_replace(
        file.path("R", "onLoad.R"),
        sprintf('packageVersion\\("%s"\\)', old.Package),
        sprintf('packageVersion\\("%s"\\)', new.Package))
      pkg_find_replace(
        file.path("src", "init.c"),
        paste0("R_init_", Package_regex),
        paste0("R_init_", gsub("[.]", "_", new.Package_)))
      pkg_find_replace(
        "NAMESPACE",
        sprintf('useDynLib\\("?%s"?', Package_regex),
        paste0('useDynLib(', new.Package_))
    }

Results

The atime package uses the results of the performance test to create the following plot:

Plot showing the 3 branches (Regression, Fixed and Before) of the issues in #4200

The graph compares the time required to execute the operation before, during, and after fixing a regression issue. The x-axis (N) represents the size of the data on a logarithmic scale. The y-axis represents the median time in milliseconds (logarithmic scale).

Lines:

“Before”: Indicates performance before fixing the regression; we hope to achieve this performance after fixing.

“Regression”: Represents an ideal or target performance level.

“Fixed”: Shows improved performance after fixing.

In the graph, as data size (N) increases, there’s an initial increase in median time, but after addressing the regression issue, there is a significant reduction in the median time, indicating improved performance (fix). The regression issue was successfully addressed.

Automated testing with Github Actions

As part of the data.table ecosystem project, Anirban Chetia has implemented a GitHub Action to automatically run performance tests any time the data.table repository is Pull Requested. This action runs the atime performance test and generates plots of the results in a comment within the pull request. See an example in this pull request.

This action allows the package maintainers to easily determine if a Pull Request has any impact on the time or memory usage of the build for the data.table package. To learn more you can visit Anirban’s documentation or this ReadMe about the atime package

Conclusion

In this blog post, we have delved into the use of the atime package to compare the asymptotic time and memory usage of different development versions of the data.table package. Specifically, we visualized the comparisons between the “Before,” “Regression,” and “Fixed” versions for a specific performance regression issue.

By employing benchmarking methodologies like atime, we gain valuable insights into the performance characteristics of proposed updates to the data.table package. This allowed us to identify and address performance regressions, ensuring that each new version of the package has indeed solved the particular issue reported.

For more examples or practice with atime and regression, you can visit this link and the corresponding fix PR here.

Benchmarking memory usage in R

performance

benchmarking

developer

Profiling memory in R has never been a trivial task.
In this post, I would like to emphasize that currently popular methods are quite inaccurate and should therefore be used…

Jul 14, 2025

Jan Gorecki

data.table is a NumFOCUS project!

announcements

grant

We are SO excited to announce some massive news for the data.table community: data.table is now a NumFOCUS Sponsored Project!!!

Jul 10, 2025

Community Team

`data.table` vs `dplyr`: A Side-by-Side Comparison

guest post

tutorials

[Note: This blog post originally appeared on albert-rapp.de and has been shared here to serve both the {data.table} and {dplyr} communities.]

Jul 6, 2025

Albert Rapp

Use of non-API entry points in `data.table`

developer

guest post

performance

In the late 1970’s, people at Bell Laboratories designed the S programming language in order to facilitate interactive exploratory data analysis (Chambers 2016). Instead of…

Jan 13, 2025

Ivan Krylov

Advent of Code with `data.table`: Week One

tutorials

community

Happy December, R friends!

Dec 7, 2024

Kelly Bodwin

Continuous performance testing using GitHub Actions

developer

performance

github

In an effort to address the need for continuous performance benchmarking in data.table, I created a GitHub Action¹ to facilitate testing the time/memory-based performance of…

Nov 11, 2024

Anirban Chetia

Comparing `data.table` reshape to `duckdb` and `polars`

tips

tutorials

developer

benchmarks

One element of the NSF POSE grant for data.table is to create benchmarks which can inform users about when data.table could be more performant than similar software. Two…

Oct 17, 2024

Toby Dylan Hocking

Seal of Approval: mlr3

seal of approval

application package

Author(s): Michel Lang, Bernd Bischl, Jakob Richter, Patrick Schratz, Martin Binder, Florian Pfisterer, Raphael Sonabend, Marc Becker, Sebastian Fischer

Oct 1, 2024

Maximilian Mücke

Seal of Approval: collapse

seal of approval

partner package

Author(s): Sebastian Krantz

Sep 21, 2024

Sebastian Krantz

Newly awarded translation projects

announcements

grant

translation

We are pleased to fund a French translation project, led by Philippe Grosjean, who is also the leader of the base R French translation. Co-authors include Christian Wia…

Aug 20, 2024

Toby Dylan Hocking

Seal of Approval: dtplyr

seal of approval

bridge package

Author(s): Hadley Wickham, Maximilian Girlich, Mark Fairbanks, Ryan Dickerson, Posit Software PBC

Aug 1, 2024

Kelly Bodwin

Seal of Approval: nc

seal of approval

extension package

Maintainer: Toby Dylan Hocking (toby.hocking@r-project.org)

Aug 1, 2024

Toby Dylan Hocking

Seal of Approval: tidyfast

seal of approval

bridge package

Author(s): Tyson S. Barrett, Mark Fairbanks, Ivan Leung, Indrajeet Patil

Aug 1, 2024

Tyson S. Barrett

Announcement: The ‘Seal of Approval’

announcements

grant

community

seal of approval

The Community Team, alongside a group of regular data.table contributors, is very pleased to announce a new Seal of Approval program!

Jul 31, 2024

Kelly Bodwin

Announcement: Paola Corrales, data.table Ambassador

announcements

grant

ambassadors

travel

Paola is a professor teaching Data Science at Guillermo Brown University in Argentina, a developer of R packages and teaching materials, and a leader of the LatinR…

Jun 12, 2024

Community Team

Two Roads Diverged

opinion

Two roads diverged in a wood and I, I took the one less traveled by, and that has made all the difference.

Jun 4, 2024

Kelly Bodwin

Testing infrastructure for data.table

grant

testing

developer

One major element of the NSF POSE grant for data.table is to create more documentation and testing infrastructure, in order to help expand the data.table ecosystem. This…

Mar 10, 2024

Toby Hocking

Community interviews about data.table

community

grant

One stipulation of NSF POSE funded projects like this one was to conduct several interviews under NSF’s I-CORPS program (Winter 2024 Cohort), to gather information as to how …

Mar 6, 2024

Anirban Chetia

Results of the 2023 survey

community

guest post

governance

Thanks to everyone who helped create, shared, or filled out the first data.table survey! The survey was officially open between October 17 and December 1 and it received 391 …

Feb 25, 2024

Aljaž Sluga

Column assignment and reference semantics in data.table

tips

tutorials

developer

The goal of this blog post is to explain some similarities and differences between the base R data.frame object type, and the data.table object type. We will focus on…

Feb 18, 2024

Toby Hocking

The Benefits of `data.table` Syntax

tips

tutorials

documentation

Among the many reasons to use data.table in your code (which includes the more common answers of speed, memory efficiency, etc.) is the syntax. The syntax is

Feb 5, 2024

Tyson Barrett

New governance, release with new features

governance

releases

I am proud to report that today, the first major new data.table features in several years have been released to CRAN!

Jan 30, 2024

Toby Dylan Hocking

Piping data.tables

tips

tutorials

documentation

guest post

Like a devoted plumber, modern R loves pipes. The magrittr pipe has a long history and it’s fair share of detractors, but with the implementation of the native pipe operator …

Jan 28, 2024

Elio Campitelli

Announcement: Jan Gorecki, data.table Ambassador

announcements

grant

ambassadors

Jan is a natural choice for an Ambassador, due to his many years of fantastic contribution to the data.table package. You can find his great work in open-source development…

Jan 14, 2024

Kelly Bodwin

Summary of LatinR conference

conferences

Last month, I (Toby) went to the LatinR conference in Montevideo, Uruguay. I had two goals: to teach about data.table in a tutorial, and to find people to work on…

Nov 19, 2023

Toby Dylan Hocking

Announcement: The data.table Ambassadors Travel Grant

announcements

grant

funding opportunity

We on the community team are very excited to announce another major funding opportunity!

Nov 1, 2023

Kelly Bodwin

Announcement: data.table translation projects

announcements

grant

funding opportunity

In 2023-2025, National Science Foundation (NSF) has provided funds to support the project “Expanding the data.table ecosystem for efficient big data manipulation in R.” One…

Oct 17, 2023

Welcome to the data.table ecosystem project!

An NSF-POSE funded venture.

announcements

grant

Hi! My name is Toby Dylan Hocking, and I have been using R since 2003, which means 20 years, can you believe it?

Oct 15, 2023

Toby Hocking

Categories

Overview

Understanding performance in data.table

Why do we run performance tests on GitHub commits?

Benchmarking for performance evaluation

Details of the performance tests

Test procedure

Overall workflow

Example

Results

Automated testing with Github Actions

Conclusion

Understanding performance in `data.table`