Welcome to the data.table ecosystem project!

Hi! My name is Toby Dylan Hocking, and I have been using R since 2003, which means 20 years, can you believe it?

Source: wikimedia. — Toby still using R in 50 years (artist rendering)

I work as an Assistant Professor of Computer Science, and my research expertise is machine learning, the modern branch of artificial intelligence which uses big data. R is an important tool in my machine learning work, and in the work of many people in academia/industry/government, because it provides so many useful functions for handling big data.

The {data.table} package

Since 2015, I have been using the R package data.table to do large parts of data processing before and after running machine learning algorithms - to get the raw data into the right format for the algorithm, and also to get the results in the right format for visualization/interpretation.

data.table is highly valued for its long-term stability and its lightning-fast speed in large data calculations.

Ecosystem expansion

With these use cases in mind, I proposed a project “Expanding the data.table ecosystem for efficient big data manipulation in R,” and I am excited to announce that it has been funded by the National Science Foundation’s “Pathways to Enable Open Source Ecosystems” (POSE) grant, for work between September 2023 and August 2025.

Our project attempts to address three issues with the current state of the data.table project:

Informal governance

The data.table package was originally created by Matt Dowle in 2008. His brilliant use of efficient algorithms and C implementations brought the world a package that has stood the test of time, and is now one of the most-used R packages available.

However, the growth of this incredible package will require more leaders than Dowle himself to help build, review, test, and organize new contributions.

Thus, one goal of this grant is to bring together data.table developers and contributors to propose a new governance structure for the package’s source code.

Limited centralized testing infrastructure

Since the primary draw of data.table is the speed of its algorithmic implementations, adding new functionality to the package is not simple. In particular, new elements must be heavily tested to ensure that they do not interfere with the core computations.

In this grant, we plan to develop software to automate the testing of new package contributions, to smooth the growth process. This part of the project includes a centralized reverse dependency checking system, new benchmarks comparing data.table with other systems such as polars and arrow, and new performance testing software.

Limited documentation and outreach

To encourage more people to learn and adopt data.table, we will be massively expanding the number of tutorials, documentations, and guides for how to use the package effectively. Part of this will be translation projects so that data.table will be more accessible in foreign languages. This grant will also include travel awards, to support selected speakers to travel to conferences and share data.table updates and usage.

Categories

The {data.table} package

Ecosystem expansion

How you can get involved