Hi! My name is Toby Dylan Hocking, and I have been using R since 2003, which means 20 years, can you believe it?
I work as an Assistant Professor of Computer Science, and my research expertise is machine learning, the modern branch of artificial intelligence which uses big data. R
is an important tool in my machine learning work, and in the work of many people in academia/industry/government, because it provides so many useful functions for handling big data.
The {data.table} package
Since 2015, I have been using the R package data.table
to do large parts of data processing before and after running machine learning algorithms - to get the raw data into the right format for the algorithm, and also to get the results in the right format for visualization/interpretation.
data.table
is highly valued for its long-term stability and its lightning-fast speed in large data calculations.
Ecosystem expansion
With these use cases in mind, I proposed a project “Expanding the data.table ecosystem for efficient big data manipulation in R,” and I am excited to announce that it has been funded by the National Science Foundation’s “Pathways to Enable Open Source Ecosystems” (POSE) grant, for work between September 2023 and August 2025.
Our project attempts to address three issues with the current state of the data.table
project:
- Informal governance
The data.table
package was originally created by Matt Dowle in 2008. His brilliant use of efficient algorithms and C implementations brought the world a package that has stood the test of time, and is now one of the most-used R packages available.
However, the growth of this incredible package will require more leaders than Dowle himself to help build, review, test, and organize new contributions.
Thus, one goal of this grant is to bring together data.table
developers and contributors to propose a new governance structure for the package’s source code.
- Limited centralized testing infrastructure
Since the primary draw of data.table
is the speed of its algorithmic implementations, adding new functionality to the package is not simple. In particular, new elements must be heavily tested to ensure that they do not interfere with the core computations.
In this grant, we plan to develop software to automate the testing of new package contributions, to smooth the growth process. This part of the project includes a centralized reverse dependency checking system, new benchmarks comparing data.table
with other systems such as polars
and arrow
, and new performance testing software.
- Limited documentation and outreach
To encourage more people to learn and adopt data.table
, we will be massively expanding the number of tutorials, documentations, and guides for how to use the package effectively. Part of this will be translation projects so that data.table
will be more accessible in foreign languages. This grant will also include travel awards, to support selected speakers to travel to conferences and share data.table
updates and usage.
Interested in contributing a tutorial/vignette or blog post? Email r.data.table@gmail.com!
How you can get involved
Interested in helping grow the data.table
ecosystem? There are so many ways to get involved!
- Follow us for updates on social media:
Subscribe to this blog, The Raft
Take the Community Survey to weigh in on next steps.
Participate in the deep discussions on GitHub:
- Email r.data.table@gmail.com to:
Be added to the community Slack.
Propose a guest blog for The Raft.
Ask questions, make suggestions, or volunteer your expertise.
- Apply for the upcoming Travel Grants and Translation Projects - watch this blog for more information!