library(data.table)
library(reshape2)
library(atime)
library(ggplot2)
library(reticulate)
#use_python("C:/Users/amoak/AppData/Local/Programs/Python/Python312/python.exe") #If you want to reproduce, please change to the path of python on your computer.
virtualenv_create("fm-proj")
use_virtualenv("fm-proj", required = F)
R and Python are two programming languages that have gained immense popularity among data scientists, statisticians, and researchers. In this blog, we will explore two widely used libraries, data.table
in R and pandas
in Python, both of which excel in data manipulation and provide versatile functionalities for working with data, focusing on their capabilities for reading, writing, and reshaping data. We will also explain how to graphically demonstrate the time taken by each operation. We will use the atime
R package to compare and visualize the asymptotic performance (time and memory usage) of the different functions mentioned above. By comparing the asymptotic performance of these packages in these programming languages, we aim to provide insights and help data scientists make informed choices when it comes to data manipulation and analysis.
Performance testing with atime
The atime::atime()
function requires the following arguments:
N
: A sequence of parameters that control dataset sizes. Peformance tests will measure time/memory usage across the different data sizes, as measured by values ofN
.setup
: A code expression that will be run for each value inN
to create data of various sizes.expr
: A code expression that will be evaluated for each value inN
, after data creation, and tested for computational performance.seconds.limit
(optional): A cutoff time; if an expression is slower than this limit for any data size, then no larger data sizes will be measured.
Interfacing Python and R with reticulate
This analysis will employ the reticulate
package in R, which facilitates interoperability between Python and R by providing a robust set of tools for seamless integration. Leveraging this package enables the efficient exchange of data and functionality between both languages, allowing for a more expansive approach to the analysis. The package includes facilities for:
Calling Python from R, including raw code or sourcing Python scripts and modules.
Translation between R and Python objects; for example, between R and pandas data frames, or between R matrices and NumPy arrays.
Flexible binding to different versions of Python including virtual environments and Conda environments.
The reticulate
translation incurs overhead depending on Python code complexity and frequency of R-Python switches. This overhead is typically constant for small scripts, but increases with larger/complex code. The code uses reticulate to benchmark writing a dataframe to CSV using R’s data.table and Python’s pandas, with key overhead factors being Python environment initialization and data transfer between R and Python. Data transfer overhead is relatively constant due to small dataframe size and minimal changes in dataframe size during processing.
https://github.com/DorisAmoakohene/Slides-and-Blogs/blob/main/Screenshot%202024-06-26%20150315.png
Example 1: Writing a CSV File with data.table::fwrite() and pandas::to_csv()
Setup
To reproduce this analysis, you will need the packages loaded in the code chunks shown below, as well as the code to setup a virtual environment using your choice of local Python install.
= 'data.csv' file_path
fwrite: fast CSV writer
Data.table provides the fwrite() function for writing data to a file while Pandas offers the to_csv() function for writing data to a CSV file.
Comparison code
<- c(
write.colors "data.table::fwrite" = "#D6604D",
"pandas.to_csv " = "#BF812D"
)= 'data.csv'
file_path <- 100
n.rows <- 10
seconds.limit <- atime::atime(
atime.write.vary.cols N = as.integer(10^seq(2, 10, by = 0.2)),
setup = {
set.seed(1)
<- rnorm(n.rows * N)
input.vec <- matrix(input.vec, n.rows, N)
input.mat <- data.frame(input.mat)
input.df <- import("pandas")
pd <- r_to_py(input.df)
input_df_pd
},seconds.limit = seconds.limit,
"data.table::fwrite" = {
::fwrite(input.df, tempfile(), showProgress = FALSE)
data.table
},"pandas.to_csv" = {
$to_csv(file_path, index = FALSE)
input_df_pd
} )
<- atime::references_best(atime.write.vary.cols) refs.write.vary.cols
The predict() function is used to generate predictions based on the atime::reference_best() dataset. The resulting plot illustrates the data size, N, that can be processed within a specific time or memory limit.
<- predict(refs.write.vary.cols)
pred.write.vary.cols plot(pred.write.vary.cols)
<- plot(pred.write.vary.cols) +
gg.write.dt.pd theme(text = element_text(size = 15)) +
ggtitle(sprintf("Write real numbers to CSV, with pandas in Python \nand data.table in R, %d x N", n.rows)) +
scale_x_log10("N = number of columns to write") +
scale_y_log10("Computation time (seconds)\nmedian line, min/max band\nover 10 timings") +
facet_null() +
scale_fill_manual(values = write.colors) +
scale_color_manual(values = write.colors)
print(gg.write.dt.pd)
The plot above shows that in terms of writing data, the data.table package in R outperforms the pandas library in Python and particularly useful when dealing with large datasets.
Example 2: Reading a CSV File with data.table::fread() and pandas.read_csv
fread: fast CSV reader
Data.table provides the fread() function for reading data from a CSV file while Pandas offers the read_csv() function for reading data from a CSV file.
Comparison code
<- c(
read.colors "data.table::fread" = "#D6604D",
"pandas.read_csv" = "#BF812D"
)<- 100
n.rows <- 10
seconds.limit = 'data.csv'
file_path <- atime::atime(
atime.read N = as.integer(10^seq(2, 15, by = 0.2)),
setup = {
set.seed(1)
<- rnorm(n.rows*N)
input.vec <- matrix(input.vec, n.rows, N)
input.mat <- data.frame(input.mat)
input.df <- tempfile()
input.csv fwrite(input.df, "data.csv")
<- import("pandas")
pd <- pd$DataFrame(input.df)
input_df_pd <- import("pandas")
pd ::py_run_string("import pandas as pd")
reticulate
},seconds.limit = seconds.limit,
"data.table::fread" = {
::fread("data.csv", showProgress = FALSE)
data.table
},"pandas.read_csv " = {
::py_run_string("pd.read_csv(file_path)")
reticulate
} )
<- atime::references_best(atime.read) refs.read.vary.cols
<- predict(refs.read.vary.cols)
pred.read.vary.cols plot(pred.read.vary.cols)
<- plot(pred.read.vary.cols)+
gg.read.pd theme(text=element_text(size=15))+
ggtitle(sprintf("Read real numbers to CSV, with pandas in Python \nand data.table in R, %d x N", n.rows))+
scale_x_log10("N = number of columns to write")+
scale_y_log10("Computation time (seconds)
median line, min/max band
over 10 timings")+
facet_null()+
scale_fill_manual(values=read.colors)+
scale_color_manual(values=read.colors)
plot(gg.read.pd)
When it comes to reading data, data.table in R also demonstrates its superiority. It provides fast and efficient methods for importing and reading various file formats, including CSV. The fread() function in data.table is known for its speed and memory efficiency, making it an optimal choice for handling large datasets.
Example 3. Reshape performance comparison.
Data reshaping means changing the shape of the data, to get it into a more appropriate format, for learning/plotting/etc. Here we consider wide to long and long to wide reshape, which means we start with a wide table (many columns) and end up with a long table (fewer columns) and vice versa.
A. wide to long reshape.
In data.table, the data.table::melt() function is used to convert data from a wide format to a long format, while in Pandas, the pandas::melt() function is used to convert data from a wide format to a long format.
Comparison code
data.table::melt() is faster
<- c(
ml.colors "data.table::melt"="#D6604D",
"pd.melt" = "#BF812D"
)<- 10
n.folds <- 100
n.rows <- 10
seconds.limit <- atime::atime(
ml.reshape.atime N=as.integer(10^seq(2, 15, by=0.2)),
setup={
<- data.frame(
df id = rep(1:N, each = 2),
category = rep(c("A", "B"), N),
value = rnorm(2 * N)
<- reticulate::r_to_py(df)
py_df <- import("pandas")
pd
)
},seconds.limit= seconds.limit,
"data.table::melt" = {
::melt(data.table(df), id.vars = c("id", "category"),variable.names="variable", value.name = "value")
data.table
},"pd.melt" = {
$melt(py_df, id_vars = c("id", "category"), value.name = "score")
pd
} )
<- atime::references_best(ml.reshape.atime) ml.reshape.refs
<- predict(ml.reshape.refs)
ml.reshape.pred plot(ml.reshape.pred)
<- plot(ml.reshape.pred)+
ml.wide2long.pd theme(text=element_text(size=15))+
ggtitle(sprintf("Reshaping from wide to long panda & data.table \nover real numbers, N times", n.folds))+
scale_x_log10("N = number of Mean,SD,Length to compute")+
scale_y_log10("Computation time (seconds)
median line, min/max band
over 10 timings")+
facet_null()+
scale_fill_manual(values=ml.colors)+
scale_color_manual(values=ml.colors)
plot(ml.wide2long.pd)
When converting data from wide to long format, as shown in the above plot, data.table’s melt() function efficiently gathers multiple columns into key-value pairs, allowing for easy transformation of wide data into a longer, more structured format.
B. long to wide reshape
In data.table, the data.table::dcast() function is often used to convert data from a long format to a wide format, and in pandas, the pd$pivot_table() function is used to convert data from a long format to a wide format.
Comparison code
data.table::dcast() is faster
<- c(
ml.colors "data.table::dcast" = "#D6604D",
"pd$pivot_table" = "#BF812D"
)<- 10
n.folds <- 100
n.rows <- 1
seconds.limit <- atime::atime(
ml.long2wide.atime N=as.integer(10^seq(2, 7, by=0.2)),
setup={
<- data.frame(
df id = rep(1:N, each = 2),
category = rep(c("A", "B"), N),
value = rnorm(2 * N)
<- reticulate::r_to_py(df)
py_df <- import("pandas")
pd
)
},seconds.limit= seconds.limit,
"data.table::dcast" = {
::dcast(data.table(df), id ~ category, value.var = "value")
data.table
},"pd$pivot_table" = {
$pivot_table(py_df, values = "value", index = "id", columns = "category")
pd
} )
<- atime::references_best(ml.long2wide.atime) ml.long2wide.refs
<- predict(ml.long2wide.refs)
ml.long2wide.pred plot(ml.long2wide.pred)
<- plot(ml.long2wide.pred)+
ml.long2wide theme(text=element_text(size=15))+
ggtitle(sprintf("Reshaping from long to wide over \nreal numbers, N times", n.folds))+
scale_x_log10("N = number of Mean,SD,Length to compute")+
scale_y_log10("Computation time (seconds)
median line, min/max band
over 10 timings")+
facet_null()+
scale_fill_manual(values=ml.colors)+
scale_color_manual(values=ml.colors)
plot(ml.long2wide)
On the other hand, when transforming data from long to wide format, data.table’s dcast() function proves to be more efficient.
Although data.table has advantages in terms of writing, reading and rehaping, pandas library in Python still provides a solid framework for data manipulation as seen in the plot.
Conclusions
In conclusion, we have shown how to use atime to compare asymptotic time of the two packages. Both Pandas and data.table are powerful libraries for data manipulation in Python and R respectively. Pandas offers a comprehensive set of functions and a user-friendly interface, making it suitable for a wide range of data analysis tasks. On the other hand, Data.table excels in terms of performance and memory efficiency, making it an excellent choice for handling large datasets and complex operations.
The choice between the two libraries ultimately depends on the specific requirements of your data manipulation tasks. It is recommended to consider the size of the dataset, the complexity of the operations, and personal preferences when making a decision.
References:
My code was copied and modified from the code links below: