library(fastverse)
-- Attaching packages ------------------------------------------------------------------------------- fastverse 0.3.4 --
v data.table 1.16.2 v kit 0.0.19
v magrittr 2.0.3 v collapse 2.0.18
Sebastian Krantz
September 21, 2024
collapse
Author(s): Sebastian Krantz
Maintainer: Sebastian Krantz (sebastian.krantz@graduateinstitute.ch)
collapse
is a large C/C++-based infrastructure package facilitating complex statistical computing, data transformation, and exploration tasks in R - at outstanding levels of performance and memory efficiency. It also implements a class-agnostic approach to R programming supporting vector, matrix and data frame-like objects (including xts, tibble, data.table, and sf). It has a stable API, depends on Rcpp, and supports R versions >= 3.4.0.
data.table
At the C-level, collapse
took much inspiration from data.table
, and leverages some of its core algorithms like radixsort, while adding significant statistical functionality and new algorithms within a class-agnostic programming framework that seamlessly supports data.table
. Notably, collapse::qDT()
is a highly efficient anything to data.table
converter, and all manipulation functions in collapse
return a valid data.table
object when a data.table
is passed, enabling subsequent reference operations (:=
).
Its added functionality includes a rich set of Fast Statistical Functions supporting vectorized (grouped, weighted) statistical operations on matrix-like objects. These are integrated with fast data manipulation functions in a way that also more complex statistical expressions can be vectorized across groups. It also adds flexible time series functions and classes supporting irregular series and panels, (panel-)data transformations, vectorized hash-joins, fast aggregation and recast pivots, (internal) support for variable labels, powerful descriptive tools, memory efficient programming tools, and recursive tools for heterogeneous nested data.
It is highly and interactively configurable. A navigable internal documentation/overview facilitates its use.
The easiest way to load collapse
and data.table
together is via the fastverse
package:
-- Attaching packages ------------------------------------------------------------------------------- fastverse 0.3.4 --
v data.table 1.16.2 v kit 0.0.19
v magrittr 2.0.3 v collapse 2.0.18
This demonstrates collapse
’s deep integration with data.table
.
There are many reasons to use collapse
, e.g., to compute advanced statistics very fast:
# Fast tidyverse-like functions: one of the ways to code with collapse
mtcDTagg <- mtcarsDT |>
fgroup_by(cyl, vs, am) |>
fsummarise(mpg_wtd_median = fmedian(mpg, wt), # Weighted median
mpg_wtd_p90 = fnth(mpg, 0.9, wt, ties = "q8"), # Weighted 90% quantile type 8
mpg_wtd_mode = fmode(mpg, wt, ties = "max"), # Weighted maximum mode
mpg_range = fmax(mpg) %-=% fmin(mpg), # Range: vectorized and memory efficient
lm_mpg_carb = fsum(mpg, W(carb)) %/=% fsum(W(carb)^2)) # coef(lm(mpg ~ carb)): vectorized
# Note: for increased parsimony, can abbreviate fgroup_by -> gby, fsummarise -> smr
mtcDTagg[, new2 := 1][1:3] # Still a data.table
cyl vs am mpg_wtd_median mpg_wtd_p90 mpg_wtd_mode mpg_range lm_mpg_carb new2
<num> <num> <num> <num> <num> <num> <num> <num> <num>
1: 4 0 1 26.0 26.00000 26.0 0.0 NaN 1
2: 4 1 0 22.8 24.40000 24.4 2.9 2.1 1
3: 4 1 1 30.4 33.80484 30.4 12.5 -1.7 1
Or simply, convenience functions like collap()
for fast multi-type aggregation:
country iso3c date year decade region income OECD PCGDP LIFEEX GINI ODA POP
1 Afghanistan AFG 1961-01-01 1960 1960 South Asia Low income FALSE NA 32.446 NA 116769997 8996973
2 Afghanistan AFG 1962-01-01 1961 1960 South Asia Low income FALSE NA 32.962 NA 232080002 9169410
3 Afghanistan AFG 1963-01-01 1962 1960 South Asia Low income FALSE NA 33.471 NA 112839996 9351441
# Population weighted mean for numeric and mode for non-numeric columns (multithreaded and
# vectorized across groups and columns, the default in statistical functions is na.rm = TRUE)
wlddev |> collap(~ year + income, fmean, fmode, w = ~ POP, nthreads = 4) |> ss(1:3)
country iso3c date year decade region income OECD PCGDP LIFEEX GINI
1 United States USA 1961-01-01 1960 1960 Europe & Central Asia High income TRUE 12768.7126 68.59372 NA
2 Ethiopia ETH 1961-01-01 1960 1960 Sub-Saharan Africa Low income FALSE 658.4778 38.33382 NA
3 India IND 1961-01-01 1960 1960 South Asia Lower middle income FALSE 500.7932 45.26707 NA
ODA POP
1 911825661 749495030
2 160457982 147355735
3 3278899549 927990163
We can also use the low-level API for statistical programming:
3 4 5
16.10667 24.53333 21.38000
4.0.1 4.1.0 4.1.1 6.0.1 6.1.0 8.0.0 8.0.1
26.00000 22.90000 28.37143 20.56667 19.12500 15.05000 15.40000
vars <- c("carb", "hp", "qsec") # columns to aggregate
# Aggregating: weighted mean - vectorized across groups and columns
add_vars(g$groups, # Grouping columns
fmean(get_vars(mtcars, vars), g,
w = mtcars$wt, use.g.names = FALSE)
)
cyl vs am carb hp qsec
1 4 0 1 2.000000 91.00000 16.70000
2 4 1 0 1.720045 83.60420 21.04028
3 4 1 1 1.416115 82.11819 18.75509
4 6 0 1 4.670296 131.78463 16.33306
5 6 1 0 2.522685 115.32202 19.21275
6 8 0 0 3.186582 196.74988 17.20449
7 8 0 1 6.118694 301.60682 14.55297
# Let's aggregate a matrix
m <- matrix(abs(rnorm(32^2)), 32)
m |> fmean(g) |> t() |> fmean(g) |> t()
4.0.1 4.1.0 4.1.1 6.0.1 6.1.0 8.0.0 8.0.1
4.0.1 0.06123789 1.4724382 0.7459940 1.5902129 0.8873607 0.6604920 0.8391957
4.1.0 0.78205486 0.8791056 1.2617126 0.8701933 1.1794070 0.7191204 0.7533241
4.1.1 0.66639757 0.7604432 0.8743168 0.8242863 0.8504150 0.7627944 0.8825123
6.0.1 0.71533372 0.4045359 0.8556836 0.8525144 0.9329643 0.7946364 0.8641836
6.1.0 1.10214877 1.2206170 0.9442454 0.9216912 0.7367946 0.7187178 0.6150456
8.0.0 0.47671550 0.7937906 0.7432943 0.9049254 0.6613901 0.7820188 0.8534884
8.0.1 0.94090449 0.8689585 0.7382680 1.0496066 1.2714088 0.8370710 0.5039534
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# Multiply the rows with a vector (by reference)
setop(m, "*", mtcars$mpg, rowwise = TRUE)
# Replace some elements with a number
setv(m, 3:40, 5.76) # Could also use a vector to copy from
whichv(m, 5.76) # get the indices back...
[1] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
It is also fairly easy to do more involved data exploration and manipulation:
# Groningen Growth and Development Center 10 Sector Database (see ?GGDC10S)
namlab(GGDC10S, N = TRUE, Ndistinct = TRUE, class = TRUE)
Variable Class N Ndist Label
1 Country character 5027 43 Country
2 Regioncode character 5027 6 Region code
3 Region character 5027 6 Region
4 Variable character 5027 2 Variable
5 Year numeric 5027 67 Year
6 AGR numeric 4364 4353 Agriculture
7 MIN numeric 4355 4224 Mining
8 MAN numeric 4355 4353 Manufacturing
9 PU numeric 4354 4237 Utilities
10 CON numeric 4355 4339 Construction
11 WRT numeric 4355 4344 Trade, restaurants and hotels
12 TRA numeric 4355 4334 Transport, storage and communication
13 FIRE numeric 4355 4349 Finance, insurance, real estate and business services
14 GOV numeric 3482 3470 Government services
15 OTH numeric 4248 4238 Community, social and personal services
16 SUM numeric 4364 4364 Summation of sector GDP
Dataset: GGDC10S, 1 Variables, N = 5027
Grouped by: Variable [2]
N Perc
EMP 2516 50.05
VA 2511 49.95
------------------------------------------------------------------------------------------------------------------------
SUM (numeric): Summation of sector GDP
Statistics (N = 4364, 13.19% NAs)
N Perc Ndist Mean SD Min Max Skew Kurt
EMP 2225 50.99 2225 36846.87 96318.65 173.88 764200 5.02 30.98
VA 2139 49.01 2139 43'961639.1 358'350627 0 8.06794210e+09 15.77 289.46
Quantiles
1% 5% 10% 25% 50% 75% 90% 95% 99%
EMP 256.12 599.38 1599.27 3555.62 9593.98 24801.5 66975.01 152402.28 550909.6
VA 0 25.01 444.54 21302 243186.47 1'396139.11 15'926968.3 104'405351 692'993893
------------------------------------------------------------------------------------------------------------------------
# Compute growth rate (Employment and VA, all sectors)
GGDC10S_growth <- tfmv(GGDC10S, AGR:SUM, fgrowth, # tfmv = transform variables. Alternatively: fmutate(across(...))
g = list(Country, Variable), t = Year, # Internal grouping and ordering, passed to fgrowth()
apply = FALSE) # apply = FALSE ensures we call fgrowth.data.frame
# Recast the dataset, median growth rate across years, taking along variable labels
GGDC_med_growth <- pivot(GGDC10S_growth,
ids = c("Country", "Regioncode", "Region"),
values = slt(GGDC10S, AGR:SUM, return = "names"), # slt = shorthand for fselect()
names = list(from = "Variable", to = "Sectorcode"),
labels = list(to = "Sector"),
FUN = fmedian, # Fast function = vectorized
how = "recast" # Recast (transposition) method
) |> qDT()
GGDC_med_growth[1:3]
Country Regioncode Region Sectorcode Sector VA EMP
<char> <char> <char> <fctr> <fctr> <num> <num>
1: BWA SSA Sub-saharan Africa AGR Agriculture 8.790267 0.8921475
2: ETH SSA Sub-saharan Africa AGR Agriculture 6.664964 2.5876142
3: GHA SSA Sub-saharan Africa AGR Agriculture 28.215905 1.4045550
# Finally, lets just join this to wlddev, enabling multiple matches (cartesian product)
# -> on average 61 years x 11 sectors = 671 records per unique (country) match
join(wlddev, GGDC_med_growth, on = c("iso3c" = "Country"),
how = "inner", multiple = TRUE) |> ss(1:3)
inner join: wlddev[iso3c] 2379/13176 (18.1%) <61:11> GGDC_med_growth[Country] 429/473 (90.7%)
country iso3c date year decade region income OECD PCGDP LIFEEX GINI
1 Argentina ARG 1961-01-01 1960 1960 Latin America & Caribbean Upper middle income FALSE 5642.765 65.055 NA
2 Argentina ARG 1961-01-01 1960 1960 Latin America & Caribbean Upper middle income FALSE 5642.765 65.055 NA
3 Argentina ARG 1961-01-01 1960 1960 Latin America & Caribbean Upper middle income FALSE 5642.765 65.055 NA
ODA POP Regioncode Region Sectorcode Sector VA EMP
1 219809998 20481779 LAM Latin America AGR Agriculture 32.91968 -0.8646301
2 219809998 20481779 LAM Latin America MIN Mining 25.72799 1.5627293
3 219809998 20481779 LAM Latin America MAN Manufacturing 26.66754 1.0801500
In summary: collapse
provides flexible high-performance statistical and data manipulation tools, which extend and seamlessly integrate with data.table
. The package follows a similar development philosophy emphasizing API stability, parsimonious syntax, and zero dependencies (apart from Rcpp
). data.table
users may wish to employ collapse
for some of the advanced statistical and manipulation functionality showcased above, but also to efficiently manipulate other data frame-like objects, such as sf
data frames.
data.table
reshape to duckdb
and polars
data.table
with atime
data.table
Syntax