<- c(
chr.pos.vec "chr10:213,054,000-213,055,000",
"chrM:111,000", # no end.
"chr1:110-111 chr2:220-222") # two ranges.
nc
Maintainer: Toby Dylan Hocking (toby.hocking@r-project.org)
User-friendly functions for extracting a data table (row for each match, column for each group) from non-tabular text data using regular expressions, and for melting columns that match a regular expression. Patterns are defined using a readable syntax that makes it easy to build complex patterns in terms of simpler, re-usable sub-patterns. Named R arguments are translated to column names in the output, thereby providing a standard interface to three regular expression ‘C’ libraries (‘PCRE’, ‘RE2’, ‘ICU’). Output can also include numeric columns via user-specified type conversion functions.
Relationship with data.table
Whereas data.table
provides several functions such as patterns()
and measure()
which support some regex engines (PCRE, TRE), nc
interfaces with two other engines (RE2, ICU). nc
imports data.table
, and always returns regex match results as a data.table
.
Overview
nc
is useful for extracting numeric data from text, for example consider the following strings, which indicate genomic positions, in bases on a chromosome:
The data above consist of a chromosome name (chr10), followed by a start position, and then optionally a dash and an end position. Using nc
, we can extract these different pieces of information into a data table using the code below, which inputs the data to parse (first argument), along with a regular expression (subsequent arguments).
::capture_first_vec(
nc
chr.pos.vec,chrom="chr.*?",
":",
start="[0-9,]+")
chrom start
<char> <char>
1: chr10 213,054,000
2: chrM 111,000
3: chr1 110
The code above uses chrom
and start
as argument names, which are therefore used for column names in the output data table (one row per input subject string, one column per named argument / capture group). However the code above only parses the start position (and not the optional end position). Below, we create a more complex regex to parse both the start and end, by first defining a common pattern to parse an integer,
<- function(x) as.integer(gsub("[^0-9]", "", x))
keep.digits
<- list("[0-9,]+", keep.digits) int.pattern
In the code above, we use a list to group the regex "[0-9],]+"
with the function keep.digits
which will be used for parsing the text that is extracted by that regex. We use that pattern twice in the code below,
<- list(
range.pattern chrom="chr.*?",
":",
start=int.pattern,
list( # un-named list becomes non-capturing group.
"-",
end=int.pattern
"?") # chromEnd is optional.
), ::capture_first_vec(chr.pos.vec, range.pattern) nc
chrom start end
<char> <int> <int>
1: chr10 213054000 213055000
2: chrM 111000 NA
3: chr1 110 111
The result above is a data table containing the first match in each subject (three rows total). Note the second row has end=NA
because that optional group did not match.
But the last subject has two potential matches (only the first is reported above). What if we wanted to get all matches in each subject? We can use another function, as in the code below.
::capture_all_str(chr.pos.vec, range.pattern) nc
chrom start end
<char> <int> <int>
1: chr10 213054000 213055000
2: chrM 111000 NA
3: chr1 110 111
4: chr2 220 222
The output above includes all matches in each subject (four rows total), but does not include any information about which subject each row came from, because it treats the subject as a single string to parse. To get that info, we can use capture_all_str()
for each row, using by=.I
as in the code below.
library(data.table)
data.table(chr.pos.vec)[, nc::capture_all_str(
=.I] chr.pos.vec, range.pattern), by
I chrom start end
<int> <char> <int> <int>
1: 1 chr10 213054000 213055000
2: 2 chrM 111000 NA
3: 3 chr1 110 111
4: 3 chr2 220 222
The output above includes the additional I
column which is the index of the subject that each match came from (two rows with I=3
because there are two matches in the third subject).
Finally, data.table::melt()
is used to power the long-to-wide data reshaping functionality in nc
. In data.table
we could use measure()
to specify a set of variables to reshape, as in the code below.
<- data.table(iris)[1]) (iris.wide
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<num> <num> <num> <num> <fctr>
1: 5.1 3.5 1.4 0.2 setosa
melt(iris.wide, measure.vars=measure(value.name, dim, pattern="(.*)[.](.*)"))
Species dim Sepal Petal
<fctr> <char> <num> <num>
1: setosa Length 5.1 1.4
2: setosa Width 3.5 0.2
The result above has reshaped the four numeric input columns into two numeric output columns (value.name
is the sentinel/keyword indicating that we want to make a new column for each unique value captured in that group). The equivalent nc
code would be as below, with the regex defined using a named argument for each capture group (instead of one long pattern
string with parentheses for each capture group).
::capture_melt_multiple(
nc
iris.wide,column=".*",
"[.]",
dim=".*")
Species dim Petal Sepal
<fctr> <char> <num> <num>
1: setosa Length 1.4 5.1
2: setosa Width 0.2 3.5
The nc
code above produces the same result, and in fact uses data.table::melt()
internally.
For more info about the nc
package, please read the vignettes on its CRAN page.