dcmodifydt
executes
dcmodify
modification
rules on a data.table, making it easy to switch between data.frame ,
database or data.table implementation.
dcmodify
separates intent from execution: a user specifies
what, why and how of an automatic data change and uses dcmodify to
execute them on a data.table
.
data.table
is an excellent package with great performance. Why not
code modification rules directly in R script? Keep it simple is
certainly a good advise, reasons to use dcmodify
are:
-
You have a set “correction” rules in a production process that you apply repeatly, e.g. in each production run.
-
You want to share these rules with different production processes.
-
You have data matter specialists that do not know (or have to) program in R, and use this as a quality frame work to communicate and specify the step to correct or derive variables.
In these cases it is useful to label, describe and document your rules,
and use dcmodify
, dcmodifydt
or dcmodifydb
to apply the
modification rules.
The development version from GitHub with:
# install.packages("devtools")
devtools::install_github("data-cleaning/dcmodifydt")
library(dcmodifydt)
#> Loading required package: dcmodify
library(data.table)
m <- modifier( if (age > 130) age = 130
, income[age < 12] <- 0
)
dat <- fread(text =
"age, income
140, 300
11, 2000
25, 3000"
)
# modify a copy of the data
dat_m <- modify(dat, m, copy = TRUE)
print(dat_m)
#> age income
#> 1: 130 300
#> 2: 11 0
#> 3: 25 3000
# the data it self
setmodify(dat, m)
print(dat)
#> age income
#> 1: 130 300
#> 2: 11 0
#> 3: 25 3000
Create a modifier object:
m <- modifier(.file = "example/example.yml")
From this specification file “example/example.yml”:
rules:
- expr: if (age > 130) age = 130L
name: M1
label: 'Maximum age'
description: |
Human age is limited. (can use "=")
Cap the age at 130
- expr: is.na(age) <- age < 0
name: M2
label: 'Unknown age'
description: |
Negative Age, nah...
(set to NA)
- expr: income[age < 12] <- 0
name: M3
label: 'No Child Labor'
description: |
Children should not work. (R syntax)
Set income to zero for children.
- expr: retired <- (age > 67)
name: M4
label: 'Retired'
description: |
Derive a new variable...
dat <- data.table::fread(text =
"age, income
11, 2000
150, 300
25, 2000
-10, 2000")
modify(dat, m, copy=FALSE)
print(dat)
#> age income retired
#> 1: 11 0 FALSE
#> 2: 130 300 TRUE
#> 3: 25 2000 FALSE
#> 4: NA 2000 NA
For convenience it is possible to dump the data.table statements. This file can be sourced, resulting in the modifications of the supplied table (name)
dump_dt(m, name = "my_dt", file="change_my_dt.R")
# ####################################
# Generated by dcmodifydt, do not edit
# dcmodify version: 0.1.9
# dcmodifydt version: 0.1.0.9000
# ####################################
# ensure that sourcing the file only affects 'my_dt'
local({
dat <- my_dt
# M1: Maximum age
dat[age > 130, age := 130L]
# M2: Unknown age
dat[age < 0, age := NA]
# M3: No Child Labor
dat[age < 12, income := 0]
# M4: Retired
dat[, retired := (age > 67)]
})