Skip to content

data-cleaning/dcmodifydt

Repository files navigation

dcmodifydt

CRAN status R-CMD-check Codecov test coverage

dcmodifydt executes dcmodify modification rules on a data.table, making it easy to switch between data.frame , database or data.table implementation.

dcmodify separates intent from execution: a user specifies what, why and how of an automatic data change and uses dcmodify to execute them on a data.table.

Rationale

data.table is an excellent package with great performance. Why not code modification rules directly in R script? Keep it simple is certainly a good advise, reasons to use dcmodify are:

  • You have a set “correction” rules in a production process that you apply repeatly, e.g. in each production run.

  • You want to share these rules with different production processes.

  • You have data matter specialists that do not know (or have to) program in R, and use this as a quality frame work to communicate and specify the step to correct or derive variables.

In these cases it is useful to label, describe and document your rules, and use dcmodify, dcmodifydt or dcmodifydb to apply the modification rules.

Installation

The development version from GitHub with:

# install.packages("devtools")
devtools::install_github("data-cleaning/dcmodifydt")

Example

library(dcmodifydt)
#> Loading required package: dcmodify
library(data.table)

m <- modifier( if (age > 130) age = 130
             , income[age < 12] <- 0
             )

dat <- fread(text =
"age, income
 140,  300
  11, 2000
  25, 3000"
)

# modify a copy of the data
dat_m <- modify(dat, m, copy = TRUE)
print(dat_m)
#>    age income
#> 1: 130    300
#> 2:  11      0
#> 3:  25   3000

# the data it self
setmodify(dat, m)
print(dat)
#>    age income
#> 1: 130    300
#> 2:  11      0
#> 3:  25   3000

Documented modification rules:

Create a modifier object:

m <- modifier(.file = "example/example.yml")

From this specification file “example/example.yml”:

rules:
- expr: if (age > 130) age = 130L
  name: M1
  label: 'Maximum age'
  description: |
    Human age is limited. (can use  "=")
    Cap the age at 130
- expr: is.na(age) <- age < 0
  name: M2
  label: 'Unknown age'
  description: |
    Negative Age, nah...
    (set to NA)
- expr: income[age < 12] <- 0
  name: M3
  label: 'No Child Labor'
  description: |
    Children should not work. (R syntax)
    Set income to zero for children.
- expr: retired <- (age > 67)
  name: M4
  label: 'Retired'
  description: |
    Derive a new variable...
dat <- data.table::fread(text = 
"age, income
  11,   2000
 150,    300
  25,   2000
 -10,   2000")

modify(dat, m, copy=FALSE)
print(dat)
#>    age income retired
#> 1:  11      0   FALSE
#> 2: 130    300    TRUE
#> 3:  25   2000   FALSE
#> 4:  NA   2000      NA

For convenience it is possible to dump the data.table statements. This file can be sourced, resulting in the modifications of the supplied table (name)

dump_dt(m, name = "my_dt", file="change_my_dt.R")
# ####################################
# Generated by dcmodifydt, do not edit
# dcmodify version: 0.1.9
# dcmodifydt version: 0.1.0.9000
# ####################################
# ensure that sourcing the file only affects 'my_dt'
local({
  dat <- my_dt
  
  # M1: Maximum age
  dat[age > 130, age := 130L]
  
  # M2: Unknown age
  dat[age < 0, age := NA]
  
  # M3: No Child Labor
  dat[age < 12, income := 0]
  
  # M4: Retired
  dat[, retired := (age > 67)]
})

About

dcmodify on a data.table

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages