Split-Apply-Combine with Dynamic Groups |
Estimate group aggregates, where one can set user-defined conditions that each group of records must satisfy to be suitable for aggregation. If a group of records is not suitable, it is expanded using a collapsing scheme defined by the user. |
van der Loo M (2023). accumulate: Split-Apply-Combine with Dynamic Groups. R package version 0.9.0.001, https://github.com/markvanderloo/accumulate. |
GitHub |
Modify Data Using Externally Defined Modification Rules |
Data cleaning scripts typically contain a lot of 'if this change that' type of statements. Such statements are typically condensed expert knowledge. With this package, such 'data modifying rules' are taken out of the code and become in stead parameters to the work flow. This allows one to maintain, document, and reason about data modification rules as separate entities. |
van der Loo M, de Jonge E (2023). dcmodify: Modify Data Using Externally Defined Modification Rules. R package version 0.8.0, https://CRAN.R-project.org/package=dcmodify. |
GitHub |
Deductive Correction, Deductive Imputation, and Deterministic Correction |
A collection of methods for automated data cleaning where all actions are logged. |
van der Loo M, de Jonge E, Scholtus S (2015). deducorrect: Deductive Correction, Deductive Imputation, and Deterministic Correction. R package version 1.3.7, https://CRAN.R-project.org/package=deducorrect. |
GitHub |
Data Correction and Imputation Using Deductive Methods |
Attempt to repair inconsistencies and missing values in data records by using information from valid values and validation rules restricting the data. |
van der Loo M, de Jonge E (2021). deductive: Data Correction and Imputation Using Deductive Methods. R package version 1.0.0, https://CRAN.R-project.org/package=deductive. |
GitHub |
Parsing, Applying, and Manipulating Data Cleaning Rules |
Facilitates reading and manipulating (multivariate) data restrictions (edit rules) on numerical and categorical data. Rules can be defined with common R syntax and parsed to an internal (matrix-like format). Rules can be manipulated with variable elimination and value substitution methods, allowing for feasibility checks and more. Data can be tested against the rules and erroneous fields can be found based on Fellegi and Holt's generalized principle. Rules dependencies can be visualized with using the 'igraph' package. |
de Jonge E, van der Loo M (2018). editrules: Parsing, Applying, and Manipulating Data Cleaning Rules. R package version 2.9.3, https://CRAN.R-project.org/package=editrules. |
GitHub |
Locate Errors with Validation Rules |
Errors in data can be located and removed using validation rules from package
'validate'. See also Van der Loo and De Jonge (2018) |
de Jonge E, van der Loo M (2022). errorlocate: Locate Errors with Validation Rules. R package version 1.1, https://CRAN.R-project.org/package=errorlocate. |
GitHub |
Univariate Outlier Detection |
Detect outliers in one-dimensional data. |
M.P.J. van der Loo, extremevalues, an R package for outlier detection in univariate data, R package version 2.3 |
GitHub |
Gower's Distance |
Compute Gower's distance (or similarity) coefficient between records. Compute the top-n matches between records. Core algorithms are executed in parallel on systems supporting OpenMP. |
van der Loo M (2022). gower: Gower's Distance. R package version 1.0.1, https://CRAN.R-project.org/package=gower. |
GitHub |
Hash R Objects to Integers Fast |
Apply an adaptation of the SuperFastHash algorithm to any R
object. Hash whole R objects or, for vectors or lists, hash R objects to obtain
a set of hash values that is stored in a structure equivalent to the input. See
|
van der Loo M (2021). hashr: Hash R Objects to Integers Fast. R package version 0.1.4, https://CRAN.R-project.org/package=hashr. |
GitHub |
Manipulation of Linear Systems of (in)Equalities |
Variable elimination (Gaussian elimination, Fourier-Motzkin elimination), Moore-Penrose pseudoinverse, reduction to reduced row echelon form, value substitution, projecting a vector on the convex polytope described by a system of (in)equations, simplify systems by removing spurious columns and rows and collapse implied equalities, test if a matrix is totally unimodular, compute variable ranges implied by linear (in)equalities. |
van der Loo M, de Jonge E (2023). lintools: Manipulation of Linear Systems of (in)Equalities. R package version 0.1.7, https://CRAN.R-project.org/package=lintools. |
GitHub |
Track Changes in Data |
A framework that allows for easy logging of changes in data.
Main features: start tracking changes by adding a single line of code to
an existing script. Track changes in multiple datasets, using multiple
loggers. Add custom-built loggers or use loggers offered by other
packages. |
van der Loo MPJ (2021). “Monitoring Data in R with the lumberjack Package.” Journal of Statistical Software, 98(1), 1–13. doi:10.18637/jss.v098.i01. |
GitHub |
Adapt Numerical Records to Fit (in)Equality Restrictions |
Minimally adjust the values of numerical records in a data.frame, such that each record satisfies a predefined set of equality and/or inequality constraints. The constraints can be defined using the 'validate' package. The core algorithms have recently been moved to the 'lintools' package, refer to 'lintools' for a more basic interface and access to a version of the algorithm that works with sparse matrices. |
van der Loo M, De Jonge E (2018). Statistical Data Cleaning with Applications in R. John Wiley and Sons, Inc, New York. ISBN 1118897153, doi:10.1002/9781118897126. |
GitHub |
Trends and Indices for Monitoring Data |
The TRIM model is widely used for estimating growth and decline of
animal populations based on (possibly sparsely available) count data. The
current package is a reimplementation of the original TRIM software developed
at Statistics Netherlands by Jeroen Pannekoek. See
|
Bogaart P, van der Loo M, Pannekoek J (2020). rtrim: Trends and Indices for Monitoring Data. R package version 2.1.1, https://CRAN.R-project.org/package=rtrim. |
GitHub |
Software Option Settings Manager for R |
Provides option settings management that goes beyond R's default 'options' function. With this package, users can define their own option settings manager holding option names, default values and (if so desired) ranges or sets of allowed option values that will be automatically checked. Settings can then be retrieved, altered and reset to defaults with ease. For R programmers and package developers it offers cloning and merging functionality which allows for conveniently defining global and local options, possibly in a multilevel options hierarchy. See the package vignette for some examples concerning functions, S4 classes, and reference classes. There are convenience functions to reset par() and options() to their 'factory defaults'. |
van der Loo M (2021). settings: Software Option Settings Manager for R. R package version 0.2.7, https://CRAN.R-project.org/package=settings. |
GitHub |
Simple Imputation |
Easy to use interfaces to a number of imputation methods that fit in the not-a-pipe operator of the 'magrittr' package. |
van der Loo M (2022). simputation: Simple Imputation. R package version 0.2.8, https://CRAN.R-project.org/package=simputation. |
GitHub |
Approximate String Matching, Fuzzy Text Search, and String Distance Functions |
Implements an approximate string matching version of R's native
'match' function. Also offers fuzzy text search based on various string
distance measures. Can calculate various string distances based on edits
(Damerau-Levenshtein, Hamming, Levenshtein, optimal sting alignment), qgrams (q-
gram, cosine, jaccard distance) or heuristic metrics (Jaro, Jaro-Winkler). An
implementation of soundex is provided as well. Distances can be computed between
character vectors while taking proper care of encoding or between integer
vectors representing generic sequences. This package is built for speed and
runs in parallel by using 'openMP'. An API for C or C++ is exposed as well.
Reference: MPJ van der Loo (2014) |
van der Loo M (2014). “The stringdist package for approximate string matching.” The R Journal, 6, 111-122. https://CRAN.R-project.org/package=stringdist. |
GitHub |
Lightweight and Feature Complete Unit Testing Framework |
Provides a lightweight (zero-dependency) and easy to use unit testing framework. Main features: install tests with the package. Test results are treated as data that can be stored and manipulated. Test files are R scripts interspersed with test commands, that can be programmed over. Fully automated build-install-test sequence for packages. Skip tests when not run locally (e.g. on CRAN). Flexible and configurable output printing. Compare computed output with output stored with the package. Run tests in parallel. Extensible by other packages. Report side effects. |
van der Loo M (2020). “A method for deriving information from running R code.” The R Journal, 13, 42-52. https://journal.r-project.org/articles/RJ-2021-056/. |
GitHub |
Data Validation Infrastructure |
Declare data validation rules and data quality indicators;
confront data with them and analyze or visualize the results.
The package supports rules that are per-field, in-record,
cross-record or cross-dataset. Rules can be automatically
analyzed for rule type and connectivity. Supports checks implied
by an SDMX DSD file as well. See also Van der Loo
and De Jonge (2018) |
Mark P. J. van der Loo, Edwin de Jonge (2021). Data Validation Infrastructure for R. Journal of Statistical Software, 97(10), 1-31. doi:10.18637/jss.v097.i10 |
GitHub |
Checking and Simplifying Validation Rule Sets |
Rule sets with validation rules may contain redundancies or contradictions. Functions for finding redundancies and problematic rules are provided, given a set a rules formulated with 'validate'. |
de Jonge E, van der Loo M (2020). validatetools: Checking and Simplifying Validation Rule Sets. R package version 0.5.0, https://CRAN.R-project.org/package=validatetools. |
GitHub |