| Title: | Compact Multiple Imputation, Assessment, and Reporting |
|---|---|
| Description: | Provides compact tools for missing-data analysis, including artificial amputation, chained single and multiple imputation, statistical and machine-learning-based imputation methods, diagnostic evaluation, and post-imputation pooling. |
| Authors: | Imad EL BADISY [aut, cre] |
| Maintainer: | Imad EL BADISY <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.8.0 |
| Built: | 2026-06-10 20:35:18 UTC |
| Source: | https://github.com/ielbadisy/mimar |
ampute() creates benchmark data by adding missing values under MCAR, MAR,
or MNAR mechanisms. The returned object keeps the original data, amputated
data, and masks for original, added, and total missingness.
## S3 method for class 'data.frame' ampute( x, prop = 0.1, mechanism = c("MCAR", "MAR", "MNAR"), target = NULL, by = NULL, direction = c("both", "left", "right"), seed = NULL, ... ) ampute(x, ...)## S3 method for class 'data.frame' ampute( x, prop = 0.1, mechanism = c("MCAR", "MAR", "MNAR"), target = NULL, by = NULL, direction = c("both", "left", "right"), seed = NULL, ... ) ampute(x, ...)
x |
A data frame. |
prop |
Target marginal proportion of newly introduced missing values. |
mechanism |
Missingness mechanism: missing completely at random
( |
target |
Character vector of variables eligible for amputation. Defaults to all variables. |
by |
Character vector of fully or partially observed variables used to
drive MAR probabilities. Required for |
direction |
For MNAR numeric targets, whether missingness is more likely in both tails, the left tail, or the right tail. |
seed |
Optional random seed. |
... |
Passed to methods. |
For each target variable , observed cells are removed by drawing
and setting to missing when . For MCAR,
, where is prop. For MAR, probabilities are based
on observed by variables through a standardized score :
with calibrated so that the average probability is prop. For
MNAR, is derived from the target variable itself; direction
selects high values, low values, or both tails for numeric targets.
A mimar_amputation object.
complete() extracts completed data sets from imputation objects. For
mimar_imputation objects, use complete(x) or complete(x, 1) for one
completed data set, complete(x, "all") for all completed data sets, and
complete(x, "long") for a stacked long data frame.
complete(x, ...)complete(x, ...)
x |
An object containing completed data. |
... |
Passed to methods. |
A data frame or list of data frames.
describe() summarizes missingness for data frames and returns compact
summaries for mimar objects.
describe(x, ...)describe(x, ...)
x |
An object to describe. |
... |
Passed to methods. |
For a data matrix , missingness is represented by
the indicator
where indexes rows and indexes variables. The variable-level
missingness proportion reported by describe() is
Row summaries use , and missingness patterns
are the unique row vectors of .
A mimar S3 object.
evaluate() compares imputed values with known truth when an amputation
object is available. Numeric targets are summarized with errors such as
while categorical targets are summarized with agreement and balanced
accuracy across classes. When no truth is available, evaluate() reports
diagnostics that can be computed from the imputed data alone. Distribution,
variability, and recovery summaries are computed across all completed data
sets; per-imputation recovery metrics are kept in recovery_by_imputation.
evaluate(x, ...)evaluate(x, ...)
x |
A |
... |
Passed to methods. |
A mimar_evaluation object.
fit() trains a mimar_imputer on complete observed rows for one target
variable. In the imputation loop, x is the predictor block
and y is the observed target vector .
fit(object, ...) ## S3 method for class 'mimar_imputer' fit( object, x, y, target = y, variable = "target", donors = 5, seed = NULL, ... )fit(object, ...) ## S3 method for class 'mimar_imputer' fit( object, x, y, target = y, variable = "target", donors = 5, seed = NULL, ... )
object |
Object to fit. |
... |
Passed to methods. |
x |
Predictor data frame containing observed rows for the current target variable. |
y |
Observed target vector for the current variable. |
target |
Original target vector, used to validate type support and restore imputed values to the correct storage type. |
variable |
Variable name used in diagnostics and error messages. |
donors |
Number of predictive mean matching donors for |
seed |
Optional random seed. |
The fitted object stores the original imputer descriptor and the model needed
by predict(). Native imputers are implemented directly in mimar; wrapped
imputers call their original learner packages directly:
"norm" fits a linear model and draws
, with
.
"pmm" fits the same linear model but imputes by predictive mean
matching: among observed donors with fitted values closest to
, one donor value is sampled.
"logreg" fits a binomial GLM and draws classes from fitted
Bernoulli probabilities.
"polyreg" fits one-vs-rest binomial GLMs and samples classes from
normalized class probabilities.
A fitted object.
fit(mimar_imputer): Fit a mimar_imputer.
impute() performs single or multiple imputation through a chained update
procedure owned by mimar. The default imputer = "pmm" uses predictive
mean matching. A named imputer such as "naive", "rf", "xgboost",
"knn", or "glmnet" can also be supplied to use that learner for every
incomplete variable it supports. The returned
object keeps completed datasets first; use complete() to extract one
completed dataset, all completed datasets, or a stacked long data frame.
impute( x, m = 5, imputer = "pmm", maxit = 5, seed = NULL, donors = 5, ncore = 1, verbose = FALSE, ... ) ## S3 method for class 'data.frame' impute( x, m = 5, imputer = "pmm", maxit = 5, seed = NULL, donors = 5, ncore = 1, verbose = FALSE, ... ) ## S3 method for class 'mimar_amputation' impute( x, m = 5, imputer = "pmm", maxit = 5, seed = NULL, donors = 5, ncore = 1, verbose = FALSE, ... )impute( x, m = 5, imputer = "pmm", maxit = 5, seed = NULL, donors = 5, ncore = 1, verbose = FALSE, ... ) ## S3 method for class 'data.frame' impute( x, m = 5, imputer = "pmm", maxit = 5, seed = NULL, donors = 5, ncore = 1, verbose = FALSE, ... ) ## S3 method for class 'mimar_amputation' impute( x, m = 5, imputer = "pmm", maxit = 5, seed = NULL, donors = 5, ncore = 1, verbose = FALSE, ... )
x |
A data frame or |
m |
Number of completed data sets to generate. |
imputer |
Imputer name or a |
maxit |
Number of chained-equation iterations. |
seed |
Optional random seed. |
donors |
Number of donor candidates used by donor-based imputers. |
ncore |
Number of CPU cores used to run completed datasets in
parallel. The default, |
verbose |
Logical; if |
... |
Passed to methods. |
Hyperparameters for learner-backed imputers can be supplied through the
imputer() specification or directly through ... when calling impute().
For donor-based imputers, donors controls the donor pool used by "pmm",
"spmm", "knn", and "hotdeck".
Let be an incomplete variable and all remaining
variables. At each iteration, mimar fits an imputer learner on observed rows
and predicts missing cells from . This is
repeated across incomplete variables for maxit iterations and across
independent completed data sets. The algorithm is intentionally
learner-agnostic: each imputer is constructed with imputer(), trained with
fit(), and used through predict().
Each requested imputer is applied to all incomplete variables it supports.
Use imputer_registry() to inspect target-type compatibility.
Learner-backed methods are supervised stochastic update rules inside the
chained workflow; inspect diagnostics and downstream sensitivity rather than
treating any single learner as a guarantee of proper uncertainty
quantification.
A mimar_imputation object.
impute(data.frame): Impute a data frame.
impute(mimar_amputation): Impute a mimar_amputation object and retain truth masks
for later evaluation.
imputer() constructs a standalone learner descriptor used by the chained
imputation engine. All imputers expose the same standard lifecycle:
imputer(method, ...) ## Default S3 method: imputer(method, spec = NULL, ...)imputer(method, ...) ## Default S3 method: imputer(method, spec = NULL, ...)
method |
Imputer method name. |
... |
Hyperparameters retained for later use by |
spec |
Optional learner specification retained for future extensions. |
construct with imputer(method);
fit on observed rows with fit(object, x, y);
impute new rows with predict(fitted, newdata).
Native learners implemented in mimar include "mean", "median",
"mode", "naive", "norm", "pmm", "spmm", "logreg", "polyreg",
"knn", and "hotdeck". Learner-backed imputers such as "rf",
"xgboost", "svm", "bart", "nbayes", "rpart", "glmnet",
"gbm", and "famd" are called directly through their original packages
installed with mimar. Additional arguments supplied to imputer() are
retained as hyperparameters and used by impute() and fit().
The "superlearner" imputer, also available as "sl", cross-validates a
candidate imputer library on observed cells and combines candidates using
non-negative loss-based weights.
Compatibility with target types is explicit. If an imputer does not support a
numeric, binary, or multiclass target, mimar stops with an error rather than
silently falling back to another method.
A mimar_imputer object.
imputer(default): Construct a mimar_imputer.
imputer_registry() returns the imputer names accepted by impute() and
metadata describing target-type support and backend packages. Native
mimar methods display package = "internal". The result is returned as a
tibble.
imputer_registry()imputer_registry()
A tibble.
plot.mimar_imputation() draws imputation diagnostics. By default it shows
imputed cell counts. Other plot types show a cell-status map, observed versus
imputed distributions, boxplots across imputations, bivariate diagnostics,
categorical proportions, convergence traces, variable-level imputation
methods, or between-imputation variability.
## S3 method for class 'mimar_imputation' plot( x, type = c("imputed", "missing", "density", "strip", "boxplot", "xy", "proportion", "trace", "methods", "variability"), variable = NULL, formula = NULL, statistic = c("mean", "sd"), ... )## S3 method for class 'mimar_imputation' plot( x, type = c("imputed", "missing", "density", "strip", "boxplot", "xy", "proportion", "trace", "methods", "variability"), variable = NULL, formula = NULL, statistic = c("mean", "sd"), ... )
x |
A |
type |
Plot type: |
variable |
Optional variable name or names used by distribution plots. |
formula |
Optional formula for bivariate and stratified diagnostics.
Use forms such as |
statistic |
Trace statistic, either |
... |
Unused. |
A ggplot object.
pool() combines post-fit quantities estimated separately on each completed
data set. The object being pooled is a quantity: a scalar, vector, matrix,
array, model coefficient, survival probability, metric, or other estimate of
interest. A data frame is only a convenient tabular adapter for tidy scalar
estimates; it is not the statistical target being pooled.
pool(x, ...) ## S3 method for class 'numeric' pool( x, variance = NULL, std.error = NULL, covariance = NULL, rule = NULL, transform = NULL, inverse = NULL, conf.level = 0.95, name = "quantity", ... ) ## S3 method for class 'list' pool( x, variance = NULL, std.error = NULL, covariance = NULL, rule = NULL, transform = NULL, inverse = NULL, conf.level = 0.95, ... ) ## S3 method for class 'matrix' pool( x, variance = NULL, std.error = NULL, covariance = NULL, rule = NULL, transform = NULL, inverse = NULL, conf.level = 0.95, ... ) ## S3 method for class 'data.frame' pool( x, variance = NULL, std.error = NULL, covariance = NULL, rule = NULL, transform = NULL, inverse = NULL, conf.level = 0.95, ... )pool(x, ...) ## S3 method for class 'numeric' pool( x, variance = NULL, std.error = NULL, covariance = NULL, rule = NULL, transform = NULL, inverse = NULL, conf.level = 0.95, name = "quantity", ... ) ## S3 method for class 'list' pool( x, variance = NULL, std.error = NULL, covariance = NULL, rule = NULL, transform = NULL, inverse = NULL, conf.level = 0.95, ... ) ## S3 method for class 'matrix' pool( x, variance = NULL, std.error = NULL, covariance = NULL, rule = NULL, transform = NULL, inverse = NULL, conf.level = 0.95, ... ) ## S3 method for class 'data.frame' pool( x, variance = NULL, std.error = NULL, covariance = NULL, rule = NULL, transform = NULL, inverse = NULL, conf.level = 0.95, ... )
x |
Quantity estimates across imputations. Use a numeric vector for one scalar quantity, a list for scalar/vector/matrix/array quantities, a matrix with imputations in rows and quantities in columns, or a data frame as a tabular adapter for tidy scalar estimates. |
... |
Passed to methods. |
variance |
Complete-data variance for the quantity in each imputation.
For vector quantities this may also be a list of elementwise variance
vectors or matrices matching |
std.error |
Complete-data standard error for the quantity in each
imputation. Ignored when |
covariance |
For a list of numeric vectors, optional list of covariance matrices. When supplied, vector pooling uses Rubin's multivariate matrix form and returns the pooled covariance matrix. |
rule |
Pooling rule. |
transform |
Optional function applied before pooling, for example
|
inverse |
Optional inverse transformation applied to pooled estimates and intervals. |
conf.level |
Confidence level for interval estimates. |
name |
Name of a scalar quantity. |
For a scalar quantity with estimates and complete-data
variances , Rubin rules are
and total variance is
The reported standard error is . Confidence intervals use a
reference distribution with
For a vector quantity, pass x as a list of numeric vectors and covariance
as a list of complete-data covariance matrices. Rubin's matrix form is then
used: is a vector and is the
pooled covariance matrix. For matrices or arrays, pass a list of same-shaped
quantities. Unless a joint covariance structure is supplied through a vector
input, these are pooled element by element, which is appropriate for grids of
scalar estimands such as survival probabilities at several times and
covariate profiles. For survival-probability matrices, pool_survmat()
applies the complementary log-log transform and back-transform automatically.
Some metrics do not have reliable complete-data variance estimates or do not
satisfy approximate normality. Following Marshall et al. (2009), pool()
reports robust summaries by default when no variance is supplied: median,
interquartile range, and range across imputations. Use rule = "mean" to
request a mean and between-imputation standard error for such metrics.
A list is the preferred input for post-fit quantities. Use a list of
length m, one element per imputation. Each element can be a scalar,
vector, matrix, or array. When covariance is supplied for a list of
vectors, Rubin's multivariate matrix rule is used. Otherwise list elements
are pooled element by element.
Data frames are accepted as a convenience adapter, but the pooled
object is not the data frame itself. Rows must encode post-fit scalar
quantities: term, estimate, std.error, and imputation for Rubin
pooling, or metric, value, and imputation for metric summaries.
A mimar_pool object.
pool(numeric): Pool a scalar quantity observed across imputations.
pool(list): Pool a list of scalar, vector, matrix, or array quantities.
pool(matrix): Pool a matrix whose rows are imputations and columns are
scalar quantities.
pool(data.frame): Tabular adapter for tidy scalar estimates or metrics.
Marshall A, Altman DG, Holder RL, Royston P. Combining estimates of interest in prognostic modelling studies after multiple imputation: current practice and guidelines. BMC Medical Research Methodology. 2009;9:57.
pool(c(0.10, 0.11, 0.09), std.error = c(0.04, 0.05, 0.04), name = "age") betas <- list(c(age = 0.10, bmi = 0.30), c(age = 0.11, bmi = 0.32), c(age = 0.09, bmi = 0.29)) covs <- list(diag(c(0.04, 0.08)^2), diag(c(0.05, 0.09)^2), diag(c(0.04, 0.08)^2)) pool(betas, covariance = covs)pool(c(0.10, 0.11, 0.09), std.error = c(0.04, 0.05, 0.04), name = "age") betas <- list(c(age = 0.10, bmi = 0.30), c(age = 0.11, bmi = 0.32), c(age = 0.09, bmi = 0.29)) covs <- list(diag(c(0.04, 0.08)^2), diag(c(0.05, 0.09)^2), diag(c(0.04, 0.08)^2)) pool(betas, covariance = covs)
pool_survmat() pools a list of same-shaped survival-probability matrices
or arrays by applying Rubin-style pooling on the complementary
log-log scale and back-transforming the result. The helper is designed for
predicted survival probabilities at a grid of times, subjects, or covariate
profiles.
pool_survmat( x, variance = NULL, std.error = NULL, rule = NULL, conf.level = 0.95, clip = 1e-12, ... )pool_survmat( x, variance = NULL, std.error = NULL, rule = NULL, conf.level = 0.95, clip = 1e-12, ... )
x |
A non-empty list of numeric matrices or arrays containing survival probabilities. All elements must have the same dimensions. |
variance |
Optional list of within-imputation variances with the same
dimensions as |
std.error |
Optional list of within-imputation standard errors with the
same dimensions as |
rule |
Pooling rule. Defaults to Rubin pooling when within-imputation variance is supplied and to the robust median/IQR/range summary otherwise. |
conf.level |
Confidence level for interval estimates. |
clip |
Small positive value used to keep probabilities away from 0 and 1 before applying the cloglog transform. |
... |
Passed to lower-level pooling helpers. |
Let be the survival probability for imputation , row
index , and column index , with . Define
the complementary log-log transform
Rubin pooling is then applied elementwise on :
The pooled survival probability is
A delta-method standard error on the original scale is
Confidence intervals are obtained on the transformed scale and then back-transformed:
where is the Rubin degrees of freedom. Because is
decreasing, the lower survival bound comes from the upper transformed bound.
Probabilities are clipped to [clip, 1 - clip] before transformation to
avoid log(0) at the boundaries.
A mimar_pool object with pooled survival probabilities.
surv <- list( matrix(c(0.90, 0.80, 0.70, 0.60), 2, 2), matrix(c(0.91, 0.79, 0.72, 0.61), 2, 2), matrix(c(0.89, 0.81, 0.71, 0.59), 2, 2) ) pool_survmat(surv)surv <- list( matrix(c(0.90, 0.80, 0.70, 0.60), 2, 2), matrix(c(0.91, 0.79, 0.72, 0.61), 2, 2), matrix(c(0.89, 0.81, 0.71, 0.59), 2, 2) ) pool_survmat(surv)