--- title: "Chained Multiple Imputation Workflows with mimar" output: rmarkdown::html_vignette bibliography: references.bib vignette: > %\VignetteIndexEntry{Chained Multiple Imputation Workflows with mimar} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` ## Overview Missing data are ubiquitous in applied statistics. A large proportion of real data sets contain at least some incomplete cases, and the way missingness is handled can substantially affect point estimates, standard errors, and downstream inference. Complete-case analysis discards potentially informative observations and, unless data are missing completely at random, introduces selection bias. Single imputation fills missing values once but ignores imputation uncertainty, leading to standard errors that are too narrow. Multiple imputation addresses both problems. By generating $m$ plausible completions of the data and combining results using Rubin's pooling rules [@rubin1987], it propagates uncertainty about what the missing values might have been into the final estimates. The chained-equation formulation, sometimes called fully conditional specification, makes multiple imputation tractable for mixed-type data by modeling each incomplete variable conditionally on all others, cycling through variables until the chained summaries appear stable [@vanbuuren2011mice]. `mimar` is designed for analyses where missing-data handling, artificial missingness benchmarks, imputation diagnostics, and downstream pooling should live in one compact grammar. Its contribution is not to replace the mature special-purpose imputation ecosystem, but to make a complete workflow concise: describe the missingness, create benchmark amputations when needed, impute with native or learner-backed update rules, inspect diagnostics, evaluate recovered cells when truth is available, and pool post-fit quantities. `mimar` organizes this workflow into a small set of composable functions: ```r describe() # characterise missingness structure ampute() # introduce controlled artificial missingness imputer_registry()# inspect available imputation methods impute() # run the chained multiple imputation complete() # extract a single completed dataset evaluate() # score imputation quality against known truth pool() # apply Rubin's pooling rules plot() # visualise any of the above objects ``` The central design decision is that `mimar` owns the outer chained-equation loop and multiple-imputation bookkeeping. Each imputer is simply a variable-level update rule slotted into that loop. This separation means that native parametric methods, donor-based methods, and machine-learning-backed methods are called identically and obey the same convergence and type-restoration logic: ```r impute(data, imputer = "pmm", m = 5, maxit = 5) impute(data, imputer = "rf", m = 5, maxit = 5) impute(data, imputer = "xgboost", m = 5, maxit = 5) ``` No external modelling framework is required. Learner-backed imputers call their original packages directly, and those backend packages are installed with `mimar` so all registered imputers are available without extra installation steps. ## Succinct Use For everyday work, `impute()` is the only function you need. The input data can contain `NA`, and the completed outputs returned by `complete()` do not. ```{r} succinct_dat <- data.frame( x = c(1, NA, 3, 4, NA), z = factor(c("a", "b", NA, "a", "b")) ) i <- mimar::impute(succinct_dat, imputer = "knn", m = 2, maxit = 2, seed = 1) mimar::complete(i, 1) mimar::complete(i, "all") ``` The workflow can be summarized as: | Stage | Function | Main object | |---|---|---| | Missingness description | `describe()` | `mimar_missing` | | Artificial missingness | `ampute()` | `mimar_amputation` | | Imputation | `impute()` | `mimar_imputation` | | Completed data extraction | `complete()` | data frame or list | | Quality assessment | `evaluate()` | `mimar_evaluation` | | Post-fit pooling | `pool()` | `mimar_pool` | | Diagnostics | `plot()` | `ggplot` | ```{r} library(mimar) set.seed(1) dat <- data.frame( age = rnorm(120, 50, 10), bmi = rnorm(120, 25, 4), sex = factor(sample(c("F", "M"), 120, TRUE)), group = factor(sample(c("A", "B", "C"), 120, TRUE)), smoker = sample(c(TRUE, FALSE), 120, TRUE) ) head(dat) ``` ## Characterizing the Missingness Structure Before imputing, it is good practice to understand the extent and pattern of missingness. `describe()` returns a summary object that reports per-variable missingness proportions, complete-case counts, and the missingness pattern matrix. Formally, let $X$ be an $n \times p$ data matrix and define the missingness indicator $$ R_{ij} = \begin{cases} 1, & X_{ij}\ \text{is missing},\\ 0, & X_{ij}\ \text{is observed}. \end{cases} $$ The per-variable missingness proportion is $$ \hat{\pi}_j = \frac{1}{n}\sum_{i=1}^{n} R_{ij}, $$ and the complete-case count is $$ n_{\mathrm{cc}} = \sum_{i=1}^{n} I(R_{i1}=0,\ldots,R_{ip}=0). $$ These summaries answer the first practical question: how much information would be lost under complete-case analysis, and which variables are driving that loss? ```{r} d <- describe(dat) d summary(d) ``` ```{r, fig.width=7, fig.height=4} plot(d) ``` ## Imputer Registry `mimar` ships with native imputers and learner-backed methods. The registry documents each method, whether it is implemented natively or through a backend package, and which target types it supports. ```{r} imputer_registry() ``` `describe("imputers")` returns the same registry through the unified `describe` interface, which also exposes the corresponding print and plot methods. ```{r} describe("imputers") ``` ## Simulating Missingness for Benchmarking When the goal is to evaluate imputation quality, a natural approach is to start with a complete dataset, introduce missingness under a known mechanism, impute, and compare the recovered values to the truth. `ampute()` supports this workflow by generating artificial missingness and tracking three masks: - `mask_original`: cells already missing before amputation; - `mask_added`: cells made missing by `ampute()`; - `mask_total`: all missing cells after amputation. Only `mask_added` cells have known truth and are used by `evaluate()`. ### Missing completely at random Under MCAR, each cell in the target variable is independently assigned to be missing with constant probability: $$ A_{ij} \sim \operatorname{Bernoulli}(\pi), \qquad X_{ij} \leftarrow \mathrm{NA}\ \text{if}\ A_{ij}=1, $$ where $\pi$ is `prop`. MCAR is the only mechanism under which complete-case analysis is unbiased, making it a useful baseline when comparing imputers. ### Missing at random Under MAR, the probability of missingness depends on observed covariates but not on the missing value itself. `mimar` constructs a standardized linear score $s_i$ from the `by` variables and maps it through a logistic function: $$ p_i = \frac{1}{1+\exp[-(\alpha+s_i)]}. $$ The intercept $\alpha$ is calibrated so that $n^{-1}\sum_i p_i \approx \texttt{prop}$. This ensures that the targeted missingness proportion is respected on average while allowing individual probabilities to vary with the observed predictors. ### Missing not at random Under MNAR, $s_i$ is derived from the target variable itself, which is the mechanism most harmful to inference because the missingness is informative about the value being missed. For numeric targets, `ampute()` can emphasize the right tail, left tail, or both tails of the distribution. ```{r} a <- ampute( dat, prop = 0.25, mechanism = "MAR", target = c("bmi", "group"), by = c("age", "sex"), seed = 1 ) a summary(a) ``` ## Quick Visual Checks The first thing most users want to see is whether the diagnostics are actually visible and legible. This compact set of plots appears early in the vignette so the rendered HTML shows the core visual outputs without requiring a long scroll. ```{r, fig.width=7, fig.height=4, fig.show='hold'} i_knn <- impute(a, imputer = "knn", m = 3, maxit = 3, seed = 1) plot(i_knn, type = "density", variable = "bmi") plot(i_knn, type = "xy", formula = bmi ~ age | sex) ``` ## The Chained Imputation Algorithm The fully conditional specification approach imputes each incomplete variable in turn, conditioning on all other variables. This is repeated for $T$ iterations and replicated $m$ times to obtain $m$ independent completed datasets. For incomplete variable $X_j$, let $$ O_j = \{i : R_{ij}=0\}, \qquad M_j = \{i : R_{ij}=1\}. $$ At iteration $t$, with the current completed data $\widetilde X^{(t)}$, `mimar` fits an imputation model using the observed rows of variable $j$: $$ \hat f_j^{(t)}:\widetilde X_{-j,O_j}^{(t)} \rightarrow X_{j,O_j}. $$ Missing values are then updated by applying the learned model to the missing rows: $$ \widetilde X_{j,M_j}^{(t+1)} = g_j\!\left(\hat f_j^{(t)},\widetilde X_{-j,M_j}^{(t)}\right), $$ where $g_j$ is the prediction or donor-sampling step specific to the chosen imputer. Observed values are never overwritten; $\widetilde X_{j,O_j}^{(t)}$ is reset to $X_{j,O_j}$ after each update. The full algorithm, written compactly, is as follows. Let $h$ be the imputer, $T$ the number of chained iterations, $m$ the number of imputations, and $\mathcal{B}(O_j)$ a bootstrap sample from the observed rows of variable $j$. $$ \begin{array}{ll} \textbf{Input:} & X,\ R,\ h,\ m,\ T.\\[2mm] \textbf{Initialize:} & \widetilde X^{(0)} \leftarrow \operatorname{init}(X), \quad\text{medians for numeric, modes for categorical variables.}\\[2mm] \textbf{For } k=1,\ldots,m: & \widetilde X_k^{(0)} \leftarrow \widetilde X^{(0)}.\\[1mm] & \textbf{For } t=1,\ldots,T:\\[1mm] & \quad \textbf{For each } j \text{ with } M_j \ne \varnothing:\\[1mm] & \quad\quad B_j \leftarrow \mathcal{B}(O_j).\\[1mm] & \quad\quad \hat f_{jk}^{(t)} \leftarrow \operatorname{fit}_h\!\left(\widetilde X_{k,B_j,-j}^{(t-1)}, X_{B_j,j}\right).\\[1mm] & \quad\quad \widetilde X_{k,M_j,j}^{(t)} \leftarrow \operatorname{restore}_j\!\left[ g_h\!\left(\hat f_{jk}^{(t)}, \widetilde X_{k,M_j,-j}^{(t-1)}\right) \right].\\[1mm] & \quad\quad \widetilde X_{k,O_j,j}^{(t)} \leftarrow X_{O_j,j}.\\[2mm] \textbf{Return:} & \{\widetilde X_1^{(T)},\ldots,\widetilde X_m^{(T)}\} \text{ as a \texttt{mimar\_imputation} object.} \end{array} $$ The bootstrap draw within each iteration is what introduces between-imputation variability and ensures that the $m$ completed datasets are not identical. The `restore` step converts predictions back to the target storage type (numeric, integer, `Date`, logical, factor, or character), which allows the same loop to handle numeric-only, categorical-only, and mixed-type data frames without special-casing at the call site. ## Implemented Update Equations The imputers differ in how $\hat f_j$ is fitted and how $g_j$ generates draws. The choice of imputer determines both the distributional assumptions made about each variable and the computational cost per iteration. ### Mean, Median, Mode, and Naive These constant-predictor imputers ignore the predictor variables entirely. For numeric targets, `mean` and `median` use $$ \widetilde x_{ij} = \bar x_j \quad\text{or}\quad \widetilde x_{ij} = \operatorname{median}(x_{O_j,j}). $$ For categorical targets, `mode` imputes the most frequent observed category: $$ \widetilde x_{ij} = \arg\max_c \sum_{i \in O_j} I(x_{ij}=c). $$ `naive` selects median for numeric-like variables and mode for categorical-like variables automatically. These methods are primarily useful as baselines and for initialising the chain; they do not preserve the conditional distribution of the missing variable given observed covariates. ### Normal Linear Draw `norm` fits an ordinary linear model $$ X_j = X_{-j}\beta_j + \epsilon_j, \qquad \epsilon_j \sim N(0,\sigma_j^2), $$ and draws from the predictive distribution at missing rows: $$ \widetilde X_{j,M_j} = \hat X_{-j,M_j}\hat\beta_j + e, \qquad e \sim N(0,\hat\sigma_j^2). $$ The stochastic draw is important: deterministic regression imputation underestimates the variance of the imputed variable and collapses the distribution around the regression surface. ### Predictive Mean Matching `pmm` combines linear prediction with donor sampling, which addresses the main limitation of `norm` for non-Gaussian variables: imputed values are constrained to the observed support. The method fits the same linear model but instead of drawing from the normal predictive distribution, it selects a donor from the observed rows whose predicted value is closest to the missing row's prediction: $$ d(i,\ell) = |\hat\mu_i - \hat\mu_\ell|, \qquad \widetilde x_{ij} = x_{\ell j},\quad \ell \sim \mathcal{D}_i. $$ This makes PMM robust to departures from normality and ensures that imputed values respect the range and granularity of the observed data [@vanbuuren2011mice]. `spmm` uses the same donor-matching rule in a single-step variant. ### Logistic and Polytomous Regression For binary targets, `logreg` models the conditional probability directly: $$ \Pr(X_j = 1 \mid X_{-j}) = \operatorname{logit}^{-1}(X_{-j}\beta_j), $$ and imputes by drawing $$ \widetilde X_{ij} \sim \operatorname{Bernoulli}(\hat p_i). $$ For multiclass targets, `polyreg` fits one-vs-rest logistic models for each class $c = 1,\ldots,C$ and normalises the resulting probabilities: $$ \widetilde p_{ic} = \frac{\hat p_{ic}}{\sum_{r=1}^{C}\hat p_{ir}}, $$ then draws from $\operatorname{Categorical}(\widetilde p_i)$. Stochastic class draws, rather than hard majority-class assignment, are necessary for preserving the uncertainty that multiple imputation is designed to propagate. ### Donor Methods: knn, hotdeck, famd `knn` and `hotdeck` are donor-based imputers that do not assume any parametric form for the conditional distribution. Predictors are encoded into a numeric design matrix $Z$. The donor distance for missing row $i$ is $$ d(i,\ell) = \left\|Z_i - Z_\ell\right\|_2. $$ `knn` samples the imputed value from the $k$ nearest observed donors. `hotdeck` uses the same standardized distance inside the chained-equation loop, operating as a stochastic hot-deck update. `famd` extends the donor approach to mixed-type predictors by first projecting observations into FAMD-assisted coordinates with `missMDA` [@josse2016missmda], then applying donor matching in that space. ### Learner-Backed Imputers `rf`, `ranger`, `rpart`, `nbayes`, `svm`, `bart`, `glmnet`, `gbm`, and `xgboost` treat the imputation of each variable as a supervised learning problem. The learner is fitted on observed rows and predicts (or draws from) the conditional distribution at missing rows: $$ \hat f_j = \mathcal{A}_h(X_{-j,O_j}, X_{j,O_j}). $$ Numeric targets use regression predictions; binary and multiclass targets use sampled class probabilities when the backend exposes them. These learner-backed methods are pragmatic stochastic chained update rules. They are useful for predictive recovery and sensitivity analyses, but users should inspect between-imputation variability, convergence-screening plots, and downstream model sensitivity rather than assuming that every learner automatically yields proper multiple-imputation uncertainty. `rf` operates as a variable-level supervised update inside `mimar`'s chained loop; it is not a whole-dataframe imputation engine. This is conceptually related to the missForest strategy [@stekhoven2012missforest], but the outer iteration, multiple-imputation replication, and type handling are controlled by `mimar`. The `ranger`, `glmnet`, `gbm`, and `xgboost` backends follow @wright2017ranger, @friedman2010glmnet, @friedman2001gbm, and @chen2016xgboost, respectively. ## Running the Imputation All imputers are called through the same `impute()` interface. The `m` argument controls the number of completed datasets and `maxit` the number of chained-equation cycles per dataset. Increasing `maxit` allows the conditional distributions more cycles to converge; in practice, 5 to 10 iterations is often sufficient for well-behaved data, though trace diagnostics and sensitivity analyses should be checked for complex missingness patterns. When a learner-backed imputer is requested, `mimar` applies that imputer to each incomplete variable rather than silently substituting another method. This keeps benchmark comparisons interpretable: `imputer = "rf"` means that random-forest updates are used for numeric, binary, and multiclass targets, and the same principle holds for the other learner-backed imputers. ```{r} i_knn <- impute(a, imputer = "knn", m = 3, maxit = 3, seed = 1) i_knn summary(i_knn) complete(i_knn, 1) ``` The vignette uses `knn` for the main diagnostics because it is dependency-light, fast on the small example data, and produces visible stochastic donor variation across imputations. The same plotting calls work for the other registered imputers. The `m` completed datasets are conditionally independent once their random seeds have been fixed. `mimar` therefore parallelises at the completed-dataset level when `ncore > 1`. Internally, the package maps the same single-chain imputation routine over `1:m` with `functionals::fmap()`. This is the natural parallel boundary: each worker receives the original incomplete data, the initial median/mode completion, the imputer specification, and a deterministic seed offset. Worker `k` uses `seed + k - 1`, so a run remains reproducible for a fixed `seed`, `m`, `maxit`, and imputer. ```r i_parallel <- impute( a, imputer = "knn", m = 5, maxit = 5, seed = 1, ncore = 2 ) ``` The default `ncore = 1` is deliberately conservative. It avoids parallel overhead for small data and gives the most predictable behavior in constrained execution environments. Increasing `ncore` is most useful when `m` is greater than one and each chain is expensive, for example with random forests, gradient boosting, penalized regression, or BART. Parallelism does not change the chained-equation model; it only changes how the independent completed datasets are scheduled. Methods that differ in their statistical assumptions can be compared by running them on the same amputed object: ```{r} i_knn_small <- impute(a, imputer = "knn", m = 1, maxit = 2, seed = 1) i_hotdeck <- impute(a, imputer = "hotdeck", m = 1, maxit = 2, seed = 1) summary(i_knn_small) summary(i_hotdeck) ``` Learner-backed methods are available through the same call: ```r impute(a, imputer = "rf", m = 5, maxit = 5, seed = 1) impute(a, imputer = "glmnet", m = 5, maxit = 5, seed = 1) impute(a, imputer = "xgboost", m = 5, maxit = 5, seed = 1) impute(a, imputer = "superlearner", m = 5, maxit = 5, seed = 1) ``` ## Tuning Imputers Learner-backed imputers expose their hyperparameters through a reusable specification object, or directly through `...` when calling `impute()`. The same hyperparameter set is applied across the full chained-imputation pipeline. ```{r} rf_spec <- imputer("rf", num.trees = 500) xgb_spec <- imputer("xgboost", nrounds = 100, max_depth = 3) describe(a) i_knn <- impute(a, imputer = "knn", m = 3, maxit = 3, seed = 1, donors = 10) summary(i_knn) ``` The learner-backed specifications can then be used in the same pipeline: ```r i_rf <- impute(a, imputer = rf_spec, m = 3, maxit = 3, seed = 1) i_xgb <- impute(a, imputer = xgb_spec, m = 3, maxit = 3, seed = 1) ``` ## Super Learner Imputation `superlearner` builds a cross-validated ensemble of candidate imputers. For each incomplete variable, the candidate library is evaluated on observed cells, non-negative weights are assigned from the observed-cell prediction loss, and the weighted ensemble is used inside the same chained-imputation loop. ```{r} sl_spec <- imputer( "superlearner", library = c("pmm", "knn", "rpart"), folds = 3, metalearner = "inverse_loss" ) i_sl <- impute(a, imputer = sl_spec, m = 2, maxit = 2, seed = 1) summary(i_sl) ``` The shorter `sl` alias is equivalent: ```r impute(a, imputer = "sl", m = 5, maxit = 5, seed = 1) ``` ## Evaluating Imputation Quality When the imputation object was produced from an amputed dataset, the true values of the artificially introduced missing cells are known. `evaluate()` exploits this by scoring only cells where `mask_added == TRUE`, isolating imputation error from any pre-existing missingness. For numeric variables with true values $y_i$ and imputed values $\hat y_i$ over the $q$ recovered cells: $$ \operatorname{RMSE} = \sqrt{\frac{1}{q}\sum_{i=1}^{q}(\hat y_i-y_i)^2}, \qquad \operatorname{MAE} = \frac{1}{q}\sum_{i=1}^{q}|\hat y_i-y_i|, \qquad \operatorname{Bias} = \frac{1}{q}\sum_{i=1}^{q}(\hat y_i-y_i). $$ RMSE penalises large errors more heavily than MAE; comparing the two gives a sense of whether the imputer is making a few large mistakes or many small ones. Bias diagnoses systematic over- or under-imputation, which can occur when the imputation model is misspecified relative to the true conditional distribution. For categorical variables, $$ \operatorname{Accuracy} = \frac{1}{q}\sum_{i=1}^{q} I(\hat y_i=y_i). $$ Balanced accuracy averages per-class recall and is more informative when class frequencies are unequal among the missing cells. ```{r} e <- evaluate(i_knn) e describe(e) head(e$recovery_by_imputation) ``` ```{r, fig.width=7, fig.height=4} plot(i_knn) plot(i_knn, type = "missing") plot(i_knn, type = "density") plot(e) ``` ## Diagnostic Plotting Diagnostic plotting is not a cosmetic step in multiple imputation. The pooled analysis assumes that the imputed values are plausible draws from the conditional distribution implied by the imputation model. No single plot proves that assumption or establishes convergence, but several simple graphics expose common failures: chains that have not stabilised, imputed values with implausible marginal distributions, categorical levels that are over- or under-produced, and relationships between variables that are distorted among the imputed cells. `mimar` keeps lightweight iteration diagnostics in the imputation object. For each completed dataset, iteration, and incomplete variable, it records the number of imputed cells and, for numeric variables, the mean and standard deviation of the current imputed values. These summaries are small enough to store by default but rich enough to support convergence-screening trace plots. ```{r} head(i_knn$diagnostics$trace) ``` The trace plot shows how a summary of the imputed values changes over the chained iterations. Strong monotone trends or chains that remain separated can signal unstable chained updates, feedback between derived variables, or a poor predictor specification. Flat traces are not automatically good: deterministic or constant imputers such as `naive` can be flat because they do not generate meaningful stochastic updates. Trace plots are screening tools, not formal proofs of convergence, and are most informative for stochastic methods such as `pmm`, `norm`, `logreg`, `polyreg`, and learner-backed imputers. ```{r, fig.width=7, fig.height=4} plot(i_knn, type = "trace", statistic = "mean") ``` Density plots compare the observed distribution with the distribution of imputed values. The observed data are drawn once, while imputed values are drawn separately for each completed dataset. This mirrors the common diagnostic style where the observed curve acts as a reference and each imputation contributes its own imputed curve. Density diagnostics are drawn as lines rather than filled areas so that several completed datasets can be compared without the later layers obscuring the earlier ones. The marginal distributions do not have to be identical under MAR, because missingness may depend on observed covariates. Large differences should still be investigated because they may represent either a real conditional difference or an imputation-model problem. ```{r, fig.width=7, fig.height=4} plot(i_knn, type = "density", variable = "bmi") ``` Box-and-whisker diagnostics provide the same comparison in a more robust form. Imputation number `0` denotes the observed values and `1:m` denote the completed datasets. This is useful when densities are unstable because there are few imputed values or because the variable has a skewed distribution with outliers. ```{r, fig.width=7, fig.height=4} plot(i_knn, type = "boxplot", variable = "bmi") ``` The strip plot displays individual observed and imputed values by imputation number. It is especially helpful for small datasets, variables with few missing cells, and situations where overplotting is limited. A tight pile of imputed points at one value reveals deterministic or overly concentrated imputation. ```{r, fig.width=7, fig.height=4} plot(i_knn, type = "strip", variable = "bmi") ``` Bivariate diagnostics examine whether the imputed rows preserve relationships between variables. The formula interface follows the familiar `y ~ x` pattern, with an optional conditioning variable after `|`. Observed points are shown separately from rows where either plotted variable had to be imputed. ```{r, fig.width=7, fig.height=4} plot(i_knn, type = "xy", formula = bmi ~ age | sex) ``` For categorical variables, density and boxplot diagnostics are not meaningful. The proportion plot compares the category distribution among observed values with the category distribution generated within each imputation. A stratified formula can be used when a categorical discrepancy may be explained by another observed variable. ```{r, fig.width=7, fig.height=4} plot(i_knn, type = "proportion", variable = "group") plot(i_knn, type = "proportion", formula = group ~ sex) ``` ## Pooling Across Imputations The purpose of generating $m > 1$ completed datasets is to obtain valid standard errors that reflect uncertainty about the missing data. Analysing each completed dataset separately and then combining results using Rubin's rules [@rubin1987; @marshall2009pooling] accomplishes this. `pool()` is designed for post-fit quantities computed after imputation. The statistical target is the quantity, not the storage container. A quantity may be a scalar estimate, a vector of regression coefficients, a covariance-aware parameter vector, a matrix of survival probabilities, or a scalar performance metric. Data frames are accepted only as a tidy adapter for scalar estimates that are naturally stored one row per term and imputation. For a single quantity, let $Q_k$ be its estimate and $U_k$ the complete-data variance estimate from completed dataset $k$. The pooled point estimate and between-imputation variance are: $$ \bar Q = \frac{1}{m}\sum_{k=1}^{m}Q_k, \qquad B = \frac{1}{m-1}\sum_{k=1}^{m}(Q_k-\bar Q)^2. $$ The total variance combines within- and between-imputation components: $$ \bar U = \frac{1}{m}\sum_{k=1}^{m}U_k, \qquad T = \bar U + \left(1+\frac{1}{m}\right)B. $$ The term $(1+m^{-1})B$ is the excess variance attributable to missingness. The fraction of missing information, $\lambda = (1+m^{-1})B/T$, quantifies how much the missingness inflates uncertainty relative to a complete-data analysis; variables with large $\lambda$ are those for which inference is most sensitive to the imputation model. The result is a `mimar_pool` object whose `pooled` table contains the pooled quantity, total standard error, confidence interval, and variance diagnostics for each `term`. ```{r} pool(c(0.10, 0.11, 0.09), std.error = c(0.04, 0.05, 0.04), name = "age") ``` For several coefficients from the same fitted model, the vector and covariance matrix should be pooled together. This keeps the joint uncertainty structure, which is needed for multivariate Wald tests and downstream contrasts. ```{r} betas <- list( c(age = 0.10, bmi = 0.30), c(age = 0.11, bmi = 0.32), c(age = 0.09, bmi = 0.29) ) covariances <- list( diag(c(0.04, 0.08)^2), diag(c(0.05, 0.09)^2), diag(c(0.04, 0.08)^2) ) pooled_betas <- pool(betas, covariance = covariances) pooled_betas pooled_betas$estimate pooled_betas$variance ``` Matrix-valued quantities are supplied as a list with one matrix per imputation. For example, a survival probability matrix might have time points in rows and covariate profiles in columns. Unless a full joint covariance is available, `pool()` treats each cell as a scalar estimand and pools element by element. ```{r} survival_probabilities <- list( matrix(c(0.90, 0.80, 0.70, 0.60), nrow = 2), matrix(c(0.91, 0.79, 0.72, 0.61), nrow = 2), matrix(c(0.89, 0.81, 0.71, 0.59), nrow = 2) ) pooled_survival <- pool(survival_probabilities) pooled_survival pooled_survival$estimate ``` The tidy data-frame adapter remains useful for model outputs that already have one scalar row per term and imputation. In the next example there are six input rows because two quantities, `age` and `bmi`, are estimated in each of three imputations. `pool()` returns two pooled quantities: one row for `age` and one row for `bmi`. ```{r} external_results <- data.frame( term = rep(c("age", "bmi"), each = 3), estimate = c(0.10, 0.11, 0.09, 0.30, 0.32, 0.29), std.error = c(0.04, 0.05, 0.04, 0.08, 0.09, 0.08), imputation = rep(1:3, times = 2) ) p <- pool(external_results) p ``` When no reliable complete-data variance is available, as is common for bounded performance measures, Marshall et al. recommend robust summaries. Accordingly, `pool()` reports the median, interquartile range, and range by default for such quantities. Use `rule = "mean"` only when a mean and between-imputation standard error are substantively appropriate. ## Limitations and Future Work `mimar` is intentionally compact. It standardizes missingness description, amputation, imputation, diagnostics, evaluation, and pooling, but it does not yet expose all controls available in larger imputation systems. In particular, there is no full predictor-matrix interface, passive-imputation grammar for deterministic derived variables, formal convergence statistic, or automatic tuning layer for learner-backed imputers. The learner-backed methods should be interpreted as supervised update rules inside a chained stochastic workflow. They can improve predictive recovery, but they do not by themselves guarantee congeniality with a downstream analysis model or valid uncertainty quantification in every setting. For inferential analyses, users should compare methods, inspect trace and distribution diagnostics, evaluate recovery when benchmark truth is available, and assess whether substantive conclusions are robust to plausible imputation choices. ## References