pyampute.ampute module¶
Transformer for generating multivariate missingness in complete datasets
- class pyampute.ampute.MultivariateAmputation(prop=0.5, patterns=None, std=True, verbose=False, seed=None, lower_range=- 3, upper_range=3, max_diff_with_target=0.001, max_iter=100)¶
Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Generating multivariate missingness patterns in complete datasets
n = number of samples.
m = number of features/variables.
k = number of patterns.
Amputation is the opposite of imputation: the generation of missing values in complete datasets. This is useful for evaluating the effect of missing values in your model, possibly in a larger pipeline in combination with various imputation methods, estimators and evaluation metrics.
We provide several examples and an extensive blogpost to explain in more detail how certain parameter choices affect the generated missingness.
Our class is compatible with the scikit-learn-style
fit
andtransform
paradigm and can be used in a scikit-learnPipeline
(see this example).- Parameters
prop (float, default : 0.5) – Proportion of incomplete data rows as a decimal or percent.
patterns (List[Dict], default:
DEFAULT_PATTERN
) –List of k dictionaries. Each dictionary has the following key-value pairs:
- incomplete_vars (Union[ArrayLike[int], ArrayLike[str]]) –
Indicates which variables that should be amputed. List of int for indices of variables, list of str for column names of variables. observed_vars is the complement of incomplete_vars.
- weights (Union[ArrayLike[float], Dict[int, float], Dict[str, float]], default: all 0s (MCAR) or observed_vars weight 1 (MAR) or incomplete_vars weight 1 (MNAR)) –
Specifies the (relative) size of effect of each specified var on missing vars. If using an array, you must specify all m weights. If using a dictionary, the keys are either indices of vars or column names; unspecified vars will be assumed to have a weight of 0. Negative values have a decrease effect, 0s indicate no role in missingness, and positive values have an increase effect. The weighted score for sample i in pattern k is the inner product of the weights and sample[i]. Note: weights are required to be defined if the corresponding mechanism is MAR+MNAR.
- mechanism (str, {MAR, MCAR, MNAR, MAR+MNAR}) –
Case insensitive. MNAR+MAR is only possible by passing a custom weight array.
- freq (float [0,1], default: all patterns with equal frequency (1/k)) –
Relative occurence of a pattern with respect to other patterns. All frequencies across k dicts/patterns must sum to 1. Either specify for all patterns, or none for the default. For example (k = 3 patterns),
freq := [0.4, 0.4, 0.2]
means that of all rows with missing values, 40% should have pattern 1, 40% pattern 2. and 20% pattern 3.- score_to_probability_func (Union[str, Callable[ArrayLike[floats] -> ArrayLike[floats]]], {“sigmoid-right”, “sigmoid-left”, “sigmoid-mid”, “sigmoid-tail”, Callable}) –
Converts standardized weighted scores for each data row (in a data subset corresponding to pattern k) to probability of missingness. Choosing one of the sigmoid options (case insensitive) applies sigmoid function with a logit cutoff per pattern. The simgoid functions dictate that a [high, low, average, extreme] score (respectively) has a high probability of amputation. The sigmoid functions will be shifted to ensure correct joint missingness probabilities. Custom functions must accept arrays with values
(-inf, inf)
and output values[0,1]
. We will not shift custom functions, refer to Amputing with a custom probability function for more.
std (bool, default : True) – Whether or not to standardize data before computing weighted scores. Standardization ensures that weights can be interpreted relative to each other. Do not standardize if train and test split is done after amputation (prevent leaking).
verbose (bool, default : False) – Toggle on to see INFO level logging information.
seed (int, optional) – If you want reproducible results during amputation, set an integer seed. If you don’t set it, a random number will be produced every time.
lower_range (float, default : -3) – Lower limit in range when searching for horizontal shift of score_to_probability_func.
upper_range (float, default : 3) – Upper limit in range when searching for horizontal shift of score_to_probability_func.
max_dif_with_target (float, default : 0.001) – The allowable error between the desired percent of missing data (prop) and calculated joint missingness probability.
max_iter (int, default : 100) – Max number of iterations for binary search when searching for horizontal shift of score_to_probability_func.
- DEFAULT_PATTERN¶
If patterns are not passed, the default is the following:
{ "incomplete_vars": random 50% of vars, "mechanism": "MAR", "freq": 1 "score-to-prob": "sigmoid-right" }
- Type
Dict[str, Any]
- DEFAULTS¶
Default values used, especially if values are not passed for parameters in certain patterns (not to be confused with patterns not being specified at all).
- Type
Dict[str, Any]
See also
mdPatterns
Displays missing data patterns in incomplete datasets
MCARTest
Statistical hypothesis test for Missing Completely At Random (MCAR)
Notes
The methodology for multivariate amputation has been proposed by Schouten et al. (2018). For a more thorough understanding of how the input parameters can be used, read this blogpost. It may be good to know that multivariate amputation is implemented in an R-function as well; mice::ampute.
Examples
>>> import numpy as np >>> from pyampute.ampute import MultivariateAmputation >>> m = 1000 >>> n = 10 >>> rng = np.random.default_rng(seed) >>> X_compl = rng.standard_normal((m, n)) >>> ma = MultivariateAmputation() >>> X_incompl = ma.fit_transform(X_compl)
- DEFAULTS = {'lower_range': -3, 'max_diff_with_target': 0.001, 'max_iter': 100, 'mechanism': 'MAR', 'score_to_probability_func': 'SIGMOID-RIGHT', 'upper_range': 3}¶
- fit(X, y=None)¶
Fits amputer on complete data X.
Validates input data and parameter settings.
- Parameters
X (Matrix) – Matrix of shape (n, m). Complete input data, where n is the number of data rows (samples) and m is the number of features (column, variables). Data cannot contain missing values and should be numeric, or will be forced to be numeric.
y (ArrayLike) – Ignored. Not used, present here for consistency.
- Return type
- transform(X, y=None)¶
Masks data according to the desired pattern and returns the incomplete data X.
- Parameters
X (Matrix) – Matrix of shape (n, m) Complete input data, where n is the number of data rows (samples) and m is the number of features (column, variables). Data cannot contain missing values and should be numeric, or will be forced to be numeric.
y (ArrayLike) – Ignored. Not used, present here for consistency.
- Returns
X_incomplete – Matrix of shape (n, m). Incomplete data masked according to parameters.
- Return type
Matrix