pyampute.exploration

In the pyampute.exploration folder, we provide functionality for inspecting incomplete datasets.

mdPatterns displays all unique missing data patterns in an incomplete dataset, which is useful for investigating whether there is structure (or a good explanation) for the occurrence of missing data.

MCARTest performs a statistical hypothesis test to evaluate whether it is likely that missing data has a Missing Completely At Random (MCAR) mechanism.

pyampute.exploration.md_patterns module

Displays missing data patterns in incomplete datasets

class pyampute.exploration.md_patterns.mdPatterns

Bases: object

Displays missing data patterns in incomplete datasets

Extracts all unique missing data patterns in an incomplete dataset and creates a visualization. 1 (red) and 0 (blue) refer to missing and observed values respectively.

Parameters

None (currently no parameters available.) –

md_patterns

k is the number of unique missing data patterns and m the number of dataset columns (features). 0 and 1 correspond to missing and observed values respectively. The first row displays the data rows with no missing values and the last row gives column totals. The first column displays the count or proportion of rows that follow a pattern and the last column displays the number of missing values per pattern.

Type

pandas DataFrame of shape (k+2, m+2)

See also

MultivariateAmputation

Transformer for generating multivariate missingness in complete datasets

Notes

This class is useful for investigating any structure in an incomplete dataset, and can help to understand possible reasons or solutions. We follow the logic of a comparable R-function, mice::md_patterns.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from pyampute.exploration.md_patterns import mdPatterns
>>> nhanes2 = pd.read_csv("data/nhanes2.csv")
>>> mdp = mdPatterns()
>>> patterns = mdp.get_patterns(nhanes2)
>>> print(patterns)
                        row_count  age  hyp  bmi  chl  n_missing_values
rows_no_missing                 13    1    1    1    1                 0
1                                3    1    1    1    0                 1
2                                1    1    1    0    1                 1
3                                1    1    0    0    1                 2
4                                7    1    0    0    0                 3
n_missing_values_per_col              0    8    9   10                27
get_patterns(X, count_or_proportion='count', show_plot=True)

Extracts and visualizes missing data patterns in an incomplete dataset

Parameters
  • X (Matrix of shape (n, m)) – Dataset with missing values. n rows (samples) and m columns (features).

  • count_or_proportion (str, {"count", "proportion"}, default : "count") – Whether the number of rows should be specified as a count or a proportion.

  • show_plot (bool, default : True) – Whether a plot should be displayed using plt.show.

Returns

md_patternsk is the number of unique missing data patterns and m the number of dataset columns (features). The first row displays the data rows with no missing values and the last row gives column totals. The first column displays the count or proportion of rows that follow a pattern, the last column displays the number of missing values per pattern.

Return type

pandas DataFrame of shape (k+2, m+2)

pyampute.exploration.mcar_statistical_tests module

Statistical hypothesis test for Missing Completely At Random (MCAR)

class pyampute.exploration.mcar_statistical_tests.MCARTest(method='little')

Bases: object

Statistical hypothesis test for Missing Completely At Random (MCAR)

Performs Little’s MCAR test (see Little, R.J.A. (1988)). Null hypothesis: data is Missing Completely At Random (MCAR). Alternative hypothesis: data is not MCAR.

Parameters

method (str, {"little", "ttest"}, default : "little") – Whether to perform a chi-square test on the entire dataset (“little”) or separate t-tests for every combination of variables (“ttest”).

See also

mdPatterns

Displays missing data patterns in incomplete datasets

MultivariateAmputation

Transformer for generating multivariate missingness in complete datasets

Notes

We advise to use Little’s MCAR test carefully. Rejecting the null hypothesis may not always mean that data is not MCAR, nor is accepting the null hypothesis a guarantee that data is MCAR. See Schouten et al. (2021) for a thorough discussion of missingness mechanisms.

Examples

>>> import pandas as pd
>>> from pyampute.exploration.mcar_statistical_tests import MCARTest
>>> data_mcar = pd.read_table("data/missingdata_mcar.csv")
>>> mt = MCARTest(method="little")
>>> print(mt.little_mcar_test(data_mcar))
0.17365464213775494
static little_mcar_test(X)

Implementation of Little’s MCAR test

Parameters

X (Matrix of shape (n, m)) – Dataset with missing values. n rows (samples) and m columns (features).

Returns

pvalue – The p-value of a chi-square hypothesis test. Null hypothesis: data is Missing Completely At Random (MCAR). Alternative hypothesis: data is not MCAR.

Return type

float

static mcar_t_tests(X)

Performs t-tests for MCAR for each pair of features.

Parameters

X (Matrix of shape (n, m)) – Dataset with missing values. n rows (samples) and m columns (features).

Returns

pvalues – The p-values of t-tests for each pair of features. Null hypothesis for cell pvalues[h,j]: data in feature h is Missing Completely At Random (MCAR) with respect to feature j for all h,j in {1,2,...m}. Diagonal values do not exist.

Return type

pandas DataFrame of shape (m, m)