We present an R-function to generate missing values in complete datasets. Such an amputation procedure is useful to accurately evaluate the effect of missing data on analysis outcomes. R-function
ampute is available in multiple imputation package mice. Van Buuren’s book (2018) gives an extensive overview of missing data methodology and multiple imputation algorithm MICE. In this tutorial, we will focus on amputation, which is the generation of missing values in complete data and as such, the opposite of imputation.
This tutorial covers
For a theoretical justification and a demonstration of the method, we refer to Schouten, Lugtig and Vink (2018) (use this paper as your reference). The paper discusses how missing data methods are evaluated in four steps:
Obiously, the second step in this procedure (amputation) is very important, since the amputation procedure determines the severity of the missing data problem. Before the existence of
ampute, a proper amputation procedure was not available. Therefore, most simulation studies were performed with completely random missing data (MCAR). However, in real world problems the MCAR assumption is often unlikely and missing data methods need to handle MAR and MNAR mechanisms as well. Hence, we needed an amputation procedure that could create severe MAR and MNAR missingness:
An example of how
ampute can be used to evaluate missing data methods can be found in Schouten and Vink (2018). With
ampute it is straightforward to generate missing values in multivariate datasets, with any desired proportion, varying underlying mechanisms, different missingness patterns and varying data distributions.
We will now discuss the multivariate amputation procedure that underlies
ampute. Then, we will discuss the function’s arguments and some additional features. In the end, we propose solutions for special cases such as mixed missingness mechanisms and amputation in datasets with a large number of variables.
The multivariate amputation procedure is built on an initial idea proposed by (1999) and adapted to be more generic and easy to use in Schouten, Lugtig and Vink (2018). Figure 1 shows a schematic overview of the resulting amputation procedure. On the left, the method requires a complete dataset of \(n\) participants and \(m\) variables. On the right, multiple subsets with either incomplete or complete data are merged, resulting in an incomplete version of the original dataset.