We present an R-function to generate missing values in complete datasets. Such an amputation procedure is useful to accurately evaluate the effect of missing data on analysis outcomes.
R-function ampute
is available in multiple imputation package mice. Van Buuren’s book (2018) gives an extensive overview of missing data methodology and multiple imputation algorithm MICE. In this tutorial, we will focus on amputation, which is the generation of missing values in complete data and as such, the opposite of imputation.
This tutorial covers
For a theoretical justification and a demonstration of the method, we refer to Schouten, Lugtig and Vink (2018) (use this paper as your reference). The paper discusses how missing data methods are evaluated in four steps:
Obiously, the second step in this procedure (amputation) is very important, since the amputation procedure determines the severity of the missing data problem. Before the existence of ampute
, a proper amputation procedure was not available. Therefore, most simulation studies were performed with completely random missing data (MCAR). However, in real world problems the MCAR assumption is often unlikely and missing data methods need to handle MAR and MNAR mechanisms as well. Hence, we needed an amputation procedure that could create severe MAR and MNAR missingness: ampute
!
An example of how ampute
can be used to evaluate missing data methods can be found in Schouten and Vink (2021). With ampute
it is straightforward to generate missing values in multivariate datasets, with any desired proportion, varying underlying mechanisms, different missingness patterns and varying data distributions.
We will now discuss the multivariate amputation procedure that underlies ampute
. Then, we will discuss the function’s arguments and some additional features. In the end, we propose solutions for special cases such as mixed missingness mechanisms and amputation in datasets with a large number of variables.
The multivariate amputation procedure is built on an initial idea proposed by (1999) and adapted to be more generic and easy to use in Schouten, Lugtig and Vink (2018). Figure 1 shows a schematic overview of the resulting amputation procedure. On the left, the method requires a complete dataset of \(n\) participants and \(m\) variables. On the right, multiple subsets with either incomplete or complete data are merged, resulting in an incomplete version of the original dataset.
The core of the procedure lies in the missingness patterns. A missing data pattern is a particular combination of variables with missing values and variables that remain complete. Based on the number of missing data patterns \(k\), the complete dataset is randomly divided into \(k\) subsets. The size of these subsets may differ between the patterns using a so-called frequency vector.
All the data rows in a certain subset are candidates for a certain missing data pattern. Whether or not the candidates will become incomplete depends on a combination of factors.
One of those factors is the missingness mechanism. In case of MCAR missingness, all data rows have the same probability of being amputed. For MAR and MNAR missingness, we will use so-called weighted sum scores. In essence, MAR misingness occurs when the information about the missing data is in the observed data. In case of MNAR missingness, the information about the missing data is missing itself. For a discussion of how these mechanisms interact with observed data, see Schouten and Vink (2021).
A weighted sum score is simply the outcome of a linear regression equation where the coefficients are determined by the user. We will discuss this a bit more below. For now it suffices to say that based on its weighted sum score, a candidate obtains a certain probability of being amputed. These probabilities are assigned using one of four logistic distribution types.
In the end, in every subset the specified proportion of data rows is made incomplete according to the missing data pattern of their candidacy. All subsets are merged and we now have one incomplete dataset, amputed according to the parameters specified by the user! We will now discuss these parameters in a bit more detail.
ampute
and its argumentsAlthough all parameters are connected, ampute
provides a way to manipulate the missing data generation procedure without influencing other parameters. Figure 2 shows the most important arguments and the order in which we will discuss them.
In short, ampute
’s arguments are used for the following:
Use the help function to read ampute
’s documentation. The function is available in multiple imputation package mice.
The first argument data
is an input argument for a complete dataset. In this tutorial, as in many simulation studies, we will randomly generate a dataset to be our complete dataset. Here, we will use function mvrnorm
from R-package MASS to sample from a multivariate normal distribution.
set.seed(2016)
testdata <- as.data.frame(MASS::mvrnorm(n = 10000,
mu = c(10, 5, 0),
Sigma = matrix(data = c(1.0, 0.2, 0.2,
0.2, 1.0, 0.2,
0.2, 0.2, 1.0),
nrow = 3,
byrow = T)))
summary(testdata)
## V1 V2 V3
## Min. : 6.346 Min. :0.9789 Min. :-3.799124
## 1st Qu.: 9.317 1st Qu.:4.3199 1st Qu.:-0.690542
## Median :10.003 Median :4.9963 Median :-0.009941
## Mean : 9.998 Mean :4.9973 Mean : 0.000759
## 3rd Qu.:10.677 3rd Qu.:5.6759 3rd Qu.: 0.673069
## Max. :13.770 Max. :8.6579 Max. : 3.810729
We can immediately generate missing values by calling ampute
. The resulting object is of class mads
and contains the default values that are used as arguments. It is important to know that the incomplete dataset is stored under object amp
in class mads
.
## [1] "mads"
## V1 V2 V3
## 1 NA 6.018999 0.3681981
## 2 9.668991 4.391821 -1.1127595
## 3 10.273415 4.662521 0.1796964
## 4 10.124800 4.055544 NA
## 5 NA 7.171578 1.7141378
## 6 10.803311 5.038649 NA
Apart from the argument values and the incomplete dataset, the mads
object contains the assigned subset for each data row (cand
), the weighted sum scores (scores
) and the original data (data
).
## [1] "call" "prop" "patterns" "freq" "mech" "weights"
## [7] "cont" "std" "type" "odds" "amp" "cand"
## [13] "scores" "data"
We can quickly investigate the incomplete dataset with function md.pattern
, where the resulting visualization shows the missing data in red and the observed data in blue. The first row always shows the complete cases, of which we have approximately 50%. Each subsequent row depicts a specific missing data pattern. By default, ampute
generates missing values in each variable. Note that because md.pattern
sorts the columns in increasing order of missing data proportion, the variables are displayed in a different order than in the dataset itself.
## V3 V1 V2
## 5014 1 1 1 0
## 1670 1 1 0 1
## 1670 1 0 1 1
## 1646 0 1 1 1
## 1646 1670 1670 4986
The argument prop
specifies the proportion of incomplete rows. As a default, the missingness proportion is 0.5:
## [1] 0.5
A proportion of 0.5 means that 50% of the data rows will have missing values. This is not the same as the proportion of missing cells, because incomplete cases will still have some observed values for some variables. The number of missing cells therefore depends on the missing data patterns that are specified.
To specify the proportion of missing cells, additional argument bycases
should be set to FALSE
.
## V3 V1 V2
## 4046 1 1 1 0
## 1998 1 1 0 1
## 1987 1 0 1 1
## 1969 0 1 1 1
## 1969 1987 1998 5954
As the testdata
contains 10000 * 3 = 30000 cells, a missing data proportion of 0.2 means that approximately 6000 cells will become missing. As the visualization shows, this is indeed the case. In combination with the current set of missing data patterns, the resulting proportion of incomplete cases is:
## [1] 0.6
The core idea of ampute
is the generation of missing data patterns. Each pattern is a combination of missingness on specific variables while other variables remain complete. For example, someone could have forgotten to fill in the last page of a questionnaire, which results in missing values for a specific set of questions/variables. Or a participant misses one or more waves in a longitudinal study. Thus, each pattern is a specific combination of incomplete and complete variables.
The patterns
argument uses a matrix to define these pattern. In this matrix, the patterns are placed on the rows and the variables on the columns. The value 0
is used for variables that should have missing values and 1
is used for complete variables.
## V1 V2 V3
## 1 0 1 1
## 2 1 0 1
## 3 1 1 0
By default, the number of patterns equals the number of variables, where in each pattern one variable contains missing values. Note that as a result, there are no cases with missing values on more than one variable. A case either has missingness on V1, V2 or V3 or remains complete.
We can manipulate the matrix by changing the values or by adding rows. For example, we can change the matrix by creating a missingness pattern where cases have missing values in V1 and V2 but not on V3 (pattern 2). Furthermore, we will add a fourth missing data pattern that generates missingness in V1 and V3.
## V1 V2 V3
## 1 0 1 1
## 2 0 0 1
## 3 1 1 0
## 4 0 1 0
Then, we again call the amputation procedure with the desired patterns matrix as its third argument. We inspect the result with the md.pattern
function.
## V2 V3 V1
## 5079 1 1 1 0
## 1216 1 1 0 1
## 1248 1 0 1 1
## 1200 1 0 0 2
## 1257 0 1 0 2
## 1257 2448 3673 7378
As explained before, the amputation procedure divides the complete dataset into multiple subsets. The number of these subsets is determined by the number of patterns. The size of the subsets, and therefore the relative occurrence of the missing data patterns, can be determined with a frequency vector.
Argument freq
is a vector with values between 0 and 1. The number of values determines the number of subsets and must be equal to the number of patterns.
## [1] 0.25 0.25 0.25 0.25
By default, the frequency vector has values of equal size (1/number of patterns). This means that all four subsets will have approximately the same size. We can adapt the frequency vector, for instance such that subset one becomes much larger than the other subsets.
Note that the frequency values should always sum to 1.0 in order to divide all the cases over the subsets.
## V2 V3 V1
## 4982 1 1 1 0
## 3555 1 1 0 1
## 524 1 0 1 1
## 473 1 0 0 2
## 466 0 1 0 2
## 466 997 4494 5957
As the visualization shows, there are now four missing data patterns and the first pattern occurs 7 times as often as the other three patterns.
At this point in the procedure, we are able to generate missing data with a specified proportion, the missingness patterns of interest and in the desired relative occurence of those patterns.
We will now decide what kind of missingness mechanism we will implement. As said before, there are three mechanisms: MCAR, MAR and MNAR. In case of MCAR missingness, all data rows have the same probability of being amputed. With MAR misingness, the information about the missing data is in the observed data and with MNAR missingness, the information about the missing data is missing itself. We refer to Schouten and Vink (2021) for a more thorough discussion of missingness mechanisms.
Argument mech
in function ampute
is a string with either MCAR
, MAR
or MNAR
. As a default:
## [1] "MAR"
For MAR and MNAR mechanisms, we calculate a weighted sum score for all data rows. This calculation depends on pre-specified weights
which are manipulated by the user through a matrix. The weights differ per pattern, and therefore, similar to the patterns matrix, the weights matrix contains the patterns on the rows and the variables on the columns (\(k\) by \(m\)).
A weighted sum score is simply the outcome of a linear regression equation where the coefficients are the values of the weights matrix. When data row \(i\) is a candidate for pattern \(k\), the weighted sum score is therefore:
\[\begin{equation*} wss_i = w_{k,1} \cdot y_{1,i} + w_{k,2} \cdot y_{2,i} + ... + w_{k,m} \cdot y_{m,i}, \end{equation*}\]
where \(\{y_{1,i}, y_{2,i}, ..., y_{m,i}\}\) is the set of variable values of case \(i\) and \(\{w_{k,1}, w_{k,2}, ..., w_{k,m}\}\) are the pre-specified weights on row \(k\) of the weights matrix. In our example, \(m=3\) and \(k\in\{1, 2, 3, 4\}\) because there are three variables and four missing data patterns.
In general, larger weights will give higher sum scores than smaller weights. For instance, if variables V1 and V2 have weights 4 and 2 respectively, V1’s influence on the sum scores is twice as large as that of V2. Note that the influence of the weights is relative; weight values of 0.4 and 0.2 will have the same effect.
Note that the direction of the weights is important as well. A positive weight will increase the weighted sum score while a negative weight will decrease it.
Note as well that each pattern receives its own weights and that the comparison of weights will happen within a pattern and not between patterns. For instance, variable V1 can have a weight of 4 in the first pattern, but a weight of -0.2 in the second pattern.
## V1 V2 V3
## 1 0 1 1
## 2 0 0 1
## 3 1 1 0
## 4 0 1 0
By default, the weights matrix assigns values of 1
to the variables that remain complete in a certain pattern, and 0
to variables that will become incomplete. As such, the default weights matrix contains a MAR mechanism where every variable has the same influence.
Instead, for pattern 1, we can give variable V2 a larger weight than variable V3.
By choosing the values 0.8 and 0.4, variable V2 is weighted twice as much as variable V3. For pattern 3, we will weight variable V1 thrice as much as variable V2.
## V1 V2 V3
## 1 0 0.8 0.4
## 2 0 0.0 1.0
## 3 3 1.0 0.0
## 4 0 1.0 0.0
The weights matrix is very powerful, because the user determines which variables determine the missingness (by a non-zero value) and which variables do not (by assigning a 0). As such, the weights matrix can also be used to switch from MAR to MNAR missingness, or even to create a combined mechanism.
We can see this when we change the mech
argument to MNAR
. It turns out that ampute
then uses a weighs matrix that shows 1
for all variables that should become incomplete according to the patterns matrix (0
in the patterns matrix).
## V1 V2 V3
## 1 0 1 1
## 2 0 0 1
## 3 1 1 0
## 4 0 1 0
## V1 V2 V3
## 1 1 0 0
## 2 1 1 0
## 3 0 0 1
## 4 1 0 1
Before we inspect the results of the weights matrix, we will quickly discuss the type
argument in ampute
. Here, we refer to the type of logistic probability distribution that is applied to the weighted sum scores, as shown in Figure 3. In ampute
, these types can be called by setting cont == TRUE
(this is by default the case) and then setting argument type
to LEFT
, MID
, RIGHT
or TAIL
.
With RIGHT
missingness, cases with high weighted sum scores will have a larger probability of becoming incomplete. With a left-tailed (LEFT), centered (MID) or both-tailed (TAIL) missingness type, larger probabilities are assigned to the candidates with low, average or extreme weighted sum scores respectively.
ampute
Within package mice, we developed a few extra functions to visualize and inspect the generated missing data problem. We will first discuss boxplots, then scatterplots and finally the possibility to specify your own probability distribution by means of odds values.
Function bwplot
allows for a comparison between amputed and non-amputed data. Note that the function uses as input the mads
object and not the incomplete dataset.
When using the function, argument which.pat
can be used to specify which patterns we want to inspect (default: all patterns). With argument yvar
, we specify the variable names of the variables we are interested in (default: all variables).
In addition to the visualizations, the function returns the descriptives when descriptives = TRUE
(default). In the output, the column Amp
shows a 1
for the amputed data and 0
for the non-amputed data.
We will now inspect the results of the amputation procedure for pattern 1.
## $Descriptives
## , , Variable = V1
##
## Descriptives
## Pattern Amp Mean Var N
## 1 1 0.11770 0.96079 3552
## 1 0 -0.12512 0.99905 3495
##
## , , Variable = V2
##
## Descriptives
## Pattern Amp Mean Var N
## 1 1 0.36608 0.84028 3552
## 1 0 -0.39542 0.85435 3495
##
## , , Variable = V3
##
## Descriptives
## Pattern Amp Mean Var N
## 1 1 0.22595 0.96457 3552
## 1 0 -0.26021 0.90809 3495
##
##
## $`Boxplot pattern 1`
We see that in pattern 1 the amputed data are shifted to the right with respect to the non-amputed data. This shift is due to the default value for argument type
, which is RIGHT
.
The shift in the distribution is largest for variable V2, due to the specified weight of 0.8. Although we would not expect to see a shift for variable V1 (in the first pattern V1 would be amputed, en because we have chosen a MAR mechanism, V1 will not influence the missingness), we still see a small difference between the amputed and non-amputed data for V1. This is due to the positive correlation between V1 on the one hand and V2 and V3 on the other hand These correlations were created during the simulation of the data.
To test whether missingness is MCAR or MAR, some researchers use a t-test. If desired, one could use the function tsum.test
from package BSDA
to perform a t-test using the summary statistics from the descriptives that are provided. Be aware of the general assumptions and limitations for t-tests.
##
## Welch Modified Two-Sample t-Test
##
## data: Summarized x and y
## t = 35.184, df = 6961.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.7371929 0.8241871
## sample estimates:
## mean of x mean of y
## 0.39077 -0.38992
We will now change the missingness types as follows and evaluate pattern 2.
result <- ampute(testdata, freq = myfreq, patterns = mypatterns,
weights = myweights, cont = TRUE,
type = c("RIGHT", "TAIL", "MID", "LEFT"))
bwplot(result, which.pat = 2, descriptives = FALSE)
## $`Boxplot pattern 2`
Since we specified TAIL
missingness, we expect that data rows with extreme (both low and high) sum scores will have a larger probability of being amputed. Indeed, if we inspect V3, the interquartile range (IQR) of the amputed data is much wider than that of the non-amputed data. As expected, since V1 and V2 were not weighted for missing data pattern 2, we do not see an effect for these variables.
Similar inspections can be done using the function xyplot
. The scatterplots show the correlation between the variable values and the weighted sum scores. For the fourth pattern, the scatterplots are as follows.
## $`Scatterplot Pattern 4`
From the scatterplots of pattern 4 we can learn the following three things:
The amputed data is on the left hand side of the weighted sum scores. This is due to the type
setting: we specified a LEFT
probability distribution.
There is a perfect correlation between variable V2 and the weighted sum scores. Clearly, pattern 4 depends on variable V2 only. This is exactly as we specified in the weights matrix.
Note that there are other R-packages with nice functions to visualize missing data patterns. An example is package naniar from Nicholas Tierney.
run
It may be desirable to quickly run ampute
without actually executing the amputation procedure. An empty run will generate the default argument values, which can then be adapted as desired. This can be especially useful for large datasets. For an empty run, the argument run
can be set to FALSE
.
## data frame with 0 columns and 0 rows
Sometimes, we wish to determine the missingness proportion per variable (and not for the number of incomplete rows, or the number of missing cells). At first sight, this is easy to accomplish by using the default patterns matrix as this will create patterns with missingness in 1 variable only and then specifying the proportion with the prop
argument.
However, for datasets with a small N:M ratio, this can give problems. For instance, the Boston dataset in the MASS package has 506 data rows and 14 variables. Let’s say we wish to create 30% missing values in each of the columns.
## [1] 506 14
With the default settings in ampute
and prop = 0.3
, the function will divide 506 rows over 14 patterns, resulting in approximately 36 data rows per subset. The function will then ampute 0.3 * 36 = 11
rows per pattern. In total, this will give a missingness proportion per variable of 11 / 506 = 0.02
wich is clearly not what we desired!
In order to reach 30% missingness in each variable in the Boston dataset, 0.3 * 506 = 152
data rows per subset should be made incomplete. This means that at least 152 / 36 = 422%
of the available data rows in each subset should be amputed. Obviously, this is not possible.
We will present two possible solutions.
In solution 1 we first create masks for every variable separately, and then apply the masks at the end. The following steps should be taken:
For every i in 1:14:
ampute.continuous
to generate a mask that indicates which rows should become incompleteWhen you have the masks for all variables, apply the masks.
# for instance for i = 1
# step 1 and 2
data_withoutvar1 <- data[, c(2:14)]
std_datawithoutvar1 <- scale(data_withoutvar1)
# step 3
# we generate only 1 pattern so the weights matrix has to contain 1 row
# specify the weights as desired
# here we use 1 for all variables,
# indicating that we use all variables - i to create missingness in variable i
# which is a MAR mechanism
weights <- matrix(rep(1, 13), nrow = 1)
my_scores <- apply(std_datawithoutvar1, 1, function (x) weights %*% x)
# step 4
# P has to be set to 2 for all rows in the dataset
# meaning that all rows are candidates for this pattern
# specify the probability distribution with argument type
# and the desired missingness proportion
mask_var1 <- ampute.continuous(P = rep(2, nrow(data_withoutvar1)),
scores = list(my_scores), prop = 0.3, type = "RIGHT")
# in the mask, 0 indicates the row should become incomplete,
# 1 means the row should stay complete
mask_var1
## [[1]]
## [1] 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0
## [149] 1 1 0 1 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 1 1 1 0 1 0 1 0 1
## [223] 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1
## [260] 1 1 1 0 1 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 0 0 0 1 1 1 0 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [334] 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## [445] 1 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## [482] 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1
Note that this procedure of first calculating the masks and then applying those masks will pose a problem for the quality of the MAR mechanism. Here, we use all observed data to generate the missing values, but later we will ampute the observed data with another mask. Hence, this approach will create a weak MNAR mechanism and not purely MAR.
In solution 2 we decide to generate the missingness by using 3 patterns instead of 14:
my_pat <- matrix(c(rep(0, 5), rep(1, 9),
rep(1, 5), rep(0, 5), rep(1, 4),
rep(1, 10), rep(0, 4)), nrow = 3, byrow = TRUE)
With this approach, all variables will become incomplete, but the data rows will have more than 1 missing value. Be aware that the missing data method that will be used has to be able to deal with this.
solution2 <- ampute(data, prop = 0.3, patterns = my_pat)
md.pattern(solution2$amp, rotate.names = TRUE)
## crim zn indus chas nox ptratio black lstat medv rm age dis rad tax
## 346 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
## 60 1 1 1 1 1 1 1 1 1 0 0 0 0 0 5
## 50 1 1 1 1 1 0 0 0 0 1 1 1 1 1 4
## 50 0 0 0 0 0 1 1 1 1 1 1 1 1 1 5
## 50 50 50 50 50 50 50 50 50 60 60 60 60 60 750
In this tutorial, we have discussed how to generate a combination of MAR and MNAR missingness using the weights matrix. Unfortunately, the generation of both MCAR and MAR missingness (or any other form of weak MAR) is currently not directly possible with ampute
. The following code shows how a combination of MCAR and MAR can still be created.
# ampute the complete data once for every mechanism
ampdata1 <- ampute(testdata, patterns = c(0, 1, 1), prop = 0.2, mech = "MAR")$amp
ampdata2 <- ampute(testdata, patterns = c(1, 1, 0), prop = 0.8, mech = "MCAR")$amp
# create a random allocation vector
# use the prob argument to specify how much of each mechanism should be created
# here, 0.5 of the missingness should be MAR and 0.5 should be MCAR
indices <- sample(x = c(1, 2), size = nrow(testdata),
replace = TRUE, prob = c(0.5, 0.5))
# create an empty data matrix
# fill this matrix with values from either of the two amputed datasets
ampdata <- matrix(NA, nrow = nrow(testdata), ncol = ncol(testdata))
ampdata[indices == 1, ] <- as.matrix(ampdata1[indices == 1, ])
ampdata[indices == 2, ] <- as.matrix(ampdata2[indices == 2, ])
odds
Function ampute
provides 4 logistic probability distributions that are used to translate a weighted sum score into a probability of being amputed. By default, in order to use these distributions, we set the argument cont == TRUE
.
However, when cont == FALSE
, the user has the possibility to define the probability values manually by means of the odds
argument. When using odds values:
The odds are specified by means of a matrix, and the default is as follows. The number of rows indicates the number of patterns, which is 4 in this tutorial. The number of columns determines the number of equally sized groups. Here, for each pattern, the weighted sum scores are divided over 4 groups.
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 1 2 3 4
## [3,] 1 2 3 4
## [4,] 1 2 3 4
The values 1, 2, 3, 4
indicate that a data row with a weighted sum score in the highest quantile (last value) has a probability of becoming incomplete that is four times as large (last value is a 4) as a data row with a weighted sum score in the lowest quantile (first value is 1).
We can adapt the odds matrix as follows. For pattern 3, we keep the four quantiles but we assign a 0
for positions 2 and 3. This means that data rows with an average weighted sum score (quantile 2 and 3) have a zero probability of becoming incomplete.
For pattern 4, we let data rows with an average weighted sum score (two center quantiles) have a twice as large probability of becoming incomplete than data rows with extreme weighted sum scores (odds values are 2 and 1). Since we assign 6 values to the odds matrix, the weighted sum scores of subset 4 are divided into 6 quantiles.
myodds[3, ] <- c(1, 0, 0, 1)
myodds[4, ] <- c(1, 1, 2, 2)
myodds <- cbind(myodds, matrix(c(NA, NA, NA, 1, NA, NA, NA, 1), nrow = 4, byrow = F))
myodds
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 2 3 4 NA NA
## [2,] 1 2 3 4 NA NA
## [3,] 1 0 0 1 NA NA
## [4,] 1 1 2 2 1 1
result <- ampute(testdata, freq = myfreq, patterns = mypatterns,
weights = myweights, cont = FALSE, odds = myodds, prop = 0.3)
In Figure 4 we show the quantiles for 100 candidates in pattern 1. Indeed, there are four groups of approximately equal size. They all have some nonzero probability of becoming incomplete, but this probability is largest for data rows with high weighted sum scores. In fact, for data rows in quantile 4, the probability is 0.8, which is four times as large as for data rows in quantile 1, which have a probability of being amputed of 0.2.
Figure 5 shows the relation between the quantiles and the variable values. Because in pattern 1, variables V2 and V3 are both uses to calculate the weighted sum scores (remember the weights matrix), the groups can be distinguished well.
We see that cases with a high value on V2, are more often in group 4 than in group 1. Group 4 contains the higher weighted sum scores. Hence, cases with a high value on V2, have a higher weighted sum score. Again, this is exactly as we specified the weights matrix. For variable V3, the relation between the values and the group allocation is less distinct but still present.
Figure 6 shows the probabilities for 100 candidates for pattern 3. We see that data rows with average weighted sum scores, have a zero probability of being amputed. The data rows with extreme sum scores have a probability of 0.6 of becoming incomplete. Since the odds values are similar for the lowest and highest quantile, the probabilities for these two subgroups are equal.
The diagram for pattern 4 will have 6 groups instead of 4, but with a similar distribution of 0 and nonzero probabilities.
ampute
!For questions or comments regarding ampute
or the amputation methodology, contact Rianne Schouten. Find her contact details here.
News! Multivariate amputation is now available in Python as well; in library pyampute
.
Brand, Jaap. 1999. “Development, Implementation and Evaluation of Multiple Imputation Strategies for the Statistical Analysis of Incomplete Data Sets.” PhD thesis.
Schouten, Rianne Margaretha, Peter Lugtig, and Gerko Vink. 2018. “Generating Missing Values for Simulation Purposes: A Multivariate Amputation Procedure.” Journal of Statistical Computation and Simulation 88 (15): 2909–30.
Schouten, Rianne Margaretha, and Gerko Vink. 2021. “The Dance of the Mechanisms: How Observed Information Influences the Validity of Missingness Assumptions.” Sociological Methods & Research 50 (3): 1243–58.
Van Buuren, Stef. 2018. Flexible Imputation of Missing Data. CRC press.