Compute response distribution indicators

Compute response distribution indicators for responses to multi-item scales or matrix questions.

Usage

resp_distributions(x, min_valid_responses = 1, id = T)

Arguments

x: A data frame containing survey responses in wide format. For more information see section "Data requirements" below.
min_valid_responses: Numeric between 0 and 1 of length 1. Defines the share of valid responses a respondent must have to calculate response quality indicators. Default is 1.
id: default is True. If the default value is supplied a column named id with integer ids will be created. If False is supplied, no id column will be created. Alternatively, a numeric or character vector of unique values identifying each respondent can be supplied. Needs to be of the same length as the number of rows of x.

Value

Returns a data frame with response quality indicators per respondent. Dimensions:

Rows: Equal to number of rows in x.
Columns: Six response distribution indicator colunns + id column (if specified).

Details

The following response distribution indicators are calculated per respondent:

n_na: number of intra-individual missing answers
prop_na: proportion of intra-individual missing responses
ii_mean: intra-individual mean
ii_median: intra-individual median
ii_sd: intra-individual standard deviation
mahal: mahalanobis distance per respondent.

Intra-individual response variability (ii_sd) has been proposed to measure insufficient effort responding (Dunn et al., 2018) and to distinguish between random and conscientious responding (Marjanovic et al, 2015).

Intra-individual location indicators can be used to asses the average location of responses on a set of questions (ii_mean, ii_median).

Mahalanobis distance is a outlier detection indicator. It represents the distance of a participants responses from the center of a multivariate normal distribution defined by the data of all respondents.

Data requirements

resp_distributions() assumes that data comes from multi-item scales or matrix questions, which have the same number and labeling of response options for many questions. The input data frame must be structured in the following way:

The data frame is in wide format, meaning each row represents one respondent, each column represents one variable.
All responses have integer values.
Missing values are set to NA.

Reverse coding of variables

The interpretation of the indicators depends on the whether response data of negatively worded questions was reversed or not:

Do not reverse data of negatively worded questions if you want to assess average response patterns (Dunn et al., 2018).
Reverse data of negatively worded questions if you want to assess whether responses are distributed randomly or not with respect to an assumed latent variable (Marjanovic et al., 2015).

Mahalanobis distance could not be calculated

Under certain circumstances, the mahalanobis distance can not be calculated. This may be if there is high collinearity (correlation between variables) or if there are to many missing values. Although this can happen in survey research data, this message can also indicate that something in the data is "off" due to one of the reasons stated above. A manual inspection for low-quality responses can be a next step.

References

Dunn, Alexandra M., Eric D. Heggestad, Linda R. Shanock, and Nels Theilgard. 2018. “Intra-Individual Response Variability as an Indicator of Insufficient Effort Responding: Comparison to Other Indicators and Relationships with Individual Differences.” Journal of Business and Psychology 33(1):105–21. doi: 10.1007/s10869-016-9479-0.

Marjanovic, Zdravko, Ronald Holden, Ward Struthers, Robert Cribbie, and Esther Greenglass. 2015. “The Inter-Item Standard Deviation (ISD): An Index That Discriminates between Conscientious and Random Responders.” Personality and Individual Differences 84:79–83. doi: 10.1016/j.paid.2014.08.021.

Author

Matthias Roth, Matthias Bluemke & Clemens Lechner

Examples

# A small test data set with ten respondents
# and responses to three survey questions
# with response scales from 1 to 5.
testdata <- data.frame(
  var_a = c(1,4,3,5,3,2,3,1,3,NA),
  var_b = c(2,5,2,3,4,1,NA,2,NA,NA),
  var_c = c(1,2,3,NA,3,4,4,5,NA,NA))

# Calculate response distribution indicators
resp_distributions(x = testdata) |>
    round(2)
#> # A tibble: 10 × 7
#>       id  n_na prop_na ii_mean ii_sd ii_median mahal
#>    <dbl> <dbl>   <dbl>   <dbl> <dbl>     <dbl> <dbl>
#>  1     1     0    0       1.33  0.58         1  2.04
#>  2     2     0    0       3.67  1.53         4  1.6 
#>  3     3     0    0       2.67  0.58         3  1.38
#>  4     4     1    0.33   NA    NA           NA NA   
#>  5     5     0    0       3.33  0.58         3  0.97
#>  6     6     0    0       2.33  1.53         2  1.38
#>  7     7     1    0.33   NA    NA           NA NA   
#>  8     8     0    0       2.67  2.08         2  1.88
#>  9     9     2    0.67   NA    NA           NA NA   
#> 10    10     3    1      NA    NA           NA NA   

# Include respondents with NA values by decreasing the
# necessary number of valid responses per respondent.

resp_distributions(
      x = testdata,
      min_valid_responses = 0.2) |>
   round(2)
#> # A tibble: 10 × 7
#>       id  n_na prop_na ii_mean  ii_sd ii_median mahal
#>    <dbl> <dbl>   <dbl>   <dbl>  <dbl>     <dbl> <dbl>
#>  1     1     0    0       1.33   0.58       1    2.27
#>  2     2     0    0       3.67   1.53       4    1.68
#>  3     3     0    0       2.67   0.58       3    1.05
#>  4     4     1    0.33    4      1.41       4    2.21
#>  5     5     0    0       3.33   0.58       3    1.24
#>  6     6     0    0       2.33   1.53       2    1.29
#>  7     7     1    0.33    3.5    0.71       3.5  0.71
#>  8     8     0    0       2.67   2.08       2    2.24
#>  9     9     2    0.67    3    NaN          3    0.24
#> 10    10     3    1      NA     NA         NA   NA