Getting started with resquin

An introduction to `resquin`

This short tutorial describe the functions in resquin and how you can use them on a technical level. For a more substantive introduction see the (forthcoming) article Using resquin in practice.

Functions in resquin calculate response quality indicators for survey data stored in a data frame or tibble. The functions assume that the input data frame is structured in the following way:

The data frame is in wide format, meaning each row represents one respondent, each column represents one variable.
All variables have the same number of response options.
The variables are in same the order as the questions respondents saw while taking the survey.
All responses have integer values.
Missing values are set to NA.
(For resp_styles()) Reverse keyed variables are in their original form. No items were recoded.

Example dataset of survey responses

Consider the following (fake) data set of survey responses.

# A test data set with three items and ten respondents
testdata <- data.frame(
  var_a = c(1,4,3,5,3,2,3,1,3,NA),
  var_b = c(2,5,2,3,4,1,NA,2,NA,NA),
  var_c = c(1,2,3,NA,3,4,4,5,NA,NA))

testdata
#>    var_a var_b var_c
#> 1      1     2     1
#> 2      4     5     2
#> 3      3     2     3
#> 4      5     3    NA
#> 5      3     4     3
#> 6      2     1     4
#> 7      3    NA     4
#> 8      1     2     5
#> 9      3    NA    NA
#> 10    NA    NA    NA

The data set contains responses to three survey questions (var_a,var_b and var_c) from ten respondents. All three survey question allow responses on a scale from 1 to 5. Some respondents have missing values, which are set to NA.

Lets use this data set to calculate response quality indicators.

`resp_styles()`: Response style indicators

Response styles capture systematic shifts in respondents response behavior. For example, respondents with an extreme response style may only choose the lowest and highest categories (in our example 1 and 5) while mid-point responder only choose the midpoint of a scale (in our example 3).

To calculate response styles we can use the resp_styles() function. First, we need to specify our data argument x. Then, we need to specify the minimum and maximum of the scales used in our questionnaire (scale_min and scale_max respectively). Remember that all questions included must have the same number of response options. We will discuss the arguments min_valid_responses and normalize later.

library(resquin)
# Calculating response style indicators for all respondents with no missing values
results_response_styles <- resp_styles(
  x = testdata,
  scale_min = 1,
  scale_max = 5,
  min_valid_responses = 1, # Excludes respondents with less than 100% valid responses
  normalize = T)  # Presents results in percent of all responses

round(results_response_styles,2)
#> # Number of missings due to min_valid_responses equal to 1: 4
#> # A data frame:                                             10 × 6
#>       id   MRS   ARS   DRS   ERS  NERS
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     1  0     0     1     0.67  0.33
#>  2     2  0     0.67  0.33  0.33  0.67
#>  3     3  0.67  0     0.33  0     1   
#>  4     4 NA    NA    NA    NA    NA   
#>  5     5  0.67  0.33  0     0     1   
#>  6     6  0     0.33  0.67  0.33  0.67
#>  7     7 NA    NA    NA    NA    NA   
#>  8     8  0     0.33  0.67  0.67  0.33
#>  9     9 NA    NA    NA    NA    NA   
#> 10    10 NA    NA    NA    NA    NA

The resulting data frame contains five columns corresponding to the middle response style (MRS), acquiescence response style (ARS), disaquiescence response style (DRS), extreme response style (ERS), and non-extreme response style (NERS) - you can learn more about the response styles in the help file of the function using ?resp_styles.

Each respondent receives one value for each indicator, given that they can be calculated. Because normalize is set to TRUEthe values are expressed as the share of responses of a respondent that can be attributed to a response style. For example, respondent one has an ERS value of 0.67 meaning that two out of three responses can be identified as extreme responses. On the other hand, respondent one does not have any mid-point response, leading to a value of 0 in the MRS column.

Instead of calculating proportions, we can extract the counts of responses that can be attributed to a response option by setting normalize to FALSE.

Finally, we can decide to include or exclude respondents from receiving response style values by setting min_valid_responses, which can take values from 0 to 1. min_valid_responses sets the share of valid responses (i.e. non-missing responses) a respondent must have to receive response style values. A value of 0 indicates that response style values should be calculated for all respondents, regardless of whether or not they have missing values. A value of 1 indicates that response styles should only be calculated for respondents who have valid responses on all variables. Values between 0 and 1 indicate the share of responses that need to be valid to be included in the response style calculations.

`resp_distributions()`: Intra-individual response distribution indicators

resp_distributions() calculates indicators which reflect the location and variability of responses within a respondent. resp_distributions() works similar to resp_styles(): We need to specify the data argument and we can include or exclude respondents from the calculations based on amount of missing data they exhibit (for an explanation see paragraph above).

# Calulating response distribution indicators for all respondents with no missing values
results_resp_distributions <- resp_distributions(
  x = testdata,
  min_valid_responses = 1) # Excludes respondents with less than 100% valid responses

round(results_resp_distributions,2)
#> # Number of missings due to min_valid_responses equal to 1: 4
#> # A data frame:                                             10 × 7
#>       id  n_na prop_na ii_mean ii_sd ii_median mahal
#>    <dbl> <dbl>   <dbl>   <dbl> <dbl>     <dbl> <dbl>
#>  1     1     0    0       1.33  0.58         1  2.04
#>  2     2     0    0       3.67  1.53         4  1.6 
#>  3     3     0    0       2.67  0.58         3  1.38
#>  4     4     1    0.33   NA    NA           NA NA   
#>  5     5     0    0       3.33  0.58         3  0.97
#>  6     6     0    0       2.33  1.53         2  1.38
#>  7     7     1    0.33   NA    NA           NA NA   
#>  8     8     0    0       2.67  2.08         2  1.88
#>  9     9     2    0.67   NA    NA           NA NA   
#> 10    10     3    1      NA    NA           NA NA

The resulting data frame contains eight columns:

n_na: number of intra-individual missing answers
prop_na: proportion of intra-individual missing responses
ii_mean: intra-individual mean
ii_median: intra-individual median
ii_sd: intra-individual standard deviation
mahal: Mahalanobis distance per respondent.

You can learn more about the response distribution indicators using ?resp_distributions

`resp_nondifferentiation()`: Response nondifferentiation indicators

resp_nondifferentiation calculates indicators primarily measuring what is known as straightlining. However, different indicators measure different aspects of straightlining. While Simple Nondifferentiation just measures whether respondents used the same response option for all questions, other indicators such as the Mean Root of Pairs Method or the Scale Point Variation Method provide continuous values of nondifferentiation, measuring how varied the choice of response options for each respondent is.

resp_nondifferentiation() works just like resp_distributions() and resp_styles().

results_resp_nondifferentiation <- resp_nondifferentiation(
  x = testdata,
  min_valid_responses = 1) # Excludes respondents with less than 100% valid responses

round(results_resp_nondifferentiation,2)
#> # Number of missings due to min_valid_responses equal to 1: 4
#> # A data frame:                                             10 × 5
#>       id simple_nondifferentiation mean_root_pairs max_identical_rating
#>    <dbl>                     <dbl>           <dbl>                <dbl>
#>  1     1                         0            1                    0.67
#>  2     2                         0            0.21                 0.33
#>  3     3                         0            1                    0.67
#>  4     4                        NA           NA                   NA   
#>  5     5                         0            1                    0.67
#>  6     6                         0            0.21                 0.33
#>  7     7                        NA           NA                   NA   
#>  8     8                         0            0                    0.33
#>  9     9                        NA           NA                   NA   
#> 10    10                        NA           NA                   NA   
#> # ℹ 1 more variable: scale_point_variation <dbl>

The resulting dataframe contains four columns containing indicators described by (Kim et al. 2019):

simple_nondifferentiation: Respondents are assigned 1 or 0 depending on whether all responses have the same value (1) or not (0).
mean_root_pairs: Mean of the root of the absolute differences between all pairs in a multi-item scale or matrix questions. It ranges from 0 (least straightlining) to 1 (most straightlining).
max_identical_rating: Proportion of the most commonly selected response option among all responses in a multi-item scale or matrix questions. It ranges from 0 (least straightlining) to 1 (most straightlining).
scale_point_variation: The measure becomes larger if respondents use more scales points in a multi-item scale or matrix questions.

You can learn more about the the response nondifferentiation indicators using ?resp_nondifferentiation.

`resp_patterns()`: Response pattern indicators

Response patterns describe suspicious response patterns arising from careless or insufficient effort responding. Careless or insufficient effort respondents might choose the same answer option repeatedly (i.e. creating a long string of equal responses) or use pattern, such as zig-zaging, to complete a survey.

resp_patterns() provides three columns which contain longstring analysis indicators for each respondent. These three columns are always returned. There are two more optional columns () for the analysis of other repeating response patterns: defined_patterns and arbitrary_patterns:

To obtain the three longstring analysis columns simply call resp_patterns():

results_resp_patterns <- resp_patterns(testdata)
round(results_resp_patterns,2)
#> # Number of missings due to min_valid_responses equal to 1: 4
#> # A data frame:                                             10 × 4
#>       id n_transitions mean_string_length longest_string_length
#>    <dbl>         <dbl>              <dbl>                 <dbl>
#>  1     1             2                  1                     1
#>  2     2             2                  1                     1
#>  3     3             2                  1                     1
#>  4     4            NA                 NA                    NA
#>  5     5             2                  1                     1
#>  6     6             2                  1                     1
#>  7     7            NA                 NA                    NA
#>  8     8             2                  1                     1
#>  9     9            NA                 NA                    NA
#> 10    10            NA                 NA                    NA

To obtain counts of defined response patterns per respondents use the defined_patterns argument. You can either supply a single vector representing a response pattern, or a list of vectors representing multiple response patterns.

# Single defined pattern
defined_patterns_single <- resp_patterns(
  x = testdata,
  defined_patterns = c(1,2,3)
)
defined_patterns_single
#> # Number of missings due to min_valid_responses equal to 1: 4
#> # A data frame:                                             10 × 5
#>       id n_transitions mean_string_length longest_string_length defined_patterns
#>    <int>         <dbl>              <dbl>                 <int> <list>          
#>  1     1             2                  1                     1 <int [1]>       
#>  2     2             2                  1                     1 <int [1]>       
#>  3     3             2                  1                     1 <int [1]>       
#>  4     4            NA                 NA                    NA <NULL>          
#>  5     5             2                  1                     1 <int [1]>       
#>  6     6             2                  1                     1 <int [1]>       
#>  7     7            NA                 NA                    NA <NULL>          
#>  8     8             2                  1                     1 <int [1]>       
#>  9     9            NA                 NA                    NA <NULL>          
#> 10    10            NA                 NA                    NA <NULL>

# Multiple defined patterns
defined_patterns_multiple <- resp_patterns(
  x = testdata,
  defined_patterns = list( # wrap multiple pattern vectors into a list
    c(1,2,3),
    c(2,3,4),
    c(4,3,2),
    c(3,2,1)
  )
)

defined_patterns_multiple
#> # Number of missings due to min_valid_responses equal to 1: 4
#> # A data frame:                                             10 × 5
#>       id n_transitions mean_string_length longest_string_length defined_patterns
#>    <int>         <dbl>              <dbl>                 <int> <list>          
#>  1     1             2                  1                     1 <int [4]>       
#>  2     2             2                  1                     1 <int [4]>       
#>  3     3             2                  1                     1 <int [4]>       
#>  4     4            NA                 NA                    NA <NULL>          
#>  5     5             2                  1                     1 <int [4]>       
#>  6     6             2                  1                     1 <int [4]>       
#>  7     7            NA                 NA                    NA <NULL>          
#>  8     8             2                  1                     1 <int [4]>       
#>  9     9            NA                 NA                    NA <NULL>          
#> 10    10            NA                 NA                    NA <NULL>

# defined patterns are returned as a single list column.
# One way to make the data accessible is by using tidyr::unnest_wider()
# This way, each column represents the count of one defined pattern

defined_patterns_multiple |> 
  tidyr::unnest_wider(defined_patterns)
#> # A tibble: 10 × 8
#>       id n_transitions mean_string_length longest_string_length `1_2_3` `2_3_4`
#>    <int>         <dbl>              <dbl>                 <int>   <int>   <int>
#>  1     1             2                  1                     1       0       0
#>  2     2             2                  1                     1       0       0
#>  3     3             2                  1                     1       0       0
#>  4     4            NA                 NA                    NA      NA      NA
#>  5     5             2                  1                     1       0       0
#>  6     6             2                  1                     1       0       0
#>  7     7            NA                 NA                    NA      NA      NA
#>  8     8             2                  1                     1       0       0
#>  9     9            NA                 NA                    NA      NA      NA
#> 10    10            NA                 NA                    NA      NA      NA
#> # ℹ 2 more variables: `4_3_2` <int>, `3_2_1` <int>

Finally, you can request the detection of arbitrary response patterns with counts larger than one and pattern lengths longer than one using the arbitrary_patterns argument

arbitrary_patterns_length_2 <- resp_patterns(
  x = testdata,
  arbitrary_patterns = 2
)

arbitrary_patterns_length_2
#> # Number of missings due to min_valid_responses equal to 1: 4
#> # A data frame:                                             10 × 5
#>       id n_transitions mean_string_length longest_string_length
#>    <int>         <dbl>              <dbl>                 <int>
#>  1     1             2                  1                     1
#>  2     2             2                  1                     1
#>  3     3             2                  1                     1
#>  4     4            NA                 NA                    NA
#>  5     5             2                  1                     1
#>  6     6             2                  1                     1
#>  7     7            NA                 NA                    NA
#>  8     8             2                  1                     1
#>  9     9            NA                 NA                    NA
#> 10    10            NA                 NA                    NA
#> # ℹ 1 more variable: arbitrary_patterns <list>

# You can also request the detection of patterns of multiple 
# lengths, in this case 2 and 3
arbitrary_patterns_length_2_3 <- resp_patterns(
  x = testdata,
  arbitrary_patterns = c(2,3)
)
arbitrary_patterns_length_2_3
#> # Number of missings due to min_valid_responses equal to 1: 4
#> # A data frame:                                             10 × 5
#>       id n_transitions mean_string_length longest_string_length
#>    <int>         <dbl>              <dbl>                 <int>
#>  1     1             2                  1                     1
#>  2     2             2                  1                     1
#>  3     3             2                  1                     1
#>  4     4            NA                 NA                    NA
#>  5     5             2                  1                     1
#>  6     6             2                  1                     1
#>  7     7            NA                 NA                    NA
#>  8     8             2                  1                     1
#>  9     9            NA                 NA                    NA
#> 10    10            NA                 NA                    NA
#> # ℹ 1 more variable: arbitrary_patterns <list>

Using the `id` argument: Uniquely identify respondents

In its default setting, all resquin functions provide an integer id column running from 1 to the number of respondents in the data set supplied in x. You can turn of the creation of the id column by setting id = False or supply a vector of unique ids (character or numeric). The latter option is useful if a data set provides its own vector of ids.

Here are examples of the three options:

# Default. Integer ids.
resp_distributions(testdata)
#> # Number of missings due to min_valid_responses equal to 1: 4
#> # A data frame:                                             10 × 7
#>       id  n_na prop_na ii_mean  ii_sd ii_median  mahal
#>    <int> <dbl>   <dbl>   <dbl>  <dbl>     <dbl>  <dbl>
#>  1     1     0   0        1.33  0.577         1  2.04 
#>  2     2     0   0        3.67  1.53          4  1.60 
#>  3     3     0   0        2.67  0.577         3  1.38 
#>  4     4     1   0.333   NA    NA            NA NA    
#>  5     5     0   0        3.33  0.577         3  0.970
#>  6     6     0   0        2.33  1.53          2  1.38 
#>  7     7     1   0.333   NA    NA            NA NA    
#>  8     8     0   0        2.67  2.08          2  1.88 
#>  9     9     2   0.667   NA    NA            NA NA    
#> 10    10     3   1       NA    NA            NA NA

# No id column
resp_distributions(testdata,id = F)
#> # Number of missings due to min_valid_responses equal to 1: 4
#> # A data frame:                                             10 × 6
#>     n_na prop_na ii_mean  ii_sd ii_median  mahal
#>    <dbl>   <dbl>   <dbl>  <dbl>     <dbl>  <dbl>
#>  1     0   0        1.33  0.577         1  2.04 
#>  2     0   0        3.67  1.53          4  1.60 
#>  3     0   0        2.67  0.577         3  1.38 
#>  4     1   0.333   NA    NA            NA NA    
#>  5     0   0        3.33  0.577         3  0.970
#>  6     0   0        2.33  1.53          2  1.38 
#>  7     1   0.333   NA    NA            NA NA    
#>  8     0   0        2.67  2.08          2  1.88 
#>  9     2   0.667   NA    NA            NA NA    
#> 10     3   1       NA    NA            NA NA

# Custom id vectors
custom_ids <- letters[1:nrow(testdata)]
resp_distributions(testdata,id = custom_ids)
#> # Number of missings due to min_valid_responses equal to 1: 4
#> # A data frame:                                             10 × 7
#>    id     n_na prop_na ii_mean  ii_sd ii_median  mahal
#>    <chr> <dbl>   <dbl>   <dbl>  <dbl>     <dbl>  <dbl>
#>  1 a         0   0        1.33  0.577         1  2.04 
#>  2 b         0   0        3.67  1.53          4  1.60 
#>  3 c         0   0        2.67  0.577         3  1.38 
#>  4 d         1   0.333   NA    NA            NA NA    
#>  5 e         0   0        3.33  0.577         3  0.970
#>  6 f         0   0        2.33  1.53          2  1.38 
#>  7 g         1   0.333   NA    NA            NA NA    
#>  8 h         0   0        2.67  2.08          2  1.88 
#>  9 i         2   0.667   NA    NA            NA NA    
#> 10 j         3   1       NA    NA            NA NA

Missing data as a threat to the validity of response quality indicators

The argument min_valid_responses can be used to control how many responses a respondent must have for a response quality indicator to be computed. Per default, resquin requires respondents to have valid responses for all questions asked (min_valid_responses = 1). In other words, per default resquin requires respondents to have no missing values.

Of course, missing values are a common occurrence in survey data, thus it can make sense to reduce the required share of valid responses in min_valid_responses from 1 to a lower one. This leads to the calculation of response quality indicator values for more respondents. For example, reducing min_valid_responses to 0.5 would mean that respondents only have to have valid (i.e. non-missing) responses on 50% of the questions asked.

However, allowing NA values (i.e. missing values) can change the interpretation and validity of response quality indicators. For example, a respondent with ten identical responses on a ten item scale (i.e. ten time response option “3”) has more evidence for straightlining then a respondent with two identical responses which are separated by eight NA values. Both would receive a 1 on the simple_nondifferentiation indicator of resp_nondifferentiation() or 1 (meaning absolute presence of) a Middle Response Style if the response scale had five response options. But should the respondents be considered to be identical in their extent of straightlining or Middle Response Style? This is up to the interpretation of the analyst.

Although an increasing number of missing values per respondent can pose validity threats to all response quality indicators, some indicators are more affected by NA values than others. Indicators calculated with resp_distributions() are not affected as much by missing data, because the resp_distributions() indicators average over all response values. For the following response indicators missing values are more of a threat to validity because they conceptually rely more on the position of responses on the response scale:

resp_patterns():
- mean_string_length - Response strings with identical values can be broken by NA.
- longest_string_length - Response strings with identical values can be broken by NA.
resp_nondifferentiation():
- simple_nondifferentiation - Straightlining with identical values can be broken by NA.
resp_styles():
- All response styles - Response strings with identical values can be broken by NA.

In any case, it helps to report how missing values were handled in an analysis to increase confidence in and replicability of the results.

References

Kim, Yujin, Jennifer Dykema, John Stevenson, Penny Black, and D. Paul Moberg. 2019. “Straightlining: Overview of Measurement, Comparison of Indicators, and Effects in Mail–Web Mixed-Mode Surveys.” Social Science Computer Review 37 (2): 214–33. https://doi.org/10.1177/0894439317752406.

An introduction to resquin

Example dataset of survey responses

resp_styles(): Response style indicators

resp_distributions(): Intra-individual response distribution indicators

resp_nondifferentiation(): Response nondifferentiation indicators

resp_patterns(): Response pattern indicators

Using the id argument: Uniquely identify respondents