Skip to contents

This article covers more advanced use cases as well as details of how the QuestionLink package works. You can quickly navigate to relevant topics with the table of contents on the left.

If you have not already, we suggest working through the QuestionLink tutorial first.

The code below simulates response data for four questions which all capture the same construct in a population. This process is explained in detail in the tutorial. This simulated dataset will be used in the code examples later on.

Data simulation code (click to unfold)
knitr::opts_chunk$set(warning = FALSE, message = FALSE) 

# Loading necessary packages

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
## Loading required package: viridisLite
library(ggplot2)
library(forcats)

## Installing questionlink with devtools
#devtools::install_github("https://github.com/MatRoth/questionlink")
## Installing questionlink with pak
##install.packages("pak")
#pak::pak("https://github.com/MatRoth/questionlink")

library(questionlink)



# Defining Custom functions

format_table <- function(df){
  df |> kable() |> 
    kable_styling(position = "left",
      full_width = FALSE
      )
}

plot_question_years <- function(combined_df){
  
  year_min = min(combined_df$year)
  year_max = max(combined_df$year)
  years_vec = c(year_min:year_max)
  major_breaks_vec = years_vec[years_vec %% 5 == 0]
  
  combined_df |> 
    group_by(year, question) |> 
    summarise() |> 
    ungroup() |> 
    arrange(question) |> 
    mutate(question = fct_rev(question)) |> 
    ggplot(aes(year, question, color = question))+
    geom_point(size = 3)+
    scale_x_continuous(breaks = major_breaks_vec,
                       minor_breaks = years_vec)+
    theme_minimal(base_size = 16)+
    theme(legend.position = "none")
}

# helper function that allows us to cut continuous values into ordinal chunks
cut_responses <- function(x, breaks){
  ord_vec <- cut(x, c(-Inf, breaks, Inf), labels = FALSE)
  return(ord_vec)
}

# ------------------
# Simulating response data for four different questions for the same construct in the same population
# ------------------

# Parameters of the simultation

## Sample size per simulated survey wave
survey_sample_n = 2000

## Response thresholds for the four simulated survey questions
response_thresholds_A <- c(-1.5, -0.5, 0.5, 1.5)
response_thresholds_B <- (response_thresholds_A - 0.5) * 1.5
response_thresholds_C <- (response_thresholds_A + 0.5) * 0.7
response_thresholds_D <- seq(-1, 2, length.out = 9)

sim_df <- tibble(
  year = c(1980:2020),
  # construct linearly increases over time
  target_construct_mean = ((year - min(year))/ (max(year)-min(year)))-0.5, 
  simulated_true_score = map(target_construct_mean, ~ rnorm(survey_sample_n, .x, 1))
) |> 
  unnest(simulated_true_score) |> 
  mutate( # Cut simulated true score into ordinal responses for Questions A to D
    question_A = cut_responses(simulated_true_score, response_thresholds_A),
    question_B = cut_responses(simulated_true_score, response_thresholds_B),
    question_C = cut_responses(simulated_true_score, response_thresholds_C),
    question_D = cut_responses(simulated_true_score, response_thresholds_D)
  )

# Bringing the simulated data into long format

qlink_format_df <-   sim_df |> 
  select(year, contains("question")) |> 
  pivot_longer(contains("question"), values_to = "response") |> 
  separate(name, into = c("discard", "question")) |> 
  select(-discard)

# Introducing gaps in the time series for the questions

## Questions A, B, and C get survey waves in even years, D in odd years
qlink_format_df <- qlink_format_df |> 
  filter(!(question == "A" & year %% 2 != 0)) |> 
  filter(!(question == "B" & year %% 2 != 0)) |> 
  filter(!(question == "C" & year %% 2 != 0)) |> 
  filter(!(question == "D" & year %% 2 == 0)) 

## Defining larger gaps in time series by defining the "possible years" for each question

possible_years_lists <- list(
  A = c(2000:2015),
  B = c(1980:1990),
  C = c(1980:2020),
  D = c(1985:2005)
)

qlink_format_df <- qlink_format_df |> 
  mutate(
    valid_year = map2_lgl(year, question, ~ .x %in% possible_years_lists[[.y]])
  ) |> 
  filter(valid_year) |> 
  select(-valid_year)

We simulate data from 1980 to 2020 with deliberate gaps in the time series for each question so that we can showcase the purpose of QuestionLink.

plot_question_years(qlink_format_df)

We also prepare a harmonization solution with QuestionLink, since we will explore the resulting object in greater detail in this article.

# Creating a ql_prepare() object with the harmonization opportunities

prepared_connections <- qlink_format_df |> 
  ql_prepare(
    time_relaxation = 1 ,
    use_relay = TRUE)

# Creating a ql_harmonize() object with the calculated harmonization solutions

harmonization_solution <- prepared_connections |> 
  ql_harmonize()

Inverted items

When combining data from different sources with their own question versions for a concept of interest, we often encounter the issue that not all questions measure the concept in the same numerical order. Some questions represent stronger expression of the concept with higher numerical scores, while other questions are inverted in the sense that higher numerical scores mean weaker expressions of the concept.

Imagine two questions for social trust; both with 5 response options. However, in one scale “1” is the lowest level of trust and “5” is the highest level of trust. Meanwhile, in the other scale “1” is the highest level of trust and “5” is the lowest level of trust. This is an obstacle to pooling and harmonizing the data!

This issue of inverted items can stem from two different causes:

\(1\) Response scales with different directions can lead to different scale “orientations”.

For example: \[fully agree\]\[do not agree at all\] versus \[do not agree at all\]\[fully agree\]

\(2\) The question wording is different in relation to the concept. This is sometimes discussed as “negative” or “reversed” items.

For example: “All in all, I trust other people” versus “All in all, I do not trust other people.”

Resolving inverted items during harmonization

For the QuestionLink package, we decided to leave item inversion up to the users beforehand. This means you need to recode items before feeding the data into ql_prepare().

You might wonder why the package does not automate this process. This would be fairly easy to implement, but would require users to add additional meta-data on each question. Specifically, one would have to specify inverted items in a separate data structure or as another column. We thus found it best to leave this step up to users before applying the package so as not to add an unnecessary layer of complexity to the package usage.

As best practice, we would recommend inverting response scores of questions that do not match the orientation of the chosen target question. That way, scores can in the end be interpreted in line with the scale of the target question. If there are no other considerations to be made, we would also recommend choosing a target question that follows the heuristic “higher means more” or “better”. That way all analyses with the harmonized data become easier to interpret.

Inverting items is a trivial matter in R:

invert_item <- function(responses, 
                         min_score = min(responses,na.rm=T), 
                         max_score = max(responses,na.rm=T))
{ 
  v <- responses
  v <- (v-(min_score + max_score))*-1
  return(v)
}

For example for a 5-point scale:

c(1,2,3,4,5) |> invert_item()
## [1] 5 4 3 2 1

However, it is often better to clearly specifiy the minimum and maximum scores. That way, the function works even in data sets where respondents did not use the full scale. Consider again a 5-point scale:

#wrong:
c(2,3,4,5) |> invert_item()
## [1] 5 4 3 2
#correct:
c(2,3,4,5) |> invert_item(min_score = 1, max_score = 5)
## [1] 4 3 2 1

Weights

Many survey programs provide weights in their scientific use files. Weights are generally provided when surveys are based on probability samples of a population. In such cases, weights are often included in analyses with the survey data to correct for some aspects of this sampling.

Design weights correct for conscious sample design decisions. In Germany, for example, some survey programs “oversample” the federal states of former East Germany. This means surveys such as the ALLBUS recruit proportionally more respondents in these states to increase the statistical resolution there. However, to avoid skewing analyses, this has to be addressed with a design weight.

Adjustment weights, meanwhile, correct for sampling issues that were not intended, such as unit non-response. Often, this involves weighting the data set so that it aligns better to the socio-demographic distribution of the target population (i.e., post-stratification).

Since QuestionLink harmonizes questions based on their frequency distribution in a specific population, we need to apply weights to get as close as possible to representing the “same” population. To this end, ql_prepare() can process an additional column in the combined data set that we feed into it.

Aside from year, question, and response we simply add another column called weight. In this column, we provide the weights from different sources. We would routinely recommend using the design weight if provided, but adding adjustment weights can also be a sound choice depending on your field and the assumptions you are comfortable with.

Weights in the scientific use files of survey programs can also vary in the way they are scaled. Some weights result in frequency distributions as they would have been in the total target population. However, we would recommend using rescaled weights that are standardized to the sample size of each survey wave. You can easily identify these weights by calculating the mean in each wave. Weights that are rescaled to represent the sample size should always have a mean of exactly 1.

If you want to use design and adjustment weights at the same time, you can simply multiply the rescaled weights and enter their product as the weight column into ql_prepare().

Missing values and non substantive responses

In many survey data sets, we find that some response values are missing. As mentioned earlier, when preparing data to be passed into ql_prepare(), we remove missing values first. Such missing values are often either actual missing responses (e.g., due to omissions or refusals) or non-substantive responses, such as “don’t know”. In many survey contexts, such cases are few and should not impact the quality of the harmonization solution overly much.

Nonetheless, under certain circumstances we recommend taking a closer look:

  1. If there are many missing values, the issue becomes more pressing.
  2. If the proportion of missing values varies considerably between source surveys, it might mean that surveys differ systematically in how respondents behave.
  3. Sometimes, some source surveys offer non-substantive response options while others do not.

In such cases, you need to consider that QuestionLink assumes that the probability of having a missing response value is either statistically independent from the measured concept or that biases are at least the same across all data sources. Another way of looking at the same issue is this: If you could look into the heads of the missing value respondents for every question, we should find the same distribution of the true construct scores for participants with valid responses across all questions.

These issues perhaps become clearer with two practical examples.

Imagine respondents who omit answers out of embarrassment; for example if they hold an opinion they think is seen as unpopular or even immoral. That is, in itself, not yet a problem for QuestionLink as long as the same problem persists across all questions. However, now imagine that one question is more subtle while the other question is very direct. In such cases, respondents with strong beliefs might respond to one question but would omit the other. In such cases, the two underlying populations are no longer the same in terms of the distribution of true scores (i.e., what people truly believe). And this would bias the harmonization solution.

Another example are two questions where only one features a non-substantive “don’t know” option. That is in itself not a problem, as long as people who would have chosen “don’t know” pick a random substantive response option if they are forced to choose. However, if respondents choose a midpoint category as a way of expressing “don’t know” if they do not get that option, then the distributions are again biased.

If you are concerned about systematic missing values, you might consider analyzing the pattern of missing values. For example, by looking at the demographic distribution of responses with missing responses. Those should be similar across data sources if missing responses are random or the result of a similar mechanism across sources. A potential solution for missing values is to impute them before handing the responses over to ql_prepare(). However, note that this only works if the reason of missingness is reflected in other variables. In technical terms: QuestionLink assumes that missings are completely at random (MCAR). With careful imputation, we can mitigate this assumption down to the less restrictive missing at random (MAR) assumption. This is the case when responses are not missing randomly, but any systematic pattern of missing values can be explained with other variables in the dataset. However, even imputation cannot resolve responses that are missing not at random (MNAR). For a quick introducution into these types of missingness, see Wikipedia.

Lastly, for clarification: Missing values are a problem in the harmonization data; i.e., the data handed to ql_prepare() to calculate the harmonization solutions. However, they are no problem when we want to harmonize data with ql_transform() using a previously established harmonization solution.

Ensuring data integrity with documentation meta-data

In our examples up until now, QuestionLink simply inferred the response scale of each question from the supplied data. Specifically, it infers the minimum and maximum response scale from what it finds in the pooled harmonization data. However, this can be dangerous in some cases.

One example are zeros as responses. In some cases, zeros are valid responses, for example in 11-point scales from 0 to 10. In other cases, zeros can be undeclared missing values, for example when response options were omitted in a web survey. Of course, the summaries in ql_prepare |> summary() and in ql_prepare |> check() can reveal such issues. However, we recommend that you clearly define each response scale minimum and maximum for each question. This also means reviewing the documentation of your source surveys carefully.

Once you have this information, you can easily supply it to ql_prepare() with a list. This then informs the ql_check() function and leads to more reliable data validation checks.

min_max_list = list(
  "A" = c(1,5),
  "B" = c(1,5),
  "C" = c(1,5),
  "D" = c(1,10)
)

Consider the following example. First, we delete all values of 10 from instrument “D” to corrupt our data.

corrupted_data <- qlink_format_df |> 
  filter(!(question == "D" & response == 10))

Then we run ql_check() without providing additional scale information:

corrupted_data |> 
  ql_prepare(
    use_relay = TRUE,
    time_relaxation = 1
  ) |> 
  ql_check()
## 
## ── A ✔ ──
## 
## ── B ✔ ──
## 
## ── C ✔ ──
## 
## ── D ✔ ──
## 

Despite our corrupted data, we get no warning. As far as ql_check() is concerned, question D is not a 9-point scale.

However, if we provide the scale information, ql_check() now spots the inconsistency!

min_max_list = list(
  "A" = c(1,5),
  "B" = c(1,5),
  "C" = c(1,5),
  "D" = c(1,10)
)

corrupted_data |> 
  ql_prepare(
    use_relay = TRUE,
    time_relaxation = 1,
    scale_min_max = min_max_list
  ) |> 
  ql_check()
## ── A ✔ ──
## 
## ── B ✔ ──
## 
## ── C ✔ ──
## 
## ── D ──
## 
## Issue: Not all response options between the minimum (1) and maximum (10) of D
## were used.
## Year(s): 1985, 1987, 1989, 1991, 1993, 1995, 1997, 1999, 2001, 2003, 2005
## 

Retrieving recoding tables

QuestionLink makes it unnecessary to handle recoding tables manually. However, for purposes of transparency or to document the solutions in papers it might become necessary to access the recoding tables.

To do so, we will delve manually into the output object of ql_harmonize(). To get a quick overview that that object type (or any other object) contains, you can use str(). We will use the “attributes” summarized_rec_tbl and connections_rec_tbl. You can imagine these attributes as containers nested inside the larger object.

harmonization_solution |> str(max.level = 1)
##  'questionlink' chr "questionlink_object"
##  - attr(*, "connections_rec_tbl")= tibble [1,610 × 13] (S3: tbl_df/tbl/data.frame)
##  - attr(*, "summarized_rec_tbl")= tibble [100 × 5] (S3: tbl_df/tbl/data.frame)
##  - attr(*, "data")= tibble [92,000 × 5] (S3: tbl_df/tbl/data.frame)
##  - attr(*, "population_flag")= logi FALSE
##  - attr(*, "weight_flag")= logi FALSE
##  - attr(*, "scale_min_max")= list()

Retrieving the aggregated recoding tables

The most likely use case is that you want to access the final, aggregated recoding table that QuestionLink has formed from all available connections. This can be easily done by picking the recoding information from the ql_harmonize() object. To access this, apply |> attr("summarized_rec_tbl") to a ql_harmonize() object. This results in a dataframe with all recoding tables in one data frame.

# Create a ql_harmonize object:
harmonization_solution <- qlink_format_df |> 
  ql_prepare(
    use_relay = TRUE,
    time_relaxation = 1
  ) |> 
  ql_harmonize()

harmonization_solution |> 
  attr("summarized_rec_tbl") |> 
  head(10) |> 
  format_table()
source_question target_question population source median_equivalent
A B dummy 1 1.8525822
A B dummy 2 2.8072627
A B dummy 3 3.4988292
A B dummy 4 4.2687967
A B dummy 5 4.9837607
A C dummy 1 0.6357454
A C dummy 2 1.2935848
A C dummy 3 2.4731599
A C dummy 4 3.8667301
A C dummy 5 5.1112039

If you want the recoding information for a specific pair of questions, simply filter that data frame:

# Looking up recodings from "A" to "C"

harmonization_solution |> 
  attr("summarized_rec_tbl") |> 
  filter(
    source_question == "A",
    target_question == "C"
  ) |> 
  select(-population) |> 
  format_table()
source_question target_question source median_equivalent
A C 1 0.6357454
A C 2 1.2935848
A C 3 2.4731599
A C 4 3.8667301
A C 5 5.1112039

Retrieving all recoding tables that were calculated

If you will recall, QuestionLink calculates all possible equating solutions (and thus recoding tables). This means it will calculate one recoding table per found, allowed (direct, time-relaxed, or relay) connection .

Accessing these recoding tables can make sense if you want to assess the variability of the separate equatings. This can be interesting for methodological research or to identify problems with specific data sources or across time.

Specifically, ql_harmonize() objects contain a data frame in the attribute connections_rec_tbl, which contains all connections between the questions and for each connection the equating solution in the form of a recoding table.

A quick look into that data frame shows that it contains a number of variables:

harmonization_solution |> 
  attr("connections_rec_tbl") |> 
  names() |> 
  as_tibble_col(column_name = "variable names") |> 
  format_table()
variable names
source_question
source_year
source_population
target_question
target_year
target_population
connection_type
relay_question
relay_source_year
relay_source_population
relay_target_year
relay_target_population
recoding_tables

Let us go through these step by step. In general, all variables describe a connection from one question to another. The last variable, recoding_tables then gives the equating solution for that specific connection.

  1. The first three variables, source_question, source_year, & source_population, denote the source question (and its specific data source in terms of year and population); that is the question we want to harmonize. For example, the row might harmonize question “A” with data gathered in 1992 from Germany towards some other question format.
  2. The next three variables, target_question, target_year, & target_population, denote the target question (and its specific data source in terms of year and population); that is the question we want to use as the target scale format for our harmonization. For example, the row might harmonize the source question towards the format of question “B” with data gathered in 1992 from Germany.
  3. connection_type then denotes the type of connection that was used in this connection instance to harmonize the source question towards the format of the target question. The three possible types are direct, time_relaxation, and relay. Of course, time_relaxation, and relay connections only appear in the data frame if we allowed these two types of connections during the ql_prepare() stage of our harmonization.
  4. The next five variables all detail a relay used for harmonization and can be ignored if you are not interested in this specifically. Of course these variables only have valid values if the connection is of the type relay. If you remember our data example, there were no direct or time-relaxed connections between questions “A” and “B”. Thus we harmonize “A” to “B” via the relay question “C”. In such a case, the relay variables for this connection would be filled with information on which relay_question was used (e.g., “C”), which relay_source_year and relay_source_population instance was used to harmonize the source question to (e.g., “A” to “C”), and lastly, which relay_target_year and relay_target_population instance was used to harmonize the equating solution onwards to the target question (e.g., “C” to “A”).
  5. Last but not least, recoding_tables contain the equating solutions for each connection. This is where we need to get our recoding tables from.

The content of each cell in the column recoding_tables is itself a nested table.

harmonization_solution |> 
  attr("connections_rec_tbl") |> 
  pull(recoding_tables) |> 
  first() |> 
  format_table()
source target
1 0.5092251
2 0.7146371
3 1.5531716
4 3.2322738
5 5.1176471

We chose this “nested” structure, because it means that we get a “one row, one connection” data frame. For our purposes, however, we might want to unnest() that structure so that one row is one source question score harmonized with a specific connection.

Consider one row (omitting the relay variables for brevity):

harmonization_solution |> 
  attr("connections_rec_tbl") |> 
  select(-contains("relay")) |> 
  head(1) |> 
  format_table()
source_question source_year source_population target_question target_year target_population connection_type recoding_tables
B 1980 dummy C 1980 dummy direct 1.0000000, 2.0000000, 3.0000000, 4.0000000, 5.0000000, 0.5092251, 0.7146371, 1.5531716, 3.2322738, 5.1176471

Now observe hpw unnest() unfolds this one row into five rows; one for each of the five possible response options of the source question B. This also adds the columns source for the source questions scores, and target for the harmonized scores towards the target questions format.

harmonization_solution |> 
  attr("connections_rec_tbl") |> 
  select(-contains("relay")) |> 
  head(1) |> 
  unnest(recoding_tables) |> 
  format_table()
source_question source_year source_population target_question target_year target_population connection_type source target
B 1980 dummy C 1980 dummy direct 1 0.5092251
B 1980 dummy C 1980 dummy direct 2 0.7146371
B 1980 dummy C 1980 dummy direct 3 1.5531716
B 1980 dummy C 1980 dummy direct 4 3.2322738
B 1980 dummy C 1980 dummy direct 5 5.1176471

If you are interested in the variability of equating solutions between to questions, for example over time, you can simply unnest() and then filter() the dataframe.

As an example, we look at all the direct connection between “A” and “C”. The plot shows source scores of question “A” on the x-axis and the calculated equivalents in the target format of “C” on the y-axis. Note how there are several “target” equivalent values for each “source” score. Specifically, one equivalent for each score per direct connection. However, they are very similar across all those direct connections. (This will look messier an empirical data!)

harmonization_solution |> 
  attr("connections_rec_tbl") |> 
  mutate(connection_id = row_number()) |> 
  filter(
    source_question == "A",
    target_question == "C",
    connection_type == "direct"
  ) |> 
  unnest(recoding_tables) |> 
  ggplot(aes(source,target))+
  geom_line(aes(group = connection_id), alpha = 0.5)+
  geom_point(alpha = 0.5)+
  theme_minimal()+
  coord_fixed()+
  scale_x_continuous(limits = c(0.5,5.5), breaks = c(1:5), minor_breaks = NULL)+
  scale_y_continuous(limits = c(0.5,5.5), breaks = c(1:5), minor_breaks = NULL)

Now we repeat this plot, but this time with all relay connections. We immediately see that (1) there are many more relay connections and (2) that they are far messier. The reason is that a relay connection performs two equating processes, one chained after the other. This means that random measurement and sampling errors are amplified.

harmonization_solution |> 
  attr("connections_rec_tbl") |> 
  mutate(connection_id = row_number()) |> 
  filter(
    source_question == "A",
    target_question == "C",
    connection_type == "relay"
  ) |> 
  unnest(recoding_tables) |> 
  ggplot(aes(source,target))+
  geom_line(aes(group = connection_id), alpha = 0.1)+
  geom_point(alpha = 0.2)+
  theme_minimal()+
  coord_fixed()+
  scale_x_continuous(limits = c(0.5,5.5), breaks = c(1:5), minor_breaks = NULL)+
  scale_y_continuous(limits = c(0.5,5.5), breaks = c(1:5), minor_breaks = NULL)

This variability of equating solutions derived from relay connections might seem worrying. However, if you will recall, QuestionLink does not use these equating solutions separately. Instead, it aggregates them into one solution by choosing the median equivalent score. Now let us do the same and we immediately see that the median aggregated direct connections and the median aggregated relay connections match almost perfectly. The aggregation over many links has mitigated the problem of randomness in measurement and sampling.

harmonization_solution |> 
  attr("connections_rec_tbl") |> 
  filter(
    source_question == "A",
    target_question == "C"
  ) |> 
  unnest(recoding_tables) |> 
  select(connection_type, source, target) |> 
  group_by(connection_type, source) |> 
  summarise(
    target = median(target)
  ) |> 
  ggplot(aes(source,target, color = connection_type))+
  geom_line(aes(group = connection_type))+
  geom_point()+
  theme_minimal()+
  coord_fixed()+
  scale_x_continuous(limits = c(0.5,5.5), breaks = c(1:5), minor_breaks = NULL)+
  scale_y_continuous(limits = c(0.5,5.5), breaks = c(1:5), minor_breaks = NULL)

Several Populations

Until now, we only dealt with data from a single population (e.g., the adult German population). However, we often have harmonization tasks where we have data from different populations, such as different countries. However, we cannot form an equating connection across populations. Each connection QuestionLink uses is within one population (and at the approximately same point in time). However, what QuestionLink can do is to perform its tasks population by population. This means we have to find suitable connections in every population (e.g., country), but then we can harmonize each populations’ data with recoding tables that are specific to that population.

To illustrate this feature, we are copying our simulated dataset three times and pretend that these data were gathered from the populations, pop_1, pop_2, and pop_3:

multi_pop_data <- tibble(
  population = c("pop_1", "pop_2", "pop_3"),
  data = list(qlink_format_df, qlink_format_df, qlink_format_df)
  ) |> 
  unnest(data)

The harmonization process with QuestionLink now stays exactly the same. We do not even have to specify that we are now accessing the multi-population mode. It is enough that our input data now has a population column. Note how the summary() of the ql_prepare() object now signifies the three populations with the message “Populations found:”pop_1”, “pop_2”, and “pop_3”“.

multi_pop_harmonization_prepared <- multi_pop_data |> 
  ql_prepare(
    use_relay = TRUE,
    time_relaxation = 1
  ) 

multi_pop_harmonization_prepared |> summary()
## 
## ── Summary: QuestionLink Prepare ───────────────────────────────────────────────
## Current stage of the QuestionLink workflow: ql_prepare ✔ -> ql_harmonize ->
## ql_transform
## 
## • Number of unique question combinations: 12
## 
## ── Number of found connections: ──
## 
## • Direct connections: 84
## • Time relaxation connections: 216
## • Relay connections: 4530
## 
## ── Additional information: ──
## 
## • Populations found: "pop_1", "pop_2", and "pop_3"
## • No weight variable was supplied.
## 
## ── Overview over question combinations: ──
## 
## 
## 
## Source question   Target question    Direct   Time relaxation   Relay
## ----------------  ----------------  -------  ----------------  ------
## A                 C                      24                NA     420
## C                 A                      24                NA     420
## B                 C                      18                NA     420
## C                 B                      18                NA     420
## A                 B                      NA                NA     291
## A                 D                      NA                21     507
## B                 A                      NA                NA     291
## B                 D                      NA                21     375
## C                 D                      NA                66     252
## D                 A                      NA                21     507
## D                 B                      NA                21     375
## D                 C                      NA                66     252
## 
## ── Overview over response options found: ──
## 
## 
## 
## Question   Response options                User supplied 
## ---------  ------------------------------  --------------
## A|pop_1    1, 2, 3, 4, 5                   FALSE         
## A|pop_2    1, 2, 3, 4, 5                   FALSE         
## A|pop_3    1, 2, 3, 4, 5                   FALSE         
## B|pop_1    1, 2, 3, 4, 5                   FALSE         
## B|pop_2    1, 2, 3, 4, 5                   FALSE         
## B|pop_3    1, 2, 3, 4, 5                   FALSE         
## C|pop_1    1, 2, 3, 4, 5                   FALSE         
## C|pop_2    1, 2, 3, 4, 5                   FALSE         
## C|pop_3    1, 2, 3, 4, 5                   FALSE         
## D|pop_1    1, 2, 3, 4, 5, 6, 7, 8, 9, 10   FALSE         
## D|pop_2    1, 2, 3, 4, 5, 6, 7, 8, 9, 10   FALSE         
## D|pop_3    1, 2, 3, 4, 5, 6, 7, 8, 9, 10   FALSE

We can also look at the connections plot for each population, simply by calling plot() on the ql_prepare() object:

multi_pop_harmonization_prepared |> plot(combine_population_plots = FALSE)
## [[1]]

## 
## [[2]]

## 
## [[3]]

We can then harmonize the entire, multi-population data frame by calling ql_harmonize() and ql_transform():

multi_pop_harmonization_prepared |> 
  ql_harmonize() |> 
  ql_transform(
    target_question = "A"
  ) |> 
  head() |> 
  format_table()
population year question response weight target_question median_equivalent
pop_1 1980 B 4 1 A 3.527580
pop_1 1980 C 3 1 A 3.332302
pop_1 1980 B 3 1 A 2.297157
pop_1 1980 C 1 1 A 1.707434
pop_1 1980 B 2 1 A 1.192544
pop_1 1980 C 1 1 A 1.707434

What happens under the hood is that QuestionLink performs all steps for each population, one after the other. If we look at the summarized recoding tables in the ql_harmonize() object, we see that we get a recoding table for each question combination, in both directions, for each population. Note that with count() we condense the recoding tables down to one line per recoding table, arriving at 48 recoding tables (and rows): 4 instruments * 4 instruments * 3 populations = 48.

multi_pop_harmonization_prepared |> 
  ql_harmonize() |> 
  attr("summarized_rec_tbl") |> 
  count(source_question, target_question, population)
## # A tibble: 48 × 4
##    source_question target_question population     n
##    <chr>           <chr>           <chr>      <int>
##  1 A               A               pop_1          5
##  2 A               A               pop_2          5
##  3 A               A               pop_3          5
##  4 A               B               pop_1          5
##  5 A               B               pop_2          5
##  6 A               B               pop_3          5
##  7 A               C               pop_1          5
##  8 A               C               pop_2          5
##  9 A               C               pop_3          5
## 10 A               D               pop_1          5
## # ℹ 38 more rows

Lastly, what happens if we cannot find links for some questions in some populations? As an example, we delete question A from the harmonization data in population 1. In such cases QuestionLink makes its inability to harmonize some values transparent. Specifically, it details which scores of which instrument in which population were not harmonized.

multi_pop_research_data <- multi_pop_data |> 
  transmute(population,
            source_question = question,
            response)

multi_pop_harmonization_data_incomplete <- multi_pop_data |> 
  filter(!(population == "pop_1" & question == "A"))

multi_pop_harmonization_data_incomplete |> 
  ql_prepare(
    use_relay = TRUE,
    time_relaxation = 1
  ) |> 
  ql_harmonize() |> 
  ql_transform(
    target_question = "A",
    new_data = multi_pop_research_data
  ) -> incomplete_solution

If we look at some examples for question A, we see that QuestionLink has harmonized responses to question A in population 2 and 3. In population 1, however, it has added NA values to indicate that no harmonization solution was possible.

 incomplete_solution |> 
   filter(source_question == "A" | target_question == "A") |> 
   group_by(population) |> 
   summarise(across(everything(), first)) |> 
   format_table()
population source_question response target_question median_equivalent
pop_1 A 3 NA NA
pop_2 B 4 A 3.52758
pop_3 B 4 A 3.52758

Overcoming gaps in harmonization data

The structure of QuestionLink is optimal for making use of existing random samples of one (or more) populations collected by survey programs over longer time series. However, in some cases there are simply questions which cannot be harmonized with others with that approach. Here we want to address some possible ways of overcoming such limitations.

Choosing the right target question

Often, the choice of target question (i.e., to which question’s format to harmonize) is up to convenience. However, there are edge cases where the choice of target question determines how many other questions we can harmonize towards it.

Consider the following case:

chain_limit_prep <- tibble(
  question = c("A", "B", "C", "D") |> rev(),
  year = list(
    c(1980:1990) +5,
    c(1980:1990) +12,
    c(1980:1990) +19,
    c(1980:1990) +26
  )
) |> unnest(year) |>
  mutate(response = 1) |> 
  ql_prepare(use_relay = TRUE, time_relaxation = 0) 

chain_limit_prep |> 
  plot()

We immediately see that we can only directly harmonize temporally adjacent questions:

  • A to B

  • B to A and C

  • C to B and D

  • D to C

However, we can use relays. Yet since relay (i.e., chained) equatings incur amplifying random errors, QuestionLink limits relays to only one jump. This means we cannot reach D from A or vice versa.

However, both B and C can reach all other questions:

  • B has direct connections to A and C and a relay to D (via C)

  • C has direct connections to B and D and a relay to A (via B)

Looking into the connection attribute of the ql_prepare() object, we can easily verify this with a bit of data wrangling:

number_of_questions <- chain_limit_prep |> 
  attr("data") |> 
  dplyr::pull(question) |> 
  unique() |> 
  length()

question_connections_table <- chain_limit_prep |> 
  attr("connection") |> 
  dplyr::select(source_question, target_question) |> 
  dplyr::distinct() |> 
  dplyr::arrange(source_question) |> 
  dplyr::group_by(source_question) |> 
  dplyr::count() |> 
  dplyr::mutate(
    pct_link = 100 * n / (number_of_questions -1),
    question_connections = dplyr::case_when(
      pct_link == 0 ~ "no links",
      pct_link == 100 ~ "links to all",
      pct_link > 0 & pct_link < 100 ~ "links to some"
    ),
    question_connections = factor(
      question_connections,
      levels = c("no links", "links to some", "links to all")
    )
  ) |> 
  dplyr::select(question = source_question, question_connections)


chain_limit_prep |> 
  attr("data") |> 
  dplyr::group_by(year, question) |> 
  dplyr::summarise() |> 
  dplyr::arrange(year, question) |> 
  dplyr::full_join(question_connections_table)
## # A tibble: 44 × 3
## # Groups:   year [32]
##     year question question_connections
##    <dbl> <chr>    <fct>               
##  1  1985 D        links to some       
##  2  1986 D        links to some       
##  3  1987 D        links to some       
##  4  1988 D        links to some       
##  5  1989 D        links to some       
##  6  1990 D        links to some       
##  7  1991 D        links to some       
##  8  1992 C        links to all        
##  9  1992 D        links to some       
## 10  1993 C        links to all        
## # ℹ 34 more rows

Cutting intersecting samples down to size

Sometimes the issue is that we do have random samples for two questions, but the target population is not a perfect match. However, as long as there is some overlap (i.e., intersection) between the target populations of the data sources, we might still make do.

Examples for such situations include:

  • Surveys which cover only a sub-population of the general population of a country, such as samples of the employed population, samples of migrants, or samples of women.

  • Surveys might also cover different age ranges.

  • Surveys might cover different regions, such as covering some but not all federal states.

In all these cases, we might still harmonize two questions of their respective samples have an intersection of sufficient size. All we have to do is to exclude cases from one or both samples so that their remaining populations match.

For example, if we want to harmonize a sample of people in retirement with a sample of the general population, we might exclude all non-pensioners from the general sample and then match both of them for harmonization. Or imagine a sample of adult women and a sample of only employed people. We might exclude all unemployed respondents from the first and all non-female people from the second sample to match them.

However, please note that this might lead to small remaining sample sizes, which increases the random error of each equating solution. This is, of course, less of an issue if there are many such connections (e.g., when there are many waves to harmonize).

Conducting split-half experiments

Sometimes, there are simply no existing data to form a connection between two questions. In such cases, we might consider collecting data ourselves to bridge the missing connection(s). Of course it is seldom feasible to collect random samples like the larger survey programs do. However, if you will recall, the random groups design is traditionally a split-half experiment. And our requirement for random samples was just to be able to harmonize across survey programs. In a split-half experiment, we get the “same population” requirement for free. This means we can resort to cheaper sampling methods, such as non-probability samples.

There are some aspects that we should consider, however:

  • Sample size is key! Unlike other connections, we only get one connection between two questions here. After all, we do this only when there are no other connections. This means that the higher our sample size, the better we can mitigate the random error of equating. (As a rule of thumb: N >= 2000.)

  • This not only affects the quality of harmonizing the two questions in the experiment. Depending on the other questions and the data availability, QuestionLink might route relay connections through our split-half experiment data, using it as a relay. This means its randomness is then injected into other harmonizations as well.

  • We should strive to at least ensure socio-demographic variety via quotas. This does not make our data “representative” of the general population, of course. However, the harmonization solution will be more comparable, if we avoid a very homogenous sample; for example with mostly highly educated respondents.

  • Lastly, note that we assume that both questions work the same way in our non-probability sample as they do in the general population, which we discuss more in the section Population invariance. Just for clarification: We do not mean that we get the same response distribution as in the general population. We mean that respondents should understand the question in the same way (in the formal sense of measurement invariance).

Adding split-half experimental data to QuestionLink to bridge missing connections is easily done, but the process might seem counter-intuitive at first. It might be best to look at the practical process and then to try and understand why it works the way it does.

Adding split-half experimental data works like this:

  1. Bring the experimental data into a long format, just like we normally do with data for QuestionLink. At this stage, we have a question and a response column, with the respective Question (i.e., condition in the split-half experiment) and the individual response scores.
  2. Next, we fill the year column. This step is crucial! We have to choose years that are outside the range of the years from the random sample surveys. Otherwise, QuestionLink might equate one of our non-probability experimental conditions with a random sample of one of the surveys. You can choose any arbitrary year that is far enough away from the others so that it does not trigger the time-relaxation range. However, it might perhaps be best to clearly demarcate experimental data. You might choose years from previous or future centuries or even negative years.
  3. If you have different populations (e.g., countries), then we recommend using experimental data only for the population it was collected in. Thus enter that population’s id into the population column.
  4. If the random sample data sources use weights, add a weight column to your experimental data, where every row has the value 1.
  5. Lastly, add the experimental data as additional rows to your other harmonization data (e.g., with add_row() or a join). Then proceed as normal with the QuestionLink R-package.

The reason for this peculiar structure with imaginary data collection years is that we want to fold experimental data into the QuestionLink connection routing. However, we also need to avoid that QuestionLink forms connections between the general population and our experimental sample(s).

A last issue is that this process might lead to less readable and perhaps confusing connection plots if we apply plot() to the ql_prepare() object.

However, this can be resolved with a bit of ggplot() magic.

If you simply want to omit all experimental datapoints from the plot, then you can use the limits argument in ggplot::scale_x_continuous(). This way you can define the minimum and maximum number of years to display. If you choose only the range of the random sample surveys, the experiments are not displayed.

# Experimental years dummy data:
experimental_years_data <- tibble(
  question = c("A", "B") |> rev(),
  year = list(
    c(-10, c(1980:1990) +5),
    c(-10, c(1980:1990) +20)
  )
) |> unnest(year) |>
  mutate(response = 1) |> 
  ql_prepare(use_relay = TRUE, time_relaxation = 0) 

If we add negative year numbers to our data, the plot() becomes unreadable:

# Without limits

experimental_years_data |> 
  plot()

However, if we exclude the experimental years, then it works as normal:

# With limits
experimental_years_data |> 
  plot()+
  scale_x_continuous(limits = c(1980, 2020))

If you want to clearly designate specific “years” as experiments, you can proceed as follows. Here we set two imaginary experiments to 1960 and 1970, a bit away from the earliest random sample data from 1980. Then we define labels for our experiments and supply those labels inside scale_x_continuous() as a custom ~ lambda function with ifelse().

# Experimental years dummy data:
experimental_years_data_2 <- tibble(
  question = c("A", "B") |> rev(),
  year = list(
    c(1960, 1970, c(1980:1990) +5),
    c(1960, 1970, c(1980:1990) +20)
  )
) |> unnest(year) |>
  mutate(response = 1) |> 
  ql_prepare(use_relay = TRUE, time_relaxation = 0) 


# define experiment "years"
experiment_labels <- c(
  "1960" = "Experiment 1",
  "1970" = "Experiment 2" 
)

# plot
experimental_years_data_2 |> 
  plot()+
  scale_x_continuous(
    labels = ~ifelse(.x %in% labels(experiment_labels), 
                     experiment_labels[as.character(.x)], 
                     .x))

Population invariance

One last issue is less technical, but rather a methodological limitation of observed score equating. The issue of population invariance becomes relevant when we want to apply the recoding tables derived from harmonization data to other data which was measured with the same questions. In simple terms, population invariance holds if an equating solution derived from random groups data in one population is the same as the equating solution we would get from random groups data in another population.

If this were the case, then we can equate in one population and can apply the recoding tables directly in the other population. However, such population invariance is not guaranteed when applying observed score equating algorithms. In formal terms, we only get population invariance across two populations if both questions are measurement invariant across the two populations. In intuitive terms, this means we only get population invariance if the questions are understood and respondent to in the same way in two populations.

The problem is, of course, that we do not know if single-item questions are measurement invariant. This is because all conventional methods to assessing measurement invariance require multiple indicators such as several items for one construct. As a concrete example, think of multi group confirmatory factor analyses, which require several items per factor.

Now in pragmatic terms, we can relegate this issue to assumptions. Every harmonization project must make some assumptions about spearate and joint data quality and this might just add another such assumption. In fact, we routinely make this assumption when analyzing single-source data in the social sciences. If we asked men and women in Germany how interested they were in politics, and found a higher average score for men, most researchers would say men are more interested in politics. This means social scientists routinely assume measurement invariance; here between male and female respondents.

The way QuestionLink is set up, it assumes population invariance within each population defined with a unique population id. It also assumes that population invariance between such groups is violated and (chained) equating across populations is not desirable. This way, users can make their own decisions. Up until now, we used countries as population examples. However, you can also distinguish subpopulations. In data from one country, you might define “women” and “men” as ids in population. Then you would get separate recoding tables which would also be used separately to other data.

Apart from which data to use for deriving the recoding tables, the issue of population invariance also governs which other data we can apply the recoding tables to. Imagine deriving recoding tables from surveys of the general adult population in a country. Can you then use the resulting recoding table to harmonize survey data collected in a sample of teenagers? Well, you can if you feel confident that those teenagers understand and treat the question similarly to adults.

Sometimes, we can also try to vet our population invariance assumption empirically. Specifically, we can check for population invariance when we have random groups data from both populations. Then we can define these populations with a population id and apply QuestionLink. If we get very similar recoding tables in both (or more) populations, then this supports the population invariance assumption.