Creating a Qualtrics Documenter

Qualtrics R Excel Regex Github Data Documentation

I import data from Qualtris into R and create a codebook or documentation spreadsheet. This can be used for question labelling and recoding of response values (see renaming script project).

Joshua Scriven
2023-03-10

To avoid inadvertently sharing sensitive client data, this example uses a simplified dataset that mimics most of the issues encountered in the actual project.

Purpose:

Create a codebook that lists every question from a Qualtrics survey, indicating the respective column(s) in the response data, along with every possible response value (for non-open ended questions). The codebook can be used for quickly corresponding between exported response data and the survey quesstionnaire. It can also be used as a tool for changing the names of exported data columns to more meaningful ones. The Qualtrics question labeling option was set to “Block” but the regex patterns contained below can be updated to apply to any of the other options.

Joshua’s work principles applied to this project:

Load my custom fucntions from github

source("https://raw.githubusercontent.com/joshuascriven/helper_functions/main/helper_functions.R")

Load CRAN packages

packageloader(c(
"tidyverse"
, "dplyr"
, "knitr"
, "qualtRics"
, "odbc"
, "lubridate"
, "stringr"
, "anytime"
, "tm"
, "fastDummies"
, "jtools"
, "janitor"
, "here"
, "odbc"
, "data.table"
, "openxlsx"
, "GrpString"
, "kableExtra"
, "gtools"
, "flextable"
))
[1] "packages loaded"

Import data and data conversion tables

Connecting to the Qualtrics survey data requires an api token. I retrieve my my credentials from a local file in a centralized location using my getKey() function. I query the API to fetch 3 types of Qualtrics data using the R “qualtRics” package. I always label projects to cross-reference in emails, system folders, and code searches.

data_out (Qualtrics questionnaire metadata

I use metadata() to get metadata for each question. leading and trailing whitespace from question text, retaining original text values. I create row indicators for any imported data to help with later joins or troubleshooting.

data_dir <- "../data/"
projname <- "qo_request_JS22"

api_key <- read.table(filemaker("api_key","tsv"), header = TRUE)

# Get Qualtrics data
qualtrics_api_credentials(
  api_key = getKey("qualtrics_api_key")
  , base_url = getKey("qualtrics_base_url")
  , install = TRUE,overwrite = T); readRenviron("~/.Renviron")

sid <- all_surveys()$id[grep("^Li.*To.* TEST COPY$",all_surveys()$name)]

dat <- fetch_survey(surveyID = sid,
                         verbose = TRUE,force_request = T)

  |                                                                  
  |                                                            |   0%
  |                                                                  
  |============================================================| 100%
# Extract Q metadata
md <- metadata(sid)
q_rule <- stack((lapply(md$questions, function(x) as.character(x$validation))))

df_list <- list(
  qnum     = stack(lapply(md$questions, function(x) x$questionName))
  , qid    = data.frame(ind = names(md$questions))
  , qtext  = stack(lapply(md$questions, function(x) cleanhtml(x$questionText)))
  , qtype  = stack(lapply(md$questions, function(x) x$questionType$type))
  , qslct  = stack(lapply(md$questions, function(x) x$questionType$selector))
  , q_sbsl = stack(lapply(md$questions
                          , function(x) as.character(x$questionType$subSelector)))
  , q_frce = stack(lapply(md$questions
                          , function(x) x$validation$doesForceResponse))
  , q_rule = aggregate(values ~ ind
                  , FUN = paste
                  , collapse = " ; "
                  , data = q_rule)
  , q_opt_txt  = cbind(cbind(stack(lapply(md$questions
                    , function(w) lapply(w$choices, function(z) z$choiceText)))
                         , stack(lapply(md$questions
          , function(w) as.character(lapply(w$choices, function(z) z$recode))))
  ))[,-4]
  
)

data_out <- df_list %>% reduce(full_join, by='ind')
names(data_out) <- c(names(df_list),"q_opt_num")

data_out <- data_out %>% 
  mutate(
  source = "data_out"
  ,qtext_stripped = gsub(":$", "", trimws(qtext))
  ,rownum_data_out = seq(nrow(.))
)

tab_options (Response Metadata)

Next, I use column_map() to extract metadata on responses. Each row in the resulting data represents a response choice for a survey question.

tab_options <- tab_options_bac <- qualtRics::column_map(sid)
tab_options$source <- "tab_options"

tab_desc (Response Data)

Importantly, the metadata provided by metadata() and column_map() are for the questions in the survey instrument and not on the final data that Qualtrics exports. Responses to “select all that apply” questions, for instance, produce a binary indicator for each choice option provided. These columns tend to follow some pattern identifying each set with particular survey questions.

tab_desc <- stack(lapply(dat, function(x) attributes(x)$label))
names(tab_desc) <- c("desc","qname") 
tab_desc <- tab_desc %>%
  mutate(rownum_tab_desc = seq(nrow(tab_desc))
         ,source = "tab_desc")

Merging datasets

When a question in the Qualtrics survey allows the respondent to select all options that apply, Qualtrics stores responses for that question type as a set of binary indicators, one for each of the n number of choices presented. A similar situation presents for Qualtrics matrix or grouped questions, where multiple responses are elicited from respondents for a set of questions that share the same scale. Hence, metadata() returns a QID (question identifier) column, the length of which is equal to the n number of columns in the exported data, while column_map() provides all response options for a given QID/Exported data column. Qualtrics does not provide a key to merge these three datasets. I use the following method to merge these data sets.

Candidate Fields for Merge-key

A merge-key column is one that can unique identify the same observation across various datasets. Of the possible fields available, only tab_options$qname uniquely identifies every possible question-response for the survey (i.e., entered in the web UI). tab_desc$qname which is necessarily a set of unique values, consisting of the column names of the exported data. Because their lengths differ, however, a simple 1-to-1 join isn’t possible..

# test duplication
tabyl(tab_options$qid) %>% arrange(desc(n)) %>% head() %>%
  kbl() %>%
  kable_styling()# duplicates
tab_options$qid n percent
QID367 132 0.0888889
QID357 120 0.0808081
QID373 108 0.0727273
QID403 96 0.0646465
QID411 75 0.0505051
QID361 72 0.0484848
tabyl(data_out$qid) %>% arrange(desc(n)) %>% head() %>%
  kbl() %>%
  kable_styling() # duplicates
data_out$qid n percent
QID167 138 0.0969782
QID11 107 0.0751933
QID162 79 0.0555165
QID165 63 0.0442727
QID163 49 0.0344343
QID101 40 0.0281096
tabyl(data_out$qnum) %>% arrange(desc(n)) %>% head() %>%
    kbl() %>%
  kable_styling()# duplicates
data_out$qnum n percent
Q1.37 138 0.0969782
Q1.32 107 0.0751933
Q1.33 79 0.0555165
Q1.35 63 0.0442727
Q1.34 49 0.0344343
Q20.6 40 0.0281096
tabyl(tab_desc$desc) %>% arrange(desc(n)) %>% head() %>%
  kbl() %>%
  kable_styling()# duplicates
tab_desc$desc n percent
Provide details in comment box: 32 0.0207523
Provide explanation in comment box: 8 0.0051881
Was guidance provided by primary Supervisor? 6 0.0038911
Comment Box: - Other - OTI 1 - Enter Reason 2 0.0012970
Comment Box: - Other - OTI 10 - Enter Reason 2 0.0012970
Comment Box: - Other - OTI 11 - Enter Reason 2 0.0012970
tabyl(tab_desc$qname) %>% 
  filter(grepl("Q[0-9]",`tab_desc$qname`)) %>%
  arrange(desc(n)) %>% 
  head() %>%
  kbl() %>%
  kable_styling() # unique
tab_desc$qname n percent
Q1.2 1 0.0006485
Q1.3 1 0.0006485
Q1.4 1 0.0006485
Q1.4_6_TEXT 1 0.0006485
Q1.5 1 0.0006485
Q1.6 1 0.0006485
tabyl(tab_options$qname) %>% arrange(desc(n)) %>% head() %>%
  kbl() %>%
  kable_styling() # unique
tab_options$qname n percent
Q1.1 1 0.0006734
Q1.10 1 0.0006734
Q1.11 1 0.0006734
Q1.12 1 0.0006734
Q1.13 1 0.0006734
Q1.14 1 0.0006734

tab_options$qid, data_out$qid and data_out$qnum all share the same length, and uniquely identifies each survey question. However, data_out$qnum uses the labeling selected by the use in Qualtrics, while qid in both instances represents the internal Qualtrics identifier for that question and is usually based on the order of question creation.

# test length
getuniquelen(tab_options$qid) # len 1
[1] 232
getuniquelen(data_out$qid) # len 1
[1] 232
getuniquelen(data_out$qnum) # len 1
[1] 232
getuniquelen(tab_desc$desc) # len diff
[1] 1481
getuniquelen(tab_desc$qname) # len diff
[1] 1542
getuniquelen(tab_options$qname) # len diff
[1] 1485

Finally there is some overlap between tab_options$qname and tab_desc$qname resulting from the fact that a survey question, as indicated by its QID or the root of its qname, will appear multiple times in the respective data because it comprises multiple sub-questions or because its responses are recorded as if their existed multiple sub-questions (binary indicator example mentioned above). It is also clear that sub-strings of qname values in both datasets have an exact relation to qnum in data_out.

# test overlap
len(intersect(tab_options$qid,data_out$qnum))
[1] 0
len(setdiff(tab_options$qid,data_out$qid)) # full overlap
[1] 0
len(intersect(tab_options$qname,tab_desc$qname)) # some overlap
[1] 948
len(setdiff(tab_options$qname,tab_desc$qname)) # some overlap
[1] 537

Solution

data_out can be joined using qnum with parsed versions of qname from both tab_desc and tab_options. However, because a 1:1 relationship should exist between tab_desc and tab_options for much of the data and a 1:many between either and data_out, the first two must be merged before joining their merger to the third dataset.

The following table shows the pattern of candidate merge-key column values, numeric substrings of size 1 are indicated with a single 0. Embedded data fields, calculated fields, and Qualtrics-generated fields in the exported data are excluded.

temp1  <- grep("Q[0-9]",mixedsort(getunique(
  gsub(pattern = "[0-9]",replacement = "0",x = getunique(tab_desc$qname)))),val=T)
temp2  <- mixedsort(getunique(
  gsub(pattern = "[0-9]",replacement = "0",x = getunique(tab_options$qname))))
temp3  <- mixedsort(getunique(
  gsub(pattern = "[0-9]",replacement = "0",x = getunique(tab_options$qid))))
temp4  <- mixedsort(getunique(
  gsub(pattern = "[0-9]",replacement = "0",x = getunique(data_out$qid))))
temp5  <- mixedsort(getunique(
  gsub(pattern = "[0-9]",replacement = "0",x = getunique(data_out$qnum))))

cbind(
  paste0(temp1, collapse = ", ")
  ,paste0(temp2, collapse = ", ")
  ,paste0(temp3, collapse = ", ")
  ,paste0(temp4, collapse = ", ")
  ,paste0(temp5, collapse = ", ")) %>% `colnames<-`(
    c( "tab_desc`$`qname"
       , "tab_options`$`qname"
       , "tab_options`$`qid"
       , "data_out`$`qid"
       , "data_out`$`qnum") ) %>%
  kbl() %>%
  kable_styling()
tab_desc$qname tab_options$qname tab_options$qid data_out$qid data_out$qnum
0_Q0.0, 00_Q0.0, Q0.0, Q0.00, Q00.0, Q00.00, Q0.0 - Parent Topics, Q00.0 - Parent Topics, Q0.0 - Sentiment, Q00.0 - Sentiment, Q0.0 - Sentiment Polarity, Q00.0 - Sentiment Polarity, Q0.0 - Sentiment Score, Q00.0 - Sentiment Score, Q0.0 - Topic Sentiment Label, Q00.0 - Topic Sentiment Label, Q0.0 - Topic Sentiment Score, Q00.0 - Topic Sentiment Score, Q0.0 - Topics, Q00.0 - Topics, Q0.00#0_0, Q0.00#0_00, Q0.0#0_0, Q0.0#0_00, Q0.00#0_0_0, Q0.00#0_00_0, Q0.0#0_0_0, Q0.0#0_00_0, Q0.00_0, Q0.0_0, Q0.0_00, Q0.00_00, Q00.0_0, Q00.0_00, Q00.00_0, Q00.00_00, Q0.00_0_0, Q0.00_0_00, Q0.00_00_0, Q0.00_00_00, Q00.0_0_0, Q0.0_0_TEXT, Q0.0_00_TEXT, Q00.0_00_TEXT, Q00.0_0_TEXT Q0.0, Q0.00, Q00.0, Q00.00, Q0.00#0_0, Q0.00#0_00, Q0.0#0_0, Q0.0#0_00, Q0.0#0_0_0, Q0.0#0_00_0, Q0.00#0_0_0_TEXT, Q0.00#0_00_0_TEXT, Q0.0#0_0_0_TEXT, Q0.0#0_00_0_TEXT, Q0.0(0), Q0.0(00), Q0.0_0, Q0.0_00, Q0.00_0, Q0.00_00, Q00.0_0, Q00.0_00, Q00.00_0, Q00.00_00, Q0.00_0_0, Q0.00_0_00, Q0.00_00_0, Q0.00_00_00, Q0.00_0_TEXT, Q0.0_00_TEXT, Q00.0_00_TEXT, Q00.0_0_TEXT, Q0.0_TEXT, Q00.0_x0, Q00.0_x00, Q00.0_xx0, Q00.0_xx0_0 QID0, QID000, QID00 QID0, QID000, QID00 Q0.0, Q0.00, Q00.0, Q00.00

Parsing tab_options$qname

The exported data field names should be left unmodified. So, I chose to harmonize tab_options$qname to tab_desc$qname.

setdiff(temp2,temp1)
 [1] "Q0.00#0_0_0_TEXT"  "Q0.00#0_00_0_TEXT" "Q0.0#0_0_0_TEXT"  
 [4] "Q0.0#0_00_0_TEXT"  "Q0.0(0)"           "Q0.0(00)"         
 [7] "Q0.00_0_TEXT"      "Q0.0_TEXT"         "Q00.0_x0"         
[10] "Q00.0_x00"         "Q00.0_xx0"         "Q00.0_xx0_0"      
found <- grepl(pattern = "\\(", x = tab_options$qname)

temp <- paste0(
  gsub(".*\\((.*)\\).*", "\\1", tab_options$qname),"_",tab_options$qname)
tab_options$qname_cleaned <- gsub("\\s*\\([^\\)]+\\)","",as.character(temp))

tab_options<- tab_options %>% 
  rename(qname_orig = qname
         ,qname = qname_cleaned) %>% 
  mutate(rownum_tab_options = seq(nrow(tab_options))
)
tab_options$qname[!found] <- tab_options$qname_orig[!found]


found <- grepl(pattern = "x", x = tab_options$qname)

tab_options$qname_cleaned <- gsub("\\s*x","",as.character(tab_options$qname))

tab_options <- tab_options %>% select(!qname_orig) %>% rename(qname_orig = qname
         ,qname = qname_cleaned) 
tab_options$qname[!found] <- tab_options$qname_orig[!found]

temp1  <- grep("Q[0-9]",mixedsort(getunique(gsub(pattern = "[0-9]"
                                                 ,replacement = "0"
                                                 ,x = getunique(tab_desc$qname)))),val=T)
temp2  <- mixedsort(getunique(gsub(pattern = "[0-9]"
                                   ,replacement = "0"
                                   ,x = getunique(tab_options$qname))))
setdiff(temp2,temp1)
[1] "Q0.00#0_0_0_TEXT"  "Q0.00#0_00_0_TEXT" "Q0.0#0_0_0_TEXT"  
[4] "Q0.0#0_00_0_TEXT"  "Q0.00_0_TEXT"      "Q0.0_TEXT"        
"Q0.0_TEXT" %in% temp1
[1] FALSE
"Q0.0_TEXT" %in% temp2
[1] TRUE
temp1  <- paste0(grep("Q[0-9]",mixedsort(getunique(gsub(pattern = "[0-9]",replacement = "0"
,x = getunique(tab_desc$qname)))),val=T), collapse = ", ")
temp2  <- paste0(mixedsort(getunique(
  gsub(pattern = "[0-9]"
       ,replacement = "0"
       ,x = getunique(tab_options$qname)))), collapse = ", ")
temp3  <- paste0(mixedsort(getunique(
  gsub(pattern = "[0-9]" 
       ,replacement = "0"
       ,x = getunique(tab_options$qid)))), collapse = ", ")
temp4  <- paste0(mixedsort(getunique(
  gsub(pattern = "[0-9]"
       ,replacement = "0"
       ,x = getunique(data_out$qid)))), collapse = ", ")
temp5  <- paste0(mixedsort(getunique(
  gsub(pattern = "[0-9]"
       ,replacement = "0"
       ,x = getunique(data_out$qnum)))), collapse = ", ")

cbind(
  temp1
  ,temp2
  ,temp3
  ,temp4
  ,temp5) %>% `colnames<-`(
    c( "tab_desc`$`qname"
       , "tab_options`$`qname"
       , "tab_options`$`qid"
       , "data_out`$`qid"
       , "data_out`$`qnum") ) %>%
  kbl() %>%
  kable_styling()
tab_desc$qname tab_options$qname tab_options$qid data_out$qid data_out$qnum
0_Q0.0, 00_Q0.0, Q0.0, Q0.00, Q00.0, Q00.00, Q0.0 - Parent Topics, Q00.0 - Parent Topics, Q0.0 - Sentiment, Q00.0 - Sentiment, Q0.0 - Sentiment Polarity, Q00.0 - Sentiment Polarity, Q0.0 - Sentiment Score, Q00.0 - Sentiment Score, Q0.0 - Topic Sentiment Label, Q00.0 - Topic Sentiment Label, Q0.0 - Topic Sentiment Score, Q00.0 - Topic Sentiment Score, Q0.0 - Topics, Q00.0 - Topics, Q0.00#0_0, Q0.00#0_00, Q0.0#0_0, Q0.0#0_00, Q0.00#0_0_0, Q0.00#0_00_0, Q0.0#0_0_0, Q0.0#0_00_0, Q0.00_0, Q0.0_0, Q0.0_00, Q0.00_00, Q00.0_0, Q00.0_00, Q00.00_0, Q00.00_00, Q0.00_0_0, Q0.00_0_00, Q0.00_00_0, Q0.00_00_00, Q00.0_0_0, Q0.0_0_TEXT, Q0.0_00_TEXT, Q00.0_00_TEXT, Q00.0_0_TEXT 0_Q0.0, 00_Q0.0, Q0.0, Q0.00, Q00.0, Q00.00, Q0.00#0_0, Q0.00#0_00, Q0.0#0_0, Q0.0#0_00, Q0.0#0_0_0, Q0.0#0_00_0, Q0.00#0_0_0_TEXT, Q0.00#0_00_0_TEXT, Q0.0#0_0_0_TEXT, Q0.0#0_00_0_TEXT, Q0.0_0, Q0.0_00, Q0.00_0, Q0.00_00, Q00.0_0, Q00.0_00, Q00.00_0, Q00.00_00, Q0.00_0_0, Q0.00_0_00, Q0.00_00_0, Q0.00_00_00, Q00.0_0_0, Q0.00_0_TEXT, Q0.0_00_TEXT, Q00.0_00_TEXT, Q00.0_0_TEXT, Q0.0_TEXT QID0, QID000, QID00 QID0, QID000, QID00 Q0.0, Q0.00, Q00.0, Q00.00

Final Stretch

data_out <- data_out %>% mutate(case = paste0(qtype,"_",qslct,"_",q_sbsl))

tab_values <- merge(tab_desc,tab_options
                    ,by = "qname"
                    , suffixes = c("_tbdesc","_tbopt"),all = T) 
tab_values <- tab_values %>% 
    extract(qname, c("QROOT", "B"), "(^Q[0-9]+.[0-9]+)|^[0-9]+_(Q[0-9]+.[0-9]+)") %>%
    mutate(QROOT = case_when(QROOT!="" ~ QROOT
                             ,TRUE ~B )
           , rownum_tab_values = seq(nrow(.))) %>%
      select(QROOT) %>%
  bind_cols(tab_values) %>% 
  mutate(QROOT = case_when(is.na(QROOT) ~ as.character(qname)
                           ,TRUE ~QROOT))

data_out_dist <- data_out  %>% 
  distinct(qnum, .keep_all = TRUE)

data_out_exampled <- full_join(data_out_dist
                               , tab_values
                               , by=c("qnum"="QROOT")) %>% 
  mutate(rownum_data_out_exampled =  seq(nrow(.)))


(yt <- getunique(
  gsub(pattern = "[0-9]+"
       ,replacement = "0"
       ,x = getunique(tab_values$qname))))
 [1] "StartDate"                    "EndDate"                     
 [3] "Status"                       "IPAddress"                   
 [5] "Progress"                     "Duration (in seconds)"       
 [7] "Finished"                     "RecordedDate"                
 [9] "ResponseId"                   "RecipientLastName"           
[11] "RecipientFirstName"           "RecipientEmail"              
[13] "ExternalReference"            "LocationLatitude"            
[15] "LocationLongitude"            "DistributionChannel"         
[17] "UserLanguage"                 "Q_RelevantIDDuplicate"       
[19] "Q_RelevantIDDuplicateScore"   "Q_RelevantIDFraudScore"      
[21] "Q_RelevantIDLastStartDate"    "Q0.0"                        
[23] "Q0.0_0_TEXT"                  "Q0.0_0"                      
[25] "Q0.0#0_0_0"                   "Q0.0#0_0"                    
[27] "0_Q0.0"                       "Q0.0_0_0"                    
[29] "FSFN"                         "TicketKey0"                  
[31] "Q_R"                          "ReviewerEmail"               
[33] "Q_RelatedResponseField"       "Q_RelatedResponseFieldValue" 
[35] "ResponseID"                   "RetakeLinkNew"               
[37] "CountyCombined"               "Q_TicketTeam"                
[39] "Q_TicketOwner"                "Supervisor"                  
[41] "Manager"                      "Region Unit Combined"        
[43] "Q_URL"                        "Skills Assessment QO Survey" 
[45] "RetakeLinkReplace"            "Q0.0 - Parent Topics"        
[47] "Q0.0 - Sentiment Polarity"    "Q0.0 - Sentiment Score"      
[49] "Q0.0 - Sentiment"             "Q0.0 - Topic Sentiment Label"
[51] "Q0.0 - Topic Sentiment Score" "Q0.0 - Topics"               
[53] "Q0.0#0_0_0_TEXT"              "Q0.0_TEXT"                   
(yt <- sort(yt[grep("Q.*\\.",yt)]))
 [1] "0_Q0.0"                       "Q0.0"                        
 [3] "Q0.0 - Parent Topics"         "Q0.0 - Sentiment"            
 [5] "Q0.0 - Sentiment Polarity"    "Q0.0 - Sentiment Score"      
 [7] "Q0.0 - Topic Sentiment Label" "Q0.0 - Topic Sentiment Score"
 [9] "Q0.0 - Topics"                "Q0.0#0_0"                    
[11] "Q0.0#0_0_0"                   "Q0.0#0_0_0_TEXT"             
[13] "Q0.0_0"                       "Q0.0_0_0"                    
[15] "Q0.0_0_TEXT"                  "Q0.0_TEXT"                   
strs.vec <- yt
patts <- c("_", "#", "TEXT", "^Q", "^Q[0-9]+.[0-9]+$")
PatternInfo(patts, strs.vec)
                             length  _  # TEXT ^Q ^Q[0-9]+.[0-9]+$
0_Q0.0                            6  2 -1   -1 -1               -1
Q0.0                              4 -1 -1   -1  1                1
Q0.0 - Parent Topics             20 -1 -1   -1  1               -1
Q0.0 - Sentiment                 16 -1 -1   -1  1               -1
Q0.0 - Sentiment Polarity        25 -1 -1   -1  1               -1
Q0.0 - Sentiment Score           22 -1 -1   -1  1               -1
Q0.0 - Topic Sentiment Label     28 -1 -1   -1  1               -1
Q0.0 - Topic Sentiment Score     28 -1 -1   -1  1               -1
Q0.0 - Topics                    13 -1 -1   -1  1               -1
Q0.0#0_0                          8  7  5   -1  1               -1
Q0.0#0_0_0                       10  7  5   -1  1               -1
Q0.0#0_0_0_TEXT                  15  7  5   12  1               -1
Q0.0_0                            6  5 -1   -1  1               -1
Q0.0_0_0                          8  5 -1   -1  1               -1
Q0.0_0_TEXT                      11  5 -1    8  1               -1
Q0.0_TEXT                         9  5 -1    6  1               -1
grep("Q[0-9]+.[0-9]+_[0-9]+",c("Q22.0#0_0","Q43.6_1","Q22.0#0_0_0_TEXT","Q1.3" ), val=T)
[1] "Q43.6_1"
grep("Q[0-9]+.[0-9]+",c("Q22.0#0_0","Q43.6_1","Q22.0#0_0_0_TEXT","Q1.3" ), val=T)
[1] "Q22.0#0_0"        "Q43.6_1"          "Q22.0#0_0_0_TEXT"
[4] "Q1.3"            
patts <- getunique(gsub(replacement = "[0-9]+", pattern = "0",x = yt))
patts <- sprintf("^%s$",patts)
strs.vec <- as.character(data_out_exampled$qname)
names(strs.vec) <- data_out_exampled$rownum_data_out_exampled
matched.mat <- as_tibble(((PatternInfo(patts, strs.vec)) >-1)*1) %>%
  select(-length) %>% 
  mutate(qname = strs.vec
         , rownum_formatched = names(strs.vec)) %>% 
  select(where(function(x) any((x!=0)))) %>%
  filter()

# empty == good...if not empty, pattern applies to multiple question types
matched.mat %>% 
  select(contains("Q[")) %>% 
  mutate(rowsum= rowSums(.)) %>% 
  filter(rowsum >1)
# A tibble: 0 x 17
# ... with 17 variables: ^[0-9]+_Q[0-9]+.[0-9]+$ <dbl>,
#   ^Q[0-9]+.[0-9]+$ <dbl>, ^Q[0-9]+.[0-9]+ - Parent Topics$ <dbl>,
#   ^Q[0-9]+.[0-9]+ - Sentiment$ <dbl>,
#   ^Q[0-9]+.[0-9]+ - Sentiment Polarity$ <dbl>,
#   ^Q[0-9]+.[0-9]+ - Sentiment Score$ <dbl>,
#   ^Q[0-9]+.[0-9]+ - Topic Sentiment Label$ <dbl>,
#   ^Q[0-9]+.[0-9]+ - Topic Sentiment Score$ <dbl>, ...
# create question type pattern indicator
dat_matched <- merge(data_out_exampled
                     , matched.mat
                     , by.x="rownum_data_out_exampled"
                     , by.y =  "rownum_formatched")

# get row ids for column names with no patterns matched
dat_matched <- dat_matched %>% rowwise() %>% 
  mutate(dummysums = sum(c_across(contains("Q[")), na.rm = T)) %>% 
  ungroup()
nonmatch <- dat_matched %>% filter(dummysums==0) # non-matched column rows
match <- dat_matched %>% filter(dummysums==1)
match <- (match %>% select(contains("Q[")) %>% 
            dedummy(.) %>% cbind(match)) %>% 
  rename("pattmatched"=".")  %>% 
  mutate(groupmatached = paste0(case, "__",as.numeric(as.factor((pattmatched)))))  %>% 
  select(
  qnum
  , pattmatched
  ,groupmatached
  , textEntry
  , choice
  , desc
  , qname.x
  , case
  ) %>% rename_all(~stringr::str_replace(.,"\\.x","")) %>% filter(!is.na(desc))


# check if all questions in final data are in this match set
setdiff(match$qname, names(dat)) # there are some questions that have not been answered, so they do not appear column names in the data.
 [1] "Q6.1 - Parent Topics"          "Q6.1 - Sentiment Polarity"    
 [3] "Q6.1 - Sentiment Score"        "Q6.1 - Sentiment"             
 [5] "Q6.1 - Topic Sentiment Label"  "Q6.1 - Topic Sentiment Score" 
 [7] "Q6.1 - Topics"                 "Q8.4 - Parent Topics"         
 [9] "Q8.4 - Sentiment Polarity"     "Q8.4 - Sentiment Score"       
[11] "Q8.4 - Sentiment"              "Q8.4 - Topic Sentiment Label" 
[13] "Q8.4 - Topic Sentiment Score"  "Q8.4 - Topics"                
[15] "Q20.3 - Parent Topics"         "Q20.3 - Sentiment Polarity"   
[17] "Q20.3 - Sentiment Score"       "Q20.3 - Sentiment"            
[19] "Q20.3 - Topic Sentiment Label" "Q20.3 - Topic Sentiment Score"
[21] "Q20.3 - Topics"               
setdiff(names(dat), match$qname)
 [1] "StartDate"                   "EndDate"                    
 [3] "Status"                      "IPAddress"                  
 [5] "Progress"                    "Duration (in seconds)"      
 [7] "Finished"                    "RecordedDate"               
 [9] "ResponseId"                  "RecipientLastName"          
[11] "RecipientFirstName"          "RecipientEmail"             
[13] "ExternalReference"           "LocationLatitude"           
[15] "LocationLongitude"           "DistributionChannel"        
[17] "UserLanguage"                "Q_RelevantIDDuplicate"      
[19] "Q_RelevantIDDuplicateScore"  "Q_RelevantIDFraudScore"     
[21] "Q_RelevantIDLastStartDate"   "Q1.23_5"                    
[23] "Q1.23_6"                     "FSFN"                       
[25] "TicketKey1"                  "Q_R"                        
[27] "ReviewerEmail"               "Q_RelatedResponseField"     
[29] "Q_RelatedResponseFieldValue" "ResponseID"                 
[31] "RetakeLinkNew"               "CountyCombined"             
[33] "Q_TicketTeam"                "Q_TicketOwner"              
[35] "Supervisor"                  "Manager"                    
[37] "Region Unit Combined"        "Q_URL"                      
[39] "Skills Assessment QO Survey" "RetakeLinkReplace"          
# intersect(names(dat), match$qname)

case_patterns <- match %>% 
  group_by(pattmatched, case, groupmatached) %>% 
  count(pattmatched, case, groupmatached) %>% 
  mutate(grouppattmatched = paste0(pattmatched,"::",case))  %>% 
  ungroup() %>% 
  mutate(row = row_number())
ï..pattmatched case groupmatached n grouppattmatched row rules
^[0-9]+_Q[0-9]+.[0-9]+$ MC_SAVR_TX MC_SAVR_TX__1 90 ^[0-9]+_Q[0-9]+.[0-9]+$::MC_SAVR_TX 1 qnum
^[0-9]+_Q[0-9]+.[0-9]+$ TE_ESTB_NA TE_ESTB_NA__1 30 ^[0-9]+_Q[0-9]+.[0-9]+$::TE_ESTB_NA 2 qnum
^Q[0-9]+.[0-9]+#[0-9]+_[0-9]+$ SBS_SBSMatrix_NA SBS_SBSMatrix_NA__2 150 ^Q[0-9]+.[0-9]+#[0-9]+_[0-9]+$::SBS_SBSMatrix_NA 3 qnum
^Q[0-9]+.[0-9]+#[0-9]+[0-9]+[0-9]+$ SBS_SBSMatrix_NA SBS_SBSMatrix_NA__3 175 ^Q[0-9]+.[0-9]+#[0-9]+[0-9]+[0-9]+$::SBS_SBSMatrix_NA 4 qnum
^Q[0-9]+.[0-9]+$ MC_DL_NA MC_DL_NA__5 32 ^Q[0-9]+.[0-9]+$::MC_DL_NA 5 qnum
^Q[0-9]+.[0-9]+$ MC_SAVR_TX MC_SAVR_TX__5 70 ^Q[0-9]+.[0-9]+$::MC_SAVR_TX 6 qnum
^Q[0-9]+.[0-9]+$ TE_ESTB_NA TE_ESTB_NA__5 49 ^Q[0-9]+.[0-9]+$::TE_ESTB_NA 7 qnum
^Q[0-9]+.[0-9]+$ TE_SL_NA TE_SL_NA__5 13 ^Q[0-9]+.[0-9]+$::TE_SL_NA 8 qnum
^Q[0-9]+.[0-9]+_[0-9]+$ Matrix_Likert_SingleAnswer Matrix_Likert_SingleAnswer__6 131 ^Q[0-9]+.[0-9]+_[0-9]+$::Matrix_Likert_SingleAnswer 9 qnum
^Q[0-9]+.[0-9]+_[0-9]+$ MC_MACOL_TX MC_MACOL_TX__6 81 ^Q[0-9]+.[0-9]+_[0-9]+$::MC_MACOL_TX 10 qname
^Q[0-9]+.[0-9]+_[0-9]+$ MC_MAVR_TX MC_MAVR_TX__6 161 ^Q[0-9]+.[0-9]+_[0-9]+$::MC_MAVR_TX 11 qname
^Q[0-9]+.[0-9]+_[0-9]+$ TE_AUTO_STATIC TE_AUTO_STATIC__6 1 ^Q[0-9]+.[0-9]+_[0-9]+$::TE_AUTO_STATIC 12 qnum
^Q[0-9]+.[0-9]+_[0-9]+$ TE_FORM_NA TE_FORM_NA__6 4 ^Q[0-9]+.[0-9]+_[0-9]+$::TE_FORM_NA 13 colondash
^Q[0-9]+.[0-9]+[0-9]+[0-9]+$ Matrix_Likert_MultipleAnswer Matrix_Likert_MultipleAnswer__7 358 ^Q[0-9]+.[0-9]+[0-9]+[0-9]+$::Matrix_Likert_MultipleAnswer 14 underscore
^Q[0-9]+.[0-9]+_[0-9]+_TEXT$ MC_MAVR_TX MC_MAVR_TX__8 3 ^Q[0-9]+.[0-9]+_[0-9]+_TEXT$::MC_MAVR_TX 15 text
^Q[0-9]+.[0-9]+_[0-9]+_TEXT$ MC_SAVR_TX MC_SAVR_TX__8 1 ^Q[0-9]+.[0-9]+_[0-9]+_TEXT$::MC_SAVR_TX 16 text2
# update rules table if survey gets new types of questions since last run
case_patterns_ruled <- read.csv("case_patterns_ruled.csv")
patterns_new <- case_patterns[!case_patterns$grouppattmatched %in% case_patterns_ruled$grouppattmatched,]

# open rule setting table to add new rules from clipboard
if(nrow(patterns_new) > 0){
  # clipr::write_clip(patterns_new)
  # browseURL("case_patterns_ruled.csv")
}
NULL
# after updates to ruled table (using troubleshoot.R to establish new rules)
case_patterns <- merge(case_patterns_ruled[c("grouppattmatched","rules")]
                       , case_patterns
                       , "grouppattmatched")
 
rule <- list()
case <- list()
rulenames <- getunique(case_patterns$rules)
for (name_j in rulenames){
  rule[name_j] <- case_patterns %>% filter(rules==name_j) %>% select(groupmatached)
  case[name_j] <- case_patterns %>% filter(rules==name_j) %>% select(case)
}

matchout <- list()

for (i in rule$qnum){
  temp <- match %>% filter(groupmatached==i)
  matchout <- bind_rows(matchout,merge(temp,data_out,"qnum"))
}

for (i in rule$qname){
  temp <- match %>% filter(groupmatached==i)
  temp2 <- data_out %>% mutate(qname = paste0(qnum,"_", q_opt_num))
  matchout <- bind_rows(matchout,merge(temp,temp2,"qname"))
}

for (i in rule$colondash){
  temp <- match %>% filter(groupmatached==i)
  temp <- temp %>% pull(desc) %>% 
    str_split(string= ., pattern = ": - ", simplify = T) %>% 
    as.data.frame() %>% 
    mutate_if(., is.character, ~trimws(.)) %>% 
    rename(qtext_stripped=V1, q_opt_txt=V2) %>% 
    cbind(temp) 
  matchout <- bind_rows(matchout,merge(
    temp
    ,data_out
    , c("qtext_stripped","q_opt_txt")))
}

for (rx in seq_along(rule$underscore)){
  i <- rule$underscore[rx]
  j <- case$underscore[rx]
  temp <- match %>% filter(groupmatached==i)
  temp <- temp %>% pull(qname) %>% str_split(string= ., pattern = "_", simplify = T) %>% 
    as.data.frame() %>% 
    mutate_if(., is.character, ~trimws(.)) %>% 
    mutate(qnum_q_opt_num = paste0(V1,"::",V3)) %>%
    cbind(temp) 
  temp2 <- data_out %>% filter(!is.na(q_opt_num), case == j) %>% 
    mutate(qnum_q_opt_num = paste0(qnum, "::", q_opt_num))
  matchout <- bind_rows(matchout,left_join(temp,temp2,"qnum_q_opt_num"))
}

for (i in rule$text){
  temp <- match %>% filter(groupmatached==i) %>%
    mutate(qid = str_extract(textEntry, "[a-zA-Z0-9]+(?=\\.)"))
  newcols <- setdiff(names(data_out),names(temp))
  temp2 <- data.frame(matrix(nrow=nrow(temp), ncol = len(newcols)))
  colnames(temp2) <- newcols
  matchout <- bind_rows(matchout,bind_cols(temp, temp2))
}

for (i in rule$text2){
  temp <- match %>% filter(groupmatached==i) %>% arrange(qnum)
  temp$qid <- data_out %>% filter(qnum %in% temp$qnum) %>% 
    arrange(qnum) %>% pull(qid) %>% getunique()
  newcols <- setdiff(names(data_out),names(temp))
  temp2 <- data.frame(matrix(nrow=nrow(temp), ncol = len(newcols)))
  colnames(temp2) <- newcols
  matchout <- bind_rows(matchout,bind_cols(temp, temp2))
}

matchout$qnum <- matchout %>%
  select(contains("qnum"),-contains("qnum_q_opt_num")) %>% 
  coalesce_df(everything()) %>% pull(result)

# empty = good here
matchout %>% group_by(qname, q_opt_num) %>% 
  count(qname, q_opt_num) %>% 
  filter(n>1) %>% arrange(desc(n))
# A tibble: 0 x 3
# Groups:   qname, q_opt_num [0]
# ... with 3 variables: qname <fct>, q_opt_num <chr>, n <int>
setdiff(matchout$qname,names(dat))
character(0)
xy <- matchout %>% 
  select(
    qnum
    ,qname
    ,qid
    ,desc
    ,qtext
    ,qtype
    ,qslct
    ,q_sbsl
    ,q_frce
    ,q_rule
    ,q_opt_txt
    ,q_opt_num
  ) 

xy <- xy %>% 
  pull(qnum) %>% 
  str_split(string = ., pattern = "\\.", simplify = T) %>% 
  as.data.frame() %>% 
  mutate_at("V2", as.numeric) %>% 
  cbind(xy) %>% 
  arrange(V1, V2) %>% 
    mutate(qid = case_when(is.na(qid) ~ "TEXT", TRUE~qid)
           , updated = Sys.time()
           ,rownum = row_number()) %>% 
  select(rownum, updated, names(xy))


##############################
# Output :-)
##############################

write.csv(xy, sprintf("captured__%s.csv",
       format(Sys.time(), "%m_%d_%Y__H%H")),row.names = F)

Example Rows From Result