To avoid inadvertently sharing sensitive client data, this example uses a simplified dataset that mimics most of the issues encountered in the actual project.

Purpose:

Create a codebook that lists every question from a Qualtrics survey, indicating the respective column(s) in the response data, along with every possible response value (for non-open ended questions). The codebook can be used for quickly corresponding between exported response data and the survey quesstionnaire. It can also be used as a tool for changing the names of exported data columns to more meaningful ones. The Qualtrics question labeling option was set to “Block” but the regex patterns contained below can be updated to apply to any of the other options.

Joshua’s work principles applied to this project:

Regex patterns are used instead of hard-coded column names so that code can be used for multiple datasets with minimal modifications.
The exported codebook uses xlsx format to allow the average user/team to make comments/labels/highlights in the codebook, but can alternatively be created as a csv if Excel is not available.
I use a helpful set of functions I created and/or collected stored on Github, where I version control all my code projects. For example, I I created packageloader() which both loads and installs packages and len() aliases length() for convenience.

Load my custom fucntions from github

source("https://raw.githubusercontent.com/joshuascriven/helper_functions/main/helper_functions.R")

Load CRAN packages

packageloader(c(
"tidyverse"
, "dplyr"
, "knitr"
, "qualtRics"
, "odbc"
, "lubridate"
, "stringr"
, "anytime"
, "tm"
, "fastDummies"
, "jtools"
, "janitor"
, "here"
, "odbc"
, "data.table"
, "openxlsx"
, "GrpString"
, "kableExtra"
, "gtools"
, "flextable"
))

[1] "packages loaded"

Import data and data conversion tables

Connecting to the Qualtrics survey data requires an api token. I retrieve my my credentials from a local file in a centralized location using my getKey() function. I query the API to fetch 3 types of Qualtrics data using the R “qualtRics” package. I always label projects to cross-reference in emails, system folders, and code searches.

data_out (Qualtrics questionnaire metadata

I use metadata() to get metadata for each question. leading and trailing whitespace from question text, retaining original text values. I create row indicators for any imported data to help with later joins or troubleshooting.

data_dir <- "../data/"
projname <- "qo_request_JS22"

api_key <- read.table(filemaker("api_key","tsv"), header = TRUE)

# Get Qualtrics data
qualtrics_api_credentials(
  api_key = getKey("qualtrics_api_key")
  , base_url = getKey("qualtrics_base_url")
  , install = TRUE,overwrite = T); readRenviron("~/.Renviron")

sid <- all_surveys()$id[grep("^Li.*To.* TEST COPY$",all_surveys()$name)]

dat <- fetch_survey(surveyID = sid,
                         verbose = TRUE,force_request = T)


  |                                                                  
  |                                                            |   0%
  |                                                                  
  |============================================================| 100%

# Extract Q metadata
md <- metadata(sid)
q_rule <- stack((lapply(md$questions, function(x) as.character(x$validation))))

df_list <- list(
  qnum     = stack(lapply(md$questions, function(x) x$questionName))
  , qid    = data.frame(ind = names(md$questions))
  , qtext  = stack(lapply(md$questions, function(x) cleanhtml(x$questionText)))
  , qtype  = stack(lapply(md$questions, function(x) x$questionType$type))
  , qslct  = stack(lapply(md$questions, function(x) x$questionType$selector))
  , q_sbsl = stack(lapply(md$questions
                          , function(x) as.character(x$questionType$subSelector)))
  , q_frce = stack(lapply(md$questions
                          , function(x) x$validation$doesForceResponse))
  , q_rule = aggregate(values ~ ind
                  , FUN = paste
                  , collapse = " ; "
                  , data = q_rule)
  , q_opt_txt  = cbind(cbind(stack(lapply(md$questions
                    , function(w) lapply(w$choices, function(z) z$choiceText)))
                         , stack(lapply(md$questions
          , function(w) as.character(lapply(w$choices, function(z) z$recode))))
  ))[,-4]
  
)

data_out <- df_list %>% reduce(full_join, by='ind')
names(data_out) <- c(names(df_list),"q_opt_num")

data_out <- data_out %>% 
  mutate(
  source = "data_out"
  ,qtext_stripped = gsub(":$", "", trimws(qtext))
  ,rownum_data_out = seq(nrow(.))
)

tab_options (Response Metadata)

Next, I use column_map() to extract metadata on responses. Each row in the resulting data represents a response choice for a survey question.

tab_options <- tab_options_bac <- qualtRics::column_map(sid)
tab_options$source <- "tab_options"

tab_desc (Response Data)

Importantly, the metadata provided by metadata() and column_map() are for the questions in the survey instrument and not on the final data that Qualtrics exports. Responses to “select all that apply” questions, for instance, produce a binary indicator for each choice option provided. These columns tend to follow some pattern identifying each set with particular survey questions.

tab_desc <- stack(lapply(dat, function(x) attributes(x)$label))
names(tab_desc) <- c("desc","qname") 
tab_desc <- tab_desc %>%
  mutate(rownum_tab_desc = seq(nrow(tab_desc))
         ,source = "tab_desc")

Merging datasets

When a question in the Qualtrics survey allows the respondent to select all options that apply, Qualtrics stores responses for that question type as a set of binary indicators, one for each of the n number of choices presented. A similar situation presents for Qualtrics matrix or grouped questions, where multiple responses are elicited from respondents for a set of questions that share the same scale. Hence, metadata() returns a QID (question identifier) column, the length of which is equal to the n number of columns in the exported data, while column_map() provides all response options for a given QID/Exported data column. Qualtrics does not provide a key to merge these three datasets. I use the following method to merge these data sets.

Candidate Fields for Merge-key

A merge-key column is one that can unique identify the same observation across various datasets. Of the possible fields available, only tab_options$qname uniquely identifies every possible question-response for the survey (i.e., entered in the web UI). tab_desc$qname which is necessarily a set of unique values, consisting of the column names of the exported data. Because their lengths differ, however, a simple 1-to-1 join isn’t possible..

# test duplication
tabyl(tab_options$qid) %>% arrange(desc(n)) %>% head() %>%
  kbl() %>%
  kable_styling()# duplicates

tab_options$qid	n	percent
QID367	132	0.0888889
QID357	120	0.0808081
QID373	108	0.0727273
QID403	96	0.0646465
QID411	75	0.0505051
QID361	72	0.0484848

tabyl(data_out$qid) %>% arrange(desc(n)) %>% head() %>%
  kbl() %>%
  kable_styling() # duplicates

data_out$qid	n	percent
QID167	138	0.0969782
QID11	107	0.0751933
QID162	79	0.0555165
QID165	63	0.0442727
QID163	49	0.0344343
QID101	40	0.0281096

tabyl(data_out$qnum) %>% arrange(desc(n)) %>% head() %>%
    kbl() %>%
  kable_styling()# duplicates

data_out$qnum	n	percent
Q1.37	138	0.0969782
Q1.32	107	0.0751933
Q1.33	79	0.0555165
Q1.35	63	0.0442727
Q1.34	49	0.0344343
Q20.6	40	0.0281096

tabyl(tab_desc$desc) %>% arrange(desc(n)) %>% head() %>%
  kbl() %>%
  kable_styling()# duplicates

tab_desc$desc	n	percent
Provide details in comment box:	32	0.0207523
Provide explanation in comment box:	8	0.0051881
Was guidance provided by primary Supervisor?	6	0.0038911
Comment Box: - Other - OTI 1 - Enter Reason	2	0.0012970
Comment Box: - Other - OTI 10 - Enter Reason	2	0.0012970
Comment Box: - Other - OTI 11 - Enter Reason	2	0.0012970

tabyl(tab_desc$qname) %>% 
  filter(grepl("Q[0-9]",`tab_desc$qname`)) %>%
  arrange(desc(n)) %>% 
  head() %>%
  kbl() %>%
  kable_styling() # unique

tab_desc$qname	n	percent
Q1.2	1	0.0006485
Q1.3	1	0.0006485
Q1.4	1	0.0006485
Q1.4_6_TEXT	1	0.0006485
Q1.5	1	0.0006485
Q1.6	1	0.0006485

tabyl(tab_options$qname) %>% arrange(desc(n)) %>% head() %>%
  kbl() %>%
  kable_styling() # unique

tab_options$qname	n	percent
Q1.1	1	0.0006734
Q1.10	1	0.0006734
Q1.11	1	0.0006734
Q1.12	1	0.0006734
Q1.13	1	0.0006734
Q1.14	1	0.0006734

tab_options$qid, data_out$qid and data_out$qnum all share the same length, and uniquely identifies each survey question. However, data_out$qnum uses the labeling selected by the use in Qualtrics, while qid in both instances represents the internal Qualtrics identifier for that question and is usually based on the order of question creation.

# test length
getuniquelen(tab_options$qid) # len 1

[1] 232

getuniquelen(data_out$qid) # len 1

[1] 232

getuniquelen(data_out$qnum) # len 1

[1] 232

getuniquelen(tab_desc$desc) # len diff

[1] 1481

getuniquelen(tab_desc$qname) # len diff

[1] 1542

getuniquelen(tab_options$qname) # len diff

[1] 1485

Finally there is some overlap between tab_options$qname and tab_desc$qname resulting from the fact that a survey question, as indicated by its QID or the root of its qname, will appear multiple times in the respective data because it comprises multiple sub-questions or because its responses are recorded as if their existed multiple sub-questions (binary indicator example mentioned above). It is also clear that sub-strings of qname values in both datasets have an exact relation to qnum in data_out.

# test overlap
len(intersect(tab_options$qid,data_out$qnum))

[1] 0

len(setdiff(tab_options$qid,data_out$qid)) # full overlap

[1] 0

len(intersect(tab_options$qname,tab_desc$qname)) # some overlap

[1] 948

len(setdiff(tab_options$qname,tab_desc$qname)) # some overlap

[1] 537

Solution

data_out can be joined using qnum with parsed versions of qname from both tab_desc and tab_options. However, because a 1:1 relationship should exist between tab_desc and tab_options for much of the data and a 1:many between either and data_out, the first two must be merged before joining their merger to the third dataset.

The following table shows the pattern of candidate merge-key column values, numeric substrings of size 1 are indicated with a single 0. Embedded data fields, calculated fields, and Qualtrics-generated fields in the exported data are excluded.

temp1  <- grep("Q[0-9]",mixedsort(getunique(
  gsub(pattern = "[0-9]",replacement = "0",x = getunique(tab_desc$qname)))),val=T)
temp2  <- mixedsort(getunique(
  gsub(pattern = "[0-9]",replacement = "0",x = getunique(tab_options$qname))))
temp3  <- mixedsort(getunique(
  gsub(pattern = "[0-9]",replacement = "0",x = getunique(tab_options$qid))))
temp4  <- mixedsort(getunique(
  gsub(pattern = "[0-9]",replacement = "0",x = getunique(data_out$qid))))
temp5  <- mixedsort(getunique(
  gsub(pattern = "[0-9]",replacement = "0",x = getunique(data_out$qnum))))

cbind(
  paste0(temp1, collapse = ", ")
  ,paste0(temp2, collapse = ", ")
  ,paste0(temp3, collapse = ", ")
  ,paste0(temp4, collapse = ", ")
  ,paste0(temp5, collapse = ", ")) %>% `colnames<-`(
    c( "tab_desc`$`qname"
       , "tab_options`$`qname"
       , "tab_options`$`qid"
       , "data_out`$`qid"
       , "data_out`$`qnum") ) %>%
  kbl() %>%
  kable_styling()

tab_desc`$`qname	tab_options`$`qname	tab_options`$`qid	data_out`$`qid	data_out`$`qnum
0_Q0.0, 00_Q0.0, Q0.0, Q0.00, Q00.0, Q00.00, Q0.0 - Parent Topics, Q00.0 - Parent Topics, Q0.0 - Sentiment, Q00.0 - Sentiment, Q0.0 - Sentiment Polarity, Q00.0 - Sentiment Polarity, Q0.0 - Sentiment Score, Q00.0 - Sentiment Score, Q0.0 - Topic Sentiment Label, Q00.0 - Topic Sentiment Label, Q0.0 - Topic Sentiment Score, Q00.0 - Topic Sentiment Score, Q0.0 - Topics, Q00.0 - Topics, Q0.00#0_0, Q0.00#0_00, Q0.0#0_0, Q0.0#0_00, Q0.00#0_0_0, Q0.00#0_00_0, Q0.0#0_0_0, Q0.0#0_00_0, Q0.00_0, Q0.0_0, Q0.0_00, Q0.00_00, Q00.0_0, Q00.0_00, Q00.00_0, Q00.00_00, Q0.00_0_0, Q0.00_0_00, Q0.00_00_0, Q0.00_00_00, Q00.0_0_0, Q0.0_0_TEXT, Q0.0_00_TEXT, Q00.0_00_TEXT, Q00.0_0_TEXT	Q0.0, Q0.00, Q00.0, Q00.00, Q0.00#0_0, Q0.00#0_00, Q0.0#0_0, Q0.0#0_00, Q0.0#0_0_0, Q0.0#0_00_0, Q0.00#0_0_0_TEXT, Q0.00#0_00_0_TEXT, Q0.0#0_0_0_TEXT, Q0.0#0_00_0_TEXT, Q0.0(0), Q0.0(00), Q0.0_0, Q0.0_00, Q0.00_0, Q0.00_00, Q00.0_0, Q00.0_00, Q00.00_0, Q00.00_00, Q0.00_0_0, Q0.00_0_00, Q0.00_00_0, Q0.00_00_00, Q0.00_0_TEXT, Q0.0_00_TEXT, Q00.0_00_TEXT, Q00.0_0_TEXT, Q0.0_TEXT, Q00.0_x0, Q00.0_x00, Q00.0_xx0, Q00.0_xx0_0	QID0, QID000, QID00	QID0, QID000, QID00	Q0.0, Q0.00, Q00.0, Q00.00

Parsing `tab_options$qname`

The exported data field names should be left unmodified. So, I chose to harmonize tab_options$qname to tab_desc$qname.

setdiff(temp2,temp1)

 [1] "Q0.00#0_0_0_TEXT"  "Q0.00#0_00_0_TEXT" "Q0.0#0_0_0_TEXT"  
 [4] "Q0.0#0_00_0_TEXT"  "Q0.0(0)"           "Q0.0(00)"         
 [7] "Q0.00_0_TEXT"      "Q0.0_TEXT"         "Q00.0_x0"         
[10] "Q00.0_x00"         "Q00.0_xx0"         "Q00.0_xx0_0"

found <- grepl(pattern = "\\(", x = tab_options$qname)

temp <- paste0(
  gsub(".*\\((.*)\\).*", "\\1", tab_options$qname),"_",tab_options$qname)
tab_options$qname_cleaned <- gsub("\\s*\\([^\\)]+\\)","",as.character(temp))

tab_options<- tab_options %>% 
  rename(qname_orig = qname
         ,qname = qname_cleaned) %>% 
  mutate(rownum_tab_options = seq(nrow(tab_options))
)
tab_options$qname[!found] <- tab_options$qname_orig[!found]


found <- grepl(pattern = "x", x = tab_options$qname)

tab_options$qname_cleaned <- gsub("\\s*x","",as.character(tab_options$qname))

tab_options <- tab_options %>% select(!qname_orig) %>% rename(qname_orig = qname
         ,qname = qname_cleaned) 
tab_options$qname[!found] <- tab_options$qname_orig[!found]

temp1  <- grep("Q[0-9]",mixedsort(getunique(gsub(pattern = "[0-9]"
                                                 ,replacement = "0"
                                                 ,x = getunique(tab_desc$qname)))),val=T)
temp2  <- mixedsort(getunique(gsub(pattern = "[0-9]"
                                   ,replacement = "0"
                                   ,x = getunique(tab_options$qname))))
setdiff(temp2,temp1)

[1] "Q0.00#0_0_0_TEXT"  "Q0.00#0_00_0_TEXT" "Q0.0#0_0_0_TEXT"  
[4] "Q0.0#0_00_0_TEXT"  "Q0.00_0_TEXT"      "Q0.0_TEXT"

"Q0.0_TEXT" %in% temp1

[1] FALSE

"Q0.0_TEXT" %in% temp2

[1] TRUE

temp1  <- paste0(grep("Q[0-9]",mixedsort(getunique(gsub(pattern = "[0-9]",replacement = "0"
,x = getunique(tab_desc$qname)))),val=T), collapse = ", ")
temp2  <- paste0(mixedsort(getunique(
  gsub(pattern = "[0-9]"
       ,replacement = "0"
       ,x = getunique(tab_options$qname)))), collapse = ", ")
temp3  <- paste0(mixedsort(getunique(
  gsub(pattern = "[0-9]" 
       ,replacement = "0"
       ,x = getunique(tab_options$qid)))), collapse = ", ")
temp4  <- paste0(mixedsort(getunique(
  gsub(pattern = "[0-9]"
       ,replacement = "0"
       ,x = getunique(data_out$qid)))), collapse = ", ")
temp5  <- paste0(mixedsort(getunique(
  gsub(pattern = "[0-9]"
       ,replacement = "0"
       ,x = getunique(data_out$qnum)))), collapse = ", ")

cbind(
  temp1
  ,temp2
  ,temp3
  ,temp4
  ,temp5) %>% `colnames<-`(
    c( "tab_desc`$`qname"
       , "tab_options`$`qname"
       , "tab_options`$`qid"
       , "data_out`$`qid"
       , "data_out`$`qnum") ) %>%
  kbl() %>%
  kable_styling()

tab_desc`$`qname	tab_options`$`qname	tab_options`$`qid	data_out`$`qid	data_out`$`qnum
0_Q0.0, 00_Q0.0, Q0.0, Q0.00, Q00.0, Q00.00, Q0.0 - Parent Topics, Q00.0 - Parent Topics, Q0.0 - Sentiment, Q00.0 - Sentiment, Q0.0 - Sentiment Polarity, Q00.0 - Sentiment Polarity, Q0.0 - Sentiment Score, Q00.0 - Sentiment Score, Q0.0 - Topic Sentiment Label, Q00.0 - Topic Sentiment Label, Q0.0 - Topic Sentiment Score, Q00.0 - Topic Sentiment Score, Q0.0 - Topics, Q00.0 - Topics, Q0.00#0_0, Q0.00#0_00, Q0.0#0_0, Q0.0#0_00, Q0.00#0_0_0, Q0.00#0_00_0, Q0.0#0_0_0, Q0.0#0_00_0, Q0.00_0, Q0.0_0, Q0.0_00, Q0.00_00, Q00.0_0, Q00.0_00, Q00.00_0, Q00.00_00, Q0.00_0_0, Q0.00_0_00, Q0.00_00_0, Q0.00_00_00, Q00.0_0_0, Q0.0_0_TEXT, Q0.0_00_TEXT, Q00.0_00_TEXT, Q00.0_0_TEXT	0_Q0.0, 00_Q0.0, Q0.0, Q0.00, Q00.0, Q00.00, Q0.00#0_0, Q0.00#0_00, Q0.0#0_0, Q0.0#0_00, Q0.0#0_0_0, Q0.0#0_00_0, Q0.00#0_0_0_TEXT, Q0.00#0_00_0_TEXT, Q0.0#0_0_0_TEXT, Q0.0#0_00_0_TEXT, Q0.0_0, Q0.0_00, Q0.00_0, Q0.00_00, Q00.0_0, Q00.0_00, Q00.00_0, Q00.00_00, Q0.00_0_0, Q0.00_0_00, Q0.00_00_0, Q0.00_00_00, Q00.0_0_0, Q0.00_0_TEXT, Q0.0_00_TEXT, Q00.0_00_TEXT, Q00.0_0_TEXT, Q0.0_TEXT	QID0, QID000, QID00	QID0, QID000, QID00	Q0.0, Q0.00, Q00.0, Q00.00

Final Stretch

data_out <- data_out %>% mutate(case = paste0(qtype,"_",qslct,"_",q_sbsl))

tab_values <- merge(tab_desc,tab_options
                    ,by = "qname"
                    , suffixes = c("_tbdesc","_tbopt"),all = T) 
tab_values <- tab_values %>% 
    extract(qname, c("QROOT", "B"), "(^Q[0-9]+.[0-9]+)|^[0-9]+_(Q[0-9]+.[0-9]+)") %>%
    mutate(QROOT = case_when(QROOT!="" ~ QROOT
                             ,TRUE ~B )
           , rownum_tab_values = seq(nrow(.))) %>%
      select(QROOT) %>%
  bind_cols(tab_values) %>% 
  mutate(QROOT = case_when(is.na(QROOT) ~ as.character(qname)
                           ,TRUE ~QROOT))

data_out_dist <- data_out  %>% 
  distinct(qnum, .keep_all = TRUE)

data_out_exampled <- full_join(data_out_dist
                               , tab_values
                               , by=c("qnum"="QROOT")) %>% 
  mutate(rownum_data_out_exampled =  seq(nrow(.)))


(yt <- getunique(
  gsub(pattern = "[0-9]+"
       ,replacement = "0"
       ,x = getunique(tab_values$qname))))

 [1] "StartDate"                    "EndDate"                     
 [3] "Status"                       "IPAddress"                   
 [5] "Progress"                     "Duration (in seconds)"       
 [7] "Finished"                     "RecordedDate"                
 [9] "ResponseId"                   "RecipientLastName"           
[11] "RecipientFirstName"           "RecipientEmail"              
[13] "ExternalReference"            "LocationLatitude"            
[15] "LocationLongitude"            "DistributionChannel"         
[17] "UserLanguage"                 "Q_RelevantIDDuplicate"       
[19] "Q_RelevantIDDuplicateScore"   "Q_RelevantIDFraudScore"      
[21] "Q_RelevantIDLastStartDate"    "Q0.0"                        
[23] "Q0.0_0_TEXT"                  "Q0.0_0"                      
[25] "Q0.0#0_0_0"                   "Q0.0#0_0"                    
[27] "0_Q0.0"                       "Q0.0_0_0"                    
[29] "FSFN"                         "TicketKey0"                  
[31] "Q_R"                          "ReviewerEmail"               
[33] "Q_RelatedResponseField"       "Q_RelatedResponseFieldValue" 
[35] "ResponseID"                   "RetakeLinkNew"               
[37] "CountyCombined"               "Q_TicketTeam"                
[39] "Q_TicketOwner"                "Supervisor"                  
[41] "Manager"                      "Region Unit Combined"        
[43] "Q_URL"                        "Skills Assessment QO Survey" 
[45] "RetakeLinkReplace"            "Q0.0 - Parent Topics"        
[47] "Q0.0 - Sentiment Polarity"    "Q0.0 - Sentiment Score"      
[49] "Q0.0 - Sentiment"             "Q0.0 - Topic Sentiment Label"
[51] "Q0.0 - Topic Sentiment Score" "Q0.0 - Topics"               
[53] "Q0.0#0_0_0_TEXT"              "Q0.0_TEXT"

(yt <- sort(yt[grep("Q.*\\.",yt)]))

 [1] "0_Q0.0"                       "Q0.0"                        
 [3] "Q0.0 - Parent Topics"         "Q0.0 - Sentiment"            
 [5] "Q0.0 - Sentiment Polarity"    "Q0.0 - Sentiment Score"      
 [7] "Q0.0 - Topic Sentiment Label" "Q0.0 - Topic Sentiment Score"
 [9] "Q0.0 - Topics"                "Q0.0#0_0"                    
[11] "Q0.0#0_0_0"                   "Q0.0#0_0_0_TEXT"             
[13] "Q0.0_0"                       "Q0.0_0_0"                    
[15] "Q0.0_0_TEXT"                  "Q0.0_TEXT"

strs.vec <- yt
patts <- c("_", "#", "TEXT", "^Q", "^Q[0-9]+.[0-9]+$")
PatternInfo(patts, strs.vec)

                             length  _  # TEXT ^Q ^Q[0-9]+.[0-9]+$
0_Q0.0                            6  2 -1   -1 -1               -1
Q0.0                              4 -1 -1   -1  1                1
Q0.0 - Parent Topics             20 -1 -1   -1  1               -1
Q0.0 - Sentiment                 16 -1 -1   -1  1               -1
Q0.0 - Sentiment Polarity        25 -1 -1   -1  1               -1
Q0.0 - Sentiment Score           22 -1 -1   -1  1               -1
Q0.0 - Topic Sentiment Label     28 -1 -1   -1  1               -1
Q0.0 - Topic Sentiment Score     28 -1 -1   -1  1               -1
Q0.0 - Topics                    13 -1 -1   -1  1               -1
Q0.0#0_0                          8  7  5   -1  1               -1
Q0.0#0_0_0                       10  7  5   -1  1               -1
Q0.0#0_0_0_TEXT                  15  7  5   12  1               -1
Q0.0_0                            6  5 -1   -1  1               -1
Q0.0_0_0                          8  5 -1   -1  1               -1
Q0.0_0_TEXT                      11  5 -1    8  1               -1
Q0.0_TEXT                         9  5 -1    6  1               -1

grep("Q[0-9]+.[0-9]+_[0-9]+",c("Q22.0#0_0","Q43.6_1","Q22.0#0_0_0_TEXT","Q1.3" ), val=T)

[1] "Q43.6_1"

grep("Q[0-9]+.[0-9]+",c("Q22.0#0_0","Q43.6_1","Q22.0#0_0_0_TEXT","Q1.3" ), val=T)

[1] "Q22.0#0_0"        "Q43.6_1"          "Q22.0#0_0_0_TEXT"
[4] "Q1.3"

patts <- getunique(gsub(replacement = "[0-9]+", pattern = "0",x = yt))
patts <- sprintf("^%s$",patts)
strs.vec <- as.character(data_out_exampled$qname)
names(strs.vec) <- data_out_exampled$rownum_data_out_exampled
matched.mat <- as_tibble(((PatternInfo(patts, strs.vec)) >-1)*1) %>%
  select(-length) %>% 
  mutate(qname = strs.vec
         , rownum_formatched = names(strs.vec)) %>% 
  select(where(function(x) any((x!=0)))) %>%
  filter()

# empty == good...if not empty, pattern applies to multiple question types
matched.mat %>% 
  select(contains("Q[")) %>% 
  mutate(rowsum= rowSums(.)) %>% 
  filter(rowsum >1)

# A tibble: 0 x 17
# ... with 17 variables: ^[0-9]+_Q[0-9]+.[0-9]+$ <dbl>,
#   ^Q[0-9]+.[0-9]+$ <dbl>, ^Q[0-9]+.[0-9]+ - Parent Topics$ <dbl>,
#   ^Q[0-9]+.[0-9]+ - Sentiment$ <dbl>,
#   ^Q[0-9]+.[0-9]+ - Sentiment Polarity$ <dbl>,
#   ^Q[0-9]+.[0-9]+ - Sentiment Score$ <dbl>,
#   ^Q[0-9]+.[0-9]+ - Topic Sentiment Label$ <dbl>,
#   ^Q[0-9]+.[0-9]+ - Topic Sentiment Score$ <dbl>, ...

# create question type pattern indicator
dat_matched <- merge(data_out_exampled
                     , matched.mat
                     , by.x="rownum_data_out_exampled"
                     , by.y =  "rownum_formatched")

# get row ids for column names with no patterns matched
dat_matched <- dat_matched %>% rowwise() %>% 
  mutate(dummysums = sum(c_across(contains("Q[")), na.rm = T)) %>% 
  ungroup()
nonmatch <- dat_matched %>% filter(dummysums==0) # non-matched column rows
match <- dat_matched %>% filter(dummysums==1)
match <- (match %>% select(contains("Q[")) %>% 
            dedummy(.) %>% cbind(match)) %>% 
  rename("pattmatched"=".")  %>% 
  mutate(groupmatached = paste0(case, "__",as.numeric(as.factor((pattmatched)))))  %>% 
  select(
  qnum
  , pattmatched
  ,groupmatached
  , textEntry
  , choice
  , desc
  , qname.x
  , case
  ) %>% rename_all(~stringr::str_replace(.,"\\.x","")) %>% filter(!is.na(desc))


# check if all questions in final data are in this match set
setdiff(match$qname, names(dat)) # there are some questions that have not been answered, so they do not appear column names in the data.

 [1] "Q6.1 - Parent Topics"          "Q6.1 - Sentiment Polarity"    
 [3] "Q6.1 - Sentiment Score"        "Q6.1 - Sentiment"             
 [5] "Q6.1 - Topic Sentiment Label"  "Q6.1 - Topic Sentiment Score" 
 [7] "Q6.1 - Topics"                 "Q8.4 - Parent Topics"         
 [9] "Q8.4 - Sentiment Polarity"     "Q8.4 - Sentiment Score"       
[11] "Q8.4 - Sentiment"              "Q8.4 - Topic Sentiment Label" 
[13] "Q8.4 - Topic Sentiment Score"  "Q8.4 - Topics"                
[15] "Q20.3 - Parent Topics"         "Q20.3 - Sentiment Polarity"   
[17] "Q20.3 - Sentiment Score"       "Q20.3 - Sentiment"            
[19] "Q20.3 - Topic Sentiment Label" "Q20.3 - Topic Sentiment Score"
[21] "Q20.3 - Topics"

setdiff(names(dat), match$qname)

 [1] "StartDate"                   "EndDate"                    
 [3] "Status"                      "IPAddress"                  
 [5] "Progress"                    "Duration (in seconds)"      
 [7] "Finished"                    "RecordedDate"               
 [9] "ResponseId"                  "RecipientLastName"          
[11] "RecipientFirstName"          "RecipientEmail"             
[13] "ExternalReference"           "LocationLatitude"           
[15] "LocationLongitude"           "DistributionChannel"        
[17] "UserLanguage"                "Q_RelevantIDDuplicate"      
[19] "Q_RelevantIDDuplicateScore"  "Q_RelevantIDFraudScore"     
[21] "Q_RelevantIDLastStartDate"   "Q1.23_5"                    
[23] "Q1.23_6"                     "FSFN"                       
[25] "TicketKey1"                  "Q_R"                        
[27] "ReviewerEmail"               "Q_RelatedResponseField"     
[29] "Q_RelatedResponseFieldValue" "ResponseID"                 
[31] "RetakeLinkNew"               "CountyCombined"             
[33] "Q_TicketTeam"                "Q_TicketOwner"              
[35] "Supervisor"                  "Manager"                    
[37] "Region Unit Combined"        "Q_URL"                      
[39] "Skills Assessment QO Survey" "RetakeLinkReplace"

# intersect(names(dat), match$qname)

case_patterns <- match %>% 
  group_by(pattmatched, case, groupmatached) %>% 
  count(pattmatched, case, groupmatached) %>% 
  mutate(grouppattmatched = paste0(pattmatched,"::",case))  %>% 
  ungroup() %>% 
  mutate(row = row_number())

ï..pattmatched	case	groupmatached	n	grouppattmatched	row	rules
^[0-9]+_Q[0-9]+.[0-9]+$	MC_SAVR_TX	MC_SAVR_TX__1	90	^[0-9]+_Q[0-9]+.[0-9]+$::MC_SAVR_TX	1	qnum
^[0-9]+_Q[0-9]+.[0-9]+$	TE_ESTB_NA	TE_ESTB_NA__1	30	^[0-9]+_Q[0-9]+.[0-9]+$::TE_ESTB_NA	2	qnum
^Q[0-9]+.[0-9]+#[0-9]+_[0-9]+$	SBS_SBSMatrix_NA	SBS_SBSMatrix_NA__2	150	^Q[0-9]+.[0-9]+#[0-9]+_[0-9]+$::SBS_SBSMatrix_NA	3	qnum
^Q[0-9]+.[0-9]+#[0-9]+[0-9]+[0-9]+$	SBS_SBSMatrix_NA	SBS_SBSMatrix_NA__3	175	^Q[0-9]+.[0-9]+#[0-9]+[0-9]+[0-9]+$::SBS_SBSMatrix_NA	4	qnum
^Q[0-9]+.[0-9]+$	MC_DL_NA	MC_DL_NA__5	32	^Q[0-9]+.[0-9]+$::MC_DL_NA	5	qnum
^Q[0-9]+.[0-9]+$	MC_SAVR_TX	MC_SAVR_TX__5	70	^Q[0-9]+.[0-9]+$::MC_SAVR_TX	6	qnum
^Q[0-9]+.[0-9]+$	TE_ESTB_NA	TE_ESTB_NA__5	49	^Q[0-9]+.[0-9]+$::TE_ESTB_NA	7	qnum
^Q[0-9]+.[0-9]+$	TE_SL_NA	TE_SL_NA__5	13	^Q[0-9]+.[0-9]+$::TE_SL_NA	8	qnum
^Q[0-9]+.[0-9]+_[0-9]+$	Matrix_Likert_SingleAnswer	Matrix_Likert_SingleAnswer__6	131	^Q[0-9]+.[0-9]+_[0-9]+$::Matrix_Likert_SingleAnswer	9	qnum
^Q[0-9]+.[0-9]+_[0-9]+$	MC_MACOL_TX	MC_MACOL_TX__6	81	^Q[0-9]+.[0-9]+_[0-9]+$::MC_MACOL_TX	10	qname
^Q[0-9]+.[0-9]+_[0-9]+$	MC_MAVR_TX	MC_MAVR_TX__6	161	^Q[0-9]+.[0-9]+_[0-9]+$::MC_MAVR_TX	11	qname
^Q[0-9]+.[0-9]+_[0-9]+$	TE_AUTO_STATIC	TE_AUTO_STATIC__6	1	^Q[0-9]+.[0-9]+_[0-9]+$::TE_AUTO_STATIC	12	qnum
^Q[0-9]+.[0-9]+_[0-9]+$	TE_FORM_NA	TE_FORM_NA__6	4	^Q[0-9]+.[0-9]+_[0-9]+$::TE_FORM_NA	13	colondash
^Q[0-9]+.[0-9]+[0-9]+[0-9]+$	Matrix_Likert_MultipleAnswer	Matrix_Likert_MultipleAnswer__7	358	^Q[0-9]+.[0-9]+[0-9]+[0-9]+$::Matrix_Likert_MultipleAnswer	14	underscore
^Q[0-9]+.[0-9]+_[0-9]+_TEXT$	MC_MAVR_TX	MC_MAVR_TX__8	3	^Q[0-9]+.[0-9]+_[0-9]+_TEXT$::MC_MAVR_TX	15	text
^Q[0-9]+.[0-9]+_[0-9]+_TEXT$	MC_SAVR_TX	MC_SAVR_TX__8	1	^Q[0-9]+.[0-9]+_[0-9]+_TEXT$::MC_SAVR_TX	16	text2

# update rules table if survey gets new types of questions since last run
case_patterns_ruled <- read.csv("case_patterns_ruled.csv")
patterns_new <- case_patterns[!case_patterns$grouppattmatched %in% case_patterns_ruled$grouppattmatched,]

# open rule setting table to add new rules from clipboard
if(nrow(patterns_new) > 0){
  # clipr::write_clip(patterns_new)
  # browseURL("case_patterns_ruled.csv")
}

NULL

# after updates to ruled table (using troubleshoot.R to establish new rules)
case_patterns <- merge(case_patterns_ruled[c("grouppattmatched","rules")]
                       , case_patterns
                       , "grouppattmatched")
 
rule <- list()
case <- list()
rulenames <- getunique(case_patterns$rules)
for (name_j in rulenames){
  rule[name_j] <- case_patterns %>% filter(rules==name_j) %>% select(groupmatached)
  case[name_j] <- case_patterns %>% filter(rules==name_j) %>% select(case)
}

matchout <- list()

for (i in rule$qnum){
  temp <- match %>% filter(groupmatached==i)
  matchout <- bind_rows(matchout,merge(temp,data_out,"qnum"))
}

for (i in rule$qname){
  temp <- match %>% filter(groupmatached==i)
  temp2 <- data_out %>% mutate(qname = paste0(qnum,"_", q_opt_num))
  matchout <- bind_rows(matchout,merge(temp,temp2,"qname"))
}

for (i in rule$colondash){
  temp <- match %>% filter(groupmatached==i)
  temp <- temp %>% pull(desc) %>% 
    str_split(string= ., pattern = ": - ", simplify = T) %>% 
    as.data.frame() %>% 
    mutate_if(., is.character, ~trimws(.)) %>% 
    rename(qtext_stripped=V1, q_opt_txt=V2) %>% 
    cbind(temp) 
  matchout <- bind_rows(matchout,merge(
    temp
    ,data_out
    , c("qtext_stripped","q_opt_txt")))
}

for (rx in seq_along(rule$underscore)){
  i <- rule$underscore[rx]
  j <- case$underscore[rx]
  temp <- match %>% filter(groupmatached==i)
  temp <- temp %>% pull(qname) %>% str_split(string= ., pattern = "_", simplify = T) %>% 
    as.data.frame() %>% 
    mutate_if(., is.character, ~trimws(.)) %>% 
    mutate(qnum_q_opt_num = paste0(V1,"::",V3)) %>%
    cbind(temp) 
  temp2 <- data_out %>% filter(!is.na(q_opt_num), case == j) %>% 
    mutate(qnum_q_opt_num = paste0(qnum, "::", q_opt_num))
  matchout <- bind_rows(matchout,left_join(temp,temp2,"qnum_q_opt_num"))
}

for (i in rule$text){
  temp <- match %>% filter(groupmatached==i) %>%
    mutate(qid = str_extract(textEntry, "[a-zA-Z0-9]+(?=\\.)"))
  newcols <- setdiff(names(data_out),names(temp))
  temp2 <- data.frame(matrix(nrow=nrow(temp), ncol = len(newcols)))
  colnames(temp2) <- newcols
  matchout <- bind_rows(matchout,bind_cols(temp, temp2))
}

for (i in rule$text2){
  temp <- match %>% filter(groupmatached==i) %>% arrange(qnum)
  temp$qid <- data_out %>% filter(qnum %in% temp$qnum) %>% 
    arrange(qnum) %>% pull(qid) %>% getunique()
  newcols <- setdiff(names(data_out),names(temp))
  temp2 <- data.frame(matrix(nrow=nrow(temp), ncol = len(newcols)))
  colnames(temp2) <- newcols
  matchout <- bind_rows(matchout,bind_cols(temp, temp2))
}

matchout$qnum <- matchout %>%
  select(contains("qnum"),-contains("qnum_q_opt_num")) %>% 
  coalesce_df(everything()) %>% pull(result)

# empty = good here
matchout %>% group_by(qname, q_opt_num) %>% 
  count(qname, q_opt_num) %>% 
  filter(n>1) %>% arrange(desc(n))

# A tibble: 0 x 3
# Groups:   qname, q_opt_num [0]
# ... with 3 variables: qname <fct>, q_opt_num <chr>, n <int>

setdiff(matchout$qname,names(dat))

character(0)

xy <- matchout %>% 
  select(
    qnum
    ,qname
    ,qid
    ,desc
    ,qtext
    ,qtype
    ,qslct
    ,q_sbsl
    ,q_frce
    ,q_rule
    ,q_opt_txt
    ,q_opt_num
  ) 

xy <- xy %>% 
  pull(qnum) %>% 
  str_split(string = ., pattern = "\\.", simplify = T) %>% 
  as.data.frame() %>% 
  mutate_at("V2", as.numeric) %>% 
  cbind(xy) %>% 
  arrange(V1, V2) %>% 
    mutate(qid = case_when(is.na(qid) ~ "TEXT", TRUE~qid)
           , updated = Sys.time()
           ,rownum = row_number()) %>% 
  select(rownum, updated, names(xy))


##############################
# Output :-)
##############################

write.csv(xy, sprintf("captured__%s.csv",
       format(Sys.time(), "%m_%d_%Y__H%H")),row.names = F)

qnum	qname	qid	desc	qtext	qtype	qslct	q_sbsl	q_frce	q_rule	q_opt_txt	q_opt_num
Q15.1	Q15.1_1	QID345	A consultation was completed with a subject matter expert when indicated: - Domestic Violence Expert	A consultation was completed with a subject matter expert when indicated:	Matrix	Likert	SingleAnswer	FALSE	FALSE	Not required	3
Q15.1	Q15.1_1	QID345	A consultation was completed with a subject matter expert when indicated: - Domestic Violence Expert	A consultation was completed with a subject matter expert when indicated:	Matrix	Likert	SingleAnswer	FALSE	FALSE	No	2
Q15.1	Q15.1_1	QID345	A consultation was completed with a subject matter expert when indicated: - Domestic Violence Expert	A consultation was completed with a subject matter expert when indicated:	Matrix	Likert	SingleAnswer	FALSE	FALSE	Yes	1
Q15.1	Q15.1_2	QID345	A consultation was completed with a subject matter expert when indicated: - Substance Abuse Expert	A consultation was completed with a subject matter expert when indicated:	Matrix	Likert	SingleAnswer	FALSE	FALSE	Not required	3
Q15.1	Q15.1_2	QID345	A consultation was completed with a subject matter expert when indicated: - Substance Abuse Expert	A consultation was completed with a subject matter expert when indicated:	Matrix	Likert	SingleAnswer	FALSE	FALSE	No	2
Q15.1	Q15.1_2	QID345	A consultation was completed with a subject matter expert when indicated: - Substance Abuse Expert	A consultation was completed with a subject matter expert when indicated:	Matrix	Likert	SingleAnswer	FALSE	FALSE	Yes	1
Q15.1	Q15.1_3	QID345	A consultation was completed with a subject matter expert when indicated: - Mental Health Expert	A consultation was completed with a subject matter expert when indicated:	Matrix	Likert	SingleAnswer	FALSE	FALSE	Not required	3
Q15.1	Q15.1_3	QID345	A consultation was completed with a subject matter expert when indicated: - Mental Health Expert	A consultation was completed with a subject matter expert when indicated:	Matrix	Likert	SingleAnswer	FALSE	FALSE	No	2
Q15.1	Q15.1_3	QID345	A consultation was completed with a subject matter expert when indicated: - Mental Health Expert	A consultation was completed with a subject matter expert when indicated:	Matrix	Likert	SingleAnswer	FALSE	FALSE	Yes	1
Q15.2	Q15.2_1	QID300	Was the information and/or recommendations from the Subject Matter Expert considered in the decision making? - Domestic Violence Expert	Was the information and/or recommendations from the Subject Matter Expert considered in the decision making?	Matrix	Likert	SingleAnswer	FALSE	FALSE	Yes	1
Q15.2	Q15.2_1	QID300	Was the information and/or recommendations from the Subject Matter Expert considered in the decision making? - Domestic Violence Expert	Was the information and/or recommendations from the Subject Matter Expert considered in the decision making?	Matrix	Likert	SingleAnswer	FALSE	FALSE	No	2
Q15.2	Q15.2_2	QID300	Was the information and/or recommendations from the Subject Matter Expert considered in the decision making? - Substance Abuse Expert	Was the information and/or recommendations from the Subject Matter Expert considered in the decision making?	Matrix	Likert	SingleAnswer	FALSE	FALSE	Yes	1
Q15.2	Q15.2_2	QID300	Was the information and/or recommendations from the Subject Matter Expert considered in the decision making? - Substance Abuse Expert	Was the information and/or recommendations from the Subject Matter Expert considered in the decision making?	Matrix	Likert	SingleAnswer	FALSE	FALSE	No	2
Q15.4	Q15.4	QID283	Provide explanation in comment box:	Provide explanation in comment box:	TE	ESTB		FALSE	FALSE
Q16.1	Q16.1	QID78	A multi-disciplinary staffing was held when a new report is received on open case management cases.	A multi-disciplinary staffing was held when a new report is received on open case management cases.	MC	SAVR	TX	FALSE	FALSE	N/A – A multi-disciplinary staffing was not required.	4
Q16.1	Q16.1	QID78	A multi-disciplinary staffing was held when a new report is received on open case management cases.	A multi-disciplinary staffing was held when a new report is received on open case management cases.	MC	SAVR	TX	FALSE	FALSE	No - The multi-disciplinary staffing was not held within 14 days.	2
Q16.1	Q16.1	QID78	A multi-disciplinary staffing was held when a new report is received on open case management cases.	A multi-disciplinary staffing was held when a new report is received on open case management cases.	MC	SAVR	TX	FALSE	FALSE	No – A multi-disciplinary staffing was not held at all.	3
Q16.1	Q16.1	QID78	A multi-disciplinary staffing was held when a new report is received on open case management cases.	A multi-disciplinary staffing was held when a new report is received on open case management cases.	MC	SAVR	TX	FALSE	FALSE	Yes – A multi-disciplinary staffing was held with all relevant parties within 14 days	1

Creating a Qualtrics Documenter

Load my custom fucntions from github

Load CRAN packages

Import data and data conversion tables

data_out (Qualtrics questionnaire metadata

tab_options (Response Metadata)

tab_desc (Response Data)

Merging datasets

Candidate Fields for Merge-key

Solution

Parsing `tab_options$qname`

Final Stretch

Example Rows From Result

Creating a Qualtrics Documenter

Load my custom fucntions from github

Load CRAN packages

Import data and data conversion tables

data_out (Qualtrics questionnaire metadata

tab_options (Response Metadata)

tab_desc (Response Data)

Merging datasets

Candidate Fields for Merge-key

Solution

Parsing tab_options$qname

Final Stretch

Example Rows From Result

Parsing `tab_options$qname`