I import data from Qualtris into R and create a codebook or documentation spreadsheet. This can be used for question labelling and recoding of response values (see renaming script project).
To avoid inadvertently sharing sensitive client data, this example uses a simplified dataset that mimics most of the issues encountered in the actual project.
Purpose:
Create a codebook that lists every question from a Qualtrics survey, indicating the respective column(s) in the response data, along with every possible response value (for non-open ended questions). The codebook can be used for quickly corresponding between exported response data and the survey quesstionnaire. It can also be used as a tool for changing the names of exported data columns to more meaningful ones. The Qualtrics question labeling option was set to “Block” but the regex patterns contained below can be updated to apply to any of the other options.
Joshua’s work principles applied to this project:
packageloader()
which both loads and installs packages and len()
aliases length()
for convenience.source("https://raw.githubusercontent.com/joshuascriven/helper_functions/main/helper_functions.R")
packageloader(c(
"tidyverse"
, "dplyr"
, "knitr"
, "qualtRics"
, "odbc"
, "lubridate"
, "stringr"
, "anytime"
, "tm"
, "fastDummies"
, "jtools"
, "janitor"
, "here"
, "odbc"
, "data.table"
, "openxlsx"
, "GrpString"
, "kableExtra"
, "gtools"
, "flextable"
))
[1] "packages loaded"
Connecting to the Qualtrics survey data requires an api token. I retrieve my my credentials from a local file in a centralized location using my getKey()
function. I query the API to fetch 3 types of Qualtrics data using the R “qualtRics” package. I always label projects to cross-reference in emails, system folders, and code searches.
I use metadata()
to get metadata for each question. leading and trailing whitespace from question text, retaining original text values. I create row indicators for any imported data to help with later joins or troubleshooting.
data_dir <- "../data/"
projname <- "qo_request_JS22"
api_key <- read.table(filemaker("api_key","tsv"), header = TRUE)
# Get Qualtrics data
qualtrics_api_credentials(
api_key = getKey("qualtrics_api_key")
, base_url = getKey("qualtrics_base_url")
, install = TRUE,overwrite = T); readRenviron("~/.Renviron")
sid <- all_surveys()$id[grep("^Li.*To.* TEST COPY$",all_surveys()$name)]
dat <- fetch_survey(surveyID = sid,
verbose = TRUE,force_request = T)
|
| | 0%
|
|============================================================| 100%
# Extract Q metadata
md <- metadata(sid)
q_rule <- stack((lapply(md$questions, function(x) as.character(x$validation))))
df_list <- list(
qnum = stack(lapply(md$questions, function(x) x$questionName))
, qid = data.frame(ind = names(md$questions))
, qtext = stack(lapply(md$questions, function(x) cleanhtml(x$questionText)))
, qtype = stack(lapply(md$questions, function(x) x$questionType$type))
, qslct = stack(lapply(md$questions, function(x) x$questionType$selector))
, q_sbsl = stack(lapply(md$questions
, function(x) as.character(x$questionType$subSelector)))
, q_frce = stack(lapply(md$questions
, function(x) x$validation$doesForceResponse))
, q_rule = aggregate(values ~ ind
, FUN = paste
, collapse = " ; "
, data = q_rule)
, q_opt_txt = cbind(cbind(stack(lapply(md$questions
, function(w) lapply(w$choices, function(z) z$choiceText)))
, stack(lapply(md$questions
, function(w) as.character(lapply(w$choices, function(z) z$recode))))
))[,-4]
)
data_out <- df_list %>% reduce(full_join, by='ind')
names(data_out) <- c(names(df_list),"q_opt_num")
data_out <- data_out %>%
mutate(
source = "data_out"
,qtext_stripped = gsub(":$", "", trimws(qtext))
,rownum_data_out = seq(nrow(.))
)
Next, I use column_map()
to extract metadata on responses. Each row in the resulting data represents a response choice for a survey question.
tab_options <- tab_options_bac <- qualtRics::column_map(sid)
tab_options$source <- "tab_options"
Importantly, the metadata provided by metadata()
and column_map()
are for the questions in the survey instrument and not on the final data that Qualtrics exports. Responses to “select all that apply” questions, for instance, produce a binary indicator for each choice option provided. These columns tend to follow some pattern identifying each set with particular survey questions.
When a question in the Qualtrics survey allows the respondent to select all options that apply, Qualtrics stores responses for that question type as a set of binary indicators, one for each of the n number of choices presented. A similar situation presents for Qualtrics matrix or grouped questions, where multiple responses are elicited from respondents for a set of questions that share the same scale. Hence, metadata()
returns a QID (question identifier) column, the length of which is equal to the n number of columns in the exported data, while column_map()
provides all response options for a given QID/Exported data column. Qualtrics does not provide a key to merge these three datasets. I use the following method to merge these data sets.
A merge-key column is one that can unique identify the same observation across various datasets. Of the possible fields available, only tab_options$qname
uniquely identifies every possible question-response for the survey (i.e., entered in the web UI). tab_desc$qname
which is necessarily a set of unique values, consisting of the column names of the exported data. Because their lengths differ, however, a simple 1-to-1 join isn’t possible..
# test duplication
tabyl(tab_options$qid) %>% arrange(desc(n)) %>% head() %>%
kbl() %>%
kable_styling()# duplicates
tab_options$qid | n | percent |
---|---|---|
QID367 | 132 | 0.0888889 |
QID357 | 120 | 0.0808081 |
QID373 | 108 | 0.0727273 |
QID403 | 96 | 0.0646465 |
QID411 | 75 | 0.0505051 |
QID361 | 72 | 0.0484848 |
tabyl(data_out$qid) %>% arrange(desc(n)) %>% head() %>%
kbl() %>%
kable_styling() # duplicates
data_out$qid | n | percent |
---|---|---|
QID167 | 138 | 0.0969782 |
QID11 | 107 | 0.0751933 |
QID162 | 79 | 0.0555165 |
QID165 | 63 | 0.0442727 |
QID163 | 49 | 0.0344343 |
QID101 | 40 | 0.0281096 |
tabyl(data_out$qnum) %>% arrange(desc(n)) %>% head() %>%
kbl() %>%
kable_styling()# duplicates
data_out$qnum | n | percent |
---|---|---|
Q1.37 | 138 | 0.0969782 |
Q1.32 | 107 | 0.0751933 |
Q1.33 | 79 | 0.0555165 |
Q1.35 | 63 | 0.0442727 |
Q1.34 | 49 | 0.0344343 |
Q20.6 | 40 | 0.0281096 |
tabyl(tab_desc$desc) %>% arrange(desc(n)) %>% head() %>%
kbl() %>%
kable_styling()# duplicates
tab_desc$desc | n | percent |
---|---|---|
Provide details in comment box: | 32 | 0.0207523 |
Provide explanation in comment box: | 8 | 0.0051881 |
Was guidance provided by primary Supervisor? | 6 | 0.0038911 |
Comment Box: - Other - OTI 1 - Enter Reason | 2 | 0.0012970 |
Comment Box: - Other - OTI 10 - Enter Reason | 2 | 0.0012970 |
Comment Box: - Other - OTI 11 - Enter Reason | 2 | 0.0012970 |
tabyl(tab_desc$qname) %>%
filter(grepl("Q[0-9]",`tab_desc$qname`)) %>%
arrange(desc(n)) %>%
head() %>%
kbl() %>%
kable_styling() # unique
tab_desc$qname | n | percent |
---|---|---|
Q1.2 | 1 | 0.0006485 |
Q1.3 | 1 | 0.0006485 |
Q1.4 | 1 | 0.0006485 |
Q1.4_6_TEXT | 1 | 0.0006485 |
Q1.5 | 1 | 0.0006485 |
Q1.6 | 1 | 0.0006485 |
tabyl(tab_options$qname) %>% arrange(desc(n)) %>% head() %>%
kbl() %>%
kable_styling() # unique
tab_options$qname | n | percent |
---|---|---|
Q1.1 | 1 | 0.0006734 |
Q1.10 | 1 | 0.0006734 |
Q1.11 | 1 | 0.0006734 |
Q1.12 | 1 | 0.0006734 |
Q1.13 | 1 | 0.0006734 |
Q1.14 | 1 | 0.0006734 |
tab_options$qid
, data_out$qid
and data_out$qnum
all share the same length, and uniquely identifies each survey question. However, data_out$qnum
uses the labeling selected by the use in Qualtrics, while qid
in both instances represents the internal Qualtrics identifier for that question and is usually based on the order of question creation.
# test length
getuniquelen(tab_options$qid) # len 1
[1] 232
getuniquelen(data_out$qid) # len 1
[1] 232
getuniquelen(data_out$qnum) # len 1
[1] 232
getuniquelen(tab_desc$desc) # len diff
[1] 1481
getuniquelen(tab_desc$qname) # len diff
[1] 1542
getuniquelen(tab_options$qname) # len diff
[1] 1485
Finally there is some overlap between tab_options$qname
and tab_desc$qname
resulting from the fact that a survey question, as indicated by its QID or the root of its qname, will appear multiple times in the respective data because it comprises multiple sub-questions or because its responses are recorded as if their existed multiple sub-questions (binary indicator example mentioned above). It is also clear that sub-strings of qname
values in both datasets have an exact relation to qnum
in data_out
.
# test overlap
len(intersect(tab_options$qid,data_out$qnum))
[1] 0
len(setdiff(tab_options$qid,data_out$qid)) # full overlap
[1] 0
len(intersect(tab_options$qname,tab_desc$qname)) # some overlap
[1] 948
len(setdiff(tab_options$qname,tab_desc$qname)) # some overlap
[1] 537
data_out can be joined using qnum
with parsed versions of qname
from both tab_desc and tab_options. However, because a 1:1 relationship should exist between tab_desc and tab_options for much of the data and a 1:many between either and data_out, the first two must be merged before joining their merger to the third dataset.
The following table shows the pattern of candidate merge-key column values, numeric substrings of size 1 are indicated with a single 0. Embedded data fields, calculated fields, and Qualtrics-generated fields in the exported data are excluded.
temp1 <- grep("Q[0-9]",mixedsort(getunique(
gsub(pattern = "[0-9]",replacement = "0",x = getunique(tab_desc$qname)))),val=T)
temp2 <- mixedsort(getunique(
gsub(pattern = "[0-9]",replacement = "0",x = getunique(tab_options$qname))))
temp3 <- mixedsort(getunique(
gsub(pattern = "[0-9]",replacement = "0",x = getunique(tab_options$qid))))
temp4 <- mixedsort(getunique(
gsub(pattern = "[0-9]",replacement = "0",x = getunique(data_out$qid))))
temp5 <- mixedsort(getunique(
gsub(pattern = "[0-9]",replacement = "0",x = getunique(data_out$qnum))))
cbind(
paste0(temp1, collapse = ", ")
,paste0(temp2, collapse = ", ")
,paste0(temp3, collapse = ", ")
,paste0(temp4, collapse = ", ")
,paste0(temp5, collapse = ", ")) %>% `colnames<-`(
c( "tab_desc`$`qname"
, "tab_options`$`qname"
, "tab_options`$`qid"
, "data_out`$`qid"
, "data_out`$`qnum") ) %>%
kbl() %>%
kable_styling()
tab_desc$ qname
|
tab_options$ qname
|
tab_options$ qid
|
data_out$ qid
|
data_out$ qnum
|
---|---|---|---|---|
0_Q0.0, 00_Q0.0, Q0.0, Q0.00, Q00.0, Q00.00, Q0.0 - Parent Topics, Q00.0 - Parent Topics, Q0.0 - Sentiment, Q00.0 - Sentiment, Q0.0 - Sentiment Polarity, Q00.0 - Sentiment Polarity, Q0.0 - Sentiment Score, Q00.0 - Sentiment Score, Q0.0 - Topic Sentiment Label, Q00.0 - Topic Sentiment Label, Q0.0 - Topic Sentiment Score, Q00.0 - Topic Sentiment Score, Q0.0 - Topics, Q00.0 - Topics, Q0.00#0_0, Q0.00#0_00, Q0.0#0_0, Q0.0#0_00, Q0.00#0_0_0, Q0.00#0_00_0, Q0.0#0_0_0, Q0.0#0_00_0, Q0.00_0, Q0.0_0, Q0.0_00, Q0.00_00, Q00.0_0, Q00.0_00, Q00.00_0, Q00.00_00, Q0.00_0_0, Q0.00_0_00, Q0.00_00_0, Q0.00_00_00, Q00.0_0_0, Q0.0_0_TEXT, Q0.0_00_TEXT, Q00.0_00_TEXT, Q00.0_0_TEXT | Q0.0, Q0.00, Q00.0, Q00.00, Q0.00#0_0, Q0.00#0_00, Q0.0#0_0, Q0.0#0_00, Q0.0#0_0_0, Q0.0#0_00_0, Q0.00#0_0_0_TEXT, Q0.00#0_00_0_TEXT, Q0.0#0_0_0_TEXT, Q0.0#0_00_0_TEXT, Q0.0(0), Q0.0(00), Q0.0_0, Q0.0_00, Q0.00_0, Q0.00_00, Q00.0_0, Q00.0_00, Q00.00_0, Q00.00_00, Q0.00_0_0, Q0.00_0_00, Q0.00_00_0, Q0.00_00_00, Q0.00_0_TEXT, Q0.0_00_TEXT, Q00.0_00_TEXT, Q00.0_0_TEXT, Q0.0_TEXT, Q00.0_x0, Q00.0_x00, Q00.0_xx0, Q00.0_xx0_0 | QID0, QID000, QID00 | QID0, QID000, QID00 | Q0.0, Q0.00, Q00.0, Q00.00 |
tab_options$qname
The exported data field names should be left unmodified. So, I chose to harmonize tab_options$qname
to tab_desc$qname
.
setdiff(temp2,temp1)
[1] "Q0.00#0_0_0_TEXT" "Q0.00#0_00_0_TEXT" "Q0.0#0_0_0_TEXT"
[4] "Q0.0#0_00_0_TEXT" "Q0.0(0)" "Q0.0(00)"
[7] "Q0.00_0_TEXT" "Q0.0_TEXT" "Q00.0_x0"
[10] "Q00.0_x00" "Q00.0_xx0" "Q00.0_xx0_0"
found <- grepl(pattern = "\\(", x = tab_options$qname)
temp <- paste0(
gsub(".*\\((.*)\\).*", "\\1", tab_options$qname),"_",tab_options$qname)
tab_options$qname_cleaned <- gsub("\\s*\\([^\\)]+\\)","",as.character(temp))
tab_options<- tab_options %>%
rename(qname_orig = qname
,qname = qname_cleaned) %>%
mutate(rownum_tab_options = seq(nrow(tab_options))
)
tab_options$qname[!found] <- tab_options$qname_orig[!found]
found <- grepl(pattern = "x", x = tab_options$qname)
tab_options$qname_cleaned <- gsub("\\s*x","",as.character(tab_options$qname))
tab_options <- tab_options %>% select(!qname_orig) %>% rename(qname_orig = qname
,qname = qname_cleaned)
tab_options$qname[!found] <- tab_options$qname_orig[!found]
temp1 <- grep("Q[0-9]",mixedsort(getunique(gsub(pattern = "[0-9]"
,replacement = "0"
,x = getunique(tab_desc$qname)))),val=T)
temp2 <- mixedsort(getunique(gsub(pattern = "[0-9]"
,replacement = "0"
,x = getunique(tab_options$qname))))
setdiff(temp2,temp1)
[1] "Q0.00#0_0_0_TEXT" "Q0.00#0_00_0_TEXT" "Q0.0#0_0_0_TEXT"
[4] "Q0.0#0_00_0_TEXT" "Q0.00_0_TEXT" "Q0.0_TEXT"
"Q0.0_TEXT" %in% temp1
[1] FALSE
"Q0.0_TEXT" %in% temp2
[1] TRUE
temp1 <- paste0(grep("Q[0-9]",mixedsort(getunique(gsub(pattern = "[0-9]",replacement = "0"
,x = getunique(tab_desc$qname)))),val=T), collapse = ", ")
temp2 <- paste0(mixedsort(getunique(
gsub(pattern = "[0-9]"
,replacement = "0"
,x = getunique(tab_options$qname)))), collapse = ", ")
temp3 <- paste0(mixedsort(getunique(
gsub(pattern = "[0-9]"
,replacement = "0"
,x = getunique(tab_options$qid)))), collapse = ", ")
temp4 <- paste0(mixedsort(getunique(
gsub(pattern = "[0-9]"
,replacement = "0"
,x = getunique(data_out$qid)))), collapse = ", ")
temp5 <- paste0(mixedsort(getunique(
gsub(pattern = "[0-9]"
,replacement = "0"
,x = getunique(data_out$qnum)))), collapse = ", ")
cbind(
temp1
,temp2
,temp3
,temp4
,temp5) %>% `colnames<-`(
c( "tab_desc`$`qname"
, "tab_options`$`qname"
, "tab_options`$`qid"
, "data_out`$`qid"
, "data_out`$`qnum") ) %>%
kbl() %>%
kable_styling()
tab_desc$ qname
|
tab_options$ qname
|
tab_options$ qid
|
data_out$ qid
|
data_out$ qnum
|
---|---|---|---|---|
0_Q0.0, 00_Q0.0, Q0.0, Q0.00, Q00.0, Q00.00, Q0.0 - Parent Topics, Q00.0 - Parent Topics, Q0.0 - Sentiment, Q00.0 - Sentiment, Q0.0 - Sentiment Polarity, Q00.0 - Sentiment Polarity, Q0.0 - Sentiment Score, Q00.0 - Sentiment Score, Q0.0 - Topic Sentiment Label, Q00.0 - Topic Sentiment Label, Q0.0 - Topic Sentiment Score, Q00.0 - Topic Sentiment Score, Q0.0 - Topics, Q00.0 - Topics, Q0.00#0_0, Q0.00#0_00, Q0.0#0_0, Q0.0#0_00, Q0.00#0_0_0, Q0.00#0_00_0, Q0.0#0_0_0, Q0.0#0_00_0, Q0.00_0, Q0.0_0, Q0.0_00, Q0.00_00, Q00.0_0, Q00.0_00, Q00.00_0, Q00.00_00, Q0.00_0_0, Q0.00_0_00, Q0.00_00_0, Q0.00_00_00, Q00.0_0_0, Q0.0_0_TEXT, Q0.0_00_TEXT, Q00.0_00_TEXT, Q00.0_0_TEXT | 0_Q0.0, 00_Q0.0, Q0.0, Q0.00, Q00.0, Q00.00, Q0.00#0_0, Q0.00#0_00, Q0.0#0_0, Q0.0#0_00, Q0.0#0_0_0, Q0.0#0_00_0, Q0.00#0_0_0_TEXT, Q0.00#0_00_0_TEXT, Q0.0#0_0_0_TEXT, Q0.0#0_00_0_TEXT, Q0.0_0, Q0.0_00, Q0.00_0, Q0.00_00, Q00.0_0, Q00.0_00, Q00.00_0, Q00.00_00, Q0.00_0_0, Q0.00_0_00, Q0.00_00_0, Q0.00_00_00, Q00.0_0_0, Q0.00_0_TEXT, Q0.0_00_TEXT, Q00.0_00_TEXT, Q00.0_0_TEXT, Q0.0_TEXT | QID0, QID000, QID00 | QID0, QID000, QID00 | Q0.0, Q0.00, Q00.0, Q00.00 |
data_out <- data_out %>% mutate(case = paste0(qtype,"_",qslct,"_",q_sbsl))
tab_values <- merge(tab_desc,tab_options
,by = "qname"
, suffixes = c("_tbdesc","_tbopt"),all = T)
tab_values <- tab_values %>%
extract(qname, c("QROOT", "B"), "(^Q[0-9]+.[0-9]+)|^[0-9]+_(Q[0-9]+.[0-9]+)") %>%
mutate(QROOT = case_when(QROOT!="" ~ QROOT
,TRUE ~B )
, rownum_tab_values = seq(nrow(.))) %>%
select(QROOT) %>%
bind_cols(tab_values) %>%
mutate(QROOT = case_when(is.na(QROOT) ~ as.character(qname)
,TRUE ~QROOT))
data_out_dist <- data_out %>%
distinct(qnum, .keep_all = TRUE)
data_out_exampled <- full_join(data_out_dist
, tab_values
, by=c("qnum"="QROOT")) %>%
mutate(rownum_data_out_exampled = seq(nrow(.)))
(yt <- getunique(
gsub(pattern = "[0-9]+"
,replacement = "0"
,x = getunique(tab_values$qname))))
[1] "StartDate" "EndDate"
[3] "Status" "IPAddress"
[5] "Progress" "Duration (in seconds)"
[7] "Finished" "RecordedDate"
[9] "ResponseId" "RecipientLastName"
[11] "RecipientFirstName" "RecipientEmail"
[13] "ExternalReference" "LocationLatitude"
[15] "LocationLongitude" "DistributionChannel"
[17] "UserLanguage" "Q_RelevantIDDuplicate"
[19] "Q_RelevantIDDuplicateScore" "Q_RelevantIDFraudScore"
[21] "Q_RelevantIDLastStartDate" "Q0.0"
[23] "Q0.0_0_TEXT" "Q0.0_0"
[25] "Q0.0#0_0_0" "Q0.0#0_0"
[27] "0_Q0.0" "Q0.0_0_0"
[29] "FSFN" "TicketKey0"
[31] "Q_R" "ReviewerEmail"
[33] "Q_RelatedResponseField" "Q_RelatedResponseFieldValue"
[35] "ResponseID" "RetakeLinkNew"
[37] "CountyCombined" "Q_TicketTeam"
[39] "Q_TicketOwner" "Supervisor"
[41] "Manager" "Region Unit Combined"
[43] "Q_URL" "Skills Assessment QO Survey"
[45] "RetakeLinkReplace" "Q0.0 - Parent Topics"
[47] "Q0.0 - Sentiment Polarity" "Q0.0 - Sentiment Score"
[49] "Q0.0 - Sentiment" "Q0.0 - Topic Sentiment Label"
[51] "Q0.0 - Topic Sentiment Score" "Q0.0 - Topics"
[53] "Q0.0#0_0_0_TEXT" "Q0.0_TEXT"
[1] "0_Q0.0" "Q0.0"
[3] "Q0.0 - Parent Topics" "Q0.0 - Sentiment"
[5] "Q0.0 - Sentiment Polarity" "Q0.0 - Sentiment Score"
[7] "Q0.0 - Topic Sentiment Label" "Q0.0 - Topic Sentiment Score"
[9] "Q0.0 - Topics" "Q0.0#0_0"
[11] "Q0.0#0_0_0" "Q0.0#0_0_0_TEXT"
[13] "Q0.0_0" "Q0.0_0_0"
[15] "Q0.0_0_TEXT" "Q0.0_TEXT"
strs.vec <- yt
patts <- c("_", "#", "TEXT", "^Q", "^Q[0-9]+.[0-9]+$")
PatternInfo(patts, strs.vec)
length _ # TEXT ^Q ^Q[0-9]+.[0-9]+$
0_Q0.0 6 2 -1 -1 -1 -1
Q0.0 4 -1 -1 -1 1 1
Q0.0 - Parent Topics 20 -1 -1 -1 1 -1
Q0.0 - Sentiment 16 -1 -1 -1 1 -1
Q0.0 - Sentiment Polarity 25 -1 -1 -1 1 -1
Q0.0 - Sentiment Score 22 -1 -1 -1 1 -1
Q0.0 - Topic Sentiment Label 28 -1 -1 -1 1 -1
Q0.0 - Topic Sentiment Score 28 -1 -1 -1 1 -1
Q0.0 - Topics 13 -1 -1 -1 1 -1
Q0.0#0_0 8 7 5 -1 1 -1
Q0.0#0_0_0 10 7 5 -1 1 -1
Q0.0#0_0_0_TEXT 15 7 5 12 1 -1
Q0.0_0 6 5 -1 -1 1 -1
Q0.0_0_0 8 5 -1 -1 1 -1
Q0.0_0_TEXT 11 5 -1 8 1 -1
Q0.0_TEXT 9 5 -1 6 1 -1
[1] "Q43.6_1"
[1] "Q22.0#0_0" "Q43.6_1" "Q22.0#0_0_0_TEXT"
[4] "Q1.3"
patts <- getunique(gsub(replacement = "[0-9]+", pattern = "0",x = yt))
patts <- sprintf("^%s$",patts)
strs.vec <- as.character(data_out_exampled$qname)
names(strs.vec) <- data_out_exampled$rownum_data_out_exampled
matched.mat <- as_tibble(((PatternInfo(patts, strs.vec)) >-1)*1) %>%
select(-length) %>%
mutate(qname = strs.vec
, rownum_formatched = names(strs.vec)) %>%
select(where(function(x) any((x!=0)))) %>%
filter()
# empty == good...if not empty, pattern applies to multiple question types
matched.mat %>%
select(contains("Q[")) %>%
mutate(rowsum= rowSums(.)) %>%
filter(rowsum >1)
# A tibble: 0 x 17
# ... with 17 variables: ^[0-9]+_Q[0-9]+.[0-9]+$ <dbl>,
# ^Q[0-9]+.[0-9]+$ <dbl>, ^Q[0-9]+.[0-9]+ - Parent Topics$ <dbl>,
# ^Q[0-9]+.[0-9]+ - Sentiment$ <dbl>,
# ^Q[0-9]+.[0-9]+ - Sentiment Polarity$ <dbl>,
# ^Q[0-9]+.[0-9]+ - Sentiment Score$ <dbl>,
# ^Q[0-9]+.[0-9]+ - Topic Sentiment Label$ <dbl>,
# ^Q[0-9]+.[0-9]+ - Topic Sentiment Score$ <dbl>, ...
# create question type pattern indicator
dat_matched <- merge(data_out_exampled
, matched.mat
, by.x="rownum_data_out_exampled"
, by.y = "rownum_formatched")
# get row ids for column names with no patterns matched
dat_matched <- dat_matched %>% rowwise() %>%
mutate(dummysums = sum(c_across(contains("Q[")), na.rm = T)) %>%
ungroup()
nonmatch <- dat_matched %>% filter(dummysums==0) # non-matched column rows
match <- dat_matched %>% filter(dummysums==1)
match <- (match %>% select(contains("Q[")) %>%
dedummy(.) %>% cbind(match)) %>%
rename("pattmatched"=".") %>%
mutate(groupmatached = paste0(case, "__",as.numeric(as.factor((pattmatched))))) %>%
select(
qnum
, pattmatched
,groupmatached
, textEntry
, choice
, desc
, qname.x
, case
) %>% rename_all(~stringr::str_replace(.,"\\.x","")) %>% filter(!is.na(desc))
# check if all questions in final data are in this match set
setdiff(match$qname, names(dat)) # there are some questions that have not been answered, so they do not appear column names in the data.
[1] "Q6.1 - Parent Topics" "Q6.1 - Sentiment Polarity"
[3] "Q6.1 - Sentiment Score" "Q6.1 - Sentiment"
[5] "Q6.1 - Topic Sentiment Label" "Q6.1 - Topic Sentiment Score"
[7] "Q6.1 - Topics" "Q8.4 - Parent Topics"
[9] "Q8.4 - Sentiment Polarity" "Q8.4 - Sentiment Score"
[11] "Q8.4 - Sentiment" "Q8.4 - Topic Sentiment Label"
[13] "Q8.4 - Topic Sentiment Score" "Q8.4 - Topics"
[15] "Q20.3 - Parent Topics" "Q20.3 - Sentiment Polarity"
[17] "Q20.3 - Sentiment Score" "Q20.3 - Sentiment"
[19] "Q20.3 - Topic Sentiment Label" "Q20.3 - Topic Sentiment Score"
[21] "Q20.3 - Topics"
[1] "StartDate" "EndDate"
[3] "Status" "IPAddress"
[5] "Progress" "Duration (in seconds)"
[7] "Finished" "RecordedDate"
[9] "ResponseId" "RecipientLastName"
[11] "RecipientFirstName" "RecipientEmail"
[13] "ExternalReference" "LocationLatitude"
[15] "LocationLongitude" "DistributionChannel"
[17] "UserLanguage" "Q_RelevantIDDuplicate"
[19] "Q_RelevantIDDuplicateScore" "Q_RelevantIDFraudScore"
[21] "Q_RelevantIDLastStartDate" "Q1.23_5"
[23] "Q1.23_6" "FSFN"
[25] "TicketKey1" "Q_R"
[27] "ReviewerEmail" "Q_RelatedResponseField"
[29] "Q_RelatedResponseFieldValue" "ResponseID"
[31] "RetakeLinkNew" "CountyCombined"
[33] "Q_TicketTeam" "Q_TicketOwner"
[35] "Supervisor" "Manager"
[37] "Region Unit Combined" "Q_URL"
[39] "Skills Assessment QO Survey" "RetakeLinkReplace"
# intersect(names(dat), match$qname)
case_patterns <- match %>%
group_by(pattmatched, case, groupmatached) %>%
count(pattmatched, case, groupmatached) %>%
mutate(grouppattmatched = paste0(pattmatched,"::",case)) %>%
ungroup() %>%
mutate(row = row_number())
ï..pattmatched | case | groupmatached | n | grouppattmatched | row | rules |
---|---|---|---|---|---|---|
^[0-9]+_Q[0-9]+.[0-9]+$ | MC_SAVR_TX | MC_SAVR_TX__1 | 90 | ^[0-9]+_Q[0-9]+.[0-9]+$::MC_SAVR_TX | 1 | qnum |
^[0-9]+_Q[0-9]+.[0-9]+$ | TE_ESTB_NA | TE_ESTB_NA__1 | 30 | ^[0-9]+_Q[0-9]+.[0-9]+$::TE_ESTB_NA | 2 | qnum |
^Q[0-9]+.[0-9]+#[0-9]+_[0-9]+$ | SBS_SBSMatrix_NA | SBS_SBSMatrix_NA__2 | 150 | ^Q[0-9]+.[0-9]+#[0-9]+_[0-9]+$::SBS_SBSMatrix_NA | 3 | qnum |
^Q[0-9]+.[0-9]+#[0-9]+[0-9]+[0-9]+$ | SBS_SBSMatrix_NA | SBS_SBSMatrix_NA__3 | 175 | ^Q[0-9]+.[0-9]+#[0-9]+[0-9]+[0-9]+$::SBS_SBSMatrix_NA | 4 | qnum |
^Q[0-9]+.[0-9]+$ | MC_DL_NA | MC_DL_NA__5 | 32 | ^Q[0-9]+.[0-9]+$::MC_DL_NA | 5 | qnum |
^Q[0-9]+.[0-9]+$ | MC_SAVR_TX | MC_SAVR_TX__5 | 70 | ^Q[0-9]+.[0-9]+$::MC_SAVR_TX | 6 | qnum |
^Q[0-9]+.[0-9]+$ | TE_ESTB_NA | TE_ESTB_NA__5 | 49 | ^Q[0-9]+.[0-9]+$::TE_ESTB_NA | 7 | qnum |
^Q[0-9]+.[0-9]+$ | TE_SL_NA | TE_SL_NA__5 | 13 | ^Q[0-9]+.[0-9]+$::TE_SL_NA | 8 | qnum |
^Q[0-9]+.[0-9]+_[0-9]+$ | Matrix_Likert_SingleAnswer | Matrix_Likert_SingleAnswer__6 | 131 | ^Q[0-9]+.[0-9]+_[0-9]+$::Matrix_Likert_SingleAnswer | 9 | qnum |
^Q[0-9]+.[0-9]+_[0-9]+$ | MC_MACOL_TX | MC_MACOL_TX__6 | 81 | ^Q[0-9]+.[0-9]+_[0-9]+$::MC_MACOL_TX | 10 | qname |
^Q[0-9]+.[0-9]+_[0-9]+$ | MC_MAVR_TX | MC_MAVR_TX__6 | 161 | ^Q[0-9]+.[0-9]+_[0-9]+$::MC_MAVR_TX | 11 | qname |
^Q[0-9]+.[0-9]+_[0-9]+$ | TE_AUTO_STATIC | TE_AUTO_STATIC__6 | 1 | ^Q[0-9]+.[0-9]+_[0-9]+$::TE_AUTO_STATIC | 12 | qnum |
^Q[0-9]+.[0-9]+_[0-9]+$ | TE_FORM_NA | TE_FORM_NA__6 | 4 | ^Q[0-9]+.[0-9]+_[0-9]+$::TE_FORM_NA | 13 | colondash |
^Q[0-9]+.[0-9]+[0-9]+[0-9]+$ | Matrix_Likert_MultipleAnswer | Matrix_Likert_MultipleAnswer__7 | 358 | ^Q[0-9]+.[0-9]+[0-9]+[0-9]+$::Matrix_Likert_MultipleAnswer | 14 | underscore |
^Q[0-9]+.[0-9]+_[0-9]+_TEXT$ | MC_MAVR_TX | MC_MAVR_TX__8 | 3 | ^Q[0-9]+.[0-9]+_[0-9]+_TEXT$::MC_MAVR_TX | 15 | text |
^Q[0-9]+.[0-9]+_[0-9]+_TEXT$ | MC_SAVR_TX | MC_SAVR_TX__8 | 1 | ^Q[0-9]+.[0-9]+_[0-9]+_TEXT$::MC_SAVR_TX | 16 | text2 |
# update rules table if survey gets new types of questions since last run
case_patterns_ruled <- read.csv("case_patterns_ruled.csv")
patterns_new <- case_patterns[!case_patterns$grouppattmatched %in% case_patterns_ruled$grouppattmatched,]
# open rule setting table to add new rules from clipboard
if(nrow(patterns_new) > 0){
# clipr::write_clip(patterns_new)
# browseURL("case_patterns_ruled.csv")
}
NULL
# after updates to ruled table (using troubleshoot.R to establish new rules)
case_patterns <- merge(case_patterns_ruled[c("grouppattmatched","rules")]
, case_patterns
, "grouppattmatched")
rule <- list()
case <- list()
rulenames <- getunique(case_patterns$rules)
for (name_j in rulenames){
rule[name_j] <- case_patterns %>% filter(rules==name_j) %>% select(groupmatached)
case[name_j] <- case_patterns %>% filter(rules==name_j) %>% select(case)
}
matchout <- list()
for (i in rule$qnum){
temp <- match %>% filter(groupmatached==i)
matchout <- bind_rows(matchout,merge(temp,data_out,"qnum"))
}
for (i in rule$qname){
temp <- match %>% filter(groupmatached==i)
temp2 <- data_out %>% mutate(qname = paste0(qnum,"_", q_opt_num))
matchout <- bind_rows(matchout,merge(temp,temp2,"qname"))
}
for (i in rule$colondash){
temp <- match %>% filter(groupmatached==i)
temp <- temp %>% pull(desc) %>%
str_split(string= ., pattern = ": - ", simplify = T) %>%
as.data.frame() %>%
mutate_if(., is.character, ~trimws(.)) %>%
rename(qtext_stripped=V1, q_opt_txt=V2) %>%
cbind(temp)
matchout <- bind_rows(matchout,merge(
temp
,data_out
, c("qtext_stripped","q_opt_txt")))
}
for (rx in seq_along(rule$underscore)){
i <- rule$underscore[rx]
j <- case$underscore[rx]
temp <- match %>% filter(groupmatached==i)
temp <- temp %>% pull(qname) %>% str_split(string= ., pattern = "_", simplify = T) %>%
as.data.frame() %>%
mutate_if(., is.character, ~trimws(.)) %>%
mutate(qnum_q_opt_num = paste0(V1,"::",V3)) %>%
cbind(temp)
temp2 <- data_out %>% filter(!is.na(q_opt_num), case == j) %>%
mutate(qnum_q_opt_num = paste0(qnum, "::", q_opt_num))
matchout <- bind_rows(matchout,left_join(temp,temp2,"qnum_q_opt_num"))
}
for (i in rule$text){
temp <- match %>% filter(groupmatached==i) %>%
mutate(qid = str_extract(textEntry, "[a-zA-Z0-9]+(?=\\.)"))
newcols <- setdiff(names(data_out),names(temp))
temp2 <- data.frame(matrix(nrow=nrow(temp), ncol = len(newcols)))
colnames(temp2) <- newcols
matchout <- bind_rows(matchout,bind_cols(temp, temp2))
}
for (i in rule$text2){
temp <- match %>% filter(groupmatached==i) %>% arrange(qnum)
temp$qid <- data_out %>% filter(qnum %in% temp$qnum) %>%
arrange(qnum) %>% pull(qid) %>% getunique()
newcols <- setdiff(names(data_out),names(temp))
temp2 <- data.frame(matrix(nrow=nrow(temp), ncol = len(newcols)))
colnames(temp2) <- newcols
matchout <- bind_rows(matchout,bind_cols(temp, temp2))
}
matchout$qnum <- matchout %>%
select(contains("qnum"),-contains("qnum_q_opt_num")) %>%
coalesce_df(everything()) %>% pull(result)
# empty = good here
matchout %>% group_by(qname, q_opt_num) %>%
count(qname, q_opt_num) %>%
filter(n>1) %>% arrange(desc(n))
# A tibble: 0 x 3
# Groups: qname, q_opt_num [0]
# ... with 3 variables: qname <fct>, q_opt_num <chr>, n <int>
character(0)
xy <- matchout %>%
select(
qnum
,qname
,qid
,desc
,qtext
,qtype
,qslct
,q_sbsl
,q_frce
,q_rule
,q_opt_txt
,q_opt_num
)
xy <- xy %>%
pull(qnum) %>%
str_split(string = ., pattern = "\\.", simplify = T) %>%
as.data.frame() %>%
mutate_at("V2", as.numeric) %>%
cbind(xy) %>%
arrange(V1, V2) %>%
mutate(qid = case_when(is.na(qid) ~ "TEXT", TRUE~qid)
, updated = Sys.time()
,rownum = row_number()) %>%
select(rownum, updated, names(xy))
##############################
# Output :-)
##############################
write.csv(xy, sprintf("captured__%s.csv",
format(Sys.time(), "%m_%d_%Y__H%H")),row.names = F)
qnum | qname | qid | desc | qtext | qtype | qslct | q_sbsl | q_frce | q_rule | q_opt_txt | q_opt_num |
Q15.1 | Q15.1_1 | QID345 | A consultation was completed with a subject matter expert when indicated: - Domestic Violence Expert | A consultation was completed with a subject matter expert when indicated: | Matrix | Likert | SingleAnswer | FALSE | FALSE | Not required | 3 |
Q15.1 | Q15.1_1 | QID345 | A consultation was completed with a subject matter expert when indicated: - Domestic Violence Expert | A consultation was completed with a subject matter expert when indicated: | Matrix | Likert | SingleAnswer | FALSE | FALSE | No | 2 |
Q15.1 | Q15.1_1 | QID345 | A consultation was completed with a subject matter expert when indicated: - Domestic Violence Expert | A consultation was completed with a subject matter expert when indicated: | Matrix | Likert | SingleAnswer | FALSE | FALSE | Yes | 1 |
Q15.1 | Q15.1_2 | QID345 | A consultation was completed with a subject matter expert when indicated: - Substance Abuse Expert | A consultation was completed with a subject matter expert when indicated: | Matrix | Likert | SingleAnswer | FALSE | FALSE | Not required | 3 |
Q15.1 | Q15.1_2 | QID345 | A consultation was completed with a subject matter expert when indicated: - Substance Abuse Expert | A consultation was completed with a subject matter expert when indicated: | Matrix | Likert | SingleAnswer | FALSE | FALSE | No | 2 |
Q15.1 | Q15.1_2 | QID345 | A consultation was completed with a subject matter expert when indicated: - Substance Abuse Expert | A consultation was completed with a subject matter expert when indicated: | Matrix | Likert | SingleAnswer | FALSE | FALSE | Yes | 1 |
Q15.1 | Q15.1_3 | QID345 | A consultation was completed with a subject matter expert when indicated: - Mental Health Expert | A consultation was completed with a subject matter expert when indicated: | Matrix | Likert | SingleAnswer | FALSE | FALSE | Not required | 3 |
Q15.1 | Q15.1_3 | QID345 | A consultation was completed with a subject matter expert when indicated: - Mental Health Expert | A consultation was completed with a subject matter expert when indicated: | Matrix | Likert | SingleAnswer | FALSE | FALSE | No | 2 |
Q15.1 | Q15.1_3 | QID345 | A consultation was completed with a subject matter expert when indicated: - Mental Health Expert | A consultation was completed with a subject matter expert when indicated: | Matrix | Likert | SingleAnswer | FALSE | FALSE | Yes | 1 |
Q15.2 | Q15.2_1 | QID300 | Was the information and/or recommendations from the Subject Matter Expert considered in the decision making? - Domestic Violence Expert | Was the information and/or recommendations from the Subject Matter Expert considered in the decision making? | Matrix | Likert | SingleAnswer | FALSE | FALSE | Yes | 1 |
Q15.2 | Q15.2_1 | QID300 | Was the information and/or recommendations from the Subject Matter Expert considered in the decision making? - Domestic Violence Expert | Was the information and/or recommendations from the Subject Matter Expert considered in the decision making? | Matrix | Likert | SingleAnswer | FALSE | FALSE | No | 2 |
Q15.2 | Q15.2_2 | QID300 | Was the information and/or recommendations from the Subject Matter Expert considered in the decision making? - Substance Abuse Expert | Was the information and/or recommendations from the Subject Matter Expert considered in the decision making? | Matrix | Likert | SingleAnswer | FALSE | FALSE | Yes | 1 |
Q15.2 | Q15.2_2 | QID300 | Was the information and/or recommendations from the Subject Matter Expert considered in the decision making? - Substance Abuse Expert | Was the information and/or recommendations from the Subject Matter Expert considered in the decision making? | Matrix | Likert | SingleAnswer | FALSE | FALSE | No | 2 |
Q15.4 | Q15.4 | QID283 | Provide explanation in comment box: | Provide explanation in comment box: | TE | ESTB | FALSE | FALSE | |||
Q16.1 | Q16.1 | QID78 | A multi-disciplinary staffing was held when a new report is received on open case management cases. | A multi-disciplinary staffing was held when a new report is received on open case management cases. | MC | SAVR | TX | FALSE | FALSE | N/A – A multi-disciplinary staffing was not required. | 4 |
Q16.1 | Q16.1 | QID78 | A multi-disciplinary staffing was held when a new report is received on open case management cases. | A multi-disciplinary staffing was held when a new report is received on open case management cases. | MC | SAVR | TX | FALSE | FALSE | No - The multi-disciplinary staffing was not held within 14 days. | 2 |
Q16.1 | Q16.1 | QID78 | A multi-disciplinary staffing was held when a new report is received on open case management cases. | A multi-disciplinary staffing was held when a new report is received on open case management cases. | MC | SAVR | TX | FALSE | FALSE | No – A multi-disciplinary staffing was not held at all. | 3 |
Q16.1 | Q16.1 | QID78 | A multi-disciplinary staffing was held when a new report is received on open case management cases. | A multi-disciplinary staffing was held when a new report is received on open case management cases. | MC | SAVR | TX | FALSE | FALSE | Yes – A multi-disciplinary staffing was held with all relevant parties within 14 days | 1 |