I show how I create a re-coding table for harmonizing variation in person names entered without a clear format instructions.
When data validation is not used to enforce a certain format for collecting person names, certain variations in data entry for the same person might emerge according to different naming conventions. These variations might lie in the name order (i.e., “LASTNAME, FIRSTNAME” vs “FIRSTNAME LASTNAME”), its content (e.g., including “JR” or “Dr.”), the type of delimiter used (e.g., “FIRSTNAME LASTNAME” vs “FIRSTNAME-LASTNAME”), or different ways of writing first and last names made up of of multiple words. In this post, I show how I create a re-coding table for names, in a subsequent post, I show how a re-coding table can be used to correct existing and new names as a part of an automated process.
packageloader(c(
"tidyverse"
,"dplyr"
,"randomNames"
,"stringr"
,"stringi"
,"purrr"
,"janitor"
,"kableExtra"
,"flextable"
))
[1] "packages loaded"
To use source data across multiple projects, I centralize in a single folder then easily retrieve into projects with my custom filemaker()
function.
data_dir <- "../data/"
data <-read.csv(filemaker("test_data_names","csv"))
Assumptions about the source data affect the rules that should be applied. These assumptions enter the rule logic in the order specified below.
Every name in the data represents a unique individual, though slight variations of that individual’s name might exist due to input error.
Name suffixes (e.g., Jr. , III, etc.) should be retained because they might provide an exception to the above rule, while prefixes are dropped (Mrs., Dr., etc.).
Middle name initials are superfluous (i.e., “Joshua J Scriven” is the same person as “Joshua Scriven”) and are dropped.
When three words make up a full name, the second word counts as the second part of the first name (i.e., “Joshua John” is the first name of “Joshua John Scriven”)
When two words are combined by hyphenation, they are counted as one word.
Location of commas within a name is used to settle ambiguities between two-word last names that should be hyphenated and are not and two-word first names, which may or may not be hyphenated. (e.g., “Joshua John Scriven vs John Scriven, Joshua vs Scriven, Joshua John).
When only a single hyphenated word appears, the hyphen inaccurately represents a comma.
Capitalization errors exist in the data and there is a preference for uppercase in the output.
Taking a look at the data, there is wide variation in the names and name formatting. I use my name_anonymizer()
function to anonymize the names in the data. See my other post on anonymizing names to see the code for it.
input <- name_anonymizer(toupper(data$worker))
tabyl(input)[sample(x = (1:nrow(tabyl(input))),25),]%>%
head(10) %>%
select(input,n,percent ) %>%
regulartable() %>%
autofit()
input | n | percent |
JESSICA, CHANTELL | 1 | 0.0002583979 |
RIDA-JOSIAH, CIARA | 3 | 0.0007751938 |
TERRENCE, TY | 3 | 0.0007751938 |
BRAEDON, BRIANA M | 4 | 0.0010335917 |
ARELY, KRYSTAL | 10 | 0.0025839793 |
DOMENIC-LEANNA, WILLIE | 3 | 0.0007751938 |
STEPHANIE, TRENT L | 3 | 0.0007751938 |
DAVEION, SAVANNAH | 5 | 0.0012919897 |
JUSTIN, ROEANN I | 4 | 0.0010335917 |
MAJDIYYA, LISA | 8 | 0.0020671835 |
I identify the distinct patterns in the corpus and assign a name for each. I translate each pattern into a regex expression and use the expression to find an example case. This step is exploratory for new data and confirmatory for pre-examined data.
The length of name parts is irrelevant except for the distinction between a middle name and a compound first name. I restrict my regular expressions to identify the former (the commented out parts show the expressions I would need if the distinction was not important). To visualize the patterns that exist, every substring of letters in a name (word) can be reduced to a single character. Here I use ‘Z’. Non-alphanumeric characters are treated as delimiters and their order helps to define the pattern names.
patterns_exist <- getunique(str_replace_all(
string = input
, pattern = "\\w+","Z"))
patterns_exist <- data.frame(
cbind(patterns_exist
, as.numeric(unlist(lapply(patterns_exist,function (x) nchar(x)))))) %>%
arrange(as.numeric(V2),patterns_exist)
patterns_exist["pattnames"] <- c("dash",
"space",
"space_space",
"comma_space",
"space_dash",
"apos_comma_space",
"dash_comma_space",
"space_comma_space",
"comma_space_apos",
"comma_space_dash",
"comma_space_space",
"space_dot_comma_space",
"apos_comma_space_space",
"dash_comma_space_space",
"space_comma_space_space",
"comma_space_space_parens",
"space_dot_comma_space_space",
"space__space_comma_space_space")
patterns_exist["pattreg"] <- c("^\\w+-\\w+$",
"^\\w+\\s\\w+$",
"^\\w+\\s{2}\\w+$",
"^\\w+,\\s\\w+$",
"^\\w+\\s\\w+-\\w+$",
"^\\w+\\'\\w+,\\s\\w+$",
"^\\w+-\\w+,\\s\\w+$",
"^\\w+\\s\\w+,\\s\\w+$",
"^\\w+,\\s\\w+'\\w+$",
"^\\w+,\\s\\w+-\\w+$",
"^\\w+,\\s\\w+\\s[A-Z]{1}$", #"^\\w+,\\s\\w+\\s\\w+$",
"^\\w+\\s\\w+\\.,\\s\\w+$",
"^\\w+\\'\\w+,\\s\\w+\\s[A-Z]{1}$", #"^\\w+\\'\\w+,\\s\\w+\\s\\w+$",
"^\\w+-\\w+,\\s\\w+\\s[A-Z]{1}$",# "^\\w+-\\w+,\\s\\w+\\s\\w+$",
"^\\w+\\s\\w+,\\s\\w+\\s[A-Z]{1}$", #"^\\w+\\s\\w+,\\s\\w+\\s\\w+$",
"^\\w+,\\s\\w+.*[()]$",
"^\\w+\\s\\w+\\.,\\s\\w+\\s[A-Z]{1}$",#"^\\w+\\s\\w+\\.,\\s\\w+\\s\\w+$",
"^\\w+\\s\\w+\\s\\w+,\\s\\w+\\s[A-Z]{1}$") #"^\\w+\\s\\w+\\s\\w+,\\s\\w+\\s\\w+$")
Here are the first 10 lines of the table so far:
patterns_exist | pattnames | pattreg |
Z-Z | dash | ^\w+-\w+$ |
Z Z | space | ^\w+\s\w+$ |
Z Z | space_space | ^\w+\s{2}\w+$ |
Z, Z | comma_space | ^\w+,\s\w+$ |
Z Z-Z | space_dash | ^\w+\s\w+-\w+$ |
Z'Z, Z | apos_comma_space | ^\w+\'\w+,\s\w+$ |
Z-Z, Z | dash_comma_space | ^\w+-\w+,\s\w+$ |
Z Z, Z | space_comma_space | ^\w+\s\w+,\s\w+$ |
Z, Z'Z | comma_space_apos | ^\w+,\s\w+'\w+$ |
Z, Z-Z | comma_space_dash | ^\w+,\s\w+-\w+$ |
Next, I count the unique and all occurrences of the named patterns in the data and find examples to confirm that the the names flagged by the regex pattern matches identified Z-pattern.
get_reg_examples <- function(find,obs){
return(
paste(na.omit(grep(find,getunique(input),val=T)[1:obs]),collapse="|"))
}
patterns_exist[["pattexamp"]] <- patterns_exist[["pattreg"]] %>%
map_chr(get_reg_examples,1)
get_reg_examples_count <- function(find){
data.frame(pattexamp_instance_all=sum(grepl(find,(input)))
,pattexamp_instance_unique=sum(grepl(find,getunique(input))))
}
patterns_exist[c("pattexamp_count_all","pattexamp_count_unique")] <-
patterns_exist[["pattreg"]] %>% map_df(get_reg_examples_count)
Looking at counts of all the patterns, it appears that data entry workers generally relied on either comma+space or a space delimiter, which I assumed from the beginning to indicate Last Name, First Name and First Name Last Name pairs, respectively. Other patterns appear to be true errors or variation in names introduced by non-alphanumeric characters proper to a particular name (e.g., “’” and “-”).
patterns_exist | pattexamp | pattexamp_count_all | pattexamp_count_unique |
Z-Z | RAMIREZ-ANETA | 27 | 11 |
Z Z | LATAYVIA ALEXANDER | 60 | 23 |
Z Z | MEGAN WALKER | 3 | 1 |
Z, Z | PERRY, FREDDRICKA | 2,458 | 591 |
Z Z-Z | MERCEDES DIAZ-PEREZ | 6 | 2 |
Z'Z, Z | O'BRIEN, CAROLINE | 2 | 1 |
Z-Z, Z | CLASSENS-SOTO, FERNANDO | 70 | 19 |
Z Z, Z | DIAZ PEREZ, MERCEDES | 39 | 10 |
Z, Z'Z | PETTY, TE'MOY | 3 | 1 |
Z, Z-Z | SIMON, KASI-ANN | 9 | 2 |
Z, Z Z | MOORE, BRIDGET N | 995 | 272 |
Z Z., Z | WELLS JR., RICK | 9 | 1 |
Z'Z, Z Z | D'ALESSANDRO, AMANDA N | 5 | 2 |
Z-Z, Z Z | MARTINEZ-MORALES, ADRIANA P | 34 | 6 |
Z Z, Z Z | BOICE II, JOHN E | 9 | 2 |
Z, Z (Z) | MYERS, DEBORAH (DEBRA) | 1 | 1 |
Z Z., Z Z | BENNETT JR., STACEY L | 8 | 2 |
Z Z Z, Z Z | ALERS DE AZA, JASMINE E | 3 | 1 |
Now, I create additions to the table recording my regex capture groups for name parts for each name pattern (rulecap
). I also define the way in which the captured groups will be reordered to create a consistent pattern across all names (ruleout
).
patterns_exist["rulecap"] <- c(
"(\\w*)-?\\s*(\\w*)",
"(\\w*)-?\\s*(\\w*)",
"(\\w*)-?\\s*(\\w*)",
"([^,]+),\\s*(.*)",
"(\\w*)\\s*(\\w*-\\w*)$",
"([^,]+),\\s*(.*)",
"([^,]+),\\s*(.*)",
"^(\\w+)\\s+(\\w+)\\s*,\\s*(\\w+)$",
"([^,]+),\\s*(.*)",
"([^,]+),\\s*(.*)",
"^(\\w+)\\s*,\\s*(\\w+)\\s+(\\w+)$",
"([^,]+),\\s*(.*)",
"^(\\w+'\\w+)\\s*,\\s*(\\w+)\\s+(\\w+)$",
"^(\\w+-\\w+)\\s*,\\s*(\\w+)\\s+(\\w+)$",
"^(\\w+\\s\\w+)\\s*,\\s*(\\w+)\\s+(\\w+)$",
"^(\\w+)\\s*,\\s*(\\w+)\\s+\\(.*",
"^(\\w+\\s\\w+\\.)\\s*,\\s*(\\w+)\\s+(\\w+)$",
"^(\\w+\\s\\w+\\s\\w+)\\s*,\\s*(\\w+)\\s+(\\w+)$")
patterns_exist["ruleout"] <- c(
"\\2 \\1",
"\\1 \\2",
"\\1 \\2",
"\\2 \\1",
"\\1 \\2",
"\\2 \\1",
"\\2 \\1",
"\\3 \\1-\\2",
"\\2 \\1",
"\\2 \\1",
"\\2 \\1",
"\\2 \\1",
"\\2 \\1",
"\\2 \\1",
"\\2 \\1",
"\\2 \\1",
"\\2 \\1",
"\\2 \\1")
Here are the first 10 lines:
patterns_exist | pattnames | rulecap | ruleout |
Z-Z | dash | ([^,]+)-\s*(.*) | \2 \1 |
Z Z | space | \1 | |
Z Z | space_space | ([^,]+)\s{2}(.*) | \2 \1 |
Z, Z | comma_space | ([^,]+),\s*(.*) | \2 \1 |
Z Z-Z | space_dash | \1 | |
Z'Z, Z | apos_comma_space | ([^,]+),\s*(.*) | \2 \1 |
Z-Z, Z | dash_comma_space | ([^,]+),\s*(.*) | \2 \1 |
Z Z, Z | space_comma_space | ^(\w+)\s+(\w+)\s*,\s*(\w+)$ | \3 \1-\2 |
Z, Z'Z | comma_space_apos | ([^,]+),\s*(.*) | \2 \1 |
Z, Z-Z | comma_space_dash | ([^,]+),\s*(.*) | \2 \1 |
Checking to see if any names are not flagged by my regular expressions, it appears that I have not accounted for those compound first names (remember that I assumed a middle name is identified by a middle initial whereas a compound first name consists of two words, each with more than one letter).
leftovers <- grep(paste(patterns_exist$pattreg, collapse = "|")
, input,val=TRUE,invert = T)
head(leftovers)
[1] "HERNANDEZ, JANETSY RUBIE" "SMITH, KADEEN OLIVIA"
[3] "LYNCH, MATTHEW ERWI" "JENSEN, MARY ELIZABETH"
[5] "HERNANDEZ, CHERYL LYNNE" "JENSEN, MARY ELIZABETH"
(temp <- getunique(str_replace_all(
string = leftovers
, pattern = "\\w+","Z")))
[1] "Z, Z Z" "Z-Z, Z Z"
I use stringr::str_replace_all()
to create additional regex patterns which flag second words of length greater than 1. I then update the remaining columns by re-applying my counter and example-getter functions.
temp <- (patterns_exist %>%
filter(patterns_exist %in% temp)) %>%
mutate(
pattreg = str_replace_all(string = pattreg
, pattern = "1", replacement = "2,")
,ruleout = str_replace_all(string = ruleout
, pattern = "\\\\2 \\\\1", replacement = "\\\\2 \\\\3 \\\\1")
,pattexamp = NA
,pattnames = paste0(pattnames,"_middle")
)
patterns_exist <- patterns_exist %>% bind_rows(temp)
patterns_exist[["pattexamp"]] <-
patterns_exist[["pattreg"]] %>%
map_chr(get_reg_examples,1)
patterns_exist[c("pattexamp_count_all","pattexamp_count_unique")] <-
patterns_exist[["pattreg"]] %>%
map_df(get_reg_examples_count)
(leftovers <- grep(paste(patterns_exist$pattreg, collapse = "|")
, input,val=TRUE,invert = T))
[1] NA NA NA NA NA
The final table can be used for documentation purposes, but also for re-writing the values matching my named patterns according to the rules in the final two columns.
patterns_exist | pattnames | pattexamp | rulecap | ruleout |
Z-Z | dash | RAMIREZ-ANETA | ([^,]+)-\s*(.*) | \2 \1 |
Z Z | space | LATAYVIA ALEXANDER | \1 | |
Z Z | space_space | MEGAN WALKER | ([^,]+)\s{2}(.*) | \2 \1 |
Z, Z | comma_space | PERRY, FREDDRICKA | ([^,]+),\s*(.*) | \2 \1 |
Z Z-Z | space_dash | MERCEDES DIAZ-PEREZ | \1 | |
Z'Z, Z | apos_comma_space | O'BRIEN, CAROLINE | ([^,]+),\s*(.*) | \2 \1 |
Z-Z, Z | dash_comma_space | CLASSENS-SOTO, FERNANDO | ([^,]+),\s*(.*) | \2 \1 |
Z Z, Z | space_comma_space | DIAZ PEREZ, MERCEDES | ^(\w+)\s+(\w+)\s*,\s*(\w+)$ | \3 \1-\2 |
Z, Z'Z | comma_space_apos | PETTY, TE'MOY | ([^,]+),\s*(.*) | \2 \1 |
Z, Z-Z | comma_space_dash | SIMON, KASI-ANN | ([^,]+),\s*(.*) | \2 \1 |
Z, Z Z | comma_space_space | MOORE, BRIDGET N | ^(\w+)\s*,\s*(\w+)\s+(\w+)$ | \2 \1 |
Z Z., Z | space_dot_comma_space | WELLS JR., RICK | ([^,]+),\s*(.*) | \2 \1 |
Z'Z, Z Z | apos_comma_space_space | D'ALESSANDRO, AMANDA N | ^(\w+'\w+)\s*,\s*(\w+)\s+(\w+)$ | \2 \1 |
Z-Z, Z Z | dash_comma_space_space | MARTINEZ-MORALES, ADRIANA P | ^(\w+-\w+)\s*,\s*(\w+)\s+(\w+)$ | \2 \1 |
Z Z, Z Z | space_comma_space_space | BOICE II, JOHN E | ^(\w+\s\w+)\s*,\s*(\w+)\s+(\w+)$ | \2 \1 |
Z, Z (Z) | comma_space_space_parens | MYERS, DEBORAH (DEBRA) | ^(\w+)\s*,\s*(\w+)\s+\(.* | \2 \1 |
Z Z., Z Z | space_dot_comma_space_space | BENNETT JR., STACEY L | ^(\w+\s\w+\.)\s*,\s*(\w+)\s+(\w+)$ | \2 \1 |
Z Z Z, Z Z | space__space_comma_space_space | ALERS DE AZA, JASMINE E | ^(\w+\s\w+\s\w+)\s*,\s*(\w+)\s+(\w+)$ | \2 \1 |
Z, Z Z | comma_space_space_middle | HERNANDEZ, JANETSY RUBIE | ^(\w+)\s*,\s*(\w+)\s+(\w+)$ | \2 \3 \1 |
Z-Z, Z Z | dash_comma_space_space_middle | MANDELLI-LOPEZ, ANA PAULA | ^(\w+-\w+)\s*,\s*(\w+)\s+(\w+)$ | \2 \3 \1 |