Harmonizing Person Names Part I: Creating Re-coding Table

R Excel Regex Github Text Data Cleaning

I show how I create a re-coding table for harmonizing variation in person names entered without a clear format instructions.

Joshua Scriven
2023-03-27

Introduction

When data validation is not used to enforce a certain format for collecting person names, certain variations in data entry for the same person might emerge according to different naming conventions. These variations might lie in the name order (i.e., “LASTNAME, FIRSTNAME” vs “FIRSTNAME LASTNAME”), its content (e.g., including “JR” or “Dr.”), the type of delimiter used (e.g., “FIRSTNAME LASTNAME” vs “FIRSTNAME-LASTNAME”), or different ways of writing first and last names made up of of multiple words. In this post, I show how I create a re-coding table for names, in a subsequent post, I show how a re-coding table can be used to correct existing and new names as a part of an automated process.

packageloader(c(
"tidyverse"
,"dplyr"
,"randomNames"
,"stringr"
,"stringi"
,"purrr"
,"janitor"
,"kableExtra"
,"flextable"
))
[1] "packages loaded"

To use source data across multiple projects, I centralize in a single folder then easily retrieve into projects with my custom filemaker() function.

data_dir <- "../data/"
data <-read.csv(filemaker("test_data_names","csv"))

Declare Assumptions

Assumptions about the source data affect the rules that should be applied. These assumptions enter the rule logic in the order specified below.

Quick Look at Data

Taking a look at the data, there is wide variation in the names and name formatting. I use my name_anonymizer() function to anonymize the names in the data. See my other post on anonymizing names to see the code for it.

input <- name_anonymizer(toupper(data$worker))

tabyl(input)[sample(x = (1:nrow(tabyl(input))),25),]%>%
  head(10) %>%
  select(input,n,percent ) %>%
  regulartable() %>% 
  autofit()

Classify Patterns in Data

I identify the distinct patterns in the corpus and assign a name for each. I translate each pattern into a regex expression and use the expression to find an example case. This step is exploratory for new data and confirmatory for pre-examined data.

Step 1.

The length of name parts is irrelevant except for the distinction between a middle name and a compound first name. I restrict my regular expressions to identify the former (the commented out parts show the expressions I would need if the distinction was not important). To visualize the patterns that exist, every substring of letters in a name (word) can be reduced to a single character. Here I use ‘Z’. Non-alphanumeric characters are treated as delimiters and their order helps to define the pattern names.

patterns_exist <- getunique(str_replace_all(
  string = input
  , pattern = "\\w+","Z"))

patterns_exist <- data.frame(
  cbind(patterns_exist
        , as.numeric(unlist(lapply(patterns_exist,function (x) nchar(x)))))) %>%
  arrange(as.numeric(V2),patterns_exist)

patterns_exist["pattnames"] <- c("dash",
"space",
"space_space",
"comma_space",
"space_dash",
"apos_comma_space",
"dash_comma_space",
"space_comma_space",
"comma_space_apos",
"comma_space_dash",
"comma_space_space",
"space_dot_comma_space",
"apos_comma_space_space",
"dash_comma_space_space",
"space_comma_space_space",
"comma_space_space_parens",
"space_dot_comma_space_space",
"space__space_comma_space_space")

patterns_exist["pattreg"] <- c("^\\w+-\\w+$",
"^\\w+\\s\\w+$",
"^\\w+\\s{2}\\w+$",
"^\\w+,\\s\\w+$",
"^\\w+\\s\\w+-\\w+$",
"^\\w+\\'\\w+,\\s\\w+$",
"^\\w+-\\w+,\\s\\w+$",
"^\\w+\\s\\w+,\\s\\w+$",
"^\\w+,\\s\\w+'\\w+$",
"^\\w+,\\s\\w+-\\w+$",
"^\\w+,\\s\\w+\\s[A-Z]{1}$", #"^\\w+,\\s\\w+\\s\\w+$",
"^\\w+\\s\\w+\\.,\\s\\w+$",
"^\\w+\\'\\w+,\\s\\w+\\s[A-Z]{1}$", #"^\\w+\\'\\w+,\\s\\w+\\s\\w+$",
"^\\w+-\\w+,\\s\\w+\\s[A-Z]{1}$",# "^\\w+-\\w+,\\s\\w+\\s\\w+$",
"^\\w+\\s\\w+,\\s\\w+\\s[A-Z]{1}$", #"^\\w+\\s\\w+,\\s\\w+\\s\\w+$",
"^\\w+,\\s\\w+.*[()]$",
"^\\w+\\s\\w+\\.,\\s\\w+\\s[A-Z]{1}$",#"^\\w+\\s\\w+\\.,\\s\\w+\\s\\w+$",
"^\\w+\\s\\w+\\s\\w+,\\s\\w+\\s[A-Z]{1}$") #"^\\w+\\s\\w+\\s\\w+,\\s\\w+\\s\\w+$")

Here are the first 10 lines of the table so far:

Step 2.

Next, I count the unique and all occurrences of the named patterns in the data and find examples to confirm that the the names flagged by the regex pattern matches identified Z-pattern.

get_reg_examples <- function(find,obs){
  return(
    paste(na.omit(grep(find,getunique(input),val=T)[1:obs]),collapse="|"))
}

patterns_exist[["pattexamp"]] <- patterns_exist[["pattreg"]] %>% 
  map_chr(get_reg_examples,1)

get_reg_examples_count <- function(find){
  data.frame(pattexamp_instance_all=sum(grepl(find,(input)))
,pattexamp_instance_unique=sum(grepl(find,getunique(input))))
}

patterns_exist[c("pattexamp_count_all","pattexamp_count_unique")] <- 
  patterns_exist[["pattreg"]] %>% map_df(get_reg_examples_count)

Looking at counts of all the patterns, it appears that data entry workers generally relied on either comma+space or a space delimiter, which I assumed from the beginning to indicate Last Name, First Name and First Name Last Name pairs, respectively. Other patterns appear to be true errors or variation in names introduced by non-alphanumeric characters proper to a particular name (e.g., “’” and “-”).

Create Recoding Rules

Now, I create additions to the table recording my regex capture groups for name parts for each name pattern (rulecap). I also define the way in which the captured groups will be reordered to create a consistent pattern across all names (ruleout).

patterns_exist["rulecap"] <- c(
"(\\w*)-?\\s*(\\w*)",
"(\\w*)-?\\s*(\\w*)",
"(\\w*)-?\\s*(\\w*)",
"([^,]+),\\s*(.*)",
"(\\w*)\\s*(\\w*-\\w*)$",
"([^,]+),\\s*(.*)",
"([^,]+),\\s*(.*)",
"^(\\w+)\\s+(\\w+)\\s*,\\s*(\\w+)$",
"([^,]+),\\s*(.*)",
"([^,]+),\\s*(.*)",
"^(\\w+)\\s*,\\s*(\\w+)\\s+(\\w+)$",
"([^,]+),\\s*(.*)",
"^(\\w+'\\w+)\\s*,\\s*(\\w+)\\s+(\\w+)$",
"^(\\w+-\\w+)\\s*,\\s*(\\w+)\\s+(\\w+)$",
"^(\\w+\\s\\w+)\\s*,\\s*(\\w+)\\s+(\\w+)$",
"^(\\w+)\\s*,\\s*(\\w+)\\s+\\(.*",
"^(\\w+\\s\\w+\\.)\\s*,\\s*(\\w+)\\s+(\\w+)$",
"^(\\w+\\s\\w+\\s\\w+)\\s*,\\s*(\\w+)\\s+(\\w+)$")


patterns_exist["ruleout"] <- c(
"\\2 \\1",
"\\1 \\2",
"\\1 \\2",
"\\2 \\1",
"\\1 \\2",
"\\2 \\1",
"\\2 \\1",
"\\3 \\1-\\2",
"\\2 \\1",
"\\2 \\1",
"\\2 \\1",
"\\2 \\1",
"\\2 \\1",
"\\2 \\1",
"\\2 \\1",
"\\2 \\1",
"\\2 \\1",
"\\2 \\1")

Here are the first 10 lines:

Adding Rules for Compound Names

Checking to see if any names are not flagged by my regular expressions, it appears that I have not accounted for those compound first names (remember that I assumed a middle name is identified by a middle initial whereas a compound first name consists of two words, each with more than one letter).

leftovers <- grep(paste(patterns_exist$pattreg, collapse = "|")
                  , input,val=TRUE,invert = T)
head(leftovers)
[1] "HERNANDEZ, JANETSY RUBIE" "SMITH, KADEEN OLIVIA"    
[3] "LYNCH, MATTHEW ERWI"      "JENSEN, MARY ELIZABETH"  
[5] "HERNANDEZ, CHERYL LYNNE"  "JENSEN, MARY ELIZABETH"  
(temp <- getunique(str_replace_all(
  string = leftovers
  , pattern = "\\w+","Z")))
[1] "Z, Z Z"   "Z-Z, Z Z"

I use stringr::str_replace_all() to create additional regex patterns which flag second words of length greater than 1. I then update the remaining columns by re-applying my counter and example-getter functions.

temp <- (patterns_exist %>% 
           filter(patterns_exist %in% temp)) %>% 
  mutate(
  pattreg = str_replace_all(string = pattreg
                            , pattern = "1", replacement = "2,")
  ,ruleout = str_replace_all(string = ruleout
                             , pattern = "\\\\2 \\\\1", replacement = "\\\\2 \\\\3 \\\\1")
  ,pattexamp = NA
  ,pattnames = paste0(pattnames,"_middle")
)

patterns_exist <- patterns_exist %>% bind_rows(temp) 

patterns_exist[["pattexamp"]] <- 
  patterns_exist[["pattreg"]] %>% 
  map_chr(get_reg_examples,1)

patterns_exist[c("pattexamp_count_all","pattexamp_count_unique")] <- 
  patterns_exist[["pattreg"]] %>% 
  map_df(get_reg_examples_count)

(leftovers <- grep(paste(patterns_exist$pattreg, collapse = "|")
                   , input,val=TRUE,invert = T))
[1] NA NA NA NA NA

Result

The final table can be used for documentation purposes, but also for re-writing the values matching my named patterns according to the rules in the final two columns.