Joshua Scriven Data Analyst Portfolio: How to Anonymize Names

packageloader(c(
"tidyverse"
,"dplyr"
,"randomNames"
,"stringr"
,"stringi"
,"kableExtra"
,"flextable"
))

[1] "packages loaded"

Assumptions about the source data affect the rules that should be applied. These assumptions enter the rule logic in the order specified below.

Every name in the data represents a unique individual, though slight variations of that individual’s name might exist due to input error.
The masking of names larger than 3 characters is sufficient to remove identifiability
Name suffixes (e.g., Jr. , III, etc.) should be retained because they might provide an exception to the above rule, while prefixes are dropped (Mrs., Dr., etc.).
Middle name initials are superfluous (i.e., “Joshua J Scriven” is the same person as “Joshua Scriven”), but the variations around them are retained to preserve complexity in the data..
Capitalization errors exist in the data and there is a preference for uppercase in the output.

I create two functions to get my results: name_anonymizer and name_anonymizer_unique, where the second is a subfuction of the other. name_anonymizer_unique has two modes that complete step 1 and step 2 of creating a crosswalk for old versions of names to new version. Step 1 involves the harmonization of all variations of names to a single group identifier. For this, I use all substrings of a name that are longer than 3 characters, as smaller words are either rare names or are likely to be suffixes/prefixes. Step 2 uses the group identifiers to consistently provide a single replacement for very member of the group of name variations. I create a replacement string for each of the large substrings>3 using the randomNames::randomNames () function set to first names only. Because that function returns multi-word names, I filter those from the generated vector and sample from the resulting pool for my new names. The results of step 1 and step 2 are merged using the un-replaced substrings as unique identifiers. Then, I replace each of the original names, with the version having its substrings>3 parts swapped out using a loop that preserves the original order of the originally inputted data. All of these steps are performed within my name_anonymizer() function, which receives a vectors of names as input and produces a dataframe with original and new names (in troubleshoot mode) or just the new names (default mode).

name_anonymizer <- function(data_input,mode="get"){
  data_input_bac <- data_input
  name_pool <- toupper(randomNames(len(data_input)*3,which.names = "first"))
  name_pool <- name_pool[!grepl(" ",name_pool)]

    name_anonymizer_unique <- function(name,mode="harmonize"){
    namepatt <- (str_replace_all(
      string = getunique(name)
      , pattern = "\\w{4,}","\\(\\\\w+\\)"))
    str_count(namepatt, "w")
    (name_old <- (unlist(strsplit(sub(namepatt, paste0(
                                        "\\",1:str_count(namepatt, "w")
                                        , collapse=" "),name)," "))))
    (name_old_sorted <- paste(collapse=" "
                  , sort(unlist(strsplit(sub(namepatt
                               , paste0("\\",1:str_count(namepatt, "w")
                                        , collapse=" "),name)," ")))))
    name_new <- sample(name_pool,size = str_count(name_old_sorted, " ")+1)
    if (mode=="harmonize") {
      return(
        data.frame(
          name_old = name
          ,name_old_sorted = name_old_sorted
        )
      )
    } else if (mode=="anonymize"){
      return(
        data.frame(
          # name_old = name
          name_old_sorted = name_old_sorted
          ,name_new =stri_replace_all_regex(
            str = name
            ,pattern = name_old
            ,replacement = name_new
            , vectorize_all = F)
          ,namepattern = namepatt
          ,size_old = str_count(name_old_sorted, " ")+1
        )
      )
      
    }
  }
  
  temp1 <- getunique(data_input) %>% 
    map_df(name_anonymizer_unique,"harmonize")
  temp2 <- getunique(temp1$name_old_sorted) %>% 
    map_df(name_anonymizer_unique,"anonymize")
  names_unique <- merge(temp1,temp2,"name_old_sorted") 
  (names_unique)

  name_new_tab <- str_split(names_unique$name_new, " ", simplify = T)

  for(n_nq in 1:nrow(names_unique) ){
    patt_row <- names_unique %>% filter(name_old==names_unique$name_old[n_nq])
    keeps <- grep(paste0("^",names_unique$name_old[n_nq],"$"),data_input)
    name_new_reconfig <- stri_replace_all_regex(
      str = names_unique$name_old[n_nq]
      ,pattern = c(str_split(
        sub(names_unique$namepattern[n_nq]
            ,paste0("\\",1:str_count(names_unique$namepattern[n_nq], "w"), collapse=" ")
            ,names_unique$name_old_sorted[n_nq]), " ", simplify = T))
      ,replacement = name_new_tab[n_nq,][name_new_tab[n_nq,]!=""]
      ,vectorize_all = F
        )
    data_input[keeps] <- name_new_reconfig
  }  

    return(
    if (mode=="troubleshoot"){
      data.frame(
        name_old = data_input_bac
        ,name_new = data_input)
    } else {
      data_input
    })
  
}

input <- c("SCRIVEN, JOSHUA"
           ,"SCRIVEN, JOSHUA J"
           ,"SCRIVEN, JOSHUA J"
           , "JOSHUA J SCRIVEN"
           , "KATRINA O THOMAS"
           , "KENNETH EARN JONES III"
           , "JAMES COOPER"
           , "TOMMY CARTER JR"
           , "JENNIFER LISA SMITH")

name_anonymizer(input,"troubleshoot")  %>% 
  head() %>%
  regulartable() %>% 
  autofit()

name_old	name_new
SCRIVEN, JOSHUA	DANIEL, ISIAH
SCRIVEN, JOSHUA J	DANIEL, ISIAH J
SCRIVEN, JOSHUA J	DANIEL, ISIAH J
JOSHUA J SCRIVEN	ISIAH J DANIEL
KATRINA O THOMAS	VERONICA O EVAN
KENNETH EARN JONES III	NICHOLAS PAUL JESSICA III