How to Anonymize Names

R Regex Github Data Privacy

Some times we need to preserve the complexity of human entry error in a given set of names in our data, while simultaneously protecting the identity of the named individuals. In this project, I show how I anonymize named data.

Joshua Scriven
2023-03-25
packageloader(c(
"tidyverse"
,"dplyr"
,"randomNames"
,"stringr"
,"stringi"
,"kableExtra"
,"flextable"
))
[1] "packages loaded"

Assumptions about the source data affect the rules that should be applied. These assumptions enter the rule logic in the order specified below.

I create two functions to get my results: name_anonymizer and name_anonymizer_unique, where the second is a subfuction of the other. name_anonymizer_unique has two modes that complete step 1 and step 2 of creating a crosswalk for old versions of names to new version. Step 1 involves the harmonization of all variations of names to a single group identifier. For this, I use all substrings of a name that are longer than 3 characters, as smaller words are either rare names or are likely to be suffixes/prefixes. Step 2 uses the group identifiers to consistently provide a single replacement for very member of the group of name variations. I create a replacement string for each of the large substrings>3 using the randomNames::randomNames () function set to first names only. Because that function returns multi-word names, I filter those from the generated vector and sample from the resulting pool for my new names. The results of step 1 and step 2 are merged using the un-replaced substrings as unique identifiers. Then, I replace each of the original names, with the version having its substrings>3 parts swapped out using a loop that preserves the original order of the originally inputted data. All of these steps are performed within my name_anonymizer() function, which receives a vectors of names as input and produces a dataframe with original and new names (in troubleshoot mode) or just the new names (default mode).

name_anonymizer <- function(data_input,mode="get"){
  data_input_bac <- data_input
  name_pool <- toupper(randomNames(len(data_input)*3,which.names = "first"))
  name_pool <- name_pool[!grepl(" ",name_pool)]

    name_anonymizer_unique <- function(name,mode="harmonize"){
    namepatt <- (str_replace_all(
      string = getunique(name)
      , pattern = "\\w{4,}","\\(\\\\w+\\)"))
    str_count(namepatt, "w")
    (name_old <- (unlist(strsplit(sub(namepatt, paste0(
                                        "\\",1:str_count(namepatt, "w")
                                        , collapse=" "),name)," "))))
    (name_old_sorted <- paste(collapse=" "
                  , sort(unlist(strsplit(sub(namepatt
                               , paste0("\\",1:str_count(namepatt, "w")
                                        , collapse=" "),name)," ")))))
    name_new <- sample(name_pool,size = str_count(name_old_sorted, " ")+1)
    if (mode=="harmonize") {
      return(
        data.frame(
          name_old = name
          ,name_old_sorted = name_old_sorted
        )
      )
    } else if (mode=="anonymize"){
      return(
        data.frame(
          # name_old = name
          name_old_sorted = name_old_sorted
          ,name_new =stri_replace_all_regex(
            str = name
            ,pattern = name_old
            ,replacement = name_new
            , vectorize_all = F)
          ,namepattern = namepatt
          ,size_old = str_count(name_old_sorted, " ")+1
        )
      )
      
    }
  }
  
  temp1 <- getunique(data_input) %>% 
    map_df(name_anonymizer_unique,"harmonize")
  temp2 <- getunique(temp1$name_old_sorted) %>% 
    map_df(name_anonymizer_unique,"anonymize")
  names_unique <- merge(temp1,temp2,"name_old_sorted") 
  (names_unique)

  name_new_tab <- str_split(names_unique$name_new, " ", simplify = T)

  for(n_nq in 1:nrow(names_unique) ){
    patt_row <- names_unique %>% filter(name_old==names_unique$name_old[n_nq])
    keeps <- grep(paste0("^",names_unique$name_old[n_nq],"$"),data_input)
    name_new_reconfig <- stri_replace_all_regex(
      str = names_unique$name_old[n_nq]
      ,pattern = c(str_split(
        sub(names_unique$namepattern[n_nq]
            ,paste0("\\",1:str_count(names_unique$namepattern[n_nq], "w"), collapse=" ")
            ,names_unique$name_old_sorted[n_nq]), " ", simplify = T))
      ,replacement = name_new_tab[n_nq,][name_new_tab[n_nq,]!=""]
      ,vectorize_all = F
        )
    data_input[keeps] <- name_new_reconfig
  }  

    return(
    if (mode=="troubleshoot"){
      data.frame(
        name_old = data_input_bac
        ,name_new = data_input)
    } else {
      data_input
    })
  
}

input <- c("SCRIVEN, JOSHUA"
           ,"SCRIVEN, JOSHUA J"
           ,"SCRIVEN, JOSHUA J"
           , "JOSHUA J SCRIVEN"
           , "KATRINA O THOMAS"
           , "KENNETH EARN JONES III"
           , "JAMES COOPER"
           , "TOMMY CARTER JR"
           , "JENNIFER LISA SMITH")

name_anonymizer(input,"troubleshoot")  %>% 
  head() %>%
  regulartable() %>% 
  autofit()