Some times we need to preserve the complexity of human entry error in a given set of names in our data, while simultaneously protecting the identity of the named individuals. In this project, I show how I anonymize named data.
packageloader(c(
"tidyverse"
,"dplyr"
,"randomNames"
,"stringr"
,"stringi"
,"kableExtra"
,"flextable"
))
[1] "packages loaded"
Assumptions about the source data affect the rules that should be applied. These assumptions enter the rule logic in the order specified below.
Every name in the data represents a unique individual, though slight variations of that individual’s name might exist due to input error.
The masking of names larger than 3 characters is sufficient to remove identifiability
Name suffixes (e.g., Jr. , III, etc.) should be retained because they might provide an exception to the above rule, while prefixes are dropped (Mrs., Dr., etc.).
Middle name initials are superfluous (i.e., “Joshua J Scriven” is the same person as “Joshua Scriven”), but the variations around them are retained to preserve complexity in the data..
Capitalization errors exist in the data and there is a preference for uppercase in the output.
I create two functions to get my results: name_anonymizer
and name_anonymizer_unique
, where the second is a subfuction of the other. name_anonymizer_unique
has two modes that complete step 1 and step 2 of creating a crosswalk for old versions of names to new version. Step 1 involves the harmonization of all variations of names to a single group identifier. For this, I use all substrings of a name that are longer than 3 characters, as smaller words are either rare names or are likely to be suffixes/prefixes. Step 2 uses the group identifiers to consistently provide a single replacement for very member of the group of name variations. I create a replacement string for each of the large substrings>3 using the randomNames::randomNames ()
function set to first names only. Because that function returns multi-word names, I filter those from the generated vector and sample from the resulting pool for my new names. The results of step 1 and step 2 are merged using the un-replaced substrings as unique identifiers. Then, I replace each of the original names, with the version having its substrings>3 parts swapped out using a loop that preserves the original order of the originally inputted data. All of these steps are performed within my name_anonymizer()
function, which receives a vectors of names as input and produces a dataframe with original and new names (in troubleshoot mode) or just the new names (default mode).
name_anonymizer <- function(data_input,mode="get"){
data_input_bac <- data_input
name_pool <- toupper(randomNames(len(data_input)*3,which.names = "first"))
name_pool <- name_pool[!grepl(" ",name_pool)]
name_anonymizer_unique <- function(name,mode="harmonize"){
namepatt <- (str_replace_all(
string = getunique(name)
, pattern = "\\w{4,}","\\(\\\\w+\\)"))
str_count(namepatt, "w")
(name_old <- (unlist(strsplit(sub(namepatt, paste0(
"\\",1:str_count(namepatt, "w")
, collapse=" "),name)," "))))
(name_old_sorted <- paste(collapse=" "
, sort(unlist(strsplit(sub(namepatt
, paste0("\\",1:str_count(namepatt, "w")
, collapse=" "),name)," ")))))
name_new <- sample(name_pool,size = str_count(name_old_sorted, " ")+1)
if (mode=="harmonize") {
return(
data.frame(
name_old = name
,name_old_sorted = name_old_sorted
)
)
} else if (mode=="anonymize"){
return(
data.frame(
# name_old = name
name_old_sorted = name_old_sorted
,name_new =stri_replace_all_regex(
str = name
,pattern = name_old
,replacement = name_new
, vectorize_all = F)
,namepattern = namepatt
,size_old = str_count(name_old_sorted, " ")+1
)
)
}
}
temp1 <- getunique(data_input) %>%
map_df(name_anonymizer_unique,"harmonize")
temp2 <- getunique(temp1$name_old_sorted) %>%
map_df(name_anonymizer_unique,"anonymize")
names_unique <- merge(temp1,temp2,"name_old_sorted")
(names_unique)
name_new_tab <- str_split(names_unique$name_new, " ", simplify = T)
for(n_nq in 1:nrow(names_unique) ){
patt_row <- names_unique %>% filter(name_old==names_unique$name_old[n_nq])
keeps <- grep(paste0("^",names_unique$name_old[n_nq],"$"),data_input)
name_new_reconfig <- stri_replace_all_regex(
str = names_unique$name_old[n_nq]
,pattern = c(str_split(
sub(names_unique$namepattern[n_nq]
,paste0("\\",1:str_count(names_unique$namepattern[n_nq], "w"), collapse=" ")
,names_unique$name_old_sorted[n_nq]), " ", simplify = T))
,replacement = name_new_tab[n_nq,][name_new_tab[n_nq,]!=""]
,vectorize_all = F
)
data_input[keeps] <- name_new_reconfig
}
return(
if (mode=="troubleshoot"){
data.frame(
name_old = data_input_bac
,name_new = data_input)
} else {
data_input
})
}
input <- c("SCRIVEN, JOSHUA"
,"SCRIVEN, JOSHUA J"
,"SCRIVEN, JOSHUA J"
, "JOSHUA J SCRIVEN"
, "KATRINA O THOMAS"
, "KENNETH EARN JONES III"
, "JAMES COOPER"
, "TOMMY CARTER JR"
, "JENNIFER LISA SMITH")
name_anonymizer(input,"troubleshoot") %>%
head() %>%
regulartable() %>%
autofit()
name_old | name_new |
SCRIVEN, JOSHUA | DANIEL, ISIAH |
SCRIVEN, JOSHUA J | DANIEL, ISIAH J |
SCRIVEN, JOSHUA J | DANIEL, ISIAH J |
JOSHUA J SCRIVEN | ISIAH J DANIEL |
KATRINA O THOMAS | VERONICA O EVAN |
KENNETH EARN JONES III | NICHOLAS PAUL JESSICA III |