Harmonizing Person Names Part II: Using my Custom Re-coding Table

R Regex Github Text Data Cleaning

I show how I use a re-coding table for harmonizing variation in person names entered without a clear format instructions.

Joshua Scriven
2023-03-30

Introduction

When data validation is not used to enforce a certain format for collecting person names, certain variations in data entry for the same person might emerge according to different naming conventions. In this post, I show how I use a custom re-coding table for names, in a subsequent post, I show how a re-coding table can be used to correct existing and new names as a part of an automated process.

packageloader(c(
"tidyverse"
,"dplyr"
,"randomNames"
,"stringr"
,"stringi"
,"purrr"
,"janitor"
,"kableExtra"
,"flextable"
))
[1] "packages loaded"

To use source data across multiple projects, I centralize in a single folder then easily retrieve into projects with my custom filemaker() function.

load("recodetab.Rdata")
load("data_input.Rdata")

Review Assumptions and Table Validity

The Assumptions in Part I of this post are translated into the rules recorded in the recoding table for identifying and patterns in names and rearranging name parts.

I verify that no names exist in the data that aren’t flagged by my table patterns.

[1] 0

Because space and space_dash patterns are equivalent to the desired output pattern, rulecap is empty and ruleout returns the entire string. This simple code (below) is all that’s needed to harmonize any set of names, given that count of unmatched names (above) is zero.

Because of the ability of some of the rulecap expressions to create capture groups from multiple named patterns, I restrict each to only capture from the named pattern it shares a line with in the recoding table.

First Format

The modified versions of my example names old are shown in the new column below.

If I wanted to apply a different rule, I would simply need to rearrange the now standardized name parts in ruleout. Here I reformat the names from “FIRST LAST” to “LAST, FIRST”

Then, I would run the same lines of code as before, this time using the second ruleout column.

Second Format