Introduction

When data validation is not used to enforce a certain format for collecting person names, certain variations in data entry for the same person might emerge according to different naming conventions. In this post, I show how I use a custom re-coding table for names, in a subsequent post, I show how a re-coding table can be used to correct existing and new names as a part of an automated process.

packageloader(c(
"tidyverse"
,"dplyr"
,"randomNames"
,"stringr"
,"stringi"
,"purrr"
,"janitor"
,"kableExtra"
,"flextable"
))

[1] "packages loaded"

To use source data across multiple projects, I centralize in a single folder then easily retrieve into projects with my custom filemaker() function.

load("recodetab.Rdata")
load("data_input.Rdata")

Review Assumptions and Table Validity

The Assumptions in Part I of this post are translated into the rules recorded in the recoding table for identifying and patterns in names and rearranging name parts.

I verify that no names exist in the data that aren’t flagged by my table patterns.

[1] 0

Because space and space_dash patterns are equivalent to the desired output pattern, rulecap is empty and ruleout returns the entire string. This simple code (below) is all that’s needed to harmonize any set of names, given that count of unmatched names (above) is zero.

Because of the ability of some of the rulecap expressions to create capture groups from multiple named patterns, I restrict each to only capture from the named pattern it shares a line with in the recoding table.

First Format

The modified versions of my example names old are shown in the new column below.

old	new	patterns_exist	pattnames	ruleout
LATAYVIA ALEXANDER	LATAYVIA ALEXANDER	Z Z	space	\1
RAMIREZ-ANETA	ANETA RAMIREZ	Z-Z	dash	\2 \1
PERRY, FREDDRICKA	FREDDRICKA PERRY	Z, Z	comma_space	\2 \1
MEGAN WALKER	WALKER MEGAN	Z Z	space_space	\2 \1
MERCEDES DIAZ-PEREZ	MERCEDES DIAZ-PEREZ	Z Z-Z	space_dash	\1
MOORE, BRIDGET N	BRIDGET MOORE	Z, Z Z	comma_space_space	\2 \1
DIAZ PEREZ, MERCEDES	MERCEDES DIAZ-PEREZ	Z Z, Z	space_comma_space	\3 \1-\2
CLASSENS-SOTO, FERNANDO	FERNANDO CLASSENS-SOTO	Z-Z, Z	dash_comma_space	\2 \1
HERNANDEZ, JANETSY RUBIE	JANETSY RUBIE HERNANDEZ	Z, Z Z	comma_space_space_middle	\2 \3 \1
O'BRIEN, CAROLINE	CAROLINE O'BRIEN	Z'Z, Z	apos_comma_space	\2 \1
SIMON, KASI-ANN	KASI-ANN SIMON	Z, Z-Z	comma_space_dash	\2 \1
PETTY, TE'MOY	TE'MOY PETTY	Z, Z'Z	comma_space_apos	\2 \1
WELLS JR., RICK	RICK WELLS JR.	Z Z., Z	space_dot_comma_space	\2 \1
BOICE II, JOHN E	JOHN BOICE II	Z Z, Z Z	space_comma_space_space	\2 \1
MARTINEZ-MORALES, ADRIANA P	ADRIANA MARTINEZ-MORALES	Z-Z, Z Z	dash_comma_space_space	\2 \1
MYERS, DEBORAH (DEBRA)	DEBORAH MYERS	Z, Z (Z)	comma_space_space_parens	\2 \1
MANDELLI-LOPEZ, ANA PAULA	ANA PAULA MANDELLI-LOPEZ	Z-Z, Z Z	dash_comma_space_space_middle	\2 \3 \1
D'ALESSANDRO, AMANDA N	AMANDA D'ALESSANDRO	Z'Z, Z Z	apos_comma_space_space	\2 \1
BENNETT JR., STACEY L	STACEY BENNETT JR.	Z Z., Z Z	space_dot_comma_space_space	\2 \1
ALERS DE AZA, JASMINE E	JASMINE ALERS DE AZA	Z Z Z, Z Z	space__space_comma_space_space	\2 \1

If I wanted to apply a different rule, I would simply need to rearrange the now standardized name parts in ruleout. Here I reformat the names from “FIRST LAST” to “LAST, FIRST”

Then, I would run the same lines of code as before, this time using the second ruleout column.

Harmonizing Person Names Part II: Using my Custom Re-coding Table

Introduction

Review Assumptions and Table Validity

First Format

Second Format

old	new	pattnames	ruleout	ruleout2
LATAYVIA ALEXANDER	, LATAYVIA ALEXANDER	space	\1	, \1
RAMIREZ-ANETA	RAMIREZ, ANETA	dash	\2 \1	\1, \2
PERRY, FREDDRICKA	PERRY, FREDDRICKA	comma_space	\2 \1	\1, \2
MEGAN WALKER	MEGAN, WALKER	space_space	\2 \1	\1, \2
MERCEDES DIAZ-PEREZ	, MERCEDES DIAZ-PEREZ	space_dash	\1	, \1
MOORE, BRIDGET N	MOORE, BRIDGET	comma_space_space	\2 \1	\1, \2
DIAZ PEREZ, MERCEDES	DIAZ-PEREZ, MERCEDES	space_comma_space	\3 \1-\2	\1-\2, \3
CLASSENS-SOTO, FERNANDO	CLASSENS-SOTO, FERNANDO	dash_comma_space	\2 \1	\1, \2
HERNANDEZ, JANETSY RUBIE	HERNANDEZ, RUBIE JANETSY	comma_space_space_middle	\2 \3 \1	\1, \3 \2
O'BRIEN, CAROLINE	O'BRIEN, CAROLINE	apos_comma_space	\2 \1	\1, \2
SIMON, KASI-ANN	SIMON, KASI-ANN	comma_space_dash	\2 \1	\1, \2
PETTY, TE'MOY	PETTY, TE'MOY	comma_space_apos	\2 \1	\1, \2
WELLS JR., RICK	WELLS JR., RICK	space_dot_comma_space	\2 \1	\1, \2
BOICE II, JOHN E	BOICE II, JOHN	space_comma_space_space	\2 \1	\1, \2
MARTINEZ-MORALES, ADRIANA P	MARTINEZ-MORALES, ADRIANA	dash_comma_space_space	\2 \1	\1, \2
MYERS, DEBORAH (DEBRA)	MYERS, DEBORAH	comma_space_space_parens	\2 \1	\1, \2
MANDELLI-LOPEZ, ANA PAULA	MANDELLI-LOPEZ, PAULA ANA	dash_comma_space_space_middle	\2 \3 \1	\1, \3 \2
D'ALESSANDRO, AMANDA N	D'ALESSANDRO, AMANDA	apos_comma_space_space	\2 \1	\1, \2
BENNETT JR., STACEY L	BENNETT JR., STACEY	space_dot_comma_space_space	\2 \1	\1, \2
ALERS DE AZA, JASMINE E	ALERS DE AZA, JASMINE	space__space_comma_space_space	\2 \1	\1, \2