I show how I use a re-coding table for harmonizing variation in person names entered without a clear format instructions.
When data validation is not used to enforce a certain format for collecting person names, certain variations in data entry for the same person might emerge according to different naming conventions. In this post, I show how I use a custom re-coding table for names, in a subsequent post, I show how a re-coding table can be used to correct existing and new names as a part of an automated process.
packageloader(c(
"tidyverse"
,"dplyr"
,"randomNames"
,"stringr"
,"stringi"
,"purrr"
,"janitor"
,"kableExtra"
,"flextable"
))
[1] "packages loaded"
To use source data across multiple projects, I centralize in a single folder then easily retrieve into projects with my custom filemaker()
function.
The Assumptions in Part I of this post are translated into the rules recorded in the recoding table for identifying and patterns in names and rearranging name parts.
I verify that no names exist in the data that aren’t flagged by my table patterns.
[1] 0
Because space and space_dash patterns are equivalent to the desired output pattern, rulecap
is empty and ruleout
returns the entire string. This simple code (below) is all that’s needed to harmonize any set of names, given that count of unmatched names (above) is zero.
Because of the ability of some of the rulecap
expressions to create capture groups from multiple named patterns, I restrict each to only capture from the named pattern it shares a line with in the recoding table.
The modified versions of my example names old
are shown in the new
column below.
old | new | patterns_exist | pattnames | ruleout |
LATAYVIA ALEXANDER | LATAYVIA ALEXANDER | Z Z | space | \1 |
RAMIREZ-ANETA | ANETA RAMIREZ | Z-Z | dash | \2 \1 |
PERRY, FREDDRICKA | FREDDRICKA PERRY | Z, Z | comma_space | \2 \1 |
MEGAN WALKER | WALKER MEGAN | Z Z | space_space | \2 \1 |
MERCEDES DIAZ-PEREZ | MERCEDES DIAZ-PEREZ | Z Z-Z | space_dash | \1 |
MOORE, BRIDGET N | BRIDGET MOORE | Z, Z Z | comma_space_space | \2 \1 |
DIAZ PEREZ, MERCEDES | MERCEDES DIAZ-PEREZ | Z Z, Z | space_comma_space | \3 \1-\2 |
CLASSENS-SOTO, FERNANDO | FERNANDO CLASSENS-SOTO | Z-Z, Z | dash_comma_space | \2 \1 |
HERNANDEZ, JANETSY RUBIE | JANETSY RUBIE HERNANDEZ | Z, Z Z | comma_space_space_middle | \2 \3 \1 |
O'BRIEN, CAROLINE | CAROLINE O'BRIEN | Z'Z, Z | apos_comma_space | \2 \1 |
SIMON, KASI-ANN | KASI-ANN SIMON | Z, Z-Z | comma_space_dash | \2 \1 |
PETTY, TE'MOY | TE'MOY PETTY | Z, Z'Z | comma_space_apos | \2 \1 |
WELLS JR., RICK | RICK WELLS JR. | Z Z., Z | space_dot_comma_space | \2 \1 |
BOICE II, JOHN E | JOHN BOICE II | Z Z, Z Z | space_comma_space_space | \2 \1 |
MARTINEZ-MORALES, ADRIANA P | ADRIANA MARTINEZ-MORALES | Z-Z, Z Z | dash_comma_space_space | \2 \1 |
MYERS, DEBORAH (DEBRA) | DEBORAH MYERS | Z, Z (Z) | comma_space_space_parens | \2 \1 |
MANDELLI-LOPEZ, ANA PAULA | ANA PAULA MANDELLI-LOPEZ | Z-Z, Z Z | dash_comma_space_space_middle | \2 \3 \1 |
D'ALESSANDRO, AMANDA N | AMANDA D'ALESSANDRO | Z'Z, Z Z | apos_comma_space_space | \2 \1 |
BENNETT JR., STACEY L | STACEY BENNETT JR. | Z Z., Z Z | space_dot_comma_space_space | \2 \1 |
ALERS DE AZA, JASMINE E | JASMINE ALERS DE AZA | Z Z Z, Z Z | space__space_comma_space_space | \2 \1 |
If I wanted to apply a different rule, I would simply need to rearrange the now standardized name parts in ruleout.
Here I reformat the names from “FIRST LAST” to “LAST, FIRST”
Then, I would run the same lines of code as before, this time using the second ruleout column.
old | new | pattnames | ruleout | ruleout2 |
LATAYVIA ALEXANDER | , LATAYVIA ALEXANDER | space | \1 | , \1 |
RAMIREZ-ANETA | RAMIREZ, ANETA | dash | \2 \1 | \1, \2 |
PERRY, FREDDRICKA | PERRY, FREDDRICKA | comma_space | \2 \1 | \1, \2 |
MEGAN WALKER | MEGAN, WALKER | space_space | \2 \1 | \1, \2 |
MERCEDES DIAZ-PEREZ | , MERCEDES DIAZ-PEREZ | space_dash | \1 | , \1 |
MOORE, BRIDGET N | MOORE, BRIDGET | comma_space_space | \2 \1 | \1, \2 |
DIAZ PEREZ, MERCEDES | DIAZ-PEREZ, MERCEDES | space_comma_space | \3 \1-\2 | \1-\2, \3 |
CLASSENS-SOTO, FERNANDO | CLASSENS-SOTO, FERNANDO | dash_comma_space | \2 \1 | \1, \2 |
HERNANDEZ, JANETSY RUBIE | HERNANDEZ, RUBIE JANETSY | comma_space_space_middle | \2 \3 \1 | \1, \3 \2 |
O'BRIEN, CAROLINE | O'BRIEN, CAROLINE | apos_comma_space | \2 \1 | \1, \2 |
SIMON, KASI-ANN | SIMON, KASI-ANN | comma_space_dash | \2 \1 | \1, \2 |
PETTY, TE'MOY | PETTY, TE'MOY | comma_space_apos | \2 \1 | \1, \2 |
WELLS JR., RICK | WELLS JR., RICK | space_dot_comma_space | \2 \1 | \1, \2 |
BOICE II, JOHN E | BOICE II, JOHN | space_comma_space_space | \2 \1 | \1, \2 |
MARTINEZ-MORALES, ADRIANA P | MARTINEZ-MORALES, ADRIANA | dash_comma_space_space | \2 \1 | \1, \2 |
MYERS, DEBORAH (DEBRA) | MYERS, DEBORAH | comma_space_space_parens | \2 \1 | \1, \2 |
MANDELLI-LOPEZ, ANA PAULA | MANDELLI-LOPEZ, PAULA ANA | dash_comma_space_space_middle | \2 \3 \1 | \1, \3 \2 |
D'ALESSANDRO, AMANDA N | D'ALESSANDRO, AMANDA | apos_comma_space_space | \2 \1 | \1, \2 |
BENNETT JR., STACEY L | BENNETT JR., STACEY | space_dot_comma_space_space | \2 \1 | \1, \2 |
ALERS DE AZA, JASMINE E | ALERS DE AZA, JASMINE | space__space_comma_space_space | \2 \1 | \1, \2 |