The package lychee
helps to link and join data frames with key variables that are are similiar but not identical (e.g., a variable with geographic names spelled slightly different or nearby geographic coordinates). Different from the fuzzyjoin
package, the package lychee
does not output all matches given some definition of sufficient similarity, but constructs optimal one-to-one matches minimizing the total difference across all matches.
The function linkr()
stacks two data frames and finds an optimal one-to-one pairing of rows in one data frame with rows in the other data frame. The output is a data frame with as many rows as there are in the two datasets and a common identifier for the pairs. The complementary function is joinr()
which, instead of stacking and assigning a common identifier, joins two data frames similar to the base::merge()
or dplyr::full_join()
function.
# Install development version from GitHub remotes::install_github("sumtxt/lychee")
For more details and to learn how to use this package: Getting Started with lychee.
The example below shows how joinr
finds optimal matches in two data frames (elec94
and elec09
) within two groups. The first data frame (elec94
) lists the strongholds of Germany’s green party in the 1994 Federal election (election=BTW
) and the 1994 European election (election=EP
). The second data frame lists such strongholds for the 2009 elections. joinr
merges the two data frames correctly even though the city names are spelled slightly different in the two data frames preventing to merge them via base::merge()
or dplyr::full_join()
.
elec94 #> # A tibble: 6 x 3 #> city election greens #> <chr> <chr> <dbl> #> 1 Tübingen BTW 15.1 #> 2 Heidelberg, Stadt BTW 18.4 #> 3 Freiburg im Breisgau, Stadt BTW 21.9 #> 4 Münster, Stadt EP 20.7 #> 5 Heidelberg, Stadt EP 21.9 #> 6 Freiburg im Breisgau, Stadt EP 29 elec09 #> # A tibble: 6 x 3 #> city election greens #> <chr> <chr> <dbl> #> 1 Darmstadt, Wissenschaftsstadt BTW 20.9 #> 2 Heidelberg BTW 22.4 #> 3 Freiburg (Breisgau) BTW 25.4 #> 4 Heidelberg EP 28.6 #> 5 Lüchow-Dannenberg EP 29.9 #> 6 Freiburg (Breisgau) EP 32.5 joinr(elec94,elec09, strata="election", by="city", suffix=c("94","09"), add_distance=TRUE, caliper=12, method='lcs', full=TRUE) #> # A tibble: 8 x 6 #> election city94 greens94 match_dist city09 greens09 #> <chr> <chr> <dbl> <dbl> <chr> <dbl> #> 1 BTW Tübingen 15.1 NA <NA> NA #> 2 BTW Heidelberg, Stadt 18.4 7 Heidelberg 22.4 #> 3 BTW Freiburg im Breisgau, … 21.9 12 Freiburg (Breisgau) 25.4 #> 4 BTW <NA> NA NA Darmstadt, Wissenschaft… 20.9 #> 5 EP Münster, Stadt 20.7 NA <NA> NA #> 6 EP Heidelberg, Stadt 21.9 7 Heidelberg 28.6 #> 7 EP Freiburg im Breisgau, … 29 12 Freiburg (Breisgau) 32.5 #> 8 EP <NA> NA NA Lüchow-Dannenberg 29.9