Join and link two data frames by columns that are similar by some definition such that the similarity across all matches is maximized and each observation is matched at most to one other observation. The function linkr
stacks two data frames and finds an optimal one-to-one pairing of rows in one data frame with rows in the other data frame. The output is a data frame with as many rows as there are in the two datasets and a common identifier for each matched pair. The complementary function is joinr
which, instead of stacking and assigning a common identifier, joins two data frames similar to the merge function or full_join.
joinr( x, y, by, strata = NULL, method = "osa", assignment = TRUE, add_distance = FALSE, suffix = c(".x", ".y"), full = TRUE, na_matches = "na", caliper = Inf, C = 1, verbose = FALSE, ... ) linkr( x, y, by, strata = NULL, method = "osa", assignment = TRUE, add_distance = FALSE, na_matches = "na", caliper = Inf, C = 1, verbose = FALSE, ... )
x, y | data frames to join |
---|---|
by | character vector of the key variable(s) to join by. To join by different variables on x and y, use a named vector. For example, by = c("a" = "b") will match x$a to y$b. |
strata | character vector of variables to join exactly if any. Can be a named vector. |
method | the name of the distance metric to measure the similarity between the key variables. |
assignment | should one-to-one matches be constructed? |
add_distance | add a distance column to the final data frame? |
suffix | character vector of length 2 used to disambiguate non-joined duplicate variables in x and y. |
full | retain all unjoined observation from the shorter data frame? |
na_matches | should NA and NaN values match one another for any exact join defined by |
caliper | caliper value on the same scale as the distance matrix (before multipled by |
C | scaling parameter for the distance matrix |
verbose | print distance summary statistic |
... | parameters passed to distance metric function |
Matches are constructed using a fast version of the Hungarian method as implemented in the assignment function. Only the integer part of the distance matrix is used. To increase precision, use the parameter C
. A warning is printed if the distance matrix does not consist of integers but real values ("Warning in assignment(m): Matrix 'cmat' not integer; will take floor of it.").
If strata
is not NULL
, optimal one-to-one matches are constructed within the strata defined by the variables in strata
.
The method for computing distance can by any of the string distances implemented as part of the stringdist
package (see stringdist-metrics for a list), a geographic distance from the geosphere
package (distGeo, distCosine, distHaversine, distVincentySphere, distVincentyEllipsoid), or a distance metric from the registry package (run summary(proxy::pr_DB)
for a list). Users may also supply their own distance metric.
For geographic distances, by
must be of length 2 with the names of the variables that include the longitude/latitude coordinates (first one is longitude, second is latitude). For string distances, by
must be of length 1.
library(dplyr)#> #>#>#> #>#>#> #>data(greens3) btw17 <- filter(greens3, year==2017 & election=="BTW") %>% select(-year, -election, -city_clean) btw13 <- filter(greens3, year==2013 & election=="BTW") %>% select(-year, -election, -city_clean) joinr(btw13,btw17,by=c("city"), suffix=c("94","17"), method='lcs', caliper=12, add_distance=TRUE)#>#> # A tibble: 4 x 5 #> city94 greens94 match_dist city17 greens17 #> <chr> <dbl> <dbl> <chr> <dbl> #> 1 Darmstadt, Wissenschaf… 17.8 NA NA NA #> 2 Heidelberg 18.9 12 Heidelberg, Stadtkreis 21.9 #> 3 Freiburg im Breisgau 22.1 12 Freiburg im Breisgau, St… 23.3 #> 4 NA NA NA Tübingen 19.5#> # A tibble: 6 x 4 #> city greens match_id match_dist #> <chr> <dbl> <int> <dbl> #> 1 Darmstadt, Wissenschaftsstadt 17.8 4 NA #> 2 Heidelberg 18.9 2 12 #> 3 Freiburg im Breisgau 22.1 3 12 #> 4 Tübingen 19.5 5 NA #> 5 Heidelberg, Stadtkreis 21.9 2 12 #> 6 Freiburg im Breisgau, Stadtkreis 23.3 3 12