Optimal Joining And Linking Of Two Data Frames

Join and link two data frames by columns that are similar by some definition such that the similarity across all matches is maximized and each observation is matched at most to one other observation. The function linkr stacks two data frames and finds an optimal one-to-one pairing of rows in one data frame with rows in the other data frame. The output is a data frame with as many rows as there are in the two datasets and a common identifier for each matched pair. The complementary function is joinr which, instead of stacking and assigning a common identifier, joins two data frames similar to the merge function or full_join.

joinr(
  x,
  y,
  by,
  strata = NULL,
  method = "osa",
  assignment = TRUE,
  add_distance = FALSE,
  suffix = c(".x", ".y"),
  full = TRUE,
  na_matches = "na",
  caliper = Inf,
  C = 1,
  verbose = FALSE,
  ...
)

linkr(
  x,
  y,
  by,
  strata = NULL,
  method = "osa",
  assignment = TRUE,
  add_distance = FALSE,
  na_matches = "na",
  caliper = Inf,
  C = 1,
  verbose = FALSE,
  ...
)

Arguments

x, y	data frames to join
by	character vector of the key variable(s) to join by. To join by different variables on x and y, use a named vector. For example, by = c("a" = "b") will match x$a to y$b.
strata	character vector of variables to join exactly if any. Can be a named vector.
method	the name of the distance metric to measure the similarity between the key variables.
assignment	should one-to-one matches be constructed?
add_distance	add a distance column to the final data frame?
suffix	character vector of length 2 used to disambiguate non-joined duplicate variables in x and y.
full	retain all unjoined observation from the shorter data frame?
na_matches	should NA and NaN values match one another for any exact join defined by `strata`?
caliper	caliper value on the same scale as the distance matrix (before multipled by `C`).
C	scaling parameter for the distance matrix
verbose	print distance summary statistic
...	parameters passed to distance metric function

Details

Matches are constructed using a fast version of the Hungarian method as implemented in the assignment function. Only the integer part of the distance matrix is used. To increase precision, use the parameter C. A warning is printed if the distance matrix does not consist of integers but real values ("Warning in assignment(m): Matrix 'cmat' not integer; will take floor of it.").

If strata is not NULL, optimal one-to-one matches are constructed within the strata defined by the variables in strata.

The method for computing distance can by any of the string distances implemented as part of the stringdist package (see stringdist-metrics for a list), a geographic distance from the geosphere package (distGeo, distCosine, distHaversine, distVincentySphere, distVincentyEllipsoid), or a distance metric from the registry package (run summary(proxy::pr_DB) for a list). Users may also supply their own distance metric.

For geographic distances, by must be of length 2 with the names of the variables that include the longitude/latitude coordinates (first one is longitude, second is latitude). For string distances, by must be of length 1.

Examples


library(dplyr)
#> 
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union
data(greens3)

btw17 <- filter(greens3,
   year==2017 &
   election=="BTW") %>%
 select(-year, -election,
    -city_clean)

btw13 <- filter(greens3,
   year==2013 &
   election=="BTW") %>%
 select(-year, -election,
    -city_clean)

joinr(btw13,btw17,by=c("city"),
   suffix=c("94","17"),
   method='lcs',
   caliper=12,
   add_distance=TRUE)
#> Loading required package: stringdist
#> # A tibble: 4 x 5
#>   city94                  greens94 match_dist city17                    greens17
#>   <chr>                      <dbl>      <dbl> <chr>                        <dbl>
#> 1 Darmstadt, Wissenschaf…     17.8         NA NA                            NA  
#> 2 Heidelberg                  18.9         12 Heidelberg, Stadtkreis        21.9
#> 3 Freiburg im Breisgau        22.1         12 Freiburg im Breisgau, St…     23.3
#> 4 NA                          NA           NA Tübingen                      19.5

linkr(btw13,btw17,by=c("city"),
   method='lcs',
   caliper=12,
   add_distance=TRUE)
#> # A tibble: 6 x 4
#>   city                             greens match_id match_dist
#>   <chr>                             <dbl>    <int>      <dbl>
#> 1 Darmstadt, Wissenschaftsstadt      17.8        4         NA
#> 2 Heidelberg                         18.9        2         12
#> 3 Freiburg im Breisgau               22.1        3         12
#> 4 Tübingen                           19.5        5         NA
#> 5 Heidelberg, Stadtkreis             21.9        2         12
#> 6 Freiburg im Breisgau, Stadtkreis   23.3        3         12

Optimal Joining And Linking Of Two Data Frames

Arguments

Details

See also

Examples