Locally Optimal Linking Of Many Data Frames

Links a series of data frames sequentially: At each iteration, the function selects one element from all already matched tuples (found by linking data frame 1...d) and links it to the next data frame d+1 until no more data frames are available. All elements of a tuple are assigned the same identifier in the stacked data frame. Each tuple will include at most one element from every data frame d. The solution is a local approximation to the globally optimal solution.

linkr_multi(
  df,
  by,
  slice,
  strata = NULL,
  method = "osa",
  assignment = TRUE,
  na_matches = "na",
  pool = "last",
  caliper = Inf,
  C = 1,
  verbose = FALSE,
  ...
)

Arguments

df	data frame to link.
by	character vector of the key variable(s) to join by. To join by different variables on x and y, use a named vector. For example, by = c("a" = "b") will match x$a to y$b.
slice	used to split `df` into a list of data frames.
strata	character vector of variables to join exactly if any. Can be a named vector as for `by`.
method	the name of the distance metric to measure the similarity between the key columns.
assignment	should one-to-one assignments be constructed?
na_matches	should NA and NaN values match one another for any exact join defined by `strata`?
pool	one of four string values: "previous", "average", "last" or "random" (see details).
caliper	caliper value on the same scale as the distance matrix (before multipled by `C`).
C	scaling parameter for the distance matrix.
verbose	print distance summary statistic.
...	parameters passed to distance metric function.

Details

Splits df by slice into a list of data frames (indexed 1,...,d,...,D) and applies linkr to every element of this list. Each data frame d is linked to a pool of candidates. The candidate pool is defined by one observation from each matched tuple (which might only have a single element, i.e. a singleton) found in the data frames indexed 1...(d-1). By default, the last observation for each matched tuple is used (pool='last'). Other options to construct the candidate pool include:

pool='random': pool includes a randomly drawn element from each matched tuple.
pool='previous': pool includes all observations from the data frame indexed d-1.
pool='average': pool includes a new observation with the average value per key variable for every matched tuple. This option will only work when the variable(s) defined by the parameter by are numeric.

For more details see the help file of linkr.

Examples


library(dplyr)
data(greens3)

linkr_multi(
  df=filter(greens3, election=="BTW"),
  by='city',
  slice='year',
  method='lcs',
  caliper=15) %>%
arrange(match_id,year) %>%
 data.frame
#>                                city city_clean year election greens match_id
#> 1                          Tübingen   Tübingen 1994      BTW   15.1        1
#> 2                         Tuebingen   Tübingen 1998      BTW   17.0        1
#> 3                          Tübingen   Tübingen 2005      BTW   18.3        1
#> 4                          Tübingen   Tübingen 2017      BTW   19.5        1
#> 5                 Heidelberg, Stadt Heidelberg 1994      BTW   18.4        2
#> 6                 Heidelberg, Stadt Heidelberg 1998      BTW   18.2        2
#> 7                 Heidelberg, Stadt Heidelberg 2002      BTW   22.9        2
#> 8                       Heidelberg  Heidelberg 2005      BTW   19.9        2
#> 9                        Heidelberg Heidelberg 2009      BTW   22.4        2
#> 10                       Heidelberg Heidelberg 2013      BTW   18.9        2
#> 11           Heidelberg, Stadtkreis Heidelberg 2017      BTW   21.9        2
#> 12      Freiburg im Breisgau, Stadt   Freiburg 1994      BTW   21.9        3
#> 13      Freiburg im Breisgau, Stadt   Freiburg 1998      BTW   24.1        3
#> 14      Freiburg im Breisgau, Stadt   Freiburg 2002      BTW   28.7        3
#> 15             Freiburg im Breisgau   Freiburg 2005      BTW   26.2        3
#> 16              Freiburg (Breisgau)   Freiburg 2009      BTW   25.4        3
#> 17             Freiburg im Breisgau   Freiburg 2013      BTW   22.1        3
#> 18 Freiburg im Breisgau, Stadtkreis   Freiburg 2017      BTW   23.3        3
#> 19                 Darmstadt, Stadt  Darmstadt 2002      BTW   20.3        4
#> 20    Darmstadt, Wissenschaftsstadt  Darmstadt 2009      BTW   20.9        4
#> 21    Darmstadt, Wissenschaftsstadt  Darmstadt 2013      BTW   17.8        4

Locally Optimal Linking Of Many Data Frames

Arguments

Details

See also

Examples