Links a series of data frames sequentially: At each iteration, the function selects one element from all already matched tuples (found by linking data frame 1...d) and links it to the next data frame d+1 until no more data frames are available. All elements of a tuple are assigned the same identifier in the stacked data frame. Each tuple will include at most one element from every data frame d. The solution is a local approximation to the globally optimal solution.

linkr_multi(
  df,
  by,
  slice,
  strata = NULL,
  method = "osa",
  assignment = TRUE,
  na_matches = "na",
  pool = "last",
  caliper = Inf,
  C = 1,
  verbose = FALSE,
  ...
)

Arguments

df

data frame to link.

by

character vector of the key variable(s) to join by. To join by different variables on x and y, use a named vector. For example, by = c("a" = "b") will match x$a to y$b.

slice

used to split df into a list of data frames.

strata

character vector of variables to join exactly if any. Can be a named vector as for by.

method

the name of the distance metric to measure the similarity between the key columns.

assignment

should one-to-one assignments be constructed?

na_matches

should NA and NaN values match one another for any exact join defined by strata?

pool

one of four string values: "previous", "average", "last" or "random" (see details).

caliper

caliper value on the same scale as the distance matrix (before multipled by C).

C

scaling parameter for the distance matrix.

verbose

print distance summary statistic.

...

parameters passed to distance metric function.

Details

Splits df by slice into a list of data frames (indexed 1,...,d,...,D) and applies linkr to every element of this list. Each data frame d is linked to a pool of candidates. The candidate pool is defined by one observation from each matched tuple (which might only have a single element, i.e. a singleton) found in the data frames indexed 1...(d-1). By default, the last observation for each matched tuple is used (pool='last'). Other options to construct the candidate pool include:

  • pool='random': pool includes a randomly drawn element from each matched tuple.

  • pool='previous': pool includes all observations from the data frame indexed d-1.

  • pool='average': pool includes a new observation with the average value per key variable for every matched tuple. This option will only work when the variable(s) defined by the parameter by are numeric.

For more details see the help file of linkr.

See also

Examples

library(dplyr) data(greens3) linkr_multi( df=filter(greens3, election=="BTW"), by='city', slice='year', method='lcs', caliper=15) %>% arrange(match_id,year) %>% data.frame
#> city city_clean year election greens match_id #> 1 Tübingen Tübingen 1994 BTW 15.1 1 #> 2 Tuebingen Tübingen 1998 BTW 17.0 1 #> 3 Tübingen Tübingen 2005 BTW 18.3 1 #> 4 Tübingen Tübingen 2017 BTW 19.5 1 #> 5 Heidelberg, Stadt Heidelberg 1994 BTW 18.4 2 #> 6 Heidelberg, Stadt Heidelberg 1998 BTW 18.2 2 #> 7 Heidelberg, Stadt Heidelberg 2002 BTW 22.9 2 #> 8 Heidelberg Heidelberg 2005 BTW 19.9 2 #> 9 Heidelberg Heidelberg 2009 BTW 22.4 2 #> 10 Heidelberg Heidelberg 2013 BTW 18.9 2 #> 11 Heidelberg, Stadtkreis Heidelberg 2017 BTW 21.9 2 #> 12 Freiburg im Breisgau, Stadt Freiburg 1994 BTW 21.9 3 #> 13 Freiburg im Breisgau, Stadt Freiburg 1998 BTW 24.1 3 #> 14 Freiburg im Breisgau, Stadt Freiburg 2002 BTW 28.7 3 #> 15 Freiburg im Breisgau Freiburg 2005 BTW 26.2 3 #> 16 Freiburg (Breisgau) Freiburg 2009 BTW 25.4 3 #> 17 Freiburg im Breisgau Freiburg 2013 BTW 22.1 3 #> 18 Freiburg im Breisgau, Stadtkreis Freiburg 2017 BTW 23.3 3 #> 19 Darmstadt, Stadt Darmstadt 2002 BTW 20.3 4 #> 20 Darmstadt, Wissenschaftsstadt Darmstadt 2009 BTW 20.9 4 #> 21 Darmstadt, Wissenschaftsstadt Darmstadt 2013 BTW 17.8 4