Joining datasets

Lecture 7

2025-03-10

Warm up

Goals

It’s rare that a data analysis involves only a single data frame. Typically you have many data frames, and you must join them together to answer the questions that you’re interested in.

  • Mutating joins, which add new variables to one data frame from matching observations in another.
  • Filtering joins, which filter observations from one data frame based on whether or not they match an observation in another (conditional match.

For more information see this link https://r4ds.hadley.nz/joins.html

Setup

Keys

Primary key

  • Every join involves a pair of keys: a primary key (or keys) and a foreign key (or keys).

  • A primary key is a variable or set of variables that uniquely identifies each observation.

  • When more than one variable is needed, the key is called a compound key.

Foreign key

A foreign key is a variable (or set of variables) that corresponds to a primary key in another table.

Visualizing key relationships

Often it helps to visualize the keys and their relationships.

Best practices

  • Have the primary and foreign keys have the same names accross the dataframes.

  • Have the other columns different names (to avoid confusion after joining).

  • Verify that keys do indeed uniquely identify each observation.

  • Sometimes you will have to create a key by combining columns generating a surrogate key.

Basic joins

Lets use some real data

library(nycflights13)

Airlines dataset

airlines
# A tibble: 16 × 2
   carrier name                       
   <chr>   <chr>                      
 1 9E      Endeavor Air Inc.          
 2 AA      American Airlines Inc.     
 3 AS      Alaska Airlines Inc.       
 4 B6      JetBlue Airways            
 5 DL      Delta Air Lines Inc.       
 6 EV      ExpressJet Airlines Inc.   
 7 F9      Frontier Airlines Inc.     
 8 FL      AirTran Airways Corporation
 9 HA      Hawaiian Airlines Inc.     
10 MQ      Envoy Air                  
11 OO      SkyWest Airlines Inc.      
12 UA      United Air Lines Inc.      
13 US      US Airways Inc.            
14 VX      Virgin America             
15 WN      Southwest Airlines Co.     
16 YV      Mesa Airlines Inc.         

Airports dataset

airports
# A tibble: 1,458 × 8
   faa   name                             lat    lon   alt    tz dst   tzone    
   <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
 1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
 2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
 3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
 4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
 5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
 6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
 7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
 8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
 9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
# ℹ 1,448 more rows

Planes dataset

planes
# A tibble: 3,322 × 9
   tailnum  year type              manufacturer model engines seats speed engine
   <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
 1 N10156   2004 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
 2 N102UW   1998 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 3 N103US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 4 N104UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 5 N10575   2002 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
 6 N105UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 7 N107US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 8 N108UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 9 N109UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
10 N110UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
# ℹ 3,312 more rows

Weather dataset

weather
# A tibble: 26,115 × 15
   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
 7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
 8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
 9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
# ℹ 26,105 more rows
# ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
#   visib <dbl>, time_hour <dttm>

Key overview

  • flights$tailnum is a foreign key that corresponds to the primary key planes$tailnum.

  • flights$carrier is a foreign key that corresponds to the primary key airlines$carrier.

  • flights$origin-flights$time_hour is a compound foreign key that corresponds to the compound primary key weather$origin-weather$time_hour.

Mutating joins

  • mutating join allows you to combine variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other.

  • Like mutate(), the join functions add variables to the right, so if your dataset has many variables, you won’t see the new ones. 

Overview

The joins shown here are the so-called equi joins, where rows match if the keys are equal (inner and outer joins)

Inner join (A)

  • Rows are retained if and only if the keys are equal.

Inner join (B)

There are three possible outcomes for a row in x:

  • If it doesn’t match anything, it’s dropped.

  • If it matches 1 row in y, it’s preserved.

  • If it matches more than 1 row in y, it’s duplicated once for each match.

Outer joins

  • Left join

  • Right join

  • Full join

Left join

  • left join keeps all observations in x

  • Every row of x is preserved in the output because it can fall back to matching a row of NAs in y.

Right join

  • right join keeps all observations in y

  • Every row of y is preserved in the output because it can fall back to matching a row of NAs in x.

  • The output still matches x as much as possible; any extra rows from y are added to the end.

Full join

  • full join keeps all observations that appear in x or y

  • Every row of x and y is included in the output because both x and y have a fall back row of NAs.

  • Again, the output starts with all rows from x, followed by the remaining unmatched y rows.

Filtering joins

Filtering joins never duplicate rows like mutating joins do.

Semi join

  • The semi-join keeps rows in x that have one or more matches in y

Anti join

  • The anti-join keeps rows in x that match zero rows in y,

Non-equi joins (A)

So far you’ve only seen equi joins, joins where the rows match if the x key equals the y key. 

Introduction to non-equi

Basic idea of non-equi joins

Types of non-equi joins

  • Cross joins match every pair of rows.

  • Inequality joins use <, <=, >, and >= instead of ==.

  • Rolling joins are similar to inequality joins but only find the closest match.

  • Overlap joins are a special type of inequality join designed to work with ranges.

Cross join

  • A cross join matches everything, generating the Cartesian product of rows.

  • This means the output will have nrow(x) * nrow(y) rows.

  • Sometimes called a self-join.

Inequality joins

  • Inequality joins use <<=>=, or > to restrict the set of possible matches

Rolling joins

  • Rolling joins are a special type of inequality join where instead of getting every row that satisfies the inequality, you get just the closest row.

Overlap joins

  • between(x, y_lower, y_upper) is short for x >= y_lower, x <= y_upper.

  • within(x_lower, x_upper, y_lower, y_upper) is short for x_lower >= y_lower, x_upper <= y_upper.

  • overlaps(x_lower, x_upper, y_lower, y_upper) is short for x_lower <= y_upper, x_upper >= y_lower.