How to merge your data!
Joining Data
We often need to take several datasets and merge them into a single dataset that we can use for analysis. This “join” can at times be tricky, and can be made more so if we aren’t exactly sure what data matches up between the two datasets, or if the same data is named differently in each dataframe. Sometimes joining data can resemble more of a gridlock than a clean and seamless merge.
The trick that makes a join possible is to have one or more shared variables across each dataset. This may be an ID column, or a specific data class. Importantly, these data can actually have different column names, but typically each needs to have the same class of data (i.e., character
and character
) and some shared observations across each dataset in order to join the datasets.
While this may seem daunting at first, it’s not! R has some excellent tools built into the {dplyr}
package that make joining datasets fairly straightforward. First let’s import some data we can use to join. Here we will build off of the datasets we used in the previous {dplyr} module.
First let’s load the libraries and data we’ll be using. We’ll be continuing to use some groundwater data as shown in previous modules, including a dataframe of stations
, measurements
through time, and information on perforations
. We’ll also join these data to the CalEnviroscreen dataset which “identifies California communities by census tract that are disproportionately burdened by, and vulnerable to, multiple sources of pollution”. One of the most notable uses of the CalEnviroscreen dataset has been to inform CalEPA’s identification of disadvantaged communities. We’ll be using CalEnviroscreen v3.0 in the code that follows.
library(tidyverse)
# GW stations data
gw_stations <- read_csv("data/gwl/stations.csv")
# GW measurements data for Sac, El Dorado, Placer Counties
gw_measurements <- read_csv("data/gwl/measurements_sep.csv")
# GW depths
gw_perf <- read_csv("data/gwl/perforations.csv")
# CalEnviroscreen Data
calenviro <- read_csv("data/calenviroscreen/ces3results_data.csv.zip")
Challenge 1: Find the Commonalities
gw_measurments
, gw_perf
, and gw_stations
) and find potential common columns/variables we can use to join. How many are there?There are many ways to view our data. We want to not only know the column names, but also what kind of data exists in each. What common columns can we use?
SITE_CODE
and STN_ID
is in each of the gw_measurments
, gw_perf
, and gw_stations
datasets.A great package to use for cleaning and checking things like this is the {janitor
} package. We can use the compare_df_cols
function to find out what data types exist across the dataframes we want to potentially join.
library(janitor)
# tells us the column class in each dataframe, NA's for non-existent cols
compare_df_cols(gw_stations, gw_perf, gw_measurements)
column_name gw_stations gw_perf gw_measurements
1 BASIN_CODE character <NA> <NA>
2 BASIN_NAME character <NA> <NA>
3 BOT_PRF <NA> numeric <NA>
4 COOP_AGENCY_ORG_ID <NA> <NA> numeric
5 COOP_ORG_NAME <NA> <NA> character
6 COUNTY_NAME character <NA> <NA>
7 GSE_WSE <NA> <NA> numeric
8 LATITUDE numeric <NA> <NA>
9 LONGITUDE numeric <NA> <NA>
10 MSMT_CMT <NA> <NA> character
11 MSMT_DATE <NA> <NA> POSIXct, POSIXt
12 RDNG_RP <NA> <NA> numeric
13 RDNG_WS <NA> <NA> numeric
14 RPE_WSE <NA> <NA> numeric
15 SITE_CODE character character character
16 STN_ID numeric numeric numeric
17 SWN character <NA> <NA>
18 TOP_PRF <NA> numeric <NA>
19 WCR_NO character <NA> <NA>
20 WELL_DEPTH numeric <NA> <NA>
21 WELL_NAME character <NA> <NA>
22 WELL_TYPE character <NA> <NA>
23 WELL_USE character <NA> <NA>
24 WLM_ACC character <NA> <NA>
25 WLM_ACC_DESC <NA> <NA> character
26 WLM_DESC <NA> <NA> character
27 WLM_GSE <NA> <NA> numeric
28 WLM_ID <NA> <NA> numeric
29 WLM_METHOD character <NA> <NA>
30 WLM_ORG_ID <NA> <NA> numeric
31 WLM_ORG_NAME <NA> <NA> character
32 WLM_QA_DESC <NA> <NA> character
33 WLM_RPE <NA> <NA> numeric
34 WSE <NA> <NA> numeric
# note SITE_CODE and STN_ID both exist across all 3 datasets
Now we’d like to know if there are similar data columns between the groundwater datasets and the CalEnviroscreen data. To do this let’s inspect our stations
and calenviro
data with either head
or names
or str
. All will tell us similar information. At this point, we know we can tie the groundwater data together, but we need to find something to crosswalk or join the CalEnviroscreen data to the groundwater data. In this case, the CalEnviroscreen data are at a census tract level, and contain data with ZIP codes, counties, and census tracts.
There are a lot of variables in the CalEnviroscreen dataset, so we’re only showing the first 10 here.
[1] "Census Tract"
[2] "Total Population"
[3] "California County"
[4] "ZIP"
[5] "Nearby City \n(to help approximate location only)"
[6] "Longitude"
[7] "Latitude"
[8] "CES 3.0 Score"
[9] "CES 3.0 Percentile"
[10] "CES 3.0 \nPercentile Range"
names(gw_stations)
[1] "STN_ID" "SITE_CODE" "SWN" "WELL_NAME"
[5] "LATITUDE" "LONGITUDE" "WLM_METHOD" "WLM_ACC"
[9] "BASIN_CODE" "BASIN_NAME" "COUNTY_NAME" "WELL_DEPTH"
[13] "WELL_USE" "WELL_TYPE" "WCR_NO"
What do we see? Looks like we have county names in both of these datasets, but they have different column names (COUNTY_NAME
in stations
and California County
in calenviro
. So we know we can join our three groundwater datasets together, and we know we can join the CalEnviroscreen data with the stations
dataset by county. Let’s talk about the types of joins we may use now!
There are quite a few different join types that are available via the {dplyr} package. Here are some great animations by Garrick Aden-Buie that help illustrate the various join types.
Returns all rows from x
, and all columns from x
and y
. Rows in x
with no match in y
will have NA
values in the new columns. Probably one of the most common joins, where we want to keep everything from our first dataframe (x
), and anything that matches x
from the second dataframe (y
).
The matching join, returns all rows from x
where there are matching values in y
, and all columns from x
and y
.
Returns all rows and all columns from both x
and y
. Where there are not matching values, the function returns NA
for that observation. This is a full merge, where no data is lost from either dataset, and any non-matching data gets an NA
.
Returns all rows from x
where there are matching values in y
, keeping only columns from x
. Also known as a filtering join because you are basically filtering one dataframe, using another dataframe as the match. This can be useful when you don’t actually want to join with data, but you do want to subset the data to match observations in another dataset.
Also a filtering join. This keeps all rows from x
where there are not matching values in y
, keeping just columns from x
. This join is useful to find out which data is not joining. Note, the dataframe you put first matters: if you switch x
and y
you may get different answers.
Let’s demonstrate some joins with our groundwater dataframes. First, we’ll filter
the station data to only stations in Sacramento County, and then join these data with the other groundwater data.
sac_stations <- gw_stations %>%
filter(COUNTY_NAME == "Sacramento")
dim(sac_stations) # get number of ROWS , COLUMNS
[1] 494 15
It looks like we have 494 groundwater stations in Sacramento County. Let’s use this subset of data to join to our other groundwater dataframes.
The various joins
outlined above are very flexible, but the default options are also very good. That means, if we don’t specify any details, the functions will often try to match and join based on common columns (that have identical names and data types), though this isn’t a great habit to set because that may not always be what we want 👾!
Let’s left_join()
our Sacramento stations with the gw_measurments
data. Note, we expect to get the same number of stations as the left part of our join (the x
data.frame), because left_join
should keep everything on the left side, and only keep what matches from the right side (y
).
# default join, only give x, y
sac_stations_measurements <- left_join(sac_stations, gw_measurements)
dim(sac_stations_measurements)
[1] 28208 32
What happened? Since there are two matching columns, the join used both STN_ID
and SITE_CODE
. We ended up with lots more observations, and additional columns. It’s important to check and validate things worked.
Challenge 2: Verify the Join
unique()
or distinct()
to prove we only have groundwater stations from Sacramento County in our joined dataset?It’s always a good idea to double check both visually and numerically that things are working as expected. In the example above, there are many gw_measurements
per station so it may seem difficult to know this worked. However, we can double check the number of unique stations in our joined dataset…it should still equal the total number stations in our sac_stations
dataset (n = 494).
# look at how many unique SITE_CODE are in orig sac only dataset
distinct(sac_stations, SITE_CODE) %>% nrow()
[1] 494
# look at how many unique are in joined dataset
distinct(sac_stations_measurements, SITE_CODE) %>% nrow()
[1] 494
# look at top 10 stations with the most measurements:
sac_stations_measurements %>%
group_by(SITE_CODE) %>% # group by SITE_CODE
tally() %>% # count how many measurements by STATION
arrange(desc(n)) %>% # arrange by count (n), descending
head(10) # top 10 records only
# A tibble: 10 × 2
SITE_CODE n
<chr> <int>
1 384121N1212102W001 1582
2 383264N1213191W001 642
3 382548N1212908W001 595
4 384082N1213845W001 541
5 385567N1214751W001 510
6 386016N1213761W001 508
7 383204N1214430W001 499
8 384403N1212921W001 358
9 386151N1214467W001 321
10 384147N1214507W001 311
If we use an inner_join()
, do we get a different number of stations? Why?
# default join, only give x, y
sac_stations_inner <- inner_join(sac_stations, gw_measurements)
# check station number
distinct(sac_stations_inner, SITE_CODE) %>% nrow()
[1] 420
There’s a different number of stations here! What happened? Remember, inner_join()
only keeps the rows that match in both data frames. So, this means while the original sac_stations
dataset had 494 stations, the gw_measurements
dataset does not have data for every station in the sac_stations
dataset, thus n = 420
instead n = 494
. When using left_join()
, these missing observations get filled with NA
, whereas with inner_join()
, they are dropped.
We saw how we can join two datasets, but let’s try piping these together so we can join all three groundwater datasets together in one single chain of code. We’ll start with our Sacramento County stations only.
Here we’ll pipe the data to the function, which means we only need to add the y
argument. In R, a “.
” is like a placeholder that represents the data we are piping along. It’s not strictly required (try running the function without it!), but here it helps us see the pieces that go into the function (x
and y
).
# note, we aren't specifying the columns we want to join on (keep defaults)
sac_gw_data_joined <- sac_stations %>%
left_join(., gw_measurements) %>% # passing this result to the next join
left_join(., gw_perf)
# quick check of stations, should be n=494, because left_join!
distinct(sac_gw_data_joined, SITE_CODE) %>% nrow()
[1] 494
What about joining dataframes with different variable names? This is much more common when trying to merge different datasets. Let’s try to join the CalEnviroscreen data with the Sacramento County groundwater stations
data.
calenviro_stations <- left_join(calenviro, sac_stations)
Error: `by` must be supplied when `x` and `y` have no common variables.
ℹ use by = character()` to perform a cross-join.
Run `rlang::last_error()` to see where the error occurred.
Uh-oh, an error message! But if we look more closely, this error message is pretty descriptive. It tells us the two pieces we are working with (x = calenviro
and y = sac_stations
) don’t have a common variable name. Then it gives us a suggestion, use by = character()
. Let’s try specifying the column names we want to join using the by
argument. Note, we need to quote the variable names here (a character()
vector), and they need to match the same order the dataframes were provided (x
and y
).
calenviro_stations <- inner_join(sac_stations, calenviro,
by = c("COUNTY_NAME" = "California County"))
dim(calenviro_stations)
[1] 156598 71
Great this worked! But let’s figure out why there are so many records if there were only n = 494 stations.
# A tibble: 317 × 2
`Census Tract` n
<dbl> <int>
1 6067000100 494
2 6067000200 494
3 6067000300 494
4 6067000400 494
5 6067000500 494
6 6067000600 494
7 6067000700 494
8 6067000800 494
9 6067001101 494
10 6067001200 494
# … with 307 more rows
# A tibble: 317 × 2
`Census Tract` n
<dbl> <int>
1 6067000100 494
2 6067000200 494
3 6067000300 494
4 6067000400 494
5 6067000500 494
6 6067000600 494
7 6067000700 494
8 6067000800 494
9 6067001101 494
10 6067001200 494
# … with 307 more rows
So there are 317 census tracts in Sacramento County, and 494 unique groundwater stations. If we multiply 317 * 494, we get the number of observations in our dataset (n = 156,598
). So, it seems every single station is being joined to every single census tract…that’s because we only have county to join on.
For these sorts of operations, we actually want a more spatially explicit join to reduce this sort of duplication. Thankfully we’ve done this already. Let’s grab this data now so we can use it to make a better join of the data. Download it here, and see the spatial mapmaking module to see more on spatial joins. Let’s join our full joined Sacramento County groundwater dataset from earlier using the census tracts instead of county.
# can use URL here too:
xwalk <- read_csv("data/calenviroscreen/sac_county_crosswalk_to_gw_stations.csv") %>%
# filter to just columns we need
select(SITE_CODE, tract)
# join with station data...adding "tract" to dataframe
sac_stations_w_tracts <- left_join(sac_stations, xwalk)
# now join to CalEnviroScreen data
sac_station_calenviro_by_tracts <- left_join(sac_stations_w_tracts, calenviro,
by = c("tract" = "Census Tract"))
# check dim
dim(sac_station_calenviro_by_tracts)
[1] 495 72
Ok, so what happened? We used an intermediate table (xwalk
) to crosswalk using a finer resolution join for our stations. So now each census tract is associated with a specific groundwater station in Sacramento County, based on a spatial join of the CalEnviroScreen data with the stations data.
Let’s close the loop, and join the station-calenviroscreen data back to our gw_perf
and gw_measurement
data so we have one single dataset for Sacramento County!
Remember how to do this? Let’s save this using an R specific format, .rds
, because it will retain the formatting and allow us to import/read in the data using whatever name we prefer. Here we’ll use the write_rds()
function from the {readr}
package (part of the {tidyverse}
).
# remember to use a relative path!
write_rds(sac_gw_all, file = "data/sacramento_gw_data_w_calenviro.rds")
Previous module:
9. Functions Next module:
11. Spatial Data
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/r4wrds/r4wrds, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".