From 1,000 point-clicks to 1 script…
Learning objectives
{sf}
with aquatic data from NHD
sources{tmap}
We can use some of the same tools and data we used previously, but we can now also add a few options to download and actively use USGS/NHD data. These data include river stage and flow, water quality, station locations, watershed and streamline attributes, etc. Particularly for water scientists, it can be immensely useful to tie in additional data, or use standard datasets that are continuously updated (like the USGS gage network).
First we’ll need a few packages we’ve not used yet. Please install
these with install.packages()
if you don’t have them.
# GENERAL PACKAGES
library(tidyverse) # data wrangling & viz
library(purrr) # iteration
library(janitor) # name cleaning
library(glue) # pasting stuff together
# SPATIAL PACKAGES
library(sf) # analysis tools
library(mapview) # interactive maps!
mapviewOptions(fgb = FALSE) # to save interactive maps
library(tigris) # county & state boundaries
library(units) # for convert spatial and other units
library(dataRetrieval) # download USGS river data
library(tmap) # mapping
library(tmaptools) # mapping tools
We’ve demonstrated how to join data and crop data in R, but let’s use
some alternate options to download new data. We’ll focus on surface
water here, and look at how we can download and map flowlines in R. Much
of the USGS data network can be queried and downloaded in
R
. This may include data on water quality, river discharge,
water temperature, spatial basins, and NHD flowlines. The
{dataRetrieval
} package is an excellent option for these
operations.
Let’s find a few groundwater stations to use. Here we’ll grab one close to the American River and one close to the Cosumnes just for demonstration purposes, but this could be any X/Y point, or set of points you are interested in.
Iteration…remember {purrr
}? Let’s use it here!
# get counties for CA as {sf}
ca_co <- counties("CA", cb = TRUE, progress_bar = FALSE)
# read the stations
gw_files <- list.files(path = "data/gwl/county",
full.names = TRUE, pattern = "*.csv")
# read all files into dataframes and combine with purrr
gw_df <- map_df(gw_files, ~read.csv(.x))
# the readr package will also do this same thing by default
# when passed a list of files with the same data types
gw_df <- read_csv(gw_files)
# now make "spatial" as sf objects
gw_df <- st_as_sf(gw_df, coords=c("LONGITUDE","LATITUDE"),
crs=4326, remove=FALSE) %>%
# and transform!
st_transform(., 4269)
# get just sacramento: here we read in a shapefile:
sac_co <- st_read("data/shp/sac/sac_county.shp")
Reading layer `sac_county' from data source
`/Users/richpauloo/Documents/GitHub/r4wrds/intermediate/data/shp/sac/sac_county.shp'
using driver `ESRI Shapefile'
Simple feature collection with 1 feature and 9 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: -13565710 ymin: 4582007 xmax: -13472670 ymax: 4683976
Projected CRS: WGS 84 / Pseudo-Mercator
# check CRS
st_crs(sac_co)$epsg
[1] 3857
# match with other data
sac_co <- st_transform(sac_co, st_crs(ca_co))
st_crs(sac_co) == st_crs(ca_co) # should be TRUE!
[1] TRUE
# make a box around sacramento county
# (a grid with an n=1) for inset
sac_box <- st_make_grid(sac_co, n = 1)
Now we can filter to our location of interest.
# filter to only Sacramento
gw_sac <- gw_df %>%
filter(COUNTY_NAME == "Sacramento")
# use this layer to filter to only locations (stations of interest)
sac_loi <- gw_sac %>% filter(STN_ID %in% c("52418", "5605"))
mapview(sac_co, alpha.regions=0,
color="black", lwd=3, legend=FALSE) +
mapview(sac_loi, col.regions="purple",layer.name="Locations of Interest")
findNLDI
The findNLDI
function allows us to pass a single spatial
point as well as a few different parameters like search upstream or
downstream, and what we want to find, and then return a list of items
(see the help page for using the function here),
leveraging the hydro-network linked data index (NLDI)1.
Note, we’ll need internet connectivity here for these functions to run
successfully.
Let’s look only at mainstem flowlines from our locations of interest,
and return the nearest NWIS sites as well as the
NHD flowlines (streamlines). We’ll use the map()
function
to pass a list of stations along (here only 2, but this is flexible, and
in practice we can map over a much larger number of stations).
First, we want to look up the COMID or location identifier for the centroids.
library(dataRetrieval)
# Need to convert our locations of interest to WGS84
sac_loi <- sac_loi %>%
st_transform(4326)
# now we can go get flowline data!
us_nwis <- map(sac_loi$geometry,
~findNLDI(location = .x,
nav = c("UM"),
find = c("nwis", "flowlines"),
distance_km = 120)
)
Great, now we have a list
of three or more things for
each sf object we passed to findNLDI
. In this case, we
should have (for each location of interest we used):
origin
: this is the segment that the original LOI was
linked to based on the nearest distance algorithm the function used.
Note, this is an sf LINESTRING
data.frame.UM_nwissite
: these are all the NWIS sites that were
identified upstream of our origin point.UM_flowlines
: these are the Upstream Mainstem (UM)
flowlines from our origin point.Let’s play with these data and make some maps.
There are a few options, and it depends on what your goal is. Here we
show a few simple ways to pull these data out or collapse them.
Remember, we can access things in lists with our [[]]
too!
# we can split these data into separate data frame
# and add them as objects to the .Global environment.
# first add names based on our station IDs:
us_nwis <- set_names(us_nwis, nm=glue("id_{sac_loi$STN_ID}"))
# then split into separate dataframes
# us_nwis %>% list2env(.GlobalEnv)
# Or we can combine with map_df
us_flowlines <- map_df(us_nwis, ~rbind(.x$UM_flowlines))
us_nwissite <- map_df(us_nwis, ~rbind(.x$UM_nwissite))
mapview(sac_loi, col.region="purple", legend = TRUE,
cex=3, layer.name="GW LOI") +
mapview(us_nwissite, col.regions="orange",
legend = TRUE, layer.name="UM NWIS Sites") +
mapview(us_flowlines, color="steelblue", lwd=2,
layer.name="UM Flowlines", legend=FALSE)
Next, let’s filter to NWIS USGS stations that have flow data (generally these have 8-digit identifiers instead of a longer code which can be more water quality parameters), and pull streamflow data for the nearest station.
# use the stringr package, part of tidyverse to trim characters
usgs_stations <- us_nwissite %>%
filter(stringr::str_count(identifier) < 14)
# double check?
mapview(sac_loi, col.region="purple", legend = TRUE,
cex=3, layer.name="GW LOI") +
mapview(us_nwissite, col.regions="orange", cex=2,
legend = TRUE, layer.name="UM NWIS Sites") +
mapview(usgs_stations, col.regions="cyan4",
legend = TRUE, layer.name="USGS Gages") +
mapview(us_flowlines, color="steelblue", lwd=2,
layer.name="UM Flowlines", legend=FALSE)
The final filter involves snapping our LOI points (n = 2) to the nearest USGS stations from the stations we filtered to above. We can then use these data to generate some analysis and exploratory plots.
Snapping spatial data can be tricky, mainly because decimal precision can cause problems. One solution is to add a slight buffer around points or lines to improve successful pairing.
For this example, we’ll use st_nearest_feature()
, which
gives us an index of the nearest feature (row) between two sets of
spatial data. In this case, we have two sets of points.
# get row index of nearest feature between points:
usgs_nearest_index <- st_nearest_feature(sac_loi, usgs_stations)
# now filter using the row index
usgs_stations_final <- usgs_stations[usgs_nearest_index, ]
# get vector of distances from each ISD station to nearest USGS station
dist_to_loi <- st_distance(sac_loi,
usgs_stations_final,
by_element = TRUE)
# use units package to convert units to miles or km
(dist_to_loi_mi <- units::set_units(dist_to_loi, miles))
Units: [miles]
[1] 13.696934 1.202355
(dist_to_loi_km <- units::set_units(dist_to_loi, km))
Units: [km]
[1] 22.043079 1.935003
# bind back to final dataset:
usgs_stations_final <- usgs_stations_final %>%
cbind(dist_to_loi_mi, dist_to_loi_km)
# now plot!
mapview(usgs_stations, cex = 2.75, col.regions = "gray",
layer.name = "USGS Stations") +
mapview(us_flowlines, legend = FALSE, color = "steelblue") +
mapview(usgs_stations_final, col.regions = "yellow",
layer.name = "Nearest USGS Station to LOI") +
mapview(sac_loi, col.regions="purple",
layer.name = "GW LOI")
Notice anything? How could we approach this differently so we pulled at least one gage per river instead of two in one river and none in the other?
If we want to select more than a single point based on a threshold distance we can use a non-overlapping join and specify a distance. For many spatial operations, using a projected CRS is important because it generally provides a more accurate calculation since it is based on a “flat” surface and uses a linear grid. It has the additional advantage that we tend to process and understand information that is grid based more easily than curvilinear (degree-based), so a distance of 100 yards or 100 meters makes sense when compared with 0.001 degrees.
Therefore, first we transform our data into a projected CRS, then we do our join and distance calculations, then we can transform back to our lat/lon CRS.
usgs_stations <- st_transform(usgs_stations, 3310)
sac_loi <- st_transform(sac_loi, 3310)
# use a search within 15km to select stations
usgs_stations_15km <- st_join(sac_loi,
usgs_stations,
st_is_within_distance,
dist = 15000) %>%
st_drop_geometry() %>%
filter(!is.na(X)) %>% # can't have NA's
st_as_sf(coords = c("X","Y"), crs = 4326)
mapview(usgs_stations_15km, col.regions = "yellow") +
mapview(sac_loi, col.regions = "purple") +
mapview(us_flowlines, legend = FALSE, color = "steelblue")
Now we have our stations of interest, and our climate data, let’s
download river flow and water temperature data with the
{dataRetrieval}
package.
# strip out the "USGS" from our identifier with "separate"
usgs_stations_15km <- usgs_stations_15km %>%
tidyr::separate(col = identifier, # the column we want to separate
into = c("usgs", "usgs_id"), # the 2 cols to create
remove = FALSE) %>% # keep the original column
select(-usgs) # drop this column
# see if there's daily discharge/wtemperature data available ("dv"):
dataRetrieval::whatNWISdata(siteNumber = usgs_stations_15km$usgs_id,
service = "dv",
parameterCd = c("00060", "00010"),
statCd = "00003")
agency_cd site_no
163 USGS 11446500
166 USGS 11446500
243 USGS 11446700
260 USGS 11446980
station_nm site_tp_cd
163 AMERICAN R A FAIR OAKS CA ST
166 AMERICAN R A FAIR OAKS CA ST
243 AMERICAN R A WILLIAM B POND PARK A CARMICHAEL CA ST
260 AMERICAN R BL WATT AVE BRDG NR CARMICHAEL CA ST
dec_lat_va dec_long_va coord_acy_cd dec_coord_datum_cd alt_va
163 38.63546 -121.2277 F NAD83 71.53
166 38.63546 -121.2277 F NAD83 71.53
243 38.59129 -121.3327 S NAD83 45.00
260 38.56713 -121.3883 S NAD83 25.00
alt_acy_va alt_datum_cd huc_cd data_type_cd parm_cd stat_cd
163 0.01 NGVD29 18020111 dv 00010 00003
166 0.01 NGVD29 18020111 dv 00060 00003
243 2.00 NGVD29 18020111 dv 00010 00003
260 2.50 NGVD29 18020111 dv 00010 00003
ts_id loc_web_ds medium_grp_cd parm_grp_cd srs_id access_cd
163 234322 NA wat <NA> 1645597 0
166 10977 NA wat <NA> 1645423 0
243 234323 NA wat <NA> 1645597 0
260 234324 NA wat <NA> 1645597 0
begin_date end_date count_nu
163 1971-07-20 2023-05-18 3146
166 1904-10-01 2023-05-17 43328
243 2016-10-01 2023-05-17 2418
260 2016-10-01 2023-05-18 2393
# Extract Streamflow for identified sites
usgs_Q <- readNWISdv(usgs_stations_15km$usgs_id,
parameterCd = "00060",
startDate = "2016-10-01") %>%
renameNWISColumns()
# extract water temp
usgs_wTemp <- readNWISdv(usgs_stations_15km$usgs_id,
parameterCd = "00010",
startDate = "2016-10-01") %>%
renameNWISColumns()
Now we have the data, let’s plot!
# Plot!
(hydro <- ggplot() +
geom_line(data = usgs_Q, aes(x = Date, y = Flow, col = site_no),
size = .5) +
scale_color_brewer(palette = "Set1") +
facet_wrap(~site_no, scales = "free_x") +
theme_classic() +
labs(title="USGS Discharge Data",
x="", y="Discharge (cfs)") +
theme(legend.position = "none"))
# Plot temp
(gg_temp <- ggplot() +
geom_path(data = usgs_wTemp,
aes(x = Date, y = Wtemp, col = site_no),
size = .5) +
facet_wrap(~site_no) +
theme_bw() +
labs(title="USGS Water Temperature Data",
x="", y="Water Temperature (C)") +
scale_color_viridis_d() +
theme(legend.position = "none"))
Challenge
In the plots above, we see the gaps in data are connected when using
a line plot. Ideally, we would prefer to visualize these data with gaps
(no line) where there is no data. To do this, we can leverage handy
functions from the {tidyr}
package: complete()
and fill()
.
# load the package
library(tidyr)
# fill all unique combinations of Date in our data
usgs_wTemp2 <- usgs_wTemp %>%
group_by(site_no) %>% # group by our gages first
complete(Date = seq.Date(min(Date), max(Date), by="day")) %>%
# then list the cols we want to fill same value through whole dataset
fill(site_no, agency_cd)
# now regenerate plot!
# Plot temp
(gg_temp2 <- ggplot() +
geom_path(data = usgs_wTemp2,
aes(x = Date, y = Wtemp, col = site_no),
size = .5) +
facet_wrap(~site_no) +
theme_bw() +
labs(title="USGS Water Temperature Data",
x="", y="Water Temperature (C)") +
scale_color_viridis_d() +
theme(legend.position = "none"))
{tmap}
One final component that we haven’t covered much is how to create a
publication ready-map. We can do this using the {ggplot2}
package in conjunction with geom_sf()
, or we can use some
alternate packages which are built specifically to work with spatial
data and use a similar code structure to {ggplot2}
.
The {tmap}
and {tmaptools}
are excellent
options to create a nice map that can be used in any report or
publication. First, let’s load the packages we’ll use.
Now we build our layers using a similar structure as
{ggplot2}
.
final_tmap <-
# counties
tm_shape(sac_co) +
tm_polygons(border.col = "gray50", col = "gray50",
alpha = 0.1, border.alpha = 0.9, lwd = 0.5, lty = 1) +
# rivers
tm_shape(us_flowlines) +
tm_lines(col = "steelblue", lwd = 2) +
# points: LOI stations
tm_shape(sac_loi) +
tm_symbols(col = "orange3", border.col = "gray20",
shape = 21, size = 1.5, alpha = 1) +
tm_add_legend('symbol',shape = 21, col='orange3', border.col='black', size=1,
labels=c(' LOI')) +
# points usgs
tm_shape(usgs_stations_15km) +
tm_symbols(col = "cyan4", border.col = "gray20",
shape = 23, size = 0.5) +
tm_add_legend('symbol',shape = 23, col='cyan4', border.col='black', size=1,
labels=c(' USGS Stations')) +
# layout
tm_layout(
frame = FALSE,
legend.outside = FALSE, attr.outside = FALSE,
inner.margins = 0.01, outer.margins = (0.01),
legend.position = c(0.2,0.8)) +
tm_compass(type = "4star", position = c("right","bottom")) +
tm_scale_bar(position = c("right","bottom"))
final_tmap
To save this map, we use a similar function as the
ggsave()
, but in this case, it’s
tmap::tmap_save()
.
tmap::tmap_save(final_tmap,
filename = "images/map_of_sites.jpg",
height = 11, width = 8, units = "in", dpi = 300)
Previous
module:
7. Paramaterized reports
For more info on the NLDI: https://labs.waterdata.usgs.gov/about-nldi/index.html↩︎
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/r4wrds/r4wrds, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".