Use, re-use, and write code to do stuff!
Functions
What is a function? A function is simply a set of instructions or code written for a computer to follow. These can be very simple (multiply one column of data by another) to much more complex (multi-step model and visualization). Packages are usually sets of functions that focus on similar topics or themes, sort of like a cookbook for a specific kind of food.
Functions ultimately make our code more reusable and reproducible, because we can use functions for repeated processes (like cleaning and tidying data!).
In R, functions have a name, an input which may consist of one or more arguments (sometimes called parameters) and a output or the thing that the function returns.
Here’s an example function to convert discharge from cubic meters per second (cms) to cubic feet per second (cfs). Can you identify the different parts?
cms_to_cfs <- function(discharge) {
cfs_out <- discharge * 35.3146662
return(cfs_out)
}
Using functions requires we know what pieces go where inside the function (the arguments), our data is of the right class or shape for the function (input), and we know what the function should return to our Environment! Once we have this basic understanding, there are thousands of functions that are already written and part of packages we can use. In fact, we’ve been using functions in every lesson thus far. In this module, we focus on building our own functions to meet our data science needs.
To use the cms_to_cfs
function we wrote above, we need to provide a value to the discharge
argument. Let’s convert 10 cms to cfs, using our function.
cms_to_cfs(discharge = 10)
[1] 353.1467
The function returns a numeric value to our console. If we wanted to save this value, we could assign it (<-
) to an object. Functions allow us to iterate over data, or repeat the same task on different datasets without needing to re-write code every time.
It’s important to adopt consistent and descriptive names in R. This isn’t just for functions, but it’s a particularly good habit to practice when naming functions. For example, a general recommendation is to 🐍 use_snake_case
🐍. Other options include:
MaybeUseCamelCase
or.separate.by.periods
orJust_try.everythingBecause.CHAOS
We’ve already been doing this throughout the previous lessons with things like read.csv()
, dplyr::filter()
, or ggplot2::ggplot()
, but let’s use a new package to pull some surface water river flow data. Here we’ll use the {dataRetrieval}
package.
{dataRetrieval}
The {dataRetrieval} package is a way to access hydrologic and water quality data from the USGS and EPA. There are several vignettes (or tutorials) available on the package website. Here, we will show how to download some surface water data from the American River near Sacramento. First we load the package with the functions we want to use. We load packages once per session.
# load the functions via packages!
library(tidyverse)
library(dataRetrieval) # a spellbook of water data functions
# use a function to download data:
flow_site <- readNWISdata(site = "11427000", # USGS Station ID
service = "iv", # instantaneous measurements
parameterCd = "00060", # discharge in cfs
startDate = "2019-10-01",
endDate = "2021-09-30",
tz = "America/Los_Angeles")
There are a lot of arguments in that function! How did we know what each one was or what to put there? There are a few different options. For functions or packages you aren’t familiar with, it’s worth using built-in help.
?readNWISdata
and look at the Usage
and Arguments
sections!Always visualize and inspect your data… we should have an object in our environment called flow_site
. Let’s take a look at this data, tidy it, and plot!
# check data
summary(flow_site)
agency_cd site_no dateTime
Length:70080 Length:70080 Min. :2019-10-01 00:00:00
Class :character Class :character 1st Qu.:2020-03-31 13:26:15
Mode :character Mode :character Median :2020-09-30 13:07:30
Mean :2020-09-30 12:39:55
3rd Qu.:2021-04-01 11:48:45
Max. :2021-09-30 23:45:00
X_00060_00000 X_00060_00000_cd tz_cd
Min. : 9.68 Length:70080 Length:70080
1st Qu.: 55.50 Class :character Class :character
Median : 115.00 Mode :character Mode :character
Mean : 323.72
3rd Qu.: 411.00
Max. :4580.00
head(flow_site)
agency_cd site_no dateTime X_00060_00000
1 USGS 11427000 2019-10-01 00:00:00 101
2 USGS 11427000 2019-10-01 00:15:00 106
3 USGS 11427000 2019-10-01 00:30:00 106
4 USGS 11427000 2019-10-01 00:45:00 106
5 USGS 11427000 2019-10-01 01:00:00 106
6 USGS 11427000 2019-10-01 01:15:00 106
X_00060_00000_cd tz_cd
1 A America/Los_Angeles
2 A America/Los_Angeles
3 A America/Los_Angeles
4 A America/Los_Angeles
5 A America/Los_Angeles
6 A America/Los_Angeles
# fix column names so we can use a more informative parameter name
flow_site <- dataRetrieval::renameNWISColumns(flow_site)
# check data
names(flow_site)
[1] "agency_cd" "site_no" "dateTime" "Flow_Inst"
[5] "Flow_Inst_cd" "tz_cd"
# visualize:
ggplot(data = flow_site) +
geom_line(aes(x = dateTime, y = Flow_Inst), color = "darkblue") +
theme_classic() +
labs(x = "", y = "Flow (cfs)",
title = "Discharge from USGS Station 11427000",
caption = "data pulled using {dataRetrieval} package in R")
Awesome! We plotted some discharge data from a river. That’s great. We could save this data if we wanted (see the import/export module), or do additional analysis (see next module on joins!). Let’s try using a different function here.
Challenge 1: You Try!
Using the code template below, create a flow_huc
object using the readNWISdata()
function but fill in the arguments with the following information:
site
, let’s use huc
. We’ll try getting stations from HUC 18020111.iv
(instantaneous), let’s use dv
.startDate
, endDate
, and parameterCd
as we used above.%>%
the output from readNWISdata()
directly to the renameNWISColumns()
function and then to the addWaterYear()
function.How many unique site_no
are there?
Can you visualize the data using a ggplot()
with geom_line
and use a different color for each site_no
?
Code template:
flow_huc <- readNWISdata(___,
___,
parameterCd = "00060",
startDate = "2019-10-01",
endDate = "2021-09-30") %>%
___ %>%
___
# look for stations with daily values of discharge in the whole watershed
flow_huc <- readNWISdata(huc = "18020111", # using a huc id number now
service = "dv", # daily values
parameterCd = "00060",
startDate = "2019-10-01",
endDate = "2021-09-30") %>%
renameNWISColumns() %>% # the pipe infers that the data is passed through
addWaterYear()
names(flow_huc) # names look ok!
[1] "agency_cd" "site_no" "dateTime" "waterYear" "Flow"
[6] "Flow_cd" "tz_cd"
unique(flow_huc$site_no) # 2 stations
[1] "11446500" "11447360"
# visualize:
ggplot() +
geom_line(data = flow_huc,
aes(x = dateTime, y = Flow, color = site_no),
show.legend = FALSE) +
theme_classic() +
labs(x = "", y = "Flow (cfs)",
title = "Discharge from USGS Stations: 11446500, 11447360",
caption = "data pulled using {dataRetrieval} package in R") +
# a new way to view data with facets...add a free scale!
facet_grid(rows = vars(site_no), scales = "free")
Since we already wrote a custom function to convert cms to cfs, let’s make another similar function but for cfs to cms, and apply it to the flow data we just downloaded. Let’s look at how we might use a function on an existing data frame or dataset that we have in R.
Let’s use some of the dplyr
we looked at in the last lesson. Here we want to make a new column that is our Flow
data, but instead of in cfs, our units will be in cms. Let’s use the flow_site
data frame.
# the function
cfs_to_cms <- function(discharge) {
cms_out <- discharge * 0.028316847
return(cms_out)
}
# add a new column to existing dataset with mutate
flow_site <- flow_site %>%
mutate(Flow_cms = cfs_to_cms(Flow_Inst)) # apply the function to the column
# quick plot to check!
ggplot() +
geom_line(data = flow_site,
aes(x = dateTime, y = Flow_cms), color = "darkblue") +
theme_classic() +
labs(x = "", y = "Flow (cms)",
title = "Discharge from USGS Station 11427000",
caption = "data pulled using {dataRetrieval} package in R")
So what happened here? We made a function cfs_to_cms()
to apply to our data. In this case, we wanted to apply our function to a column in our dataframe, which if we remember the earlier lesson on data, in a dataframe a column is a vector of the same type of data. That means the function can very quickly iterate through every observation in the vector and convert the data from cfs to cms.
This section uses some more functions, but with some more advanced applications. We’ll want to have the {sf}
package installed for the following code to work fully.
What if we wanted to find all the possible water quality stations that are upstream or downstream of the gage we pulled data from earlier (USGS: 11446500
)? A great function that will not only help us figure that out, but also help us pull in spatial data we can use to make maps is the findNLDI()
function from the {dataRetrieval}
package.
The function findNLDI()
has arguments for the type of data we want to use (nwis
, huc
,location
), the direction along a stream to look (upstream or downstream, or in mainstem or tributaries), the data we want to get back (find
), and the distance we should search upstream or downstream (distance_km
).
Let’s look only in the mainstem river for 50 km upstream or downstream of our USGS gage and return any NWIS (water quality or flow) sites and the associated flowlines we searched.
library(sf)
# get NWIS and flowline data 50km downstream of the USGS gage
amer_ds_nimbus <- findNLDI(
nwis = "11446500", # Below Nimbus Dam
nav = c("UM","DM"),
find = c("nwis", "flowlines"),
distance_km = 50)
names(amer_ds_nimbus)
[1] "origin" "UM_nwissite" "UM_flowlines" "DM_nwissite"
[5] "DM_flowlines"
Note, this returned a named list of dataframes. We can access individual parts of a list just as we can access a single column in a dataframe using the $
. This function returned 5 parts, the origin
which is the site we used to start our search, the UM_nwissite
which is the NWIS sites upstream on the mainstem from our USGS gage, the DM_nwissite
which is the NWIS sites downstream of our USGS gage, and finally, a mainstem flowline upstream and downstream of our gage. These are dataframes with an underlying spatial component, which is nice because it allows us to easily map and visualize these data if we want. See the mapmaking lesson for more info.The dataframes are an sf
class, from the simple features {sf} package, but they can also be converted to into a dataframe.
# check the object class
class(amer_ds_nimbus$DM_nwissite)
[1] "sf" "tbl_df" "tbl" "data.frame"
# get just downstream NWIS sites and make into dataframe
nwis_sites <- amer_ds_nimbus$DM_nwissite %>%
st_drop_geometry() # drop the sf class
# verify the class has changed
class(nwis_sites)
[1] "tbl_df" "tbl" "data.frame"
Let’s visualize these sites with a fancy map! See the mapmaking lesson for more details on making maps in R.
Finally, let’s go ahead and save (export) our list of NWIS sites, and all our spatial components from the American River below Nimbus Dam so we can use it for future lessons.
There are many packages that work with water or water data. Here’s a list of just a few that you may find interesting or helpful. The R CRAN repository has a list of all packages that relate to Hydrology that is also a good place to look.
Previous module:
8. Spreadsheets & Pivots Next module:
10. Joins and Binds
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/r4wrds/r4wrds, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".