How to automate routine reporting
Learning objectives
.Rmd
reports{rmarkdown}
“Your success in life will be determined largely by your ability to speak, your ability to write, and the quality of your ideas, in that order.” — Patrick Winston (1943-2019)
Data science is the art of combining domain knowledge, statistics, math, programming, and visualization to find order and meaning in disorganized information. Communicating the results of your analyses, or ability to make data “speak”, is of utmost importance. The modern day open source package ecosystem is full of powerful ways to give voice to our analyses.
Analyses typically lead to figures, tables, and interpretation of
these information. The {rmarkdown}
package provides R
users with a standardized approach for
turning R
analyses into reports, documents, presentations,
dashboards, and websites. In this module, we assume familiarity with
{rmarkdown}
, and extend the previous modules on iteration,
functional programming, and reproducible workflows to demonstrate how to
iterate over reports.
According to R Markdown: The Definitive Guide, some example use cases for creating a parameterized report include:
In this module, we will focus on the first case, and build parameterized reports for a set of geographic locations. Throughout this course, we’ve been working with groundwater elevation data across California counties. Let’s imagine that we want to generate a report on groundwater level trends for a set of counties.
Although the RStudio Interactive Development Environment (IDE)
encourages knitting RMarkdown documents by clicking a button, we can
also knit documents via: rmarkdown::render()
. Iterating
over render()
is the key to scaling parameterized reports.
To iterate over R Markdown reports, we must first understand how to use
params
.
params
A parameterized .Rmd file takes a set of params
(short
for “parameters”) in the YAML header, which are bound into a named list
called params
and accessed with code from within the .Rmd
file with params$<paramater-name>
. For example,
consider the example YAML:
title: "My awesome paramaterized report"
output: html_document
params:
start_date: 2021-01-01
watershed: Yuba
data: gwl_yuba.csv
In the code, we could then access the value "2021-01-01"
with params$start_date
. Similarly,
params$watershed
will equal "Yuba"
and
params$data
will equal "gwl_yuba.csv"
.
params
Let’s apply params
to our task and generate an
html_document
for a set of counties. To illustrate, we will
use a simplified, pre-processed dataset of 3 counties (Sacramento, Yolo,
and San Joaquin counties). If you’re motivated to do so, you can use the
entire groundwater level dataset of > 2 million records to scale the
process to all counties. Read in the data and take a look at the
fields.
We will iterate over the COUNTY_NAME
to create three
reports, one for each county. Copy and paste the following code into a
new file reports/gwl_report.Rmd
---
title: "`r paste(params$county, 'Groundwater Levels')`"
output: html_document
params:
county: "placeholder"
---
<br>
```{r, echo = FALSE, message = FALSE, error = FALSE, warning = FALSE}
library(tidyverse)
library(here)
library(sf)
library(mapview)
library(maps)
library(DT)
mapviewOptions(fgb = FALSE)
knitr::opts_chunk$set(warning = FALSE, message = FALSE, out.width = "100%")
# filter all groundwater level data (already loaded into memory) by
# the supplied county
d <- filter(gwl, COUNTY_NAME == params$county)
# extract the county spatial file
county_sf <- st_as_sf(map("county", plot = FALSE, fill = TRUE)) %>%
filter(ID == paste0("california,", tolower(params$county)))
```
This report shows groundwater levels in `r params$county` county.
Dates range from `r min(d$MSMT_DATE, na.rm = TRUE)` to `r max(d$MSMT_DATE, na.rm = TRUE)`.
Data source: [DWR Periodic Groundwater Level Database](https://data.cnra.ca.gov/dataset/periodic-groundwater-level-measurements).
<br>
## Distribution of measurements over time
50% of measured values occur on or after `r median(d$MSMT_DATE, na.rm = TRUE)`.
```{r hist, echo = FALSE}
d %>%
ggplot() +
geom_histogram(aes(MSMT_DATE)) +
theme_minimal() +
labs(title = "", x = "", y = "Count")
```
<br>
## Monitoring sites
```{r map, echo = FALSE}
# mapview of county outline
county_mv <- mapview(
county_sf, layer.name = paste(params$county, "county"),
lwd = 2, color = "red", alpha.regions = 0
)
# mapview of monitoring points
points_mv <- d %>%
group_by(SITE_CODE) %>%
slice(1) %>%
select(-MSMT_DATE) %>% # remove msmt date b/c its irrelevant
mapview(layer.name = "Monitoring stations")
county_mv + points_mv
```
<br>
## All groundwater levels
```{r plot, echo = FALSE}
# interactive hydrograph
p <- ggplot(d, aes(MSMT_DATE, WSE, color = SITE_CODE)) +
geom_line(alpha = 0.5) +
guides(color = FALSE)
plotly::ggplotly(p)
```
<br>
```{r dt, echo = FALSE}
# data table of median groundwater level per site, per year
d %>%
select(-c("COUNTY_NAME", "WELL_DEPTH")) %>%
st_drop_geometry() %>%
mutate(YEAR = lubridate::year(MSMT_DATE)) %>%
group_by(SITE_CODE, YEAR) %>%
summarise(WSE_MEDIAN = median(WSE, na.rm = TRUE)) %>%
ungroup() %>%
DT::datatable(
extensions = 'Buttons', options = list(
dom = 'Bfrtip',
buttons =
list('copy', 'print', list(
extend = 'collection',
buttons = c('csv', 'excel', 'pdf'),
text = 'Download'
))
)
)
```
***
Report generated on `r Sys.Date()`.
Pause and think
Take a moment to read the .Rmd
file above and see what
it does. Notice where params$county
is located in the
document. Particularly, in the first code chunk it’s used to filter the
groundwater level data (assumed to be in memory so we only load it once
rather than every time we run this script) down to the county
parameter.
d <- filter(gwl, COUNTY_NAME == params$county)
Next, how might you write an .Rmd
file like the one
above and test that everything looks the way you want it to before
calling it done? In other words, would you start by writing
params$county
in all places it needs to be or start with
one county, make sure everything works, and then substitute in
params$county
?
Finally, we create a vector of counties we want to write reports for
and iterate over them. We also need to specify the output location of
each file. Since we are writing html_documents
, the file
extension is .html
. Using walk2()
from our
functional programming toolkit, we can pass in the counties vector and
the output file paths into rmarkdown::render()
and silently
write the files.
# unique counties to write reports for
counties <- unique(gwl$COUNTY_NAME)
# output file names
files_out <- tolower(counties) %>%
str_replace_all(" ", "_") %>%
paste0(., ".html")
# silently (walk) over the county names and file names,
# creating a report for each combination
walk2(
counties,
files_out,
~rmarkdown::render(
input = "reports/gwl_report.Rmd",
output_file = here("reports", .y),
params = list(county = .x)
)
)
Open and explore each of the files that were written.
Pause and think
If we wanted to automate reports like this and have them published online or emailed to our team every morning at 7AM, what tools would we need?
Hint: see the automation module section on task schedulers.
R
Within an .Rmd
we can insert R
code in-line
using the following syntax:
`r <function>`
So for instance we can write a string like:
`r mean(c(1,3))`. The mean of 1 and 3 is
And when the document knits, we get: The mean of 1 and 3 is 2.
Consider this as an approach to add specific output about each site in the text narrative.
Although we only demonstrated one type of output report in this
module, the html_document
, there are many other output
formats that you can parameterize and iterate over, including Word
documents, PDFs,
flexdashboards,
and presentations.
To dig deeper, see the official RMarkdown guide for paramaterized reports.
Previous
module:
6. Iteration
Next
module:
8. Advanced spatial
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/r4wrds/r4wrds, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".