Multiple ways to iterate in R
Learning objectives
for
loopslapply()
and the
{purrr}
family of functionslist
object typeOne of the wonderful things about R
is that there are
often many ways to achieve the same result. This makes R
expressive, and as you practice and expand your R
vocabulary, you’ll inevitably find multiple ways to obtain the same
results. Some ways may be long and meandering, and others may be more
direct and easy to remember. In this module we will practice two
approaches towards iteration—the art of not repeating
yourself—specifically, for
loops1 and
functions (functional programming).
For most parallel problems (i.e., the current operation does not
depend on the one that came before), functional programming offers
significant advantages to for
loops. However, both loops
and functions are powerful ways to automate processes and workflows. We
will review how and when to use each, and their pros and cons.
The core objective of this lesson is to practice harnessing the power
of iteration in your workflows, and to demonstrate ways to iterate with
functional programming. If you can define a function to work with one
object in R
, functional programming allows you to scale
that function to any number of objects (assuming you have enough
computing resources). Together, functional programming and iteration
allow automation at scale, and put R
a cut above GUI-based
data workflows that require “clicking though” an analysis.
In this module, let’s imagine you’re a data manager, and need to
provide data to a group of users. We’ll practice iteration on reading
and writing data to illustrate a transition from for
loops
to functional programming.
for
loops“Don’t repeat yourself. It’s not only repetitive, it’s redundant, and people have heard it before.” -Lemony Snicket
The for
loop is ubiquitous across programming languages
and a fundamental concept that allows us to obey a core concept in
programming: don’t repeat yourself. The for
loop is a good
place to begin a discussion of iteration because it makes iteration very
explicit—you can see exactly what is taking place in each iteration, or
loop.
A common problem we might face is reading multiple data frames into
R
. In the /data/gwl
folder, we have station
data for El Dorado, Placer, and Sacramento counties. We can read these
in one by one by copy and pasting code, but we’re repeating ourselves.
This may not matter for only 3 counties, but if we were to use all 58
counties with groundwater level data, this would be a non-scaleable
approach prone to human error.
We can replace this code with a for loop and read all of these items
into a list
2.
What happened above is that we evaluated the loop first starting with
1 in the place of i
, and went to length(l)
,
which is 3, each time placing the integer wherever there is an
i
in the above loop. Although our index is i
,
we can use any other unquoted character string, like j
,
k
, or even index
— its only purpose is to hold
the index that we iterate through.
After the loop evaluates, we can access each element of the list with
double bracket [[
notation, subsetting either by an integer
index, or a name if one exists. This list doesn’t have names, but we
could set them by assigning a vector of names to
names(l)
.
# access first list element - El Dorado county dataframe
l[[1]]
for
loops happen sequentially, and require us to think
of code in terms of objects that we iterate through, index by index.
This can result in both slower and duplicated (verbose) code. In
addition to writing efficient code, a core rule of good programming is
to not repeat oneself.
Let’s imagine now that you needed to write each of these data into a
separate file, separated by the unique ID of each station (e.g., the
SITE_CODE
). Your entire loop would look like this:
# initialize a list of defined length
l <- vector("list", length = length(files_in))
# loop over all files and read them into each element of the list
for(i in seq_along(l)){
l[[i]] <- read_csv(files_in[i])
}
# combine all list elements into a single dataframe
# then split into another list by SITE_CODE.
ldf <- bind_rows(l)
ldf <- split(ldf, ldf$SITE_CODE)
# loop over each list element and write a csv file
fs::dir_create("data/gwl/site_code")
# here we make a list of files names ( names(ldf) is a vector )
files_out <- glue::glue("data/gwl/site_code/{names(ldf)}.csv")
for(i in seq_along(ldf)){
write_csv(ldf[[i]], files_out[i])
}
This is a lot of code to accomplish a relatively standard iterative
workflow of reading in a directory of files, combining them, and writing
them out. Functional programming can simplify loop-based workflows, run
faster, and encourage you to think conceptually about the
transformations at play, without worrying about tracking a changing
index. Moreover, loops in R
are usually unnecessary unless
the result of the i
th index depends on the i-1
(previous) index.
lapply()
Base R
gives us a toolkit for functional programming via
the apply
family of functions, specifically
lapply()
and mapply()
. The “l” in
lapply()
stands for “list
”, and can be read as
“list apply”. The apply
functions are designed to iterate
over lists, matrices, rows, or columns, and are very flexible. For
example, we can simplify the for
loops above as:
We can make this clearer by extracting the anonymous
function3 above, assigning it to an
identifier, and calling that in mapply()
:
Notice that we don’t need to initialize a list to store output, or keep track of indices. The emphasis is on the transformation taking place, not index bookkeeping, and setting up looping patterns. The result is the same, and we have a much clearer way to approach the problem.
map()
The map
family of functions in the {purrr}
package improves on base R
’s apply
functions
with simplified syntax, type-specific output that makes it harder to
accidentally create errors, and convenience functions for common
operations. When we combine map()
with pipes
(%>%
) we can greatly simplify our code.
First, to mimic what we did with lapply()
above:
The .x
signifies each of the individual elements of
passed into the function, and can be thought of as a placeholder for the
input. We begin read_csv()
with a ~
to
indicate the beginning of a function.
We can further simplify our code with map_df()
to
automatically row bind the list elements into one dataframe. Moreover,
we can pipe these statements together to avoid creating the intermediate
ldf
object.
Let’s break down what we did above.
read_csv()
over the vector of
file paths, files_in
. Although this function is simple, in
practice we can map a large and complex function in the same way. The
_df
in map_df()
means we pass the list
otherwise returned by map()
into bind_rows()
,
and thus return one combined dataframe of all the csv files we read
instead of a list of dataframes.group_split()
to split the combined dataframe
by SITE_CODE
which returns a list of dataframes ordered by
the unique values of the grouping variable. This is identical to base
R
’s split()
except it doesn’t return a named
list.walk2()
ed over the list of dataframes (one for each
SITE_CODE
), and the files_out
vector from
above, and wrote a csv for each pair of objects (.x
=
dataframe and .y
= output file path).Extra information
What would happen if we used map2()
instead of
walk2()
in the code above?
walk2()
is a special case of walk()
which
takes 2 vector inputs instead of 1. The first and second vector inputs
are called in the function with .x
and .y
like
so: walk2(input_1, input_2, ~function(.x, .y))
. We use
walk()
instead of map()
whenever we want the
side effect of the function, like writing a file. We could also use
map()
here, but it would unnecessarily print each of the
723 dataframes.
You may now we wondering if there is a map3()
,
map4()
and so on. To map over more than 2 inputs at once,
check out the “parallel” map purrr::pmap()
function, which
is similar to base R
’s mapply()
.
You may notice that we used an intermediate object
(files_out
) from above. We could re-write our chain without
this object by creating it within the function call:
The ~
, .x
, and .y
syntax may
seem confusing at first, but with some practice it will become easier,
and it provides a consistent syntax to express complex ideas about your
code. More more importantly, the emphasis is on keeping track of
functions rather than creating and managing the scaffolding of
for
loops.
Taking things one step further, imagine you were:
WELL_USE
types equal to
“Observation
”WELL_DEPTH
from feet to
metersWe could capture these transform steps in a function, store it in our
/functions
folder as
described in the project management module, and in this way, clean
up our workspace so we can keep track of the functions applied on our
data and keep our scripts short and readable.
Store this in a file
/scripts/functions/f_import_clean.R
:
# import dataframe, filter to observation wells, convert well
# depth feet to meters, project to epgs 3310, & export the data
f_import_clean_export <- function(file_in, file_out){
read_csv(file_in) %>%
filter(WELL_USE == "Observation") %>%
mutate(well_depth_m = WELL_DEPTH * 0.3048) %>%
sf::st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4269) %>%
sf::st_transform(3310) %>%
sf::st_write(file_out, delete_layer = TRUE)
}
Now we can walk()
inputs over this function with ease in
our main script without keeping track of loops and indices, or even the
function internals, which are neatly stored in their own file.
source("scripts/functions/f_import_clean_export.R")
# create a directory to store results
fs::dir_create("results")
# vectors with function args: input (.x) & output (.y) files
files_in <- fs::dir_ls("data/gwl/county")
files_out <- here("results", str_replace_all(basename(files_in), ".csv", ".shp"))
walk2(files_in, files_out, ~f_import_clean_export(.x, .y))
Now we check the /results
directory to verify it is
populated with the shapefiles we just wrote.
Iteration is a core skill that will allow you to scale your workflows from small to large while maintaining reproducibility. Combined with a proficiency in functional programming, you will more easily develop and store functions, declutter your workspace to focus on important transformations, streamline tracking down and fixing bugs, and keep track of data pipelines. Your ability to perform and re-perform arbitrarily complex workflows will exponentially increase.
We recommend the following places to start to learn more about
functional programming in R
:
Lesson adapted from R for Data Science.
Previous
module:
5. Simple Shiny
Next
module:
7. Paramaterized Reports
For loops are an example of imperative programming, which differs from functional programming in that it emphasizes the steps to take to change the state of a computer rather than composing and applying functions.↩︎
For a review of R
’s list
data
structure, see the data
structures module and Hadley Wickham’s excellent “R for Data
Science” chapter on vectors.↩︎
An anonymous function is a function that is not assigned
to an identifier. In other words, it is created and used but doesn’t
exist in the Gobal Environment as a function you can call by name. The
benefit of using anonymous functions is that they allow you to quickly
write single-use functions without storing and managing them. For
example, in the expression
lapply(1:5, function(x) print(x))
, the anonymous function
is function(x) print(x)
. To make it a regular, named
function, we assign it to a value, like
print_value <- function(x) print(x)
. Then it can be
called by name like so: lapply(1:5, print_value)
.↩︎
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/r4wrds/r4wrds, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".