You love {purrr}, but have you tried speeding up things {furrr}-ther?

Roberto Villegas-Diaz
Data Manager @ University of Liverpool

Aim

  • Convince the audience that {furrr} is not (too) scary.

When a regular cat is not enough, you get the cat a wig?


In reality:

“The goal of {furrr} is to combine {purrr}’s family of mapping functions with {future}’s parallel processing capabilities.”

What are you {purrr}-ing about?

Shout out to Tom Smith @ Nottingham University Hospitals NHS Trust:

Learning to purrr

Hello {furrr}!

purrr::map(c("hello", "{purrr}!"), ~.x)
[[1]]
[1] "hello"

[[2]]
[1] "{purrr}!"
furrr::future_map(c("hello", "{furrr}!"), ~.x)
[[1]]
[1] "hello"

[[2]]
[1] "{furrr}!"

Note: Replacing a map function by its equivalent future_map, does not auto-magically parallelise your code! 🥲

First steps

# Set a "plan" for how the code should run.
future::plan(future::multisession, workers = 2)

# This does run in parallel!
furrr::future_map(c("hello", "{purrr}!"), ~.x)
[[1]]
[1] "hello"

[[2]]
[1] "{purrr}!"

Other functions:

future_imap(), future_imap_chr(), …,
future_map2(), future_map2_chr(), …,
future_walk(), future_map_chr(), …, and more.

Reference: https://furrr.futureverse.org/reference

future::planning

  • sequential: uses the current R process
  • multisession: uses separate R sessions
  • multicore: uses separate forked R processes
  • cluster: uses separate R sessions on one or more machines

Reference: https://future.futureverse.org/reference/plan.html

For testing at home:

future::plan("sequential")
demo("mandelbrot", package = "future", ask = FALSE)


future::plan("multisession")
demo("mandelbrot", package = "future", ask = FALSE)

Another example

bake <- function(dish) {
  Sys.sleep(2)                      # do your thing
  return(paste0(dish, ": baked!"))  # result
}

# demo inputs
cakes <- c("Cake 1", "Cake 2", "Cake 2", "Cake 4")
Sequential
future::plan(future::sequential)
tictoc::tic()
seq_res <- furrr::future_map(cakes, bake)
tictoc::toc()
8.029 sec elapsed
Multisession
future::plan(future::multisession, workers = 4)
tictoc::tic()
multi_sesh_res <- furrr::future_map(cakes, bake)
tictoc::toc()
2.558 sec elapsed

Useful commands/tips

  • To find the available CPUs (i.e., max number of workers for the plan function):

    future::availableCores()

  • To add progress bar, include .progress = TRUE in the function call:

    furrr::future_map(x, fx, .progress = TRUE)

    ⚠️ the documentation suggests shifting to the progressr framework.

A “real world” example (1)

Imagine we want to compute some spatial indicator X at UPRN (Unique Property Reference Number) level, how long will that take?

Some UPRN stats:

  • ONS-UPRN directory: 4,564,476 [in North West England]
  • NHS Cheshire & Merseyside ICB: 1,568,275

UPRNs are available under the Open Government License (OGL) from the Ordnance Survey Data Hub.

A “real world” example (2)

A “real world” example (3)

access_to_green_spaces <- function(uprn, ...) {
  Sys.sleep(1E-3) # do your thing
  return(uprn)    # result
}

# Load datasets derived with the R/uprn_example.R script
ons_uprn_nw_cm_icb <- 
  readr::read_rds("../data/ons_uprn_nw_cm_icb.Rds")
sub_icb_boundaries_cm <- 
  readr::read_rds("../data/sub_icb_boundaries_cm.Rds")

Code: R/uprn_example.R

A “real world” example (4)

Sequential
future::plan(future::sequential)
tictoc::tic()
seq_res <- ons_uprn_nw_cm_icb |>
  furrr::future_pmap(access_to_green_spaces)
tictoc::toc()
2672.98 sec elapsed
Multisession
future::plan(future::multisession, workers = 8)
tictoc::tic()
multi_sesh_res <- ons_uprn_nw_cm_icb |>
  furrr::future_pmap(access_to_green_spaces)
tictoc::toc()
292.05 sec elapsed

Common pitfalls: Argument evaluation

# setup
x <- rep(0, 3)
plus <- function(x, y) x + y
options <- furrr::furrr_options(seed = 123)

# set execution plan
future::plan(future::multisession, workers = 2)

# run with the same `y` for all workers
furrr::future_map_dbl(x, plus, runif(1))
[1] 0.2080856 0.2080856 0.2080856


# compute `y` for each worker
furrr::future_map_dbl(x, ~ plus(.x, runif(1)), .options = options)
[1] 0.1552317 0.4877356 0.5330014

Common pitfalls: Function environments and large objects (1)

# setup
future::plan(future::multisession, workers = 2)
my_fast_fn <- function() {
  furrr::future_map(1:5, ~.x)
}
my_slow_fn <- function() {
  # Massive object - but we don't want it in `.f`
  big <- 1:1e8 + 0L
  furrr::future_map(1:5, ~.x)
}
system.time(my_fast_fn())
   user  system elapsed 
  0.024   0.001   0.282 
system.time(my_slow_fn())
   user  system elapsed 
  0.342   0.502   1.191 

Common pitfalls: Function environments and large objects (2)

A possible solution, instead of using an anonymous function within the environment of the “large” object, define the function separately:

# setup
future::plan(future::multisession, workers = 2)
fn <- function(x) {
  x
}
my_not_so_slow_fn <- function() {
  big <- 1:1e8 + 0L
  
  furrr::future_map(1:5, fn)
}
system.time(my_not_so_slow_fn())
   user  system elapsed 
  0.297   0.055   0.590 

Thank you!

r.villegas-diaz@liverpool.ac.uk