Writing Reusable and Modular Code

class: inverse, center, middle

# Writing Reusable and Modular Code
### Natya Hans
### Academic Research Consulting and Services, University of Florida
### (updated: 2024-02-16)

---
class: inverse, center, middle

## "Any fool can write code that a computer can understand. Good programmers write code that humans can understand."

from "Refactoring: Improving the Design of Existing Code" by Martin Fowler

---
# Motivations

* In addition to code that **works correctly**, ideally, it should also be:
  - easy to read and understand
  - easy to maintain or change
  - verifiable easily (did it do what was expected correctly?)
  - aesthetically pleasing (optionally)

---
# Learning Outcomes

By the end of the workshop, participants will be able to:

* organize programming tasks into modular functions
* communicate code intent using comments
* recognize and fix basic code smells

---
# A Note

* These concepts are universal, but code examples are in **`R`**.
* Like any other skill, **_effective_** practice matters!
* **Refactoring** - rewriting code without changing its functionality, but making it more readable and easy to maintain.
  - Similar to revising a paper for clarity.

---
class: inverse, center, middle

# Breaking Code into Functions

---
# What are functions?

* **Functions** let you refer to another piece of code by (a hopefully informative!) name

```r
  mean()
  # computes arithmetic mean of input
```

* You can write your own functions, too!

```r
 celsius_to_fahrenheit <- function(x) {
 9/5 * x + 32
 }
```

---
# Why write your own functions?

* If my script already works, writing a function takes time...

--
* Functions enable you to:
  - perform the same task on different inputs (e.g. new data, parameter values, etc.)
  - isolate correct code
  - organize your analysis for readability

---
# You can repeat code ...

```r
df <- data.frame( a = rnorm(10),
 b = rnorm(10),
 c = rnorm(10))

# rescale all the columns to [0, 1]
df$a <- (df$a - min(df$a)) / 
 (max(df$a) - min(df$a))
df$b <- (df$b - min(df$b)) / 
 (max(df$b) - min(df$a))
df$c <- (df$c - min(df$c)) / 
 (max(df$c) - min(df$c))
```

---
# Or define a function!

```r
rescale_to_01 <- function(x) {
 (x - min(x)) / (max(x) - min(x))
}

# rescale the columns of df to [0, 1]
df$a <- rescale_to_01(df$a)
df$b <- rescale_to_01(df$b)
df$c <- rescale_to_01(df$c)

# or with dplyr
df <- df %>% mutate(
 across(c("a", "b", "c"), rescale_to_01))
```

---
# The **DRY** Principle: "Don't Repeat Yourself"

* Changing the calculation only needs to be done once (in the function definition).
* The function name helps us find where to change the code.
* A good function name helps someone understand what the code does without seeing its implementation.

---
# Workflow structure

.tiny[modified from "Reproducible research best practices @JupyterCon" (version 18) by Rachael Tatman, https://www.kaggle.com/rtatman/reproducible-research-best-practices-jupytercon]

---
# Example Code

.small[
<pre>
library(tidyverse)

source("analysis_functions.R")
source("plotting_functions.R")

data_raw <- readRDS("readings.dat")
data_proc <- preprocess_data(data_raw)

fitted_model <- run_model(data_proc)

plot_model_forecasts(fitted_model,
 "figure-1_abundance-forecasts.pdf")
plot_abundance_residuals(fitted_model,
 "figure-2_abundance-residuals.pdf")
plot_abundance_histogram(data_proc, 
 "figure-3_abundance-residuals.pdf")
</pre>
]

---
# Notes

* The code matches the steps of the analysis.
* It is easy to find where to make changes:
  - to modify a plot - edit the function
  - add a new figure - write a new function and add it to the workflow script
* Possible improvement: 
  - save the function outputs to a file, so that they do not need to be re-run
  
---
# Tips for writing functions

* name things well
* plan for (some) flexibility
* split large, complex tasks into smaller units, each with a clear purpose
* use data structures to store complex objects, as needed

---
# Tip 1: Naming Things

Function names should be verbs

```r
# bad
row_adder()
permutation()

# good
add_row()
permute()
```

.small[examples from https://style.tidyverse.org/functions.html#naming]
---
# Tip 2: Plan for flexibility

```r
plot_abundance_histogram <- 
 function(data_proc, filename, 
 width = 6, height = 6) { 
 # {{code}} 
 }
```
* `data_proc` and `filename` are required inputs
* width and height are adjustable, but have defaults values that work

---
# Tip 3: How to subdivide

1. start with the main goal of your program
2. split the goal into separate subgoals
  - e.g. read in data, clean data, fit model, do statistics, make plots
3. repeat step 2 - continue subdividing until you reach individual tasks
  - write a **function for each task**
  - choose initial implementation details to get things working

---
# Notes on Tip 3

* Get a simple version working first!
  - You can always make changes to the details and/or add flexibility later.
* Each function should have a single well-defined task
* Functions should ideally be 50 lines or less
  - this is a **guideline**—divide work into functions sensibly!

---
# More Notes on Tip 3

* If a line or set of lines of code is complicated, a new function might be needed

```r
# bad
if (class(x)[[1]] == "numeric" || 
    class(x)[[1]] == "integer")

# good
if (is.numeric(x))
```
.small[examples from https://speakerdeck.com/jennybc/code-smells-and-feels?slide=36]

---
# Tip 4: Use data structures

Most programming languages let you create data structures to store complex data.
* e.g. in **`R`**, you can create a list to include data, settings, and results. This list can be returned from a function:

```r
list(data = mtcars, 
     var_1 = "mpg", 
     var_2 = "cyl", 
     rho = cor(mtcars$mpg, mtcars$cyl))
```

---
class: inverse, center, middle

# Comments

---

<blockquote class="twitter-tweet">Developers commenting their code <a href="https://t.co/jKURCVR9ds">pic.twitter.com/jKURCVR9ds</a>&mdash; Ricardo Ferreira (@riferrei) <a href="https://twitter.com/riferrei/status/1358180071081713670?ref_src=twsrc%5Etfw">February 6, 2021</a></blockquote>

---
.small[(via "Notes on Programming in C", Rob Pike)]

There is a famously bad comment style:

```c
i=i+1;             /* Add one to i */
```

and there are worse ways to do it:

```c
/**********************************
 *                                *
 * Add one to i                   *
 *                                *
 **********************************/

i=i+1;
```

Don’t laugh now, wait until you see it in real life.

---
# Comment Dos & Don'ts

* **Do** use comments to describe **why**, not **how**
* **Do** document the inputs, outputs, and purpose of each function
* **Do** store notes/references in comments
* **Don't** turn code off and on with comments

---
# Comment Dos & Don'ts

**Do** use comments to describe **why**, not **how**

Bad:

```r
# Set lib as (1, [2/3 * n])
lib <- c(1, floor(2/3 * n))
```

Good:

```r
# set aside 2/3 of data to train model
lib <- c(1, floor(2/3 * n))
```

---
# Comment Dos & Don'ts

**Do** document the inputs, outputs, and purpose of each function

```r
plot_abundance_histogram <- 
 function(data_proc, filename, 
 width = 6, height = 6) { 
 # {{code}} 
 }
```
* What kind of object is `data_proc`, what fields/columns are used to make the plot?

---
# Comment Dos & Don'ts

**Do** store notes/references in comments

```r
# cholesky algorithm from Rasmussen & 
# Williams (2006, algorithm 2.1)
R <- chol(Sigma)
alpha <- backsolve(R, forwardsolve(t(R), y_lib - mean_y))
L_inv <- forwardsolve(t(R), diag(NROW(x_lib)))
Sigma_inv <- t(L_inv) %*% L_inv
```

---
# Comment Dos & Don'ts

**Don't** turn code off and on with comments -- you will not remember why it was commented out (was it buggy?, did you want to test somethign?)

```r
# cat("The value of x is", x)
```

- use conditional logic instead

```r
if (DEBUG_MODE) {
  cat("The value of x is", x)
}
```

Set `DEBUG_MODE` at the top of the script.

---
class: inverse, center, middle

# Code Smells

---
# What are "code smells"?

* Code smells: properties that seem not so ideal
  - code smells do not necessarily mean incorrect code, but are signs that the code may be more prone to errors
* recognizing code smells is a skill
  - depends on experience, and personal taste
* use a good style guide to avoid common issues!

---
# Some common code smells

* functions that are too long
* lots of single-letter variable names
* too much indentation
* no indentation
* duplicated (or near-identical) code fragments
* magic numbers

---
# TMI = too much indentation

.tiny[
<pre>
get_some_data <- function(config, outfile) {
 if (config_ok(config)) {
 if (can_write(outfile)) {
 if (can_open_network_connection(config)) {
 data <- parse_something_from_network()
 if(makes_sense(data)) {
 data <- beautify(data)
 write_it(data, outfile)
 return(TRUE)
 } else {
 return(FALSE)
 }
 } else {
 stop("Can't access network")
 }
 } else {
 \# uhm. What was this else for again?
 }
 } else {
 \# maybe, some bad news about ... the config? 
 }
}
</pre>
]
.small[example from https://speakerdeck.com/jennybc/code-smells-and-feels?slide=42]

---
# Simplify the logic!

.tiny[
<pre>
get_some_data <- function(config, outfile) {
 if (config_bad(config)) {
 stop("Bad config")
 }
 if (!can_write(outfile)) {
 stop("Can't write outfile")
 }
 if (!can_open_network_connection(config)) {
 stop("Can't access network")
 }
 
 data <- parse_something_from_network()
 if(!makes_sense(data)) {
 return(FALSE)
 }
 data <- beautify(data)
 write_it(data, outfile)
 TRUE
}
</pre>
]
.small[example from https://speakerdeck.com/jennybc/code-smells-and-feels?slide=43]

---
# Magic Numbers

**Magic numbers** are values in the code where the meaning of the number is derived from context.

* confusing - what do they represent?
* brittle - the value isn't always fixed
* can usually be replaced with named constants

---
# Example

```r
cat("The correlation between mpg and cyl is", 
    cor(mtcars[,1], mtcars[,2]))
```

```
The correlation between mpg and cyl is -0.852162
```
--
* What if we insert a column or change the order of columns?
* What if we want to change the variables?
  - is the text label still correct?

---
# Example (improved)

* define variables to replace the "magic numbers"

```r
var_1 <- "mpg"
var_2 <- "cyl"
cat("The correlation between", var_1, 
 "and", var_2, "is", 
 cor(mtcars[,var_1], mtcars[,var_2]))
```

```
The correlation between mpg and cyl is -0.852162
```

---
# Example (as a function)

* We can even turn it into a function!

```r
f <- function(df = mtcars, var_1 = "mpg", 
 var_2 = "cyl") {
 cat("The correlation between", var_1, 
 "and", var_2, "is", 
 cor(df[,var_1], df[,var_2]))
}

f()
```

```
The correlation between mpg and cyl is -0.852162
```
---
# General Strategies

* Embrace **refactoring**: rewriting code without changing its behavior (i.e. make it faster, cleaner, easier to use)
* follow a coding **style guide**
  - links in the [resources page](https://uf-repro.github.io/writing-reusable-code/resources.html)
* use technological tools like automatic indentation and linters
  - many modern IDEs (e.g. RStudio) have these built-in

---
# Thanks
* Let me know what content you'd like to see
* Contact me for additional questions or consultation requests!
* Email: nhans@ufl.edu
* Check back in on the libguide for more modules and contact info:
  - https://guides.uflib.ufl.edu/reproducibility
* Original slides courtesy of Hao Ye