class: inverse, center, middle # Writing Reusable and Modular Code ### Natya Hans ### Academic Research Consulting and Services, University of Florida ### (updated: 2024-02-16) --- class: inverse, center, middle ## "Any fool can write code that a computer can understand. <br />Good programmers write code that humans can understand." <p style="text-align:right">from "Refactoring: Improving the Design of Existing Code" by Martin Fowler</p> --- # Motivations * In addition to code that **works correctly**, ideally, it should also be: - easy to read and understand - easy to maintain or change - verifiable easily (did it do what was expected correctly?) - aesthetically pleasing (optionally) --- # Learning Outcomes By the end of the workshop, participants will be able to: * organize programming tasks into modular functions * communicate code intent using comments * recognize and fix basic code smells --- # A Note * These concepts are universal, but code examples are in **`R`**. * Like any other skill, **_effective_** practice matters! * **Refactoring** - rewriting code without changing its functionality, but making it more readable and easy to maintain. - Similar to revising a paper for clarity. --- class: inverse, center, middle # Breaking Code into Functions --- # What are functions? * **Functions** let you refer to another piece of code by (a hopefully informative!) name ```r mean() # computes arithmetic mean of input ``` * You can write your own functions, too! ```r celsius_to_fahrenheit <- function(x) { 9/5 * x + 32 } ``` --- # Why write your own functions? * If my script already works, writing a function takes time... -- * Functions enable you to: - perform the same task on different inputs (e.g. new data, parameter values, etc.) - isolate correct code - organize your analysis for readability --- # You can repeat code ... ```r df <- data.frame( a = rnorm(10), b = rnorm(10), c = rnorm(10)) # rescale all the columns to [0, 1] df$a <- (df$a - min(df$a)) / (max(df$a) - min(df$a)) df$b <- (df$b - min(df$b)) / (max(df$b) - min(df$a)) df$c <- (df$c - min(df$c)) / (max(df$c) - min(df$c)) ``` --- # Or define a function! ```r rescale_to_01 <- function(x) { (x - min(x)) / (max(x) - min(x)) } # rescale the columns of df to [0, 1] df$a <- rescale_to_01(df$a) df$b <- rescale_to_01(df$b) df$c <- rescale_to_01(df$c) # or with dplyr df <- df %>% mutate( across(c("a", "b", "c"), rescale_to_01)) ``` --- # The **DRY** Principle: "Don't Repeat Yourself" * Changing the calculation only needs to be done once (in the function definition). * The function name helps us find where to change the code. * A good function name helps someone understand what the code does without seeing its implementation. --- # Workflow structure <img src="project-diagram.svg" alt="Diagram of the workflow in a hypothetical data analysis project with boxes representing code and data/output files. "Raw data" goes into "Pre-processing" and then "Pre-processed data". "Pre-processed data" goes directly into "Figure 3" (code) and then "Figure 3" (Data file), but also into "Analysis/Modelling". The "Model" output from "Analysis/Modelling" is used in code for "Figure 1" and "Figure 2", which generate files "Figure 1" and "Figure 2"." width="60%" /> .tiny[modified from "Reproducible research best practices @JupyterCon" (version 18) by Rachael Tatman, https://www.kaggle.com/rtatman/reproducible-research-best-practices-jupytercon] --- # Example Code .small[ <pre> library(tidyverse) source("analysis_functions.R") source("plotting_functions.R") data_raw <- readRDS("readings.dat") data_proc <- preprocess_data(data_raw) fitted_model <- run_model(data_proc) plot_model_forecasts(fitted_model, "figure-1_abundance-forecasts.pdf") plot_abundance_residuals(fitted_model, "figure-2_abundance-residuals.pdf") plot_abundance_histogram(data_proc, "figure-3_abundance-residuals.pdf") </pre> ] --- # Notes * The code matches the steps of the analysis. * It is easy to find where to make changes: - to modify a plot - edit the function - add a new figure - write a new function and add it to the workflow script * Possible improvement: - save the function outputs to a file, so that they do not need to be re-run --- # Tips for writing functions * name things well * plan for (some) flexibility * split large, complex tasks into smaller units, each with a clear purpose * use data structures to store complex objects, as needed --- # Tip 1: Naming Things Function names should be verbs ```r # bad row_adder() permutation() # good add_row() permute() ``` .small[examples from https://style.tidyverse.org/functions.html#naming] --- # Tip 2: Plan for flexibility ```r plot_abundance_histogram <- function(data_proc, filename, width = 6, height = 6) { # {{code}} } ``` * `data_proc` and `filename` are required inputs * width and height are adjustable, but have defaults values that work --- # Tip 3: How to subdivide 1. start with the main goal of your program 2. split the goal into separate subgoals - e.g. read in data, clean data, fit model, do statistics, make plots 3. repeat step 2 - continue subdividing until you reach individual tasks - write a **function for each task** - choose initial implementation details to get things working --- # Notes on Tip 3 * Get a simple version working first! - You can always make changes to the details and/or add flexibility later. * Each function should have a single well-defined task * Functions should ideally be 50 lines or less - this is a **guideline**—divide work into functions sensibly! --- # More Notes on Tip 3 * If a line or set of lines of code is complicated, a new function might be needed ```r # bad if (class(x)[[1]] == "numeric" || class(x)[[1]] == "integer") # good if (is.numeric(x)) ``` .small[examples from https://speakerdeck.com/jennybc/code-smells-and-feels?slide=36] --- # Tip 4: Use data structures Most programming languages let you create data structures to store complex data. * e.g. in **`R`**, you can create a list to include data, settings, and results. This list can be returned from a function: ```r list(data = mtcars, var_1 = "mpg", var_2 = "cyl", rho = cor(mtcars$mpg, mtcars$cyl)) ``` --- class: inverse, center, middle # Comments --- <blockquote class="twitter-tweet"><p lang="en" dir="ltr">Developers commenting their code <a href="https://t.co/jKURCVR9ds">pic.twitter.com/jKURCVR9ds</a></p>— Ricardo Ferreira (@riferrei) <a href="https://twitter.com/riferrei/status/1358180071081713670?ref_src=twsrc%5Etfw">February 6, 2021</a></blockquote> --- .small[(via "Notes on Programming in C", Rob Pike)] There is a famously bad comment style: ```c i=i+1; /* Add one to i */ ``` and there are worse ways to do it: ```c /********************************** * * * Add one to i * * * **********************************/ i=i+1; ``` Don’t laugh now, wait until you see it in real life. --- # Comment Dos & Don'ts * **Do** use comments to describe **why**, not **how** * **Do** document the inputs, outputs, and purpose of each function * **Do** store notes/references in comments * **Don't** turn code off and on with comments --- # Comment Dos & Don'ts **Do** use comments to describe **why**, not **how** Bad: ```r # Set lib as (1, [2/3 * n]) lib <- c(1, floor(2/3 * n)) ``` Good: ```r # set aside 2/3 of data to train model lib <- c(1, floor(2/3 * n)) ``` --- # Comment Dos & Don'ts **Do** document the inputs, outputs, and purpose of each function ```r plot_abundance_histogram <- function(data_proc, filename, width = 6, height = 6) { # {{code}} } ``` * What kind of object is `data_proc`, what fields/columns are used to make the plot? --- # Comment Dos & Don'ts **Do** store notes/references in comments ```r # cholesky algorithm from Rasmussen & # Williams (2006, algorithm 2.1) R <- chol(Sigma) alpha <- backsolve(R, forwardsolve(t(R), y_lib - mean_y)) L_inv <- forwardsolve(t(R), diag(NROW(x_lib))) Sigma_inv <- t(L_inv) %*% L_inv ``` --- # Comment Dos & Don'ts **Don't** turn code off and on with comments -- you will not remember why it was commented out (was it buggy?, did you want to test somethign?) ```r # cat("The value of x is", x) ``` -- - use conditional logic instead ```r if (DEBUG_MODE) { cat("The value of x is", x) } ``` Set `DEBUG_MODE` at the top of the script. --- class: inverse, center, middle # Code Smells --- # What are "code smells"? * Code smells: properties that seem not so ideal - code smells do not necessarily mean incorrect code, but are signs that the code may be more prone to errors * recognizing code smells is a skill - depends on experience, and personal taste * use a good style guide to avoid common issues! --- # Some common code smells * functions that are too long * lots of single-letter variable names * too much indentation * no indentation * duplicated (or near-identical) code fragments * magic numbers --- # TMI = too much indentation .tiny[ <pre> get_some_data <- function(config, outfile) { if (config_ok(config)) { if (can_write(outfile)) { if (can_open_network_connection(config)) { data <- parse_something_from_network() if(makes_sense(data)) { data <- beautify(data) write_it(data, outfile) return(TRUE) } else { return(FALSE) } } else { stop("Can't access network") } } else { \# uhm. What was this else for again? } } else { \# maybe, some bad news about ... the config? } } </pre> ] .small[example from https://speakerdeck.com/jennybc/code-smells-and-feels?slide=42] --- # Simplify the logic! .tiny[ <pre> get_some_data <- function(config, outfile) { if (config_bad(config)) { stop("Bad config") } if (!can_write(outfile)) { stop("Can't write outfile") } if (!can_open_network_connection(config)) { stop("Can't access network") } <br /> data <- parse_something_from_network() if(!makes_sense(data)) { return(FALSE) } data <- beautify(data) write_it(data, outfile) TRUE } </pre> ] .small[example from https://speakerdeck.com/jennybc/code-smells-and-feels?slide=43] --- # Magic Numbers **Magic numbers** are values in the code where the meaning of the number is derived from context. * confusing - what do they represent? * brittle - the value isn't always fixed * can usually be replaced with named constants --- # Example ```r cat("The correlation between mpg and cyl is", cor(mtcars[,1], mtcars[,2])) ``` ``` The correlation between mpg and cyl is -0.852162 ``` -- * What if we insert a column or change the order of columns? * What if we want to change the variables? - is the text label still correct? --- # Example (improved) * define variables to replace the "magic numbers" ```r var_1 <- "mpg" var_2 <- "cyl" cat("The correlation between", var_1, "and", var_2, "is", cor(mtcars[,var_1], mtcars[,var_2])) ``` ``` The correlation between mpg and cyl is -0.852162 ``` --- # Example (as a function) * We can even turn it into a function! ```r f <- function(df = mtcars, var_1 = "mpg", var_2 = "cyl") { cat("The correlation between", var_1, "and", var_2, "is", cor(df[,var_1], df[,var_2])) } f() ``` ``` The correlation between mpg and cyl is -0.852162 ``` --- # General Strategies * Embrace **refactoring**: rewriting code without changing its behavior (i.e. make it faster, cleaner, easier to use) * follow a coding **style guide** - links in the [resources page](https://uf-repro.github.io/writing-reusable-code/resources.html) * use technological tools like automatic indentation and linters - many modern IDEs (e.g. RStudio) have these built-in --- # Thanks * Let me know what content you'd like to see * Contact me for additional questions or consultation requests! * Email: nhans@ufl.edu * Check back in on the libguide for more modules and contact info: - https://guides.uflib.ufl.edu/reproducibility * Original slides courtesy of Hao Ye