Syllabus

Natya Hans Academic Research Consulting and Services, University of Florida (updated: 2024-02-16)

Intro

“Any fool can write code that a computer can understand.
Good programmers write code that humans can understand.”

from “Refactoring: Improving the Design of Existing Code” by Martin Fowler

Motivations
- In addition to code that works correctly, ideally, it should also be:
  - easy to read and understand
  - easy to maintain or change
  - verifiable easily (did it do what was expected correctly?)
  - aesthetically pleasing (optionally)
Learning Outcomes By the end of the workshop, participants will be able to:
- organize programming tasks into modular functions
- communicate code intent using comments
- recognize and fix basic code smells
A Note
- These concepts are universal, but code examples are in R.
- Like any other skill, effective practice matters!
- Refactoring - rewriting code without changing its functionality, but making it more readable and easy to maintain.
  - Similar to revising a paper for clarity.

Breaking Code into Functions

What are functions?
- Functions let you refer to another piece of code by (a hopefully informative!) name
```
  mean()
  # computes arithmetic mean of input
```
- You can write your own functions, too!
```
  celsius_to_fahrenheit <- function(x) {
    9/5 * x + 32
  }
```
Why write your own functions?
- If my script already works, writing a function takes time…
- Functions enable you to:
  - perform the same task on different inputs (e.g. new data, parameter values, etc.)
  - isolate correct code
  - organize your analysis for readability
You can repeat code … ```r df <- data.frame( a = rnorm(10), b = rnorm(10), c = rnorm(10))
rescale all the columns to [0, 1] dfa < − (dfa - min(dfa))/(max(dfa) - min(dfa))dfb <- (dfb − min(dfb)) / (max(dfb) − min(dfa)) dfc < − (dfc - min(dfc))/(max(dfc) - min(df$c)) ```
Or define a function! ```r rescale_to_01 <- function(x) { (x - min(x)) / (max(x) - min(x)) }
rescale the columns of df to [0, 1] dfa < − rescale_to₀₁(dfa) dfb < − rescale_to₀₁(dfb) dfc < − rescale_to₀₁(dfc)
or with dplyr df <- df %>% mutate( across(c(“a”, “b”, “c”), rescale_to_01)) ```
The DRY Principle: “Don’t Repeat Yourself”
- Changing the calculation only needs to be done once (in the function definition).
- The function name helps us find where to change the code.
- A good function name helps someone understand what the code does without seeing its implementation.
Workflow structure modified from “Reproducible research best practices @JupyterCon” (version 18) by Rachael Tatman, https://www.kaggle.com/rtatman/reproducible-research-best-practices-jupytercon

Example Code .small[

library(tidyverse)
source("analysis_functions.R")
source("plotting_functions.R")
data_raw ]

Notes
- The code matches the steps of the analysis.
- It is easy to find where to make changes:
  - to modify a plot - edit the function
  - add a new figure - write a new function and add it to the workflow script
- Possible improvement:
  - save the function outputs to a file, so that they do not need to be re-run
Tips for writing functions
- name things well
- plan for (some) flexibility
- split large, complex tasks into smaller units, each with a clear purpose
- use data structures to store complex objects, as needed
Tip 1: Naming Things Function names should be verbs ```r
bad row_adder() permutation()
good add_row() permute() ``` examples from https://style.tidyverse.org/functions.html#naming

Tip 2: Plan for flexibility

plot_abundance_histogram <- 
  function(data_proc, filename, 
           width = 6, height = 6) { 
    # {{code}} 
  }

data_proc and filename are required inputs
width and height are adjustable, but have defaults values that work

Tip 3: How to subdivide
1. start with the main goal of your program
2. split the goal into separate subgoals
- e.g. read in data, clean data, fit model, do statistics, make plots
1. repeat step 2 - continue subdividing until you reach individual tasks
- write a function for each task
- choose initial implementation details to get things working
Notes on Tip 3
- Get a simple version working first!
  - You can always make changes to the details and/or add flexibility later.
- Each function should have a single well-defined task
- Functions should ideally be 50 lines or less
  - this is a guideline—divide work into functions sensibly!
More Notes on Tip 3
- If a line or set of lines of code is complicated, a new function might be needed ```r
bad if (class(x)[[1]] == “numeric” || class(x)[[1]] == “integer”)
good if (is.numeric(x)) ``` examples from https://speakerdeck.com/jennybc/code-smells-and-feels?slide=36
Tip 4: Use data structures Most programming languages let you create data structures to store complex data.
- e.g. in R, you can create a list to include data, settings, and results. This list can be returned from a function:
```
list(data = mtcars, 
     var_1 = "mpg", 
     var_2 = "cyl", 
     rho = cor(mtcars$mpg, mtcars$cyl))
```

Comments

Developers commenting their code pic.twitter.com/jKURCVR9ds
— Ricardo Ferreira (@riferrei) February 6, 2021

(via “Notes on Programming in C”, Rob Pike) There is a famously bad comment style: c i=i+1; /* Add one to i */ and there are worse ways to do it: c /********************************** * * * Add one to i * * * **********************************/ i=i+1; Don’t laugh now, wait until you see it in real life. * Comment Dos & Don’ts * Do use comments to describe why, not how * Do document the inputs, outputs, and purpose of each function * Do store notes/references in comments * Don’t turn code off and on with comments * Comment Dos & Don’ts Do use comments to describe why, not how Bad: r * Set lib as (1, [2/3 * n]) lib <- c(1, floor(2/3 * n)) Good: r * set aside 2/3 of data to train model lib <- c(1, floor(2/3 * n)) * Comment Dos & Don’ts Do document the inputs, outputs, and purpose of each function r plot_abundance_histogram <- function(data_proc, filename, width = 6, height = 6) { # {{code}} } * What kind of object is data_proc, what fields/columns are used to make the plot? * Comment Dos & Don’ts Do store notes/references in comments r * cholesky algorithm from Rasmussen & * Williams (2006, algorithm 2.1) R <- chol(Sigma) alpha <- backsolve(R, forwardsolve(t(R), y_lib - mean_y)) L_inv <- forwardsolve(t(R), diag(NROW(x_lib))) Sigma_inv <- t(L_inv) %*% L_inv * Comment Dos & Don’ts Don’t turn code off and on with comments – you will not remember why it was commented out (was it buggy?, did you want to test somethign?) r * cat("The value of x is", x) - use conditional logic instead
r if (DEBUG_MODE) { cat("The value of x is", x) } Set DEBUG_MODE at the top of the script.

Code Smells

What are “code smells”?
- Code smells: properties that seem not so ideal
  - code smells do not necessarily mean incorrect code, but are signs that the code may be more prone to errors
- recognizing code smells is a skill
  - depends on experience, and personal taste
- use a good style guide to avoid common issues!
Some common code smells
- functions that are too long
- lots of single-letter variable names
- too much indentation
- no indentation
- duplicated (or near-identical) code fragments
- magic numbers

TMI = too much indentation .tiny[

get_some_data

] example from https://speakerdeck.com/jennybc/code-smells-and-feels?slide=42

Simplify the logic! .tiny[

get_some_data data

] example from https://speakerdeck.com/jennybc/code-smells-and-feels?slide=43

Magic Numbers Magic numbers are values in the code where the meaning of the number is derived from context.
- confusing - what do they represent?
- brittle - the value isn’t always fixed
- can usually be replaced with named constants
Example
```
cat("The correlation between mpg and cyl is", 
    cor(mtcars[,1], mtcars[,2]))
```
```
The correlation between mpg and cyl is -0.852162
```
- What if we insert a column or change the order of columns?
- What if we want to change the variables?
  - is the text label still correct?

Example (improved)

define variables to replace the “magic numbers”

var_1 <- "mpg"
var_2 <- "cyl"
cat("The correlation between", var_1, 
    "and", var_2, "is", 
    cor(mtcars[,var_1], mtcars[,var_2]))

The correlation between mpg and cyl is -0.852162

Example (as a function)

We can even turn it into a function!

f <- function(df = mtcars, var_1 = "mpg", 
              var_2 = "cyl") {
  cat("The correlation between", var_1, 
      "and", var_2, "is", 
    cor(df[,var_1], df[,var_2]))
}
f()

The correlation between mpg and cyl is -0.852162

General Strategies
- Embrace refactoring: rewriting code without changing its behavior (i.e. make it faster, cleaner, easier to use)
- follow a coding style guide
  - links in the resources page
- use technological tools like automatic indentation and linters
  - many modern IDEs (e.g. RStudio) have these built-in ## Thanks
Let me know what content you’d like to see
Contact me for additional questions or consultation requests!
Email: nhans@ufl.edu
Check back in on the libguide for more modules and contact info:
- https://guides.uflib.ufl.edu/reproducibility
Original slides courtesy of Hao Ye

Intro

“Any fool can write code that a computer can understand. Good programmers write code that humans can understand.”

Breaking Code into Functions

Comments

Code Smells

“Any fool can write code that a computer can understand.
Good programmers write code that humans can understand.”