Natya Hans Academic Research Consulting and Services, University of Florida (updated: 2024-02-16)

Intro

“Any fool can write code that a computer can understand.
Good programmers write code that humans can understand.”

from “Refactoring: Improving the Design of Existing Code” by Martin Fowler

  • Motivations
    • In addition to code that works correctly, ideally, it should also be:
      • easy to read and understand
      • easy to maintain or change
      • verifiable easily (did it do what was expected correctly?)
      • aesthetically pleasing (optionally)
  • Learning Outcomes By the end of the workshop, participants will be able to:
    • organize programming tasks into modular functions
    • communicate code intent using comments
    • recognize and fix basic code smells
  • A Note
    • These concepts are universal, but code examples are in R.
    • Like any other skill, effective practice matters!
    • Refactoring - rewriting code without changing its functionality, but making it more readable and easy to maintain.
      • Similar to revising a paper for clarity.

Breaking Code into Functions

  • What are functions?

    • Functions let you refer to another piece of code by (a hopefully informative!) name
      mean()
      # computes arithmetic mean of input
    • You can write your own functions, too!
      celsius_to_fahrenheit <- function(x) {
        9/5 * x + 32
      }
  • Why write your own functions?

    • If my script already works, writing a function takes time…
    • Functions enable you to:
      • perform the same task on different inputs (e.g. new data, parameter values, etc.)
      • isolate correct code
      • organize your analysis for readability
  • You can repeat code … ```r df <- data.frame( a = rnorm(10), b = rnorm(10), c = rnorm(10))

  • rescale all the columns to [0, 1] dfa <  − (dfa - min(dfa))/(max(dfa) - min(dfa))dfb <- (dfb − min(dfb)) / (max(dfb) − min(dfa)) dfc <  − (dfc - min(dfc))/(max(dfc) - min(df$c)) ```

  • Or define a function! ```r rescale_to_01 <- function(x) { (x - min(x)) / (max(x) - min(x)) }

  • rescale the columns of df to [0, 1] dfa <  − rescaleto01(dfa) dfb <  − rescaleto01(dfb) dfc <  − rescaleto01(dfc)

  • or with dplyr df <- df %>% mutate( across(c(“a”, “b”, “c”), rescale_to_01)) ```

  • The DRY Principle: “Don’t Repeat Yourself”

    • Changing the calculation only needs to be done once (in the function definition).
    • The function name helps us find where to change the code.
    • A good function name helps someone understand what the code does without seeing its implementation.
  • Workflow structure Diagram of the workflow in a hypothetical data analysis project with boxes representing code and data/output files. &quot;Raw data&quot; goes into &quot;Pre-processing&quot; and then &quot;Pre-processed data&quot;. &quot;Pre-processed data&quot; goes directly into &quot;Figure 3&quot; (code) and then &quot;Figure 3&quot; (Data file), but also into &quot;Analysis/Modelling&quot;. The &quot;Model&quot; output from &quot;Analysis/Modelling&quot; is used in code for &quot;Figure 1&quot; and &quot;Figure 2&quot;, which generate files &quot;Figure 1&quot; and &quot;Figure 2&quot;. modified from “Reproducible research best practices @JupyterCon” (version 18) by Rachael Tatman, https://www.kaggle.com/rtatman/reproducible-research-best-practices-jupytercon

  • Example Code .small[

    library(tidyverse)
    source("analysis_functions.R")
    source("plotting_functions.R")
    data_raw 

    ]

  • Notes

    • The code matches the steps of the analysis.
    • It is easy to find where to make changes:
      • to modify a plot - edit the function
      • add a new figure - write a new function and add it to the workflow script
    • Possible improvement:
      • save the function outputs to a file, so that they do not need to be re-run
  • Tips for writing functions

    • name things well
    • plan for (some) flexibility
    • split large, complex tasks into smaller units, each with a clear purpose
    • use data structures to store complex objects, as needed
  • Tip 1: Naming Things Function names should be verbs ```r

  • bad row_adder() permutation()

  • good add_row() permute() ``` examples from https://style.tidyverse.org/functions.html#naming

  • Tip 2: Plan for flexibility

    plot_abundance_histogram <- 
      function(data_proc, filename, 
               width = 6, height = 6) { 
        # {{code}} 
      }
    • data_proc and filename are required inputs
    • width and height are adjustable, but have defaults values that work
  • Tip 3: How to subdivide

    1. start with the main goal of your program
    2. split the goal into separate subgoals
    • e.g. read in data, clean data, fit model, do statistics, make plots
    1. repeat step 2 - continue subdividing until you reach individual tasks
    • write a function for each task
    • choose initial implementation details to get things working
  • Notes on Tip 3

    • Get a simple version working first!
      • You can always make changes to the details and/or add flexibility later.
    • Each function should have a single well-defined task
    • Functions should ideally be 50 lines or less
      • this is a guideline—divide work into functions sensibly!
  • More Notes on Tip 3

    • If a line or set of lines of code is complicated, a new function might be needed ```r
  • bad if (class(x)[[1]] == “numeric” || class(x)[[1]] == “integer”)

  • good if (is.numeric(x)) ``` examples from https://speakerdeck.com/jennybc/code-smells-and-feels?slide=36

  • Tip 4: Use data structures Most programming languages let you create data structures to store complex data.

    • e.g. in R, you can create a list to include data, settings, and results. This list can be returned from a function:
    list(data = mtcars, 
         var_1 = "mpg", 
         var_2 = "cyl", 
         rho = cor(mtcars$mpg, mtcars$cyl))

Comments

(via “Notes on Programming in C”, Rob Pike) There is a famously bad comment style: c i=i+1; /* Add one to i */ and there are worse ways to do it: c /********************************** * * * Add one to i * * * **********************************/ i=i+1; Don’t laugh now, wait until you see it in real life. * Comment Dos & Don’ts * Do use comments to describe why, not how * Do document the inputs, outputs, and purpose of each function * Do store notes/references in comments * Don’t turn code off and on with comments * Comment Dos & Don’ts Do use comments to describe why, not how Bad: r * Set lib as (1, [2/3 * n]) lib <- c(1, floor(2/3 * n)) Good: r * set aside 2/3 of data to train model lib <- c(1, floor(2/3 * n)) * Comment Dos & Don’ts Do document the inputs, outputs, and purpose of each function r plot_abundance_histogram <- function(data_proc, filename, width = 6, height = 6) { # {{code}} } * What kind of object is data_proc, what fields/columns are used to make the plot? * Comment Dos & Don’ts Do store notes/references in comments r * cholesky algorithm from Rasmussen &amp; * Williams (2006, algorithm 2.1) R <- chol(Sigma) alpha <- backsolve(R, forwardsolve(t(R), y_lib - mean_y)) L_inv <- forwardsolve(t(R), diag(NROW(x_lib))) Sigma_inv <- t(L_inv) %*% L_inv * Comment Dos & Don’ts Don’t turn code off and on with comments – you will not remember why it was commented out (was it buggy?, did you want to test somethign?) r * cat("The value of x is", x) - use conditional logic instead
r if (DEBUG_MODE) { cat("The value of x is", x) } Set DEBUG_MODE at the top of the script.

Code Smells

  • What are “code smells”?

    • Code smells: properties that seem not so ideal
      • code smells do not necessarily mean incorrect code, but are signs that the code may be more prone to errors
    • recognizing code smells is a skill
      • depends on experience, and personal taste
    • use a good style guide to avoid common issues!
  • Some common code smells

    • functions that are too long
    • lots of single-letter variable names
    • too much indentation
    • no indentation
    • duplicated (or near-identical) code fragments
    • magic numbers
  • TMI = too much indentation .tiny[

    get_some_data 

    ] example from https://speakerdeck.com/jennybc/code-smells-and-feels?slide=42

  • Simplify the logic! .tiny[

    get_some_data 
      data 

    ] example from https://speakerdeck.com/jennybc/code-smells-and-feels?slide=43

  • Magic Numbers Magic numbers are values in the code where the meaning of the number is derived from context.

    • confusing - what do they represent?
    • brittle - the value isn’t always fixed
    • can usually be replaced with named constants
  • Example

    cat("The correlation between mpg and cyl is", 
        cor(mtcars[,1], mtcars[,2]))
    The correlation between mpg and cyl is -0.852162
    • What if we insert a column or change the order of columns?
    • What if we want to change the variables?
      • is the text label still correct?
  • Example (improved)

    • define variables to replace the “magic numbers”
    var_1 <- "mpg"
    var_2 <- "cyl"
    cat("The correlation between", var_1, 
        "and", var_2, "is", 
        cor(mtcars[,var_1], mtcars[,var_2]))
    The correlation between mpg and cyl is -0.852162
  • Example (as a function)

    • We can even turn it into a function!
    f <- function(df = mtcars, var_1 = "mpg", 
                  var_2 = "cyl") {
      cat("The correlation between", var_1, 
          "and", var_2, "is", 
        cor(df[,var_1], df[,var_2]))
    }
    f()
    The correlation between mpg and cyl is -0.852162
  • General Strategies

    • Embrace refactoring: rewriting code without changing its behavior (i.e. make it faster, cleaner, easier to use)
    • follow a coding style guide
    • use technological tools like automatic indentation and linters
      • many modern IDEs (e.g. RStudio) have these built-in ## Thanks
  • Let me know what content you’d like to see

  • Contact me for additional questions or consultation requests!

  • Email:

  • Check back in on the libguide for more modules and contact info:

  • Original slides courtesy of Hao Ye