Data Organization in Spreadsheets

class: inverse, center, middle

# Data Organization in Spreadsheets
### Hao Ye
### Health Science Center Libraries, University of Florida
### (updated: 2023-04-19)

---

# Motivations

Why should you care about this topic?

Organizing your projects and data will:
* improve reproducibility
* enable you to scale up projects
* make it easier to build off it later
--

**this includes yourself at a later point!**

---

# Learning Outcomes

By the end of the workshop, participants will be able to:

* implement recommended practices for file and folder organization in projects
* use and apply file naming conventions
* understand the principles of tidy data for structuring tabular data

---
class: inverse, center, middle

# File and Folder Organization

---

# Principles

* You don't want to spend time looking for stuff.
  Neither do your collaborators.
* Keep all files associated with a project in **one** place.
  - project description
  - data
  - code
  - figures
  - manuscript

---

# Recommendations

* Use sub-folders to organize data, figures, manuscript, etc.
  - separate raw and processed data
* Write `README` files to document:
  - who did the experiments
  - data formatting
    (more on this later)
  - when/how/why

---

# Structuring a Project Folder

<pre class = "hljs remark-code">
project/
├── data/
├── figures/
├── output/
├── paper/
├── R/
└── README.md
</pre>

--
* more examples in [Marwick et al. 2018 "Packaging Data Analytical Work Reproducibly Using R (and Friends)"](https://www.tandfonline.com/doi/abs/10.1080/00031305.2017.1375986)

---
# A Project README

A plain text file containing:
* a short summary of project & goals
* guidance for interacting with the project
  - (software) how to install and run it
  - (paper) which file is the final report/paper
  - (data) the summary figure or main file
* funding acknowledgments / how to cite

---
# Summary

* keep project stuff together in the same place
  - files related to the project
  - **key information** about the project (creating files for this if necessary!)
* give simple instructions to engage with your project
  - think of it as a free sample that helps people decide if your project is useful to them!

---
class: inverse, center, middle

# Naming Things

---

# Principles

Choose names for files that:
* are machine-readable
* are human-readable
* sort nicely

---

# Examples (NOT SO GOOD)

<pre class = "hljs remark-code">
Hao's data with spaces & punctuation.xlsx
01.R
figure 1.png
fig 2.jpg
report-final FINAL HY-comments ver3.docx
</pre>

---

# Examples (BETTER)

<pre class = "hljs remark-code">
hao_learned_about_underscores.xlsx
01_import-data.R
fig01_scatterplot_length-vs-interest.png
fig02_histogram_talk-attendance.png
yearly-report_2020-06-01_hy-comments.docx
</pre>

---
# Machine-Readable

* Avoid spaces, punctuation, accented characters, mixing CAPS and lowercase.
  - computers can struggle with these
  - simple names are easier to remember
* Separate words with hyphens `'-'`
* Separate groupings with underscores `'_'`

---

# Human-Readable

* Filenames do not need to be very detailed
* Filenames *should* help you find it later when you need to
* For code, use nouns for data objects and verbs for actions:
  - `load-data.R`
  - `data_mtcars.csv`
  - `make-scatterplot.R`

---

# Using Default Ordering

* Files can be sorted alphabetically (and usually)
  - we can make use of this!
* Left-pad with numbers to impose an order on files:
  - e.g. `01_load-data.R`, `02_process-data.R`
* Use RFC-3339 for dates:
  - `YYYY-MM-DD` format (alphabetical sorting makes it chronological, too!)

---
class: center
<img src="https://imgs.xkcd.com/comics/iso_8601_2x.png" width="60%" />

[xkcd/1179](https://xkcd.com/1179/)

---

# Example

<pre class = "hljs remark-code">
siteA_2020-04-01_animal-count.xls
siteA_2020-05-01_animal-count.xls
siteB_2020-04-01_animal-count.xls
siteB_2020-04-01_weather.dat
siteB_2020-05-01_animal-count.xls
siteB_2020-05-01_weather.dat
</pre>

--
files are organized by site -> date -> type of measurement

---

# Summary

* you probably use names to organize files already!
  - use these tips to improve your system
* consistency is important!

---
class: inverse, center, middle

# Structuring Tabular Data (e.g. spreadsheets)

---

# Principles

Make it easy for anyone to work with your data:
* Structure data for analysis (i.e. "Tidy Data")
* Store important information explicitly
* Keep an archive of raw data
* Include a Data Dictionary

---

# Tidy Data

Properties of *tidy data*:
* each column is a variable
* each row is an observation
* each table is a single observational unit

---

# Common issues

* data values are stored in column headers
  - e.g. treatment values, dates of sampling
* multiple variables are stored in one column
* variables are stored in rows and columns
* "1 observational unit = 1 table" is violated
  - e.g. subject info and measurements are stored in the same table

---

# Example: column labels encoding data

* wide-format data

|name|Thin Mints|Samoas|Tagalongs|
|--|--|--|--|
|A|4|0|0|
|B|2|0|2|
|c|0|3|1|

---

# Example: re-structured columns

* long-format data

|name|flavor|boxes|
|--|--|--|
|A|Thin Mints|4|
|B|Thin Mints|2|
|B|Tagalongs|2|
|C|Samoas|3|
|C|Tagalongs|1|

---

# Example: multiple observational units

|name|address|flavor|boxes|
|--|--|--|--|
|A|3828 Piermont Dr|Thin Mints|4|
|B|221B Baker St|Thin Mints|2|
|B|221B Baker St|Tagalongs|2|
|C|124 Conch St|Samoas|3|
|C|124 Conch St|Tagalongs|1|

---

# Example: separate tables

.pull-left[
|name|flavor|boxes|
|--|--|--|
|A|Thin Mints|4|
|B|Thin Mints|2|
|B|Tagalongs|2|
|C|Samoas|3|
|C|Tagalongs|1|
]

.pull-right[
|name|address|
|--|--|
|A|3828 Piermont Dr|
|B|221B Baker St|
|C|124 Conch St|
]

---
# Spreadsheet Formatting

* for missing data, use empty cells or `'NA'`
  - *be careful about `'NA'` for certain types of data (e.g. country code for "Namibia")*
* do NOT space out data with empty rows/columns
* do NOT use text formatting (e.g. bold/italics) to store information
* WATCH OUT: Excel LOVES to convert data into a date format

---
# Example: spreadsheet formatting

![Figure 10 from Broman & Woo "Data Organization in Spreadsheets", showing a data table with an outlier highlighted in red as an example of bad formatting, and then an alternative where the outlier status is encoded in its own column.](spreadsheet_formatting.png)

Instead of highlighting the cells with outliers, encode outlier status as its own column.

---

# Raw Data

* store raw data for reproducibility
  - copy and modify files instead
  - and/or use scripts to clean/wrangle raw data
* document processing steps in code or text
  - someone else can see what processing was done, and reproduce it
  - OpenRefine is a great tool for cleaning messy data (and records the steps for you)!  
  https://uf-repro.github.io/OpenRefine-intro/

---

# Data Dictionary

* define your rows and columns
  - e.g. do rows correspond to:
     + individual subjects
     + OR data collection sessions
     + OR individual samples from a measurement device
* define codes, categories, acronyms
* define relationships between multiple tables
* [provide metadata details](https://guides.uflib.ufl.edu/datamanagement/metadata)

---

# Example Data Dictionary

(modified from Figure 9 of Broman & Woo "Data Organization in Spreadsheets")

---

# Thanks

* Contact me for additional questions or consultation requests!
* Check back in on the libguide for more modules and contact info:
  - https://guides.uflib.ufl.edu/reproducibility