class: inverse, center, middle # Data Organization in Spreadsheets ### Hao Ye ### Health Science Center Libraries, University of Florida ### (updated: 2023-04-19) --- # Motivations Why should you care about this topic? -- Organizing your projects and data will: * improve reproducibility * enable you to scale up projects * make it easier to build off it later -- **this includes yourself at a later point!** --- # Learning Outcomes By the end of the workshop, participants will be able to: * implement recommended practices for file and folder organization in projects * use and apply file naming conventions * understand the principles of tidy data for structuring tabular data --- class: inverse, center, middle # File and Folder Organization --- # Principles * You don't want to spend time looking for stuff. Neither do your collaborators. * Keep all files associated with a project in **one** place. - project description - data - code - figures - manuscript --- # Recommendations * Use sub-folders to organize data, figures, manuscript, etc. - separate raw and processed data * Write `README` files to document: - who did the experiments - data formatting (more on this later) - when/how/why --- # Structuring a Project Folder <pre class = "hljs remark-code"> project/ ├── data/ ├── figures/ ├── output/ ├── paper/ ├── R/ └── README.md </pre> -- * more examples in [Marwick et al. 2018 "Packaging Data Analytical Work Reproducibly Using R (and Friends)"](https://www.tandfonline.com/doi/abs/10.1080/00031305.2017.1375986) --- # A Project README A plain text file containing: * a short summary of project & goals * guidance for interacting with the project - (software) how to install and run it - (paper) which file is the final report/paper - (data) the summary figure or main file * funding acknowledgments / how to cite --- # Summary * keep project stuff together in the same place - files related to the project - **key information** about the project (creating files for this if necessary!) * give simple instructions to engage with your project - think of it as a free sample that helps people decide if your project is useful to them! --- class: inverse, center, middle # Naming Things --- # Principles Choose names for files that: * are machine-readable * are human-readable * sort nicely --- # Examples (NOT SO GOOD) <pre class = "hljs remark-code"> Hao's data with spaces & punctuation.xlsx 01.R figure 1.png fig 2.jpg report-final FINAL HY-comments ver3.docx </pre> --- # Examples (BETTER) <pre class = "hljs remark-code"> hao_learned_about_underscores.xlsx 01_import-data.R fig01_scatterplot_length-vs-interest.png fig02_histogram_talk-attendance.png yearly-report_2020-06-01_hy-comments.docx </pre> --- # Machine-Readable * Avoid spaces, punctuation, accented characters, mixing CAPS and lowercase. - computers can struggle with these - simple names are easier to remember * Separate words with hyphens `'-'` * Separate groupings with underscores `'_'` --- # Human-Readable * Filenames do not need to be very detailed * Filenames *should* help you find it later when you need to * For code, use nouns for data objects and verbs for actions: - `load-data.R` - `data_mtcars.csv` - `make-scatterplot.R` --- # Using Default Ordering * Files can be sorted alphabetically (and usually) - we can make use of this! * Left-pad with numbers to impose an order on files: - e.g. `01_load-data.R`, `02_process-data.R` * Use RFC-3339 for dates: - `YYYY-MM-DD` format (alphabetical sorting makes it chronological, too!) --- class: center <img src="https://imgs.xkcd.com/comics/iso_8601_2x.png" width="60%" /> [xkcd/1179](https://xkcd.com/1179/) --- # Example <pre class = "hljs remark-code"> siteA_2020-04-01_animal-count.xls siteA_2020-05-01_animal-count.xls siteB_2020-04-01_animal-count.xls siteB_2020-04-01_weather.dat siteB_2020-05-01_animal-count.xls siteB_2020-05-01_weather.dat </pre> -- files are organized by site -> date -> type of measurement --- # Summary * you probably use names to organize files already! - use these tips to improve your system * consistency is important! --- class: inverse, center, middle # Structuring Tabular Data (e.g. spreadsheets) --- # Principles Make it easy for anyone to work with your data: * Structure data for analysis (i.e. "Tidy Data") * Store important information explicitly * Keep an archive of raw data * Include a Data Dictionary --- # Tidy Data Properties of *tidy data*: * each column is a variable * each row is an observation * each table is a single observational unit --- # Common issues * data values are stored in column headers - e.g. treatment values, dates of sampling * multiple variables are stored in one column * variables are stored in rows and columns * "1 observational unit = 1 table" is violated - e.g. subject info and measurements are stored in the same table --- # Example: column labels encoding data * wide-format data |name|Thin Mints|Samoas|Tagalongs| |--|--|--|--| |A|4|0|0| |B|2|0|2| |c|0|3|1| --- # Example: re-structured columns * long-format data |name|flavor|boxes| |--|--|--| |A|Thin Mints|4| |B|Thin Mints|2| |B|Tagalongs|2| |C|Samoas|3| |C|Tagalongs|1| --- # Example: multiple observational units |name|address|flavor|boxes| |--|--|--|--| |A|3828 Piermont Dr|Thin Mints|4| |B|221B Baker St|Thin Mints|2| |B|221B Baker St|Tagalongs|2| |C|124 Conch St|Samoas|3| |C|124 Conch St|Tagalongs|1| --- # Example: separate tables .pull-left[ |name|flavor|boxes| |--|--|--| |A|Thin Mints|4| |B|Thin Mints|2| |B|Tagalongs|2| |C|Samoas|3| |C|Tagalongs|1| ] .pull-right[ |name|address| |--|--| |A|3828 Piermont Dr| |B|221B Baker St| |C|124 Conch St| ] --- # Spreadsheet Formatting * for missing data, use empty cells or `'NA'` - *be careful about `'NA'` for certain types of data (e.g. country code for "Namibia")* * do NOT space out data with empty rows/columns * do NOT use text formatting (e.g. bold/italics) to store information * WATCH OUT: Excel LOVES to convert data into a date format --- # Example: spreadsheet formatting ![Figure 10 from Broman & Woo "Data Organization in Spreadsheets", showing a data table with an outlier highlighted in red as an example of bad formatting, and then an alternative where the outlier status is encoded in its own column.](spreadsheet_formatting.png) Instead of highlighting the cells with outliers, encode outlier status as its own column. --- # Raw Data * store raw data for reproducibility - copy and modify files instead - and/or use scripts to clean/wrangle raw data * document processing steps in code or text - someone else can see what processing was done, and reproduce it - OpenRefine is a great tool for cleaning messy data (and records the steps for you)! https://uf-repro.github.io/OpenRefine-intro/ --- # Data Dictionary * define your rows and columns - e.g. do rows correspond to: + individual subjects + OR data collection sessions + OR individual samples from a measurement device * define codes, categories, acronyms * define relationships between multiple tables * [provide metadata details](https://guides.uflib.ufl.edu/datamanagement/metadata) --- # Example Data Dictionary .compact-table[ |name|plot_name|group|description|type| |--|--|--|--|--| |mouse|Mouse|demographic|Animal identifier|text| |sex|Sex|demographic|Male (M) or Female (F)|factor| |sac_date|Date of sac|demographic|Date mouse was sacrificed|date| |partial_inflation|Partial inflation|clinical|Indicates if mouse showed partial pancreatic inflation|logical| |coat_color|Coat color|demographic|Coat color, by visual inspection|factor| |crumblers|Crumblers|clinical|Indicates if mouse stored food in their bedding|logical| |diet_days|Days on diet|clinical|Number of days on high-fat diet|numeric| ] (modified from Figure 9 of Broman & Woo "Data Organization in Spreadsheets") --- # Thanks * Contact me for additional questions or consultation requests! * Check back in on the libguide for more modules and contact info: - https://guides.uflib.ufl.edu/reproducibility