PRoMAn – J. Grady Heller

Context

I have ADHD, but that even without those challenges it’s pretty obvious that the academy has some issues with organization in general. Academic work is a mess at the base. Being a former soldier, I feel like I know the value of clear modus operandi. Perhaps I should just ask, what scientist doesn’t want to have some of the mess just handled? I think we’re all tired of administrative or superfluous tasks taking priority over more valuable uses of our time.

In my work, I have tried to navigate this for myself, as we all do, and I developed a decent project management and note-taking concept and some helper functions in R. These ideas are a mix of old military project management ideas, my own observations and trial and error, as well as ideas I exchanged with other academics in my field who have similar challenges as I do. So, I now have a package full of small tools that try to cut down on busywork involved in data analysis and academic project work.

I have done my best to put present these tools in an R package that I call PRoMAn, for Project Management (Academic). The notes in this page introduce the main ideas in the package, so that you might see if it is for you. We all have our own unique ways of working with our research, but I hope you find this helpful.

This document is an introductory overview for the ideas in the package.

PRoMAn is available at my GitHub if you would like to see updates or install the package.

Core Ideas

PRoMAn does a number of things that I rely on when working as an academic epidemiologist and clinical researcher:

Creates project folders with a standardized organization scheme and knows where all my files are
Logs and reports work done within project folders so you’re never lost
Creates fallback project statuses - an emergency restore button
Provides functions to help with data cleaning and file management within the project folders
Expedites applying code to multiple files via simple workflows
Provides functions to easily manage figures
Includes functions for my approach to writing which help simulate data based on notes, manage sources, and prepare different parts of research products more quickly (I call the approach “Blitzschreiben”)

The Project Directory

The foundation of the package lies in a standardized organization of project folders. I don’t ever have to wonder where my notes, data, etc. are, and PRoMAn knows where everything is. This is also how all my projects are organized so switching between projects is not difficult. I know we all have our preferences and this isn’t new, but I feel like in general this format is a good basis and it is nice to just push a button and everything is set as expected. Also, this creates a PRoMAn project, which is more than just a directory.

So, whenever I have a new project I just run the corresponding PRoMAn function and I get this:

CURRENT PROJECT STRUCTURE:
------------------------------
test_project
├── data/ ( 0 files)
│   └── source/ ( 2 files)
│   └── cleaned/ ( 0 files)
├── r_programs/ ( 1 files)
├── personal_notes/ ( 0 files)
├── sources/ ( 0 files)
├── old/ ( 0 files)

There is a place for our data (data), both source and cleaned data (source is raw from the source - collected or downloaded, and cleaned is processed, prepped and ready to go), and a folder for R programs (r_programs). Those are probably self-explanatory.

Personal notes is for anything that is not going to be presented to others, and thus isn’t important for writing or reporting. It is a “brain dump” that is just for you. This is great for those of us with ADHD, and in general keeps the project directory clean. Research is a complex and exploratory process that requires creativity, and sometimes that involves some odd things that just don’t make it to collaborators as-is.

The sources folder is intended to hold PDF files of journal articles - literature sources (not data). The old folder is for pushing old and no longer used files out of working areas. It’s basically a project archive while the project is ongoing.

This seems to be a generally acceptable basic structure that myself and others can readily use, and technically the project can contain everything thanks to the personal_notes folder.

For reports, abstracts, and other products, I keep those in the root project directory (here its “test_file”).

File Management

To start off, PRoMAn creates the standard project directory when told, given a project name. Just enter a name in the create_project() function, and then there you go. You’ll have the exact setup shown above.

When developing PRoMAn, I wanted to be able to tell R; “here are my files in a pile, now just give me what I want from the pile.” This led to me creating functions in PRoMAn that track your project directory, keep track of your file pathways, and then allow for easy calling of data and scripts while working in R. Just run the set_project() function, and then you’re set. Then PRoMAn is properly oriented and has all the info it needs for your project. There is actually a hidden lexicon for each project in the project root, .proman. It keeps track of where your folders are, as well as other important information that can be used to make producing scientific works easier.

PRoMAn allows for recognition of sets of files within data folders, and you can pull a set when conducting analysis. I.e., PRoMAn allows you to pull what you want without having to even look in the folder or worry about the folder itself being organized. It can be completely disorganized and you’ll still get specifically what you want if you make use of the file set system.

PRoMAn does this by using keywords and filenames, so if you use the function select_files_by_keyword(), and provide it a keyword (say “covid_counts”) it will define all the files that contain “covid_counts” in the name as part of a file set. You can also select files to add a keyword to. So, if you happen to have covid_counts files for each state, you don’t have to do anything to pull them into your R environment other than define the set with files <- select_files_by_keyword("covid_counts") for example. This is especially helpful if, for example, you are looping through creating a dataset based on files for US states, performing cleaning and variable creation, then outputting files into the cleaned folder for each state. Just loop to your hearts content, then don’t worry about the mess as long as you use descriptive filenames.

There is also a function to load a file set into your R environment, if you want to do that directly. Otherwise, it might simply be easier to loop over a list of files. Either way, PRoMAn can speed that process up.

PRoMAn also has a save function, which will save your files with a timestamp. The timestamps include date and time, so you can initiate a save at any time without overwriting previous versions. These timestamps also help with report generation and other functions in PRoMAn, so don’t overlook the save function.

Additionally, to keep the project directory clean, there is a function to gather items and put them in the folder for old things. If you get to the end of a period of work, and you see things have gotten a bit messy, just use archive() with the names of all the things you want to push to the old folder.

Project Work Logging

PRoMAn generates reports that show what you’ve accomplished in the last day by default, or however many days you specify. It does not do this automatically, but it does do this when you prompt it, and also when you close R it will prompt you with the option to generate a report.

This report will do a few things:

Examine file changes
Printout current directory tree (like the tree shown above for the test_file example) so if you add files or move folders somehow, it will show up there
detect and report file sets
provide a summary of recent activity

Here’s an example report text file:

PROJECT STATUS LOG - sales_analysis
Generated: 2023-12-01 14:30:15
==================================================

RECENT FILE ACTIVITY:
-------------------------
  Data (source) : 3 files modified
    - raw_sales.csv ( 12/01 09:15 )
    - customer_lookup.csv ( 12/01 10:22 )
  R Programs : 2 files modified
    - data_cleaning.R ( 12/01 13:45 )
    - analysis.R ( 12/01 14:15 )

CURRENT PROJECT STRUCTURE:
------------------------------
sales_analysis
├── data/ ( 2 files)
│   └── source/ ( 3 files)
│   └── cleaned/ ( 1 files)
├── r_programs/ ( 2 files)
├── personal_notes/ ( 1 files)
├── sources/ ( 0 files)
├── old/ ( 0 files)

FILE SETS DETECTED:
--------------------
Source Data :
  - Customer files : 2 files
    * customer_data.csv
    * customer_lookup.csv

RECENT ACTIVITY SUMMARY:
-------------------------
📊 Data pipeline active: 3 new source files, 1 processed files
🔬 Analysis in progress: 2 R scripts modified
🔄 Full workflow active: collecting data AND running analysis

These log files are put into the personal notes file, in a folder called logs.

Folder Tree Limitations

The logs folder is not going to show up in the tree diagram of the log. The tree doesn’t show everything or else that would be massive - some things were left out in order to keep it readable and concise. E.g., the logs folder does not show up in the tree.

The tree may also not always represent the current status - but if not, PRoMAn should still have correct pathways in its lexicon and functionality should never break, even if you accidentally move folders around in the project directory.

Fallback System

PRoMAn is supposed to do a good job of keeping the project directory in order, but things happen. Things get moved or deleted or corrupted. Just in case, PRoMAn has a function called set_fallback() which will save a snapshot of your entire project directory. If anything unexpected and untoward happens, you can call that fallback and your project will be restored to the status it had when you set the fallback.

Working with Data

For the most part, PRoMAn was not meant to focus on data cleaning, but there are a few helpful functions included that can examine data quality, which can conduct basic reports and glances at data. It can also compare datasets and find differences for debugging, or checking analysis between source and cleaned data.

One of my earliest and favorite functions I added to PRoMAn was a cleaning function that cleans strings within variable values - a simple function but extremely helpful when trying to merge or join data using character variables.

A function that I end up using more often than I expected though, is the compare_datasets() function. It does what it sounds like - compares two files and marks differences in variables and counts.

Perhaps the largest time-saver in this section though, PRoMAn’s system for rudimentary workflows. There are a number of workflow packages in R, but PRoMAn takes a much simpler approach that might be more approachable or convenient. If you write a chunk of code, and want that code to be applied to multiple files, PRoMAn has functions that can read a chunk in that is marked, parse it as a component to apply, and can chain these components in a master script if desired. So, I often just create a fileset, play around until I get a bit of code that works, then mark it and apply it to the file set. You don’t have to translate it into a loop or function, and if you don’t need the extras in proper workflow/pipeline package, it’s much faster. Just mark and fire.

The Blitzschreiben Functions

I have a particular approach to project batch processing and writing work, which I call it Blitzschreiben. This is an entire separate concept that is largely described elsewhere, so I’ll keep it short here.

Briefly, Blitzschreiben is an attempt to is to break the process of writing into core components, and efficiently manage multiple projects and produce scientific deliverables. It has three man stages: I) idea triage and development, II) analysis (mock and real), and III) writing by synthesizing earlier stages and expanding.

Each stage is intended to be done quickly (well, relative to the task… I am afraid writing still does take time) but also in batches, so that it is more reasonable to produce two things at once rather than one. I’ll mention a few highlights, but the “Blitzschreiben” post on my site explains the idea in focus.

In general, PRoMAn can generate documents for working in this mode. These docs are intended to be human-readable and PRoMAn-parseable, so if you just fill them out, the information can be carried forward. One of my favorite examples is that there is a data table in the first stage document, in which you can write your needed and expected variables and information about them for your analysis plan. It’s something many of us do already anyway. If you do fill this out, the second stage document will have code in it to simulate a dataset based on those variables, so you can start to test your analyses right away.

Another thing these functions can do is just generate a bibTex bibliograhpy. IF your PDF sources are in your sources folder, as they likely would be, then why do you need to do anything else to have citations for those sources in this day and age? So, there’s a function that just reads all of those as best as it can and generates a bibliography. Even paid reference management applications make errors, but at least this one is just “push button, get bibliography.” Naturally, you can also readily edit them and create an annotated bibliography for your use; annotated_bibliography() creates a Quarto document that lists your sources, allows you to add annotations, and then when you are done you can run update_bibliography to put any new information or changes into your bibTex document from this quarto document (or any other bibTex files you might add in your sources folder from a reference management application). If you are concerned about it’s accuracy - this is the one place where dependencies are used in PRoMAn, so that PDFs are parsed as well as possible and information from CrossRef via DOI lookup is prioritized.

The functions in PRoMAn for Blitzschreiben are often based on the idea that it’s now 2025 and if the information already exists in our project folder, can we not just access that when we need it?

Ending Notes

There are a few other little helper functions in this toolbox for R, but here I just wanted to hit some highlights and introduce it, with the vignette and other posts/docs can providing more information and examples.

In addition to a standard vignette, PRoMAn has a “cheat sheet” function that will pull up an abbreviated list of functions and a few words on what they do. It will appear in a separate window, so it can be moved adjacent to your R workspace and used as a reference.

I’m also not sure when you are reading this, but it is very obvious that R has issues with packages that fall apart after release. PRoMAn was made with as few dependencies as possible, so it is hopefully robust to deprecation, and the functions in PRoMAn are tools that I personally use regularly so they should be updated as a matter of course.

LLM Disclaimer

This is a one-man project. A lot of the code in PRoMAn used was written with the help (or hindrance) of Anthropic’s Claude models.