Note: If you want to follow along, the GitHub repo can be found here.
The GitHub OCTO team recently released their first project: Flat Data. The project aims to offer âa simple pattern for bringing working datasets into your repositories and versioning them.â And it succeeds in doing so! I recently incorporated Flat Data into one of my projects, allowing me to finally stop manually updating the data on a semiregular basis (yikes!). While working, I couldnât find any documentation on using R with Flat Data. Here, Iâll explain the steps I took to incorporate R scripts into a Flat Data pipeline.
Whatâs Flat Data?
Flat Data solves the problem of carrying out the same repetitive tasksâretrieving, cleaning, and then republishing dataâthat commonly affects developers who want to present rapidly updating data (for example, COVID-19 data that updates daily). And although alternative solutions exist, Flat Data is easy, intuitive, and integrated directly with your GitHub repository (via GitHub):
The idea, as seen above, is essentially to read in data (data.json), conduct some postprocessing (process.js), and output some better data (processed-data.json).
Doing it in R
The most essential step of a Flat Data project is postprocessing. This occurs after data retrieval and before data output, and it can be done in a few different languages. By default, the OCTO teamâs examples are done in JavaScript/TypeScript, and one user has given an example of postprocessing in Python. To the best of my knowledge, though, there arenât any examples of including R in the postprocessing stage, hence the reason for this post!
Using R in a Flat Data pipeline is as simple as installing the necessary packages and then sourcing your R cleaning script from a postprocessing TypeScript file. Letâs explore how that works.
Weâll be grabbing data from the Mapping Police Violence homepage, tidying it up, and then republishing it. (This cleaned data is the source for my visualization on police violence.) Hereâs a sample of the final data output:
01. Setup flat.yml
The first step in any Flat Data pipeline is to create .github/workflows/flat.yml
, which will include the configuration for your project. You can do so by using GitHubâs VSCode extension, or by creating your own YAML file manually. The YAML file we use in this project is remarkably similar to the boilerplate file, with a few differences:
flat.yml
The tweaks you would make to this workflow are most likely in http_url
and schedule
. To confirm, visit GitHubâs documentation.
02. Postprocess
We pick up at the last line of code in the previous chunk:
flat.yml
Here, we reference a TypeScript file titled postprocess.ts
. Upon completion of the data download, GitHub will run this script for any additional processing steps. This file must be a .js
or .ts
file.
Those who are skilled in data wrangling with JavaScript might be able to write their additional processing in JavaScript itself, but few of us are skilled in data wrangling with JavaScript. Moreover, some users want to migrate their existing projects and workflows to Flat Data, and so including languages other than JavaScript (in this case, R) is essential.
The postprocess.ts
file I use in my workflow looks like this (it might help to see how Deno works):
postprocess.ts
The above script is rather simple: it 1) installs packages, and 2) runs the processing script, titled clean.R
.
The first step is important. Package management was the biggest issue I ran into while setting up this workflow; if youâre having issues, pay attention to this step. Youâll need to identify all the packages that are required in your R processing script, but you canât install those packages in the script itself, due to virtual machine permissions. You instead have to run them via the command line, using sudo Rscript -e
, as I do above (in step 1).
The command sudo Rscript -e
precedes any regular function or command that you would run in an R script. It executes those commands via the command line, rather than within a script. (We add sudo to overcome system user permission problems.) For more, see this page.
03. Clean the data!
My clean.R
script, which I reference at the bottom of postprocess.ts
looks like this:
clean.R
Obviously, the content in the above cleaning script is irrelevant. It functions as any other R script would: it reads in data (based on the data we downloaded in postprocess.ts
), does some cleaning, and then outputs the new data. The real script is around 55 lines. Now you know why keeping the postprocessing in R was preferable!
In sum
Upon completing these steps and pushing the above to a repository, GitHub will automatically set up the action and run it on a daily basis. You can then examine the logs for each run in the Actions tab. This tab will be helpful for debugging, and you can force workflow executions manually here as well. In sum, the process of carrying out a GitHub Flat Data workflow, with the addition of an R postprocessing script, looks something like this:
Thanks for reading! You might learn more by perusing the GitHub repository that accompanies this post; otherwise, please send any questions via Twitter đ