Explained: The Data Engineering techniques used in my “Spring Job Losses” post

If you want to know how I crunched the numbers and generated the original graphs in my last graph-loaded blog post, here is an overview:

What Were The Data Sources?

I used the Bureau of Labor Statics “Current Employment” (CE) datasets as my exclusive data source. This dataset consists of over 25,000 time-series datasets covering almost 100 different data metrics for over 800 industries or industry groupings. It encompasses both seasonally adjusted and non-seasonally adjusted data going back to 1939. The total number of records for the June 2020 data release clocked in at over 7.7 million.

What Tools Were Used to Generate the Graphs?

The graphs were 100% automatically generated with R and Python 3.8, using only standard libraries and coding techniques found on python.org and the R help files, manuals, and built-in code examples.

I used a default Linux text editor for coding in R; I used the Eclipse IDE with PyDev for development of the Python components.

What Skills/Ingredients Were Required for the Programming?

Advanced proficiency in OOP concepts, code management, and version control; proficiency in normalized database structures; knowledge of college-level data scraping techniques; understanding of economic times series, economic jargon, and the theory that underlines all of it; the patience to hunt for information in massive technical documents without getting distracted; a touch of thyme.

Was There an Alternative Approach That Could Have Been Used?

If you simply want to explore U.S. employment data I would recommend using the St. Louis Federal Reserve FRED web-based tools. If you want to recreate the graphs I produced I would recommend downloading selected datasets from FRED and then manipulating them using LibreOffice Calc for barplots instead of trying to recreate them within R: R is really only “user-friendly” if you are a moderately experienced programmer and you make obsessive use the help functions.

The St. Louis Fed FRED web tools are really great, and there is even a limited (albeit incomplete) tabular version of the BLS-categorized labor market you can click through. While I did find this table to be a bit error-prone and many job categories were missing, what they provide should still be good enough for most causal data sleuths.

It’s worth emphasizing that the raw BLS Current Employment data is not beginning-friendly, and R requires extensive data engineering before it can produce sophisticated graphs for complex time series datasets (like the BLS CE data). Complex data manipulation is needed even after raw data is extracted and pre-processed.

Using FRED in combination with your preferred spreadsheet application really is the way to go for most.

So How Was the Data ‘Engineered’?

The methods are not a secret: meticulous data categorization and classification; the same thing that is basis for all quality economic analysis, the foundation of the Kimball Data Warehouse approach, and the thing emphasized in every good data guidebook out there.

Taking the time to understand the context data exists in, and to properly document and sort through its complexity is 90% of basic data processing and analysis. The remaining 10% will likely be code optimization work and maybe even some “predictive” modeling, on those days you want to play psychic.

In the end, it’s not the data engineering that is the most difficult task: what is difficult is keeping people’s attention and avoiding data overload. It’s enormously difficult to do this, and often requires experimenting with different presentation and visualization approaches. But as they say: nothing worth doing is easy.