The Downfall of the Data Engineer

This post follows up on The Rise of the Data Engineer, a recent post that was an attempt at defining data engineering and described how this new role relates to historical and modern roles in the data space.

In this post, I want to expose the challenges and risks that cripple data engineers and enumerates the forces that work against this discipline as it goes through its adolescence.

While the title of this post is sensationalistic and the content quite pessimistic, keep in mind that I strongly believe in data engineering — I needed a strong title that contrasts with my previous article. Understanding and exposing the adversity that the role is facing is a first step towards finding solutions.

Also note that the views expressed here are my own, and are based on observations made while talking to people from dozens of data teams across Silicon Valley. These views are not the views of my employer, or directly related to my current position.

Boredom & context switching

When the idle time between iteration cycles is counted in hours, it becomes tempting to work around the clock to keep your “plates spinning”. When 5–10 minutes of work at 11:30pm can save you 2–4 hours the next day, it tends to lead to unhealthy work-life balance.

Consensus seeking

Historically, people used the pejorative term “data silo” to designate issues related to heterogenous analytics that would be scattered across platforms or use incompatible referential. Silos naturally spawn into existence as projects get started, teams drift and inevitably as acquisitions occur. It’s the task of the business intelligence (now data engineering) teams to solve these issues with methodologies that enforces consensus, like Master Data Management (MDM), data integration, and an ambitious data warehousing program. Nowadays, at modern fast pace companies, the silo problem quickly grows out of proportion, where you could use the term “dark matter” to qualify the result of the expansion of chaos that is taking place. With an army of not-so-qualified people pitching in, the resulting network of pipelines can quickly become chaotic, inconsistent and wasteful. If the data engineer is the “librarian of the data warehouse”, they might feel like their mission is akin to classifying publications in a gigantic recycling plant.

In a world where the dashboard lifecycles are counted in weeks, consensus becomes a background process that can hardly keep up with the rate of change and shifting focus of the business. Traditionalists would suggest starting a data stewardship and ownership program, but at a certain scale and pace, these efforts are a weak force that are no match for the expansion taking place.

Change Management

Since pipelines are typically large and expensive, adequate unit or integration testing can be expected to be somewhat proportional. The point being: there’s only so much you can validate with sampled data and dry-runs. And if you thought a single environment was more chaos than you could handle, try to stay sane while throwing a dev and staging environment that use intricately different code and data! In my experience, it’s rare to find any sort of decent dev or test environments in the big data world. In many cases, the best you’ll find are some namespaced “sandboxes” that people use to support whatever undocumented process they see fit.

Data engineering has missed the boat on the “devops movement” and rarely benefit from the sanity and peace-of-mind it provides to modern engineers. They didn’t miss the boat because they didn’t show up, they missed the boat because the ticket was too expensive for their cargo.

The worst seat at the table

If there’s a data engineer that is part of the conversation at all, it’s probably to help the data scientists and analysts gathering the data they need. If the data of interest isn’t already available in the structured part of the data warehouse, chances are that the analyst will proceed with a short term solution querying raw data, while the data engineer may help in properly logging and eventually carrying that data into the warehouse. Most likely an answer is required in a timely fashion, and by the time the new dimensions and metrics are backfilled into the warehouse, it’s already old news and everyone has moved on. The analyst will get the glory for the insight, and everyone else may question the need for the slow background process of consolidating this new piece of information in the warehouse.

While “impact” — which implies velocity and disruption — is the most sought after word in employees’ performance review, data engineering is condemned to being a slow background process with little short term impact. Data engineers are many degrees removed from those who are “moving the needle”.

Operational creep

Since data engineering typically comes with a fairly high maintenance burden, operational creep comes fast and disarms engineers faster than you can hire them. Yes, modern tooling help people be more productive, but arguably that’s only machinery that allows pipeline builders to keep more plates spinning at once.

Moreover, operational creep can lead to high employee turnover, which ultimately lead to low quality, inconsistent, unmaintainable messes.

Real software engineers?

The role, for the reasons depicted in this article, can suffer from a bad reputation that spins that viscous circle.

But wait — there’s still hope!

With numerous companies plateauing on their data ROI and feeling the frustration of “data operational peak”, it’s inevitable that upcoming innovation will address the pain points described here, and eventually create a new era in data engineering.

One could argue that a possible path forward is de-specialization. If the proper tooling is made available, perhaps simple tasks can be deferred to information workers. Perhaps more complex workloads can become a dimension of common software engineering work, much like what happened to Q/A and release engineers while continuous delivery technologies and methodologies emerged.

In any case, proper tooling and methodology will define the path forward for the role, and I’m hopeful that it is possible to address most of the roots causes leading to the concerns expressed in this post.

I’m planning an upcoming blog post titled “Next generation, data-aware ETL” where I’ll be proposing a design for a new framework that has accessibility and maintainability at its very core. This yet-to-be-built framework would have a set of hard constraints, but in return will provide strong guarantees while enforcing best practices. Stay tuned!

Founder and CEO at Preset, creator of Apache Superset and Apache Airflow

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store