::The whole Data Stack industry is focused on the Wrong Problem

The tools designed to help are now the problem. Data engineers spend more time working on the infrastructure of their stack and learning the many tools that solve the issues it created; most new data stack products are designed to fix problems caused by the stack itself.

The Philosophy That Started It All

The modern data stack was built on one philosophy:

This is source-first thinking. It seemed obvious in 2012 when cloud data warehouses dropped costs from $100,000/year to $160/month. The race was on to move data faster, store more of it, and worry about value creation downstream.

Source-first. SQL-only. Developer-focused. These assumptions shaped every tool that followed.

As a Result

Data Engineers now spend more time fixing their stack and the issues it creates than actually working with data and delivering results.

Stat	Source
897 apps average per enterprise	MuleSoft 2025
Only 28% of apps integrated	MuleSoft 2025
40% of IT time spent on integration	MuleSoft 2025
70% of data leaders say stack is "too complex"	Modern Data 101
Most BigQuery customers store less than 1TB	Jordan Tigani

The industry built infrastructure for hyperscale when most teams needed simplicity. We optimized for "how fast can we ingest raw data" when the right question was "what data products does the business actually need?"

What Went Wrong

Source-first - ingest everything, figure out value later. Storage is cheap, ingestion is often free or close to it, so why not? The problem isn't getting data in. It's what happens after. Every time that data gets queried, moved, transformed, or exported is where consumption-based pricing starts stacking up.

SQL & Python only - the entire stack assumes you speak SQL or Python. Every transformation is a query or a script, and every answer needs a technical translator. Business users can't get their own answers. Analysts become bottlenecks. The gap between a question and an answer gets measured in days, not seconds.

Developer-focused - these tools were built by engineers, for engineers. Config in YAML and JSON, debugging in terminals, logs everywhere. The people who actually live closest to the data problems - ops, finance, marketing - have no way in. They can't touch any of it.

Every tool category that followed - ingestion, transformation, orchestration, quality, lineage, observability, reverse ETL - exists to patch a gap left by these original assumptions.

"We got distracted by circular problems of our own making. We created pipelines to shuffle data around, and orchestrators to coordinate those pipelines, and observability dashboards to monitor the orchestrators, and incident managers to organize the observability incidents."

— Benn Stancil, Co-founder Mode Analytics

"Most BigQuery customers store less than 1TB."

— Jordan Tigani, BigQuery Founding Engineer

::Data Engineer Burnout

39% considering quitting - 78% wish their job came with a therapist

The Human Cost

The wrong philosophy didn't just create technical debt. It broke the people doing the work.

Data engineers were hired to work with data. Instead, they spend their days debugging Airflow DAGs, managing Kubernetes clusters, writing glue code between tools, and firefighting pipeline failures at 2 AM.

The tools designed to help became the full-time job.

"Data teams spent more time maintaining infrastructure than delivering insights."

— Modern Data 101

Stat	Source
39% of data engineers considering quitting due to burnout	Immuta
78% wish their job came with a therapist	Industry survey
77% report heavier workloads despite AI tools	Upwork 2024
67 monthly data incidents average per org	Wakefield Research 2023
15 hours average to resolve an incident (up from 5.5)	Wakefield Research 2023
40% of time spent on integration, not data work	MuleSoft 2025

The Daily Reality

Hired to build data products. Actually doing: YAML debugging. Credential rotation. Dependency conflicts. Version mismatches. Backfill jobs. Incident triage. Vendor management. Cost optimization. Security audits. Compliance documentation.

The "modern" in modern data stack didn't mean modern work. It meant more work.

::YAML DEBUGGING ::CREDENTIAL ROTATION ::DEPENDENCY CONFLICTS ::VERSION MISMATCHES ::BACKFILL JOBS ::INCIDENT TRIAGE ::VENDOR MANAGEMENT ::COST OPTIMIZATION ::SECURITY AUDITS ::COMPLIANCE DOCUMENTATION

"Data teams spent more time maintaining infrastructure than delivering insights."

— Modern Data 101

"Integration nightmares multiplied - teams became 'glue code developers.'"

— Modern Data 101

::Data Quality Issues

Most data quality tools exist because most systems don't validate data before pushing it further down the pipeline - and almost nobody runs data-product-specific quality checks when a data product is created or updated.

Bad data lands in your warehouse or lake, and if you're lucky, or you've paid for the right tooling, you might catch the problem before it causes a production error. That's not usually how it goes.

So quality has become its own separate product category - its own tooling, its own cost, its own team to run it. Another layer bolted onto the stack to compensate for something the stack should have handled in the first place.

Stat	Source
$12.9M annual cost of poor data quality	Gartner
67% don't trust their data for decisions	Precisely/Drexel 2024
64% say quality is top challenge (up from 50%)	Precisely 2024

"Data quality is usually one of the goals of effective data management. Yet too often organisations treat it like an afterthought."

— Gartner

"Most organisations decide to address issues in a piecemeal fashion... No wonder this is only a tactical solution; sooner or later, we need to start working on another tactical project to resolve the issues caused by the previous tactical project."

— Dan Sutherland, Senior Director, Protiviti

::Consumption Pricing

Every major vendor runs on consumption pricing. Warehouses charge per credit, compute platforms charge per unit, ingestion tools charge per row. On top of that, egress costs - what you pay to move data out of a cloud environment - can dwarf the storage costs themselves, and are often the last thing teams think about until the bill arrives. Then there are the skills costs: the platform certifications, the specialists you need to hire or retain just to operate and optimise these tools. The bill goes up the more you store, process, and move data around, regardless of whether any of it is delivering value.

Stat	Source
62% exceeded cloud budget in 2024	Wasabi Index 2025
86% of CIOs planning some repatriation	Barclays CIO Q4 2024
$100B+ market cap lost to cloud costs	Andreessen Horowitz

Every major vendor uses consumption-based pricing. Warehouses charge per credit. Compute platforms charge per unit. Ingestion tools charge per row. The more you store and process, the more you pay.

"Close to half of cloud buyers spent more on cloud than they expected in 2023, with 59% anticipating similar overruns in 2024."

— Daniel Saroff, Group VP, IDC

Minimum charges punish small queries. Cloud infrastructure costs typically exceed platform charges by 50-200%. Pricing changes trigger overnight cost increases. CFOs can't forecast. Finance teams treat data as unpredictable expense.

The incentives are misaligned. Vendors profit from volume. Customers benefit from value. Storing everything "just in case" is expensive for you and profitable for them.

Case Studies

37signals reduced AWS spend from $3.2M to $1.3M annually, projecting $10M+ savings over 5 years by leaving the cloud.
GEICO achieved 50% compute cost reduction and 60% storage cost reduction through cloud repatriation.

"We're paying an at times almost absurd premium for the possibility that workloads could spike. It's like paying a quarter of your house's value for earthquake insurance when you don't live anywhere near a fault line."

— DHH, CTO 37signals (Basecamp/HEY)

"Close to half of cloud buyers spent more on cloud than they expected in 2023, with 59% anticipating similar overruns in 2024."

— Daniel Saroff, Group VP, IDC

::AI / The Chatbot Problem

Pretty much every data vendor has bolted on the same AI feature: a chatbot that writes SQL, generates DAGs, or builds pipelines. They call it a "copilot" or "analyst" or "assistant" depending on the marketing budget. They all do roughly the same thing, and they all run into the same wall: the underlying architecture was never built with AI in mind.

Putting a natural language interface on top of a complex system doesn't make it intelligent. It just makes it a bit easier to poke at while hiding everything broken underneath - and it increases the security risk footprint, since you're now exposing data access through an AI layer that wasn't part of the original threat model.

Stat	Source
95% of AI pilot projects delivered no measurable P&L impact	MIT "GenAI Divide" Study, July 2025
30% of GenAI projects abandoned after POC	Gartner 2025
77% of employees say AI tools added to their workload	Upwork, July 2024
40% of European "AI startups" had no real AI	PwC/MMC Ventures
$400K in SEC fines for AI washing claims	SEC, March 2024
GenAI now in "Trough of Disillusionment"	Gartner Hype Cycle 2025
For every 33 AI POCs launched, only 4 reach production	IDC

The AI-Native vs AI-Augmented Distinction

AI-augmented/powered: a traditional system with AI stuck on top. The core product still depends on the old architecture, just with a third-party model plugged in somewhere. The original design constrains everything it can do.

AI-native: built from scratch with intelligence as a core part of how it works. Take the AI out and the product stops making sense. It's not a feature, it's the foundation.

Technical comparisons show AI-native architectures running at 2-5x better performance on latency and throughput vs. the bolted-on approach. They also carry significantly less risk of hallucination and a lower chance of corrupting production data - both of which are real failure modes when an LLM is generating queries or pipelines against live systems. That gap in both performance and safety tends to widen as usage scales.

"When AI simply makes the product easier to use, that's AI-washing. A natural language front-end for a workflow - yes, technically it's AI - but it's just masking the complexity of the workflow with a more accessible customer interface."

— CMSWire

"LLMs can write SQL, but they are often prone to making up tables, making up fields, and generally just writing SQL that if executed against your database would not actually be valid."

— LangChain Documentation

::Data Sovereignty

Data sovereignty is no longer optional. Regulations require data to stay within borders. Cloud providers operate across jurisdictions. Compliance teams can't guarantee where data lives, who can access it, or which government can subpoena it.

The cloud promised flexibility. For regulated industries, it delivered risk.

Stat	Source
84% of European orgs using/planning sovereign cloud	IDC 2024
86% of CIOs planning some repatriation	Barclays CIO Q4 2024
GEICO: 50% compute reduction, 60% storage reduction	The Stack 2024

Data sovereignty isn't optional anymore. Regulations in a growing number of jurisdictions require data to stay within borders, but most cloud providers spread workloads across regions by default.

"The European cloud market is growing exponentially; however, the share held by European providers is diminishing. This trend poses a significant concern for Europe's technological sovereignty."

— Manuel Mateo Goyet, DG Connect, European Commission

"Ten years into that cloud journey, GEICO still hadn't migrated everything to the cloud, their bills went up 2.5X, and their reliability challenges went up quite a lot too."

— Rebecca Weekly, VP Platform Engineering, GEICO

::DevOps & Developer

SQL & Python required - Business users locked out entirely

The Problem

The entire data stack assumes you're a developer. Every tool requires SQL or Python. Configuration lives in YAML and JSON. Debugging happens in terminals and logs.

If you can't code, you can't participate. The people closest to the data problems - operations, finance, marketing - are locked out of the tools designed to solve them.

Stat	Source
897 apps average per enterprise; only 28% of apps integrated	MuleSoft 2025
70% of data leaders say stack is "too complex"	Modern Data 101

Who Gets Left Behind

Business users can't self-serve - they file tickets. Analysts become translators instead of analysts. Operations teams closest to data problems can't touch the tools. Finance waits days for reports that should take seconds.

The people who need data most are furthest from it.

The Stack They Expect You To Know

::SQL ::Python ::YAML ::JSON ::Git ::Docker ::Kubernetes ::Terraform ::Airflow ::dbt ::Spark

Each tool has its own syntax, its own mental model, its own ways of failing. The barrier isn't just high, it's deliberately technical, and it keeps getting taller. And across all these layers there is no provenance of changes - no reliable way to trace what was changed, by which tool, at which point in the chain.

"The 'freedom to choose' that once characterized the Modern Data Stack is quietly giving way to a controlled substrate that vendors can both standardize and monetize."

— Modern Data 101

::Orchestration

Orchestration tools exist because pipeline components don't coordinate themselves. You need a central scheduler to manage dependencies, sequence tasks, and handle failures. The scheduler becomes another system to maintain, debug, and keep running.

Stat	Source
70% rate pipeline management "complex"	Matillion 2025
89% of Airflow users expect more revenue-generating or external solutions this year	Astronomer 2026
32% of Airflow users have GenAI or MLOps in production	Astronomer 2026

Orchestration exists because the pipeline components do not coordinate themselves. The scheduler becomes another system to maintain, debug, and keep running.

"All in all, Airflow is far from perfect, and many of us have merely learned to deal with its limitations."

— Ben Rogojan (Seattle Data Guy), Independent Consultant

"Despite any criticism, Airflow continues to play a pivotal role, much like PHP of the data sector - often criticized but extensively relied upon."

— Ben Rogojan

::Observability

67 monthly data incidents avg - 74% say business finds issues first

The Problem

Observability tools exist because pipelines fail silently. Something breaks at 2 AM. Nobody notices for three days. Data is missing. Downstream systems already consumed what was there and made decisions on it.

Stat	Source
67 monthly data incidents average; 15 hours to resolve; 74% say business finds issues first; 31% of revenue impacted by data issues	Wakefield Research 2023
61 data incidents per month on average; 40% of the workday spent firefighting bad data	Monte Carlo 2022

Pipelines break quietly. Something goes wrong at 2 AM, nobody catches it until Wednesday, and by then three days of downstream reports have been built on bad data.

"Looking past the market fragmentation and maturity, there is significant demand among data and analytics leaders to address their growing data operations complexity."

— Gartner Market Guide for DataOps 2024

::Lineage

80% of data governance initiatives will fail by 2027

The Problem

Lineage tools exist because nothing tracks where data came from. They reconstruct history by parsing SQL queries and scanning logs. The reconstruction is incomplete, often wrong, and requires constant maintenance.

Stat	Source
80% of D&A governance to fail by 2027	Gartner
Only 25% measure data quality metrics	Precisely 2024

Lineage is the path data took; provenance is the fuller story behind the data, and it is something most data platforms do not track well, if at all.

"By 2027, 80% of data and analytics governance initiatives will fail."

— Saul Judah, VP Analyst, Gartner

"We are extremely adept at generating data, not so much at extracting value from those data, and very challenged to destroy any data at all. Data hoarding, data sprawl, and data decay are all significant problems."

— IDC

::Ingestion

95% report integration barriers - 897 apps avg, only 28% integrated

The Problem

Ingestion tools move data from source systems to warehouses. They promise "connect once, sync forever." Reality: schema changes break pipelines, API rate limits cause data loss, and you're charged per row whether the data is useful or not.

Stat	Source
95% report integration challenges as barriers to AI	Salesforce 2024
70% of data workers rate pipeline management "complex"	Matillion 2025

It should be noted that all current ingestion systems reprocess source data many times, especially those that push everything to the cloud first prior to processing.

"In 2017, Y Combinator funded 15 analytics, data engineering, and AI/ML companies. In 2021, they funded 100. It's impossible to make sense of this many tools, much less manage even a fraction of them in a single stack."

— Benn Stancil

::CDC (Change Data Capture)

89% report pipeline scaling issues - Kafka operational overhead

The Problem

CDC tools track changes in source databases. They require database-level access, often involve Kafka clusters for streaming, and create operational overhead that exceeds the original data engineering problem.

Stat	Source
89% report scaling issues with pipelines	Matillion 2025
20-30% write path overhead for trigger-based CDC	System Overflow
10-60 seconds latency for query-based CDC; deletes can be missed	System Overflow

CDC in the modern data stack is generally database-oriented. IBM is one of the few major platforms that documents broader provenance and element-level tracking concepts, while Palantir focuses more on dataset lineage and ontology rather than CDC-style element-level change capture.

"The thing to know about merchants of complexity is that they never go away, they merely migrate. From WS-DeathStar to microservices to premature k8s to auth services to GraphQL and beyond."

— DHH, CTO 37signals

::Reverse ETL

Exists because the stack moves data the wrong direction

The Problem

Reverse ETL tools exist to get data OUT of warehouses - because the whole stack moved data one direction, and now you need it back in Salesforce, HubSpot, and the other tools where work actually happens.

Provider	Tools Involved
Source -> Warehouse	Fivetran, Stitch, Airbyte
Warehouse -> Transform	dbt, Dataform
Warehouse -> Destination	Census, Hightouch, Polytomic
Data Warehouses	Databricks, Snowflake

"The problem wasn't the market around the products we were building; the problem was the product itself."

— Benn Stancil

Summary

::Final words

The current data stack is fragmented, expensive, and operationally heavy.

The same tools that were meant to simplify data work have created new layers of complexity. Teams spend more time wiring systems together, managing vendor sprawl, and handling operational overhead than they do improving the actual data products people rely on.

The result is more cost, more risk, and less control over data, governance, and compliance. That is the core problem this report has been pointing at throughout.