::The whole Data Stack industry is focused on the Wrong Problem
The tools designed to help are now the problem. Data engineers spend more time working on the infrastructure of their stack and learning the many tools that solve the issues it created; most new data stack products are designed to fix problems caused by the stack itself.
The Philosophy That Started It All
The modern data stack was built on one philosophy:
This is source-first thinking. It seemed obvious in 2012 when cloud data warehouses dropped costs from $100,000/year to $160/month. The race was on to move data faster, store more of it, and worry about value creation downstream.
Source-first. SQL-only. Developer-focused. These assumptions shaped every tool that followed.
As a Result
Data Engineers now spend more time fixing their stack and the issues it creates than actually working with data and delivering results.
| Stat | Source |
|---|---|
| 897 apps average per enterprise | MuleSoft 2025 |
| Only 28% of apps integrated | MuleSoft 2025 |
| 40% of IT time spent on integration | MuleSoft 2025 |
| 70% of data leaders say stack is "too complex" | Modern Data 101 |
| Most BigQuery customers store less than 1TB | Jordan Tigani |
The industry built infrastructure for hyperscale when most teams needed simplicity. We optimized for "how fast can we ingest raw data" when the right question was "what data products does the business actually need?"
What Went Wrong
Source-first - ingest everything, figure out value later. Storage is cheap, ingestion is often free or close to it, so why not? The problem isn't getting data in. It's what happens after. Every time that data gets queried, moved, transformed, or exported is where consumption-based pricing starts stacking up.
SQL & Python only - the entire stack assumes you speak SQL or Python. Every transformation is a query or a script, and every answer needs a technical translator. Business users can't get their own answers. Analysts become bottlenecks. The gap between a question and an answer gets measured in days, not seconds.
Developer-focused - these tools were built by engineers, for engineers. Config in YAML and JSON, debugging in terminals, logs everywhere. The people who actually live closest to the data problems - ops, finance, marketing - have no way in. They can't touch any of it.
Every tool category that followed - ingestion, transformation, orchestration, quality, lineage, observability, reverse ETL - exists to patch a gap left by these original assumptions.
"We got distracted by circular problems of our own making. We created pipelines to shuffle data around, and orchestrators to coordinate those pipelines, and observability dashboards to monitor the orchestrators, and incident managers to organize the observability incidents."
"Most BigQuery customers store less than 1TB."
::Data Engineer Burnout
39% considering quitting - 78% wish their job came with a therapist
The Human Cost
The wrong philosophy didn't just create technical debt. It broke the people doing the work.
Data engineers were hired to work with data. Instead, they spend their days debugging Airflow DAGs, managing Kubernetes clusters, writing glue code between tools, and firefighting pipeline failures at 2 AM.
The tools designed to help became the full-time job.
"Data teams spent more time maintaining infrastructure than delivering insights."
| Stat | Source |
|---|---|
| 39% of data engineers considering quitting due to burnout | Immuta |
| 78% wish their job came with a therapist | Industry survey |
| 77% report heavier workloads despite AI tools | Upwork 2024 |
| 67 monthly data incidents average per org | Wakefield Research 2023 |
| 15 hours average to resolve an incident (up from 5.5) | Wakefield Research 2023 |
| 40% of time spent on integration, not data work | MuleSoft 2025 |
The Daily Reality
Hired to build data products. Actually doing: YAML debugging. Credential rotation. Dependency conflicts. Version mismatches. Backfill jobs. Incident triage. Vendor management. Cost optimization. Security audits. Compliance documentation.
The "modern" in modern data stack didn't mean modern work. It meant more work.
"Data teams spent more time maintaining infrastructure than delivering insights."
"Integration nightmares multiplied - teams became 'glue code developers.'"
::Data Quality Issues
Most data quality tools exist because most systems don't validate data before pushing it further down the pipeline - and almost nobody runs data-product-specific quality checks when a data product is created or updated.
Bad data lands in your warehouse or lake, and if you're lucky, or you've paid for the right tooling, you might catch the problem before it causes a production error. That's not usually how it goes.
So quality has become its own separate product category - its own tooling, its own cost, its own team to run it. Another layer bolted onto the stack to compensate for something the stack should have handled in the first place.
| Stat | Source |
|---|---|
| $12.9M annual cost of poor data quality | Gartner |
| 67% don't trust their data for decisions | Precisely/Drexel 2024 |
| 64% say quality is top challenge (up from 50%) | Precisely 2024 |
"Data quality is usually one of the goals of effective data management. Yet too often organisations treat it like an afterthought."
"Most organisations decide to address issues in a piecemeal fashion... No wonder this is only a tactical solution; sooner or later, we need to start working on another tactical project to resolve the issues caused by the previous tactical project."
::Consumption Pricing
Every major vendor runs on consumption pricing. Warehouses charge per credit, compute platforms charge per unit, ingestion tools charge per row. On top of that, egress costs - what you pay to move data out of a cloud environment - can dwarf the storage costs themselves, and are often the last thing teams think about until the bill arrives. Then there are the skills costs: the platform certifications, the specialists you need to hire or retain just to operate and optimise these tools. The bill goes up the more you store, process, and move data around, regardless of whether any of it is delivering value.
| Stat | Source |
|---|---|
| 62% exceeded cloud budget in 2024 | Wasabi Index 2025 |
| 86% of CIOs planning some repatriation | Barclays CIO Q4 2024 |
| $100B+ market cap lost to cloud costs | Andreessen Horowitz |
Every major vendor uses consumption-based pricing. Warehouses charge per credit. Compute platforms charge per unit. Ingestion tools charge per row. The more you store and process, the more you pay.
"Close to half of cloud buyers spent more on cloud than they expected in 2023, with 59% anticipating similar overruns in 2024."
Minimum charges punish small queries. Cloud infrastructure costs typically exceed platform charges by 50-200%. Pricing changes trigger overnight cost increases. CFOs can't forecast. Finance teams treat data as unpredictable expense.
The incentives are misaligned. Vendors profit from volume. Customers benefit from value. Storing everything "just in case" is expensive for you and profitable for them.
Case Studies
- 37signals reduced AWS spend from $3.2M to $1.3M annually, projecting $10M+ savings over 5 years by leaving the cloud.
- GEICO achieved 50% compute cost reduction and 60% storage cost reduction through cloud repatriation.
"We're paying an at times almost absurd premium for the possibility that workloads could spike. It's like paying a quarter of your house's value for earthquake insurance when you don't live anywhere near a fault line."
"Close to half of cloud buyers spent more on cloud than they expected in 2023, with 59% anticipating similar overruns in 2024."
::AI / The Chatbot Problem
Pretty much every data vendor has bolted on the same AI feature: a chatbot that writes SQL, generates DAGs, or builds pipelines. They call it a "copilot" or "analyst" or "assistant" depending on the marketing budget. They all do roughly the same thing, and they all run into the same wall: the underlying architecture was never built with AI in mind.
Putting a natural language interface on top of a complex system doesn't make it intelligent. It just makes it a bit easier to poke at while hiding everything broken underneath - and it increases the security risk footprint, since you're now exposing data access through an AI layer that wasn't part of the original threat model.
| Stat | Source |
|---|---|
| 95% of AI pilot projects delivered no measurable P&L impact | MIT "GenAI Divide" Study, July 2025 |
| 30% of GenAI projects abandoned after POC | Gartner 2025 |
| 77% of employees say AI tools added to their workload | Upwork, July 2024 |
| 40% of European "AI startups" had no real AI | PwC/MMC Ventures |
| $400K in SEC fines for AI washing claims | SEC, March 2024 |
| GenAI now in "Trough of Disillusionment" | Gartner Hype Cycle 2025 |
| For every 33 AI POCs launched, only 4 reach production | IDC |
The AI-Native vs AI-Augmented Distinction
AI-augmented/powered: a traditional system with AI stuck on top. The core product still depends on the old architecture, just with a third-party model plugged in somewhere. The original design constrains everything it can do.
AI-native: built from scratch with intelligence as a core part of how it works. Take the AI out and the product stops making sense. It's not a feature, it's the foundation.
Technical comparisons show AI-native architectures running at 2-5x better performance on latency and throughput vs. the bolted-on approach. They also carry significantly less risk of hallucination and a lower chance of corrupting production data - both of which are real failure modes when an LLM is generating queries or pipelines against live systems. That gap in both performance and safety tends to widen as usage scales.
"When AI simply makes the product easier to use, that's AI-washing. A natural language front-end for a workflow - yes, technically it's AI - but it's just masking the complexity of the workflow with a more accessible customer interface."
"LLMs can write SQL, but they are often prone to making up tables, making up fields, and generally just writing SQL that if executed against your database would not actually be valid."
::Data Sovereignty
Data sovereignty is no longer optional. Regulations require data to stay within borders. Cloud providers operate across jurisdictions. Compliance teams can't guarantee where data lives, who can access it, or which government can subpoena it.
The cloud promised flexibility. For regulated industries, it delivered risk.
| Stat | Source |
|---|---|
| 84% of European orgs using/planning sovereign cloud | IDC 2024 |
| 86% of CIOs planning some repatriation | Barclays CIO Q4 2024 |
| GEICO: 50% compute reduction, 60% storage reduction | The Stack 2024 |
Data sovereignty isn't optional anymore. Regulations in a growing number of jurisdictions require data to stay within borders, but most cloud providers spread workloads across regions by default.
"The European cloud market is growing exponentially; however, the share held by European providers is diminishing. This trend poses a significant concern for Europe's technological sovereignty."
"Ten years into that cloud journey, GEICO still hadn't migrated everything to the cloud, their bills went up 2.5X, and their reliability challenges went up quite a lot too."
::DevOps & Developer
SQL & Python required - Business users locked out entirely
The Problem
The entire data stack assumes you're a developer. Every tool requires SQL or Python. Configuration lives in YAML and JSON. Debugging happens in terminals and logs.
If you can't code, you can't participate. The people closest to the data problems - operations, finance, marketing - are locked out of the tools designed to solve them.
| Stat | Source |
|---|---|
| 897 apps average per enterprise; only 28% of apps integrated | MuleSoft 2025 |
| 70% of data leaders say stack is "too complex" | Modern Data 101 |
Who Gets Left Behind
Business users can't self-serve - they file tickets. Analysts become translators instead of analysts. Operations teams closest to data problems can't touch the tools. Finance waits days for reports that should take seconds.
The people who need data most are furthest from it.
The Stack They Expect You To Know
Each tool has its own syntax, its own mental model, its own ways of failing. The barrier isn't just high, it's deliberately technical, and it keeps getting taller. And across all these layers there is no provenance of changes - no reliable way to trace what was changed, by which tool, at which point in the chain.
"The 'freedom to choose' that once characterized the Modern Data Stack is quietly giving way to a controlled substrate that vendors can both standardize and monetize."
::Orchestration
Orchestration tools exist because pipeline components don't coordinate themselves. You need a central scheduler to manage dependencies, sequence tasks, and handle failures. The scheduler becomes another system to maintain, debug, and keep running.
| Stat | Source |
|---|---|
| 70% rate pipeline management "complex" | Matillion 2025 |
| 89% of Airflow users expect more revenue-generating or external solutions this year | Astronomer 2026 |
| 32% of Airflow users have GenAI or MLOps in production | Astronomer 2026 |
Orchestration exists because the pipeline components do not coordinate themselves. The scheduler becomes another system to maintain, debug, and keep running.
"All in all, Airflow is far from perfect, and many of us have merely learned to deal with its limitations."
"Despite any criticism, Airflow continues to play a pivotal role, much like PHP of the data sector - often criticized but extensively relied upon."
::Observability
67 monthly data incidents avg - 74% say business finds issues first
The Problem
Observability tools exist because pipelines fail silently. Something breaks at 2 AM. Nobody notices for three days. Data is missing. Downstream systems already consumed what was there and made decisions on it.
| Stat | Source |
|---|---|
| 67 monthly data incidents average; 15 hours to resolve; 74% say business finds issues first; 31% of revenue impacted by data issues | Wakefield Research 2023 |
| 61 data incidents per month on average; 40% of the workday spent firefighting bad data | Monte Carlo 2022 |
Pipelines break quietly. Something goes wrong at 2 AM, nobody catches it until Wednesday, and by then three days of downstream reports have been built on bad data.
"Looking past the market fragmentation and maturity, there is significant demand among data and analytics leaders to address their growing data operations complexity."
::Lineage
80% of data governance initiatives will fail by 2027
The Problem
Lineage tools exist because nothing tracks where data came from. They reconstruct history by parsing SQL queries and scanning logs. The reconstruction is incomplete, often wrong, and requires constant maintenance.
| Stat | Source |
|---|---|
| 80% of D&A governance to fail by 2027 | Gartner |
| Only 25% measure data quality metrics | Precisely 2024 |
Lineage is the path data took; provenance is the fuller story behind the data, and it is something most data platforms do not track well, if at all.
"By 2027, 80% of data and analytics governance initiatives will fail."
"We are extremely adept at generating data, not so much at extracting value from those data, and very challenged to destroy any data at all. Data hoarding, data sprawl, and data decay are all significant problems."
::Ingestion
95% report integration barriers - 897 apps avg, only 28% integrated
The Problem
Ingestion tools move data from source systems to warehouses. They promise "connect once, sync forever." Reality: schema changes break pipelines, API rate limits cause data loss, and you're charged per row whether the data is useful or not.
| Stat | Source |
|---|---|
| 95% report integration challenges as barriers to AI | Salesforce 2024 |
| 70% of data workers rate pipeline management "complex" | Matillion 2025 |
It should be noted that all current ingestion systems reprocess source data many times, especially those that push everything to the cloud first prior to processing.
"In 2017, Y Combinator funded 15 analytics, data engineering, and AI/ML companies. In 2021, they funded 100. It's impossible to make sense of this many tools, much less manage even a fraction of them in a single stack."
::CDC (Change Data Capture)
89% report pipeline scaling issues - Kafka operational overhead
The Problem
CDC tools track changes in source databases. They require database-level access, often involve Kafka clusters for streaming, and create operational overhead that exceeds the original data engineering problem.
| Stat | Source |
|---|---|
| 89% report scaling issues with pipelines | Matillion 2025 |
| 20-30% write path overhead for trigger-based CDC | System Overflow |
| 10-60 seconds latency for query-based CDC; deletes can be missed | System Overflow |
CDC in the modern data stack is generally database-oriented. IBM is one of the few major platforms that documents broader provenance and element-level tracking concepts, while Palantir focuses more on dataset lineage and ontology rather than CDC-style element-level change capture.
"The thing to know about merchants of complexity is that they never go away, they merely migrate. From WS-DeathStar to microservices to premature k8s to auth services to GraphQL and beyond."
::Reverse ETL
Exists because the stack moves data the wrong direction
The Problem
Reverse ETL tools exist to get data OUT of warehouses - because the whole stack moved data one direction, and now you need it back in Salesforce, HubSpot, and the other tools where work actually happens.
| Provider | Tools Involved |
|---|---|
| Source -> Warehouse | Fivetran, Stitch, Airbyte |
| Warehouse -> Transform | dbt, Dataform |
| Warehouse -> Destination | Census, Hightouch, Polytomic |
| Data Warehouses | Databricks, Snowflake |
"The problem wasn't the market around the products we were building; the problem was the product itself."
::Final words
The current data stack is fragmented, expensive, and operationally heavy.
The same tools that were meant to simplify data work have created new layers of complexity. Teams spend more time wiring systems together, managing vendor sprawl, and handling operational overhead than they do improving the actual data products people rely on.
The result is more cost, more risk, and less control over data, governance, and compliance. That is the core problem this report has been pointing at throughout.
This report's research was aided by AI, then a human curated and wrote the report.