::The whole Data Stack industry is forcused on The Wrong Problem
The tools designed to help are now the problem. It is a sad fact that data engineers currently spend more time working on the infrastructure of their data stack and learning the many tools that help solve the issues it created; in reality, most new data stack products are designed to fix issues caused by the stack itself.
The Philosophy That Started It All
The modern data stack was built on one philosophy:
Ingest data sources as fast as possible. Store everything centrally. Figure out what to do with it later.
This is source-first thinking. It seemed obvious in 2012 when cloud data warehouses dropped costs from $100,000/year to $160/month. The race was on to move data faster, store more of it, and worry about value creation downstream.
Source-first. SQL-only. Developer-focused. These assumptions shaped every tool that followed.
"We got distracted by circular problems of our own making. We created pipelines to shuffle data around, and orchestrators to coordinate those pipelines, and observability dashboards to monitor the orchestrators, and incident managers to organize the observability incidents."
"Most BigQuery customers store less than 1TB."
The Result
Data Engineers now spend more time fixing their stack and the issues it creates than actually working with data and delivering results.
| Stat | Source |
|---|---|
| 897 apps average per enterprise | MuleSoft 2025 |
| Only 28% of apps integrated | MuleSoft 2025 |
| 40% of IT time spent on integration | MuleSoft 2025 |
| 70% of data leaders say stack is "too complex" | Modern Data 101 |
| Most BigQuery customers store less than 1TB | Jordan Tigani |
The industry built infrastructure for hyperscale when most teams needed simplicity. We optimized for "how fast can we ingest raw data" when the right question was "what data products does the business actually need?"
What Went Wrong
Source-first — Ingest everything, figure out value later. Storage is cheap, so why not? Because cheap storage created expensive complexity. You pay for every row ingested, every table stored, every query run—whether anyone uses the output or not.
SQL & Python only — The entire stack assumes you speak SQL or Python. Every transformation is a query or a script. Every answer requires a technical translator. Business users can't self-serve. Analysts become bottlenecks. The gap between question and answer is measured in days, not seconds.
Developer-focused — Tools built by engineers, for engineers. Configuration in YAML and JSON. Debugging in terminals and logs. Business users locked out entirely. The people closest to the data problems—operations, finance, marketing—can't touch the tools designed to solve them.
Every tool category that followed—ingestion, transformation, orchestration, quality, lineage, observability, reverse ETL—exists to patch a gap left by these original assumptions.
::Data Engineer Burnout
39% considering quitting — 78% wish their job came with a therapist
The Human Cost
The wrong philosophy didn't just create technical debt. It broke the people doing the work.
Data engineers were hired to work with data. Instead, they spend their days debugging Airflow DAGs, managing Kubernetes clusters, writing glue code between tools, and firefighting pipeline failures at 2 AM.
The tools designed to help became the full-time job.
"Data teams spent more time maintaining infrastructure than delivering insights."
"Integration nightmares multiplied—teams became 'glue code developers.'"
| Stat | Source |
|---|---|
| 39% of data engineers considering quitting due to burnout | Immuta |
| 78% wish their job came with a therapist | Industry survey |
| 77% report heavier workloads despite AI tools | Upwork 2024 |
| 67 monthly data incidents average per org | Wakefield Research 2023 |
| 15 hours average to resolve an incident (up from 5.5) | Wakefield Research 2023 |
| 40% of time spent on integration, not data work | MuleSoft 2025 |
The Daily Reality
Hired to build data products. Actually doing: YAML debugging. Credential rotation. Dependency conflicts. Version mismatches. Backfill jobs. Incident triage. Vendor management. Cost optimization. Security audits. Compliance documentation.
The "modern" in modern data stack didn't mean modern work. It meant more work.
::Data Quality
$12.9M/yr avg cost of poor data quality — 67% don't trust their data
The Problem
Data quality tools exist because ingestion doesn't validate. Bad data lands in your warehouse. You detect it after the fact—if you detect it at all. Quality becomes a separate purchase, a separate team, a separate problem.
"Data quality is usually one of the goals of effective data management. Yet too often organizations treat it like an afterthought."
"Most organizations decide to address issues in a piecemeal fashion... No wonder this is only a tactical solution; sooner or later, we need to start working on another tactical project to resolve the issues caused by the previous tactical project."
| Stat | Source |
|---|---|
| $12.9M annual cost of poor data quality | Gartner |
| 67% don't trust their data for decisions | Precisely/Drexel 2024 |
| 64% say quality is top challenge (up from 50%) | Precisely 2024 |
::Consumption Pricing
62% exceeded cloud budget in 2024 — 86% of CIOs planning repatriation
"We're paying an at times almost absurd premium for the possibility that workloads could spike. It's like paying a quarter of your house's value for earthquake insurance when you don't live anywhere near a fault line."
| Stat | Source |
|---|---|
| 62% exceeded cloud budget in 2024 | Wasabi Index 2025 |
| 86% of CIOs planning some repatriation | Barclays CIO Q4 2024 |
| $100B+ market cap lost to cloud costs | Andreessen Horowitz |
Every major vendor uses consumption-based pricing. Warehouses charge per credit. Compute platforms charge per unit. Ingestion tools charge per row. The more you store and process, the more you pay.
"Close to half of cloud buyers spent more on cloud than they expected in 2023, with 59% anticipating similar overruns in 2024."
Minimum charges punish small queries. Cloud infrastructure costs typically exceed platform charges by 50-200%. Pricing changes trigger overnight cost increases. CFOs can't forecast. Finance teams treat data as unpredictable expense.
The incentives are misaligned. Vendors profit from volume. Customers benefit from value. Storing everything "just in case" is expensive for you and profitable for them.
Case Studies
37signals reduced AWS spend from $3.2M to $1.3M annually—projecting $10M+ savings over 5 years by leaving the cloud.
GEICO achieved 50% compute cost reduction and 60% storage cost reduction through cloud repatriation.
::AI / The Chatbot Problem
95% of AI pilots show zero P&L impact — Everyone built the same text-to-SQL chatbot
The Problem
Every data vendor bolted on the same AI feature: a chatbot that writes SQL. They call it a "copilot" or "analyst" or "assistant." They all do roughly the same thing—and they all have the same limitation: the underlying architecture wasn't designed for AI.
A natural language front-end doesn't make a complex system intelligent. It makes it slightly easier to use while masking the complexity underneath.
"When AI simply makes the product easier to use, that's AI-washing. A natural language front-end for a workflow—yes, technically it's AI—but it's just masking the complexity of the workflow with a more accessible customer interface."
"LLMs can write SQL, but they are often prone to making up tables, making up fields, and generally just writing SQL that if executed against your database would not actually be valid."
"Right now, [agent] is being slapped on everything from simple scripts to sophisticated AI workflows. There's no shared definition, which leaves plenty of room for companies to market basic automation as something much more advanced."
"They just don't work. They don't have enough intelligence, they're not multimodal enough."
| Stat | Source |
|---|---|
| 95% of AI pilot projects delivered no measurable P&L impact | MIT "GenAI Divide" Study, July 2025 |
| 30% of GenAI projects abandoned after POC | Gartner 2025 |
| 77% of employees say AI tools added to their workload | Upwork, July 2024 |
| 40% of European "AI startups" had no real AI | PwC/MMC Ventures |
| $400K in SEC fines for AI washing claims | SEC, March 2024 |
| GenAI now in "Trough of Disillusionment" | Gartner Hype Cycle 2025 |
| For every 33 AI POCs launched, only 4 reach production | IDC |
The AI-Native vs AI-Augmented Distinction
AI-augmented/powered: Traditional systems with AI layered on top—retrofitted solutions depend on third-party APIs or cloud models... the original architecture limits it.
AI-native: Built from the ground up with intelligence as foundation—if you remove the AI, the product loses its core value.
Technical analysis shows AI-native architectures demonstrate 2-5x performance improvements in latency and throughput compared to bolted-on systems.
::Data Sovereignty
84% of European orgs using/planning sovereign cloud — GEICO repatriated to cut costs 50-60%
The Problem
Data sovereignty is no longer optional. Regulations require data to stay within borders. Cloud providers operate across jurisdictions. Compliance teams can't guarantee where data lives, who can access it, or which government can subpoena it.
The cloud promised flexibility. For regulated industries, it delivered risk.
"The European cloud market is growing exponentially; however, the share held by European providers is diminishing. This trend poses a significant concern for Europe's technological sovereignty."
"Ten years into that cloud journey, GEICO still hadn't migrated everything to the cloud, their bills went up 2.5X, and their reliability challenges went up quite a lot too."
| Stat | Source |
|---|---|
| 84% of European orgs using/planning sovereign cloud | IDC 2024 |
| 86% of CIOs planning some repatriation | Barclays CIO Q4 2024 |
| GEICO: 50% compute reduction, 60% storage reduction | The Stack 2024 |
::DevOps & Developer
SQL & Python required — Business users locked out entirely
The Problem
The entire data stack assumes you're a developer. Every tool requires SQL or Python. Configuration lives in YAML and JSON. Debugging happens in terminals and logs.
If you can't code, you can't participate. The people closest to the data problems—operations, finance, marketing—are locked out of the tools designed to solve them.
"The 'freedom to choose' that once characterized the Modern Data Stack is quietly giving way to a controlled substrate that vendors can both standardize and monetize."
| Stat | Source |
|---|---|
| 897 apps average per enterprise; only 28% of apps integrated | MuleSoft 2025 |
| 70% of data leaders say stack is "too complex" | Modern Data 101 |
Who Gets Left Behind
Business users can't self-serve—they file tickets. Analysts become translators instead of analysts. Operations teams closest to data problems can't touch the tools. Finance waits days for reports that should take seconds.
The people who need data most are furthest from it.
The Stack They Expect You To Know
SQL. Python. YAML. JSON. Git. Docker. Kubernetes. Terraform. Airflow. dbt. Spark.
Each tool has its own syntax, its own mental model, its own failure modes. The barrier to entry isn't just high—it's intentionally technical.
::Orchestration
70% rate pipeline management "complex" — Airflow = "PHP of data sector"
The Problem
Orchestration tools exist because pipeline components don't coordinate themselves. You need a central scheduler to manage dependencies, sequence tasks, and handle failures. The scheduler becomes another system to maintain, debug, and keep running.
"All in all, Airflow is far from perfect, and many of us have merely learned to deal with its limitations."
"Despite any criticism, Airflow continues to play a pivotal role, much like PHP of the data sector—often criticized but extensively relied upon."
| Stat | Source |
|---|---|
| 70% rate pipeline management "complex" | Matillion 2025 |
| 89% of Airflow users expect more revenue-generating or external solutions this year | Astronomer 2026 |
| 32% of Airflow users have GenAI or MLOps in production | Astronomer 2026 |
::Observability
67 monthly data incidents avg — 74% say business finds issues first
The Problem
Observability tools exist because pipelines fail silently. Something breaks at 2 AM. Nobody notices for three days. Data is missing. Downstream systems already consumed what was there and made decisions on it.
"Looking past the market fragmentation and maturity, there is significant demand among data and analytics leaders to address their growing data operations complexity."
| Stat | Source |
|---|---|
| 67 monthly data incidents average; 15 hours to resolve; 74% say business finds issues first; 31% of revenue impacted by data issues | Wakefield Research 2023 |
| 61 data incidents per month on average; 40% of the workday spent firefighting bad data | Monte Carlo 2022 |
::Lineage
80% of data governance initiatives will fail by 2027
The Problem
Lineage tools exist because nothing tracks where data came from. They reconstruct history by parsing SQL queries and scanning logs. The reconstruction is incomplete, often wrong, and requires constant maintenance.
"By 2027, 80% of data and analytics governance initiatives will fail."
"We are extremely adept at generating data, not so much at extracting value from those data, and very challenged to destroy any data at all. Data hoarding, data sprawl, and data decay are all significant problems."
| Stat | Source |
|---|---|
| 80% of D&A governance to fail by 2027 | Gartner |
| Only 25% measure data quality metrics | Precisely 2024 |
Lineage is the path data took; provenance is the fuller story behind the data, and it is something most data platforms do not track well, if at all.
::Ingestion
95% report integration barriers — 897 apps avg, only 28% integrated
The Problem
Ingestion tools move data from source systems to warehouses. They promise "connect once, sync forever." Reality: schema changes break pipelines, API rate limits cause data loss, and you're charged per row whether the data is useful or not.
"In 2017, Y Combinator funded 15 analytics, data engineering, and AI/ML companies. In 2021, they funded 100. It's impossible to make sense of this many tools, much less manage even a fraction of them in a single stack."
| Stat | Source |
|---|---|
| 95% report integration challenges as barriers to AI | Salesforce 2024 |
| 70% of data workers rate pipeline management "complex" | Matillion 2025 |
It should be noted that all current ingestion systems reprocess source data many times, especially those that push everything to the cloud first prior to processing.
::CDC (Change Data Capture)
89% report pipeline scaling issues — Kafka operational overhead
The Problem
CDC tools track changes in source databases. They require database-level access, often involve Kafka clusters for streaming, and create operational overhead that exceeds the original data engineering problem.
"The thing to know about merchants of complexity is that they never go away, they merely migrate. From WS-DeathStar to microservices to premature k8s to auth services to GraphQL and beyond."
| Stat | Source |
|---|---|
| 89% report scaling issues with pipelines | Matillion 2025 |
| Stat | Source |
|---|---|
| 20-30% write path overhead for trigger-based CDC | System Overflow |
| 10-60 seconds latency for query-based CDC; deletes can be missed | System Overflow |
CDC in the modern data stack is generally database-oriented. IBM is one of the few major platforms that documents broader provenance and element-level tracking concepts, while Palantir focuses more on dataset lineage and ontology rather than CDC-style element-level change capture.
::Reverse ETL
Exists because the stack moves data the wrong direction
The Problem
Reverse ETL tools exist to get data OUT of warehouses—because the whole stack moved data one direction, and now you need it back in Salesforce, HubSpot, and the other tools where work actually happens.
"The problem wasn't the market around the products we were building; the problem was the product itself."
| Provider | Tools Involved |
|---|---|
| Source → Warehouse | Fivetran, Stitch, Airbyte |
| Warehouse → Transform | dbt, Dataform |
| Warehouse → Destination | Census, Hightouch, Polytomic |
| Data Warehouses | Databricks, Snowflake |
::Final words
The current data stack is fragmented, expensive, and operationally heavy.
The same tools that were meant to simplify data work have created new layers of complexity. Teams spend more time wiring systems together, managing vendor sprawl, and handling operational overhead than they do improving the actual data products people rely on.
The result is more cost, more risk, and less control over data, governance, and compliance. That is the core problem this report has been pointing at throughout.