AIGuides

The Bitter Lesson for the AI Stack: 4 Audit Points Before…

By Tobias Massow April 2, 2026 8 min read

Most enterprise AI stacks are over-engineered contraptions: thousands of tokens in system prompts, multi-stage RAG pipelines, hard-coded business rules, and manual code review acting as the bottleneck. That approach worked when models hovered around 85 % accuracy. With every new model generation the balance shifts–and complexity becomes the drag. Four concrete audit points show where IT teams need to simplify right now.

Key Takeaways

Rich Sutton’s “Bitter Lesson” (2019) applies to AI stacks as well: human-designed scaffolding loses to model intelligence once scaling kicks in (Sutton, University of Alberta).
Context windows have ballooned from 4,000 tokens (GPT-3, 2020) to over 1 million tokens (2025/26)–a 250× jump in five years. That upends retrieval architectures at their core.
Procedural system prompts running 3,000+ tokens can often be trimmed by 30–50 % on newer models without quality loss (Anthropic Prompt Engineering Guide, 2025).
In October 2024, Google’s Big Sleep (Project Zero + DeepMind) uncovered a real zero-day vulnerability in SQLite–the first publicly documented case of an AI agent discovering an unknown security flaw in production software.
Frontier models cost 5–10× more per token than their predecessors. Efficient prompts are no longer just a quality issue; they’re a cost issue.

Why Scaling Forces Simplification

In March 2019, Rich Sutton, professor at the University of Alberta and co-founder of modern reinforcement-learning research, published an essay titled “The Bitter Lesson.” His thesis: over 70 years of AI history, methods relying on raw computing power have consistently outperformed approaches that incorporate human domain knowledge. Not because human knowledge is worthless–but because it cannot keep pace with scaling.

Six years later, the same pattern is evident in work with large language models. Teams build systems around models: multi-stage prompt chains, hard-coded decision trees, manually curated retrieval pipelines. This made sense when GPT-3 operated with 4,000-token context and hallucinated on every third query. But models have improved faster than the systems around them.

The Scaling Laws from Kaplan et al. (2020, arXiv:2001.08361) and the Chinchilla results from Hoffmann et al. (2022, arXiv:2203.15556) show: model performance rises predictably with compute, data, and parameter count. In practice, this means each new model generation renders part of the human-designed complexity obsolete. Not all of it. But enough to prompt regular reassessment of existing architectures.

250x

Context-window growth since 2020

30–50 %

Prompt reduction without quality loss

5–10x

Cost jump for frontier models

KPI

85 percent

accuracy achieved. With each new model generation, the

KPI

50 percent

reduction in prompt length–without quality loss (Anthropic

KPI

50 %

prompt reduction without quality loss 5–10x cost

Audit 1: Streamlining Prompt Scaffolding

The first question for any production-grade AI stack: how much of the system prompt describes the desired outcome–and how much prescribes the route to get there? In most production systems the split is 20 to 80. Twenty percent goal, eighty percent procedure.

A typical customer-support example: a 3,000-token system prompt that mandates intent classification across 14 categories, defines retrieval steps, enforces hallucination checks, and locks response formats. That procedural specification was necessary because earlier models skipped steps without explicit guidance. With more capable models it becomes a straightjacket: the model follows the prescribed path even when it knows a better route.

Anthropic’s Prompt Engineering Guide puts it plainly: add complexity only when it demonstrably improves results. OpenAI’s Codex documentation echoes the same principle: describe the goal, not the path.

Aspect	Procedural Prompt (status quo)	Outcome Prompt (target state)
Intent	“Classify into 14 categories, then route to handler”	“Resolve the customer’s concern”
Retrieval	“Top 5 KB articles via hybrid search, alpha=0.7”	“Use our knowledge base and policies”
Validation	“Check for hallucinated URLs, then fact-check”	“Answer must comply with our return policy”
Token Usage	~3,000 tokens	~800 tokens

The takeaway: go through every prompt line by line. For each instruction ask: is this here because the model needs it–or because I assumed it did? Teams preparing their developer-experience stack for the next model generation should start here.

Audit 2: Simplifying Retrieval Architecture

RAG isn’t dead. But the question of who controls the retrieval logic is shifting. With a 4,000-token context window, precise chunking, re-ranking, and filtering were essential for survival. With a million tokens, the calculation changes.

When a model can process 500 pages of text at once, the question “Which 5 chunks are relevant?” loses its urgency. Instead, the decisive architectural decision becomes: “Which repo or document collection does the model receive?” Retrieval intelligence migrates from pipeline code into the model itself.

The evolution of context windows illustrates this: GPT-3 launched in 2020 with 4,096 tokens. GPT-4 arrived in 2023 with 128,000 tokens. Google’s Gemini reached 1 million tokens in 2024. By early 2026, several models operate beyond the million-token mark. This isn’t linear growth–it’s a 250-fold increase in five years. Every tenfold expansion of the context window renders part of the retrieval pipeline obsolete because the model can process more raw data directly.

That doesn’t mean vector databases disappear. For corpora beyond the context window, retrieval remains necessary. But the logic simplifies: instead of multi-stage re-ranking pipelines with manually tuned thresholds, it’s increasingly enough to present the model with a well-organized, searchable repository and let the model handle selection. The effort shifts from the pipeline to the document structure.

For platform-engineering teams building internal developer platforms with AI assistants, this has a practical consequence: invest in the quality and structure of your documentation rather than the complexity of your retrieval pipeline. A cleanly organized Confluence wiki or a well-structured Git repository delivers more than a sophisticated re-ranking model.

Audit 3: Hardcoded Domain Knowledge vs. Model Inference

How many business rules have you hardcoded into system prompts? Count them. Then ask for each one: can the model infer this rule from context if it has access to the relevant documents?

Example: a reporting system that defines house style for customer reports as a 15-line instruction in the prompt–style, structure, phrasing rules, formatting. A capable model infers all of this from a single sample report with higher accuracy than from an abstract rule description. This is exactly the mechanism Sutton describes: scaling laws don’t render human-coded knowledge worthless, but increasingly redundant because the model can derive it itself.

// Quote

Anyone who needed a 3,000-token system prompt in 2024 will achieve better results in 2026 with 800 tokens–provided they describe the destination instead of the route and give the model access instead of prescriptions.

cloudmagazin editorial assessment

What must remain hardcoded: compliance rules that cannot be violated (return policies, regulatory mandates). Security boundaries where any breach is unacceptable. Everything else deserves a test: prompt with rule vs. prompt without rule. If the results are equally good, the rule can go.

Audit 4: One Eval-Gate Instead of Multiple Checkpoints

Intermediate evaluation steps in AI pipelines were a response to unreliable models: after each step, check whether the intermediate result is correct before the next step begins. Intent classified? Check. Retrieval relevant? Check. Response hallucination-free? Check.

With models that work correctly in 99 percent of cases, the cost-benefit calculation shifts. Every intermediate check adds latency, tokens, and complexity. If the final result is correct in the vast majority of cases, a single comprehensive eval-gate at the end is more efficient than five partial checks along the way.

This is especially relevant for software development. Google’s Big Sleep (a collaboration between Project Zero and DeepMind) discovered an unknown stack-buffer-underflow vulnerability in SQLite in October 2024–the first publicly documented case of an AI agent uncovering a real zero-day in widely used open-source software. If AI models can find vulnerabilities that experienced security researchers missed, they can also take on code reviews and regression tests.

The practical recommendation: an eval script at the end of the pipeline that comprehensively tests functional requirements, non-functional requirements, and edge cases. If all tests pass, the result is released. If not, it goes back to the model. No manual intermediate steps, no human review as a bottleneck.

Costs and Multi-Model Routing

Frontier models are expensive. NVIDIA’s GB200 platform (Blackwell architecture, unveiled at GTC in March 2024) and its GB300 successors (Blackwell Ultra, GTC March 2025) push training costs into the hundreds of millions of euros per model. That trickles down to inference costs: frontier models cost 5 to 10 times more per token than their predecessors. Sending all traffic through a frontier model burns budget. Delegating everything to the cheapest model sacrifices quality on complex tasks.

The answer is multi-model routing: delegate simple tasks (classification, extraction, formatting) to inexpensive models, forward complex tasks (reasoning, code generation, security audits) to frontier models. The ability to route problems correctly will become one of the most important skills in API-first architectures in 2026.

Simplifying prompts is not just a quality issue; it’s also a cost lever. A 3,000-token system prompt trimmed to 800 tokens saves 2.2 million tokens across a thousand daily API calls. At frontier prices of €15 per million input tokens, that’s €33 per day–nearly €1,000 per month. Simplification and cost efficiency go hand in hand.

Conclusion

The Bitter Lesson applies not only to AI researchers. It applies to every team running AI models in production. Four audits–prompt scaffolding, retrieval architecture, hard-coded domain knowledge, and evaluation pipelines–show exactly where complexity becomes a brake. Models are improving faster than most surrounding systems can adapt. Teams that simplify now will be ready when the next generation arrives. Teams clinging to a 5,000-token prompt honed over years will discover that a one-liner delivers better results.

Frequently Asked Questions

What exactly does Rich Sutton’s “Bitter Lesson” state?

In 2019, Rich Sutton argued that across more than 70 years of AI history, methods relying on scaling compute consistently outperformed approaches that baked in human domain knowledge. For AI stacks, the takeaway is clear: instead of layering on ever more rules and scaffolding, give the model more freedom and measure the results.

Should I delete my entire system prompt?

No. Compliance rules, safety guardrails, and non-negotiable business logic stay in the prompt. What you can remove are procedural step-by-step instructions that tell the model how to solve the task instead of defining the goal. Quick test: compare outputs with and without the rule. No drop in quality? Remove the rule.

Is RAG redundant with large context windows?

Not necessarily. For corpora that exceed the context window, retrieval remains essential. However, the retrieval logic simplifies: instead of multi-stage re-ranking pipelines, it’s increasingly enough to give the model a well-structured repository and let it handle the selection. The investment shifts from pipeline complexity to document quality.

How did Google’s Big Sleep uncover the SQLite vulnerability?

Big Sleep is a collaboration between Google Project Zero and Google DeepMind. In October 2024, the AI agent identified a stack-buffer-underflow in SQLite–an issue present in a development branch and caught before an official release. It was the first publicly documented case of an AI agent discovering an unknown security flaw in widely used software.

How do I start a prompt audit for my existing AI stack?

Three steps: first, go through every system prompt line by line and label each instruction as either “goal” or “process.” Second, remove all process instructions one by one and measure output quality against an evaluation set. Third, re-introduce only those instructions whose removal causes measurable quality drops. Most teams find that 30 to 50 percent of process instructions no longer have any measurable impact.

Editor’s Reading List

Source header image: AI-generated via Cloudflare FLUX.2 / cloudmagazin editorial team

Image source: AI-generated (May 2026), C2PA certificate embedded in image

Also available in

Français Español Deutsch