Agentic AI at Scale: Why Most Enterprises Stumble
The heartburn CTOs and CAIOs are facing today:
After a significant amount of organizational effort, you have finally rolled out an Agentic AI solution for your contact center, enhancing the employee experience and, in turn, improving customer experience. You expected measurable improvements in CSAT, handle time, deflection, and agent productivity.
However, the solution didn’t yield any real benefits; instead, it created more work and dissatisfaction among contact center users, and in turn, end customers. It hallucinated, provided incorrect, incomplete, conflicting, unnecessarily long, and at times flat-out wrong responses. You also incur significant token costs without delivering real value.
So, what happened? Why did the solution that worked perfectly well in PoC and dev/test fail miserably in the real world?
Here’s why:
The Agentic application skillfully utilized LLMs to extract data from organizations’ various data stores, leveraging several internal and external tools to execute multi-step workflows. The developers of the application wrote clever and extremely descriptive prompts. As a result, they inadvertently continued to fill LLM’s large context window. What they didn’t realize was that more is not always better.
Irrelevant and extraneous context dilutes the model’s attention, increases token cost and noise, and biases its responses to recently added context. This results in inaccurate responses and degraded workflows. LLMs suffer from attention decay, meaning older—but still critical—information becomes overshadowed by recently added noise. In most naïve deployments, 60–80% of tokens consumed are irrelevant to the task at hand.
What exactly does context contain?
1. Instructions - Prompts, few-shot examples, tool descriptions, etc.
2. Knowledge - Facts, memories, conversation history, user preferences, and retrieved information from documents or databases, and real-time data.
3. Tools – Available tools, their definitions, feedback from tools, API responses, etc.
Managing all this requires your system to have a mechanism to maintain both short-term memory and long-term state, and to optimize it by removing/pruning/compressing Context when appropriate.
This practice of managing context—maintaining short-term memory and long-term state, pruning when appropriate, and carefully selecting what information to provide to the LLM at the right time—is known as context engineering.
Here’s what Andrej Karpathy says, “Context engineering is the delicate art and science of filling the context window with just the right information for the next step.”
Let’s understand what happens if the context is not managed correctly.
If context is allowed to grow uncontrollably, it can quickly escalate the system’s costs, increase latency, degrade overall performance, resulting in poor user experience and potentially diminishing business benefits. When context is not curated or pruned, four failure modes emerge:
Context Poisoning – the system begins to hallucinate when erroneous information enters the system’s context and its goals. The system generates incorrect information, leading Agents to create erroneous workflows, behave erratically, and fail to meet their goals, or worse, chase those that can never be met.
Context Distraction – As workflows execute, agents accumulate tool results, conversation history, and operational state—all piling into context. Instead of helping, this growing context becomes noise. The model fixates on what it just did rather than what it should do next, leading agents to repeat past actions instead of adapting.
Context Confusion – Superfluous content dilutes the model’s attention. A recent study illustrates this perfectly: Llama 3.1 8b with 46 tools loaded in context performed worse than the same model with only 19 tools. More isn’t better—it’s just noisier.
Context Clash – New information contradicts what’s already in context, creating internal conflicts the model can’t resolve. This amplifies all the other failure modes simultaneously.
Fortunately, these risks and failure modes are preventable with well-designed strategies to manage context.
What strategies can you implement for better context engineering?
Here are five proven strategies for managing context effectively:
1. RAG – Implement RAG architecture to retrieve only relevant information and tool definitions, rather than loading everything into the large context windows of the LLM. The vector database should store references to data stores and functionality, such as web links, file locations, API endpoints, database connection strings, and load them dynamically only when needed. Agents can decide what to store in the context and what to persist externally.
2. Context Quarantine – Isolate context in their own threads, with each using a separate LLM and its own context window. One way to implement this is to break down the tasks into individual jobs, each executed by a different agent in parallel, with each agent focusing on a specific aspect of the workflow tasks. This minimizes context confusion, context distraction, and context clash and improves speed and accuracy.
3. Context Pruning – Remove unneeded and irrelevant information from the context. This reduces model latency, lowers token costs, and enhances overall accuracy.
4. Context Summarization – Context is summarized to its most essential parts. Knowing what information to summarize and how is a critical skill for agent builders. Many agent builders use a separate agent with a different LLM specialized in summarization to summarize the context and pass the summarized context back to the main agent.
5. Context Offloading – In this technique, models write their less-relevant context and progress to an external store, such as a scratchpad, for later reference. Context can also be offloaded to a vector database and retrieved using RAG, along with other relevant information at an appropriate time.
While these strategies provide the tactical toolkit, successful implementation requires continuous measurement and validation.
The strategies above work—but only if your teams actually implement them. How do you ensure context engineering doesn’t become another best practice that gets ignored in production?
The answer is observability. Without visibility into how your agents manage context, you’re deploying blind. Observability helps teams understand:
1. What the agent did
2. How it did it
3. Why it did it
4. When and where failures occur
Your teams should capture the information below.
1. Metrics: Token usage, latency, success/failure rate, resource utilization, etc.
2. Logs: User-agent interactions, tool calls, model prompts/responses, decision logs
3. Traces: Step-by-step agent workflows, multi-agent interactions, fallback loops
4. Evaluations: Task adherence, intent resolution, tool call accuracy, hallucination detection
[More about AI agent observability in a future article]
Every irrelevant token burned is latency added, money wasted, and attention diverted from what matters. Enterprises that actively engineer context see measurable reductions in hallucinations, token cost, latency, and workflow errors — and unlock the actual value of agentic AI in production.
References:
https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
https://blog.langchain.com/context-engineering-for-agents/
https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html
https://www.dbreunig.com/2025/06/26/how-to-fix-your-context.html
https://www.anthropic.com/engineering/multi-agent-research-system
[2501.16214] Provence: efficient and robust context pruning for retrieval-augmented generation



Didn't expect this take on the subject, but it totallly tracks with what you've written before about practical AI deployment issues. Is the solution really about just pruning context, or does it require a more fundamental rethink of agentic architecture?