Building Production grade AI Agents - Part 1

Insights to build high performant, smart, production grade AI agents and automation

Nov 26, 2024

You’re not building “AI”, At best, you’re mimicking it

Over the past two years, I’ve designed and built AI agents ranging from hobby projects to enterprise-grade applications, collectively used by hundreds of thousands of users. And in these last few months, I’ve been leading engineering at Breakout, where we’re creating a cutting-edge AI SDR agent with the hope to one day serve millions of users.

This journey—from small-scale experiments to scalable, production-ready AI solutions—has been an incredible learning experience. In this post for GGPUSH, I’ll share the insights I’ve gained along the way. The focus here is exclusively on building applications using existing large language models (LLMs). While topics like training, self-hosting, and fine-tuning are important parts to this topic, they fall outside the scope of this discussion. They are however, a natural extension to the concepts discussed here and are indeed important to creating powerful AI agents.

The first and most important thing to internalise when building generative AI applications is this: we are not building AI or AGI. A handful of researchers at organisations like OpenAI, Google, or Microsoft might be working on it, but for the rest of us, we’re simply mimicking intelligence using state-of-the-art (SOTA) foundational models to achieve specific outcomes. To do this effectively, we must fully grasp both the capabilities and limitations of large language models (LLMs).

At their core, LLMs are advanced next-token generators. The larger the model, the better it is at predicting and generating text based on its input. However, an LLM is not capable of actual thinking or reasoning. What we perceive as reasoning is modelled solely through prompt engineering. A prompt can be perceived beyond an Instruction set, It is a way of guiding a models generated output. The better you’re at writing (or generating) prompts, the better your agent behaves.
(Pro advice : Apocalyptic threats work, kind of funny and ironic, but they work.)

Take the Chain of Thought (CoT) prompting technique as an example. In CoT, you provide the model with a step-by-step example of problem-solving. Because the model generates its outputs based on patterns in its input, this structured approach encourages it to produce step-by-step responses, simulating logical reasoning. While this doesn’t mean the LLM is truly thinking, it can lead to outcomes that appear intelligent and coherent.

We can even give structured nature to these outputs (Using JSON modes) to make it act as an interface layer in our programs, there by giving us the ability to build Advanced AI agents capable of performing actions based on this simulated thoughts.

Here are a few key testaments I always keep in mind when designing AI agents,

LLMs are not knowledge sources.

LLMs are inherently limited to the data they were trained on and lack “awareness” of events or information beyond their last training date. This limitation makes them unreliable as standalone knowledge sources, as there’s no guarantee of accuracy or relevance in their responses. For this reason, One should avoid using LLMs directly for knowledge-intensive tasks like Q&A. Instead, rely on Retrieval-Augmented Generation (RAG) approaches, where the agent taps into its own knowledge base to retrieve relevant information and provides it as context in prompts.

For the knowledge base, we can typically use a combination of file storage, traditional databases, and vector databases. The success of a RAG system depends on two critical stages: index creation and retrieval.

When building the knowledge base or index, it’s essential to store only the most relevant and complete information in well-structured chunks. (Look up different chunking techniques like Proposition based chunking) In multimodal setups, contextual connections between related data points must also be established. For this, you can use a GraphDB to represent relationships effectively, though similar contextual connections can be built in a VectorDB using metadata (label) overlaps. This contextual connection can further augment our retrieval step, by making sure we are looking at the right piece of information at the right time. We can also use the labels to reduce our search space, something that will benefit performance and retrieval quality.

Retrieval quality also depends on achieving both high precision and high recall. While semantic similarity scoring is useful, it’s not always sufficient for identifying the most relevant results. A lower similarity threshold might boost recall but can compromise precision, bringing irrelevant data into the mix. In my experience, the score thresholds also often don’t mean much as it is difficult to tune these parameter alone to achieve better retrieval.

To address this, We can incorporate rerankers and graders into the retrieval pipeline. For example, an LLM or function can be used to evaluate retrieved results, eliminating unreliable contexts. While this adds latency, it significantly improves the quality of the output.

Another technique to leverage is query fusion, where queries are dynamically rewritten based on the context, increasing the likelihood of retrieving highly relevant results. The results are then “fused” together and re-ranked. Additionally, using a percentile-based filtering approach can further refine precision, which works especially well in combination with query fusion. (I recently learned that Dropbox uses a similar approach for their QnA and summarisation of files. Read the article here)

Ultimately, a strong RAG system depends on building a robust knowledge base (index) and designing a retrieval process that balances precision, recall, and computational efficiency. When implemented effectively, this approach ensures that the AI agent delivers reliable, contextually accurate responses.

LLMs don’t think

LLMs are not inherently good at reasoning; their reasoning capabilities are primarily enhanced as a result of Reinforcement Learning from Human Feedback (RLHF) applied to synthetic instruction datasets. This limitation is why we should generally avoid building ReACT (Reasoning + Action) agents for scenarios that require highly deterministic agent behaviour. Techniques like few-shot prompting and Chain of Thought (CoT) can provide some level of control over reasoning outputs, but in my experience, they fall short for production systems that demand minimal tolerance for hallucinations.

That said, if you’re building an agent for open-ended automation tasks, such as a web research assistant, these approaches can be useful—provided you have robust graders and guardrails in place to keep the agent on track. However, relying on these methods in production for creating highly consistent agents is risky unless you’re willing to compromise on performance and incorporate extensive validation mechanisms at every step to ensure reliability.

A more effective strategy is to model your agent's behaviour as a series of smaller, modular tasks that are chained together to achieve the desired outcome. This approach helps create a predictable workflow for your agent, making it easier to test and debug. Each task should be:

Well-defined in terms of functionality.
Modular enough to be independently tested and iterated upon.

Another critical metric we can evaluate in LLM-specific tasks is cognitive load—the number of independently different instructions the model needs to process in a single call (prompt). Higher cognitive load often leads to poorer performance and increased hallucination. To mitigate this, break down complex tasks into smaller, manageable components and design prompts with reduced cognitive demands.

As a rule of thumb, aim to make LLMs handle only dependent outcomes in a single prompt. When you notice high hallucination rates in cognitive-intensive tasks, it’s a signal to simplify or split them further.

It is fair to assume larger and more capable models will improve on these fronts. This could allow us to combine orthogonally different tasks within a single call, leading to better performance and cost efficiency. Keeping future scalability in mind while designing workflows today can ensure your agents are well-positioned to leverage advancements in LLM capabilities effectively. (I’ll discuss a lot more on this trade offs on part 2)

Note : This is what makes the Breakout Agent stand apart, we have minimal tolerance for hallucination and our agent is designed to perform consistently across a variety of situations

Function Calling rules.

I love function calling—it’s arguably one of the most powerful tools in our arsenal for building agentic systems. Function calling enables agents to interact with a broad spectrum of tools, significantly expanding their ability to achieve desired outcomes. For instance, integrating web search capabilities gives real time knowledge to LLMs which were traditionally plagued by training date information cutoff.

I’m also very optimistic about the future of function calling. With the potential for generating synthetic data tailored for this use case, we can RLHF foundational models into domain-specific instruction models. This would empower LLMs to execute more complex functions and become increasingly autonomous and capable.

In my implementations, I often incorporate function calling through a dynamic planner, which acts as the initial step in my agent’s workflow. The planner determines the callable functions and their parameters based on the current stage of the agent’s execution. These stages are managed through a dynamic memory, which updates asynchronously as function results are processed. However, managing race conditions in this setup requires careful attention to ensure consistency and reliability. (I’ll go into more details on this in Part 2 of this blog)

The resulting state changes are important—they control the agent’s behavior, allowing for greater predictability and flexibility. This architecture also enables me to design the agent as a dynamic workflow graph. Each stage of the graph independently executes specific functions based on state variables, progressing toward the desired outcome. It also lets us bring a more deterministic Multi-Agentic behaviour into the system as we now have more control over the “Actions” of the Agent(s).

In this first post, I wanted to kick things off with some fundamental tenets of designing an AI agent. In the next part, we’ll talk more in detail about design and execution, and some interesting mental models for trade-offs. I’ll also share tips on testing AI agents to ensure they stick to their task and expected behaviours and don’t go rogue and take over the world and enslave humanity.

If you’re finding this insightful, subscribe for more updates, or drop a message or comment for a deeper chat. I’m always up for geeking out over AI.

At Breakout, We are putting these ideas into practice and building the smartest AI SDR in the world. If this sounds like a challenging task to you, Do reach out to me at ashfakh@getbreakout.ai

ggpush

Discussion about this post