Building Production grade AI Agents - Part 2

Insights to build high performant, smart, production grade AI agents and automation.

Ashfakh

Jan 14, 2025

Read Part 1 here

Building Production grade AI Agents - Part 1

Ashfakh

November 26, 2024

Read full story

I know it’s been a while since I promised this second part—but time really does fly when you’re busy building cool things! But I hope the wait is worth it, as this post could be your first step towards building reliable, production-grade agents that deliver real value to society—without accidentally sparking an AI takeover.

**Let it be noted in Internet Archives that I positively contributed to building AI agents. 🔍**

So In the last post, we briefly discussed about LLMs and their capabilities. We also discussed about some tenets to keep in mind while building an AI agent. In this post, We’ll take a deeper dive into the technical implementation of workflow-based AI agents—designed with a clear focus on accomplishing specific tasks within an acceptable operating policy. We'll begin by exploring some useful mental models on development, followed by a detailed breakdown of their architecture.

A key thing to understand while building agents is the trade-offs between different design choices to achieve various outcomes. To illustrate this, I like to draw a parallel with the CAP theorem in distributed systems. The CAP theorem states that a distributed system can guarantee only two out of three properties: consistency, availability, and partition tolerance.

This concept can be loosely adapted to AI applications (as a thought exercise to appreciate the inherent trade-offs in building AI agents). In my AI-focused version of the CAP theorem, the "C-A-P" represents Consistency, Agentic Capability, and Performance.

Consistency - The Consistency provided by the agent in its responses/outcomes.
Agentic capabilities - The different capabilities that the Agent has.
Performance - How fast the agent executes its duties, how much resource intensive its tasks are, and how costly it is to run it.

Any Agentic system we create can be evaluated along these three axes, but achieving all three simultaneously is constrained by inherent trade-offs.

Consistency + Agentic Outcomes = Low Performance: A highly consistent agent with robust, targeted outcomes will typically sacrifice performance. This is because achieving such outcomes often requires multiple processing steps, including rankers, graders, and evaluators to prevent hallucinations or deviation from the desired path. (Read about using LLMs as evaluators here) These additional steps slow down the system.
Agentic Outcomes + High Performance = Low Consistency: An agent designed for high performance and broad capabilities will often compromise consistency. To achieve speed and handle diverse tasks, it becomes necessary to skip some grader or evaluator steps and rely on fused (combined) LLM calls, which can increase the likelihood of errors or inconsistencies. Combined LLM calls which run in parallel or Series have their outputs fused together to get the final result. In absence of proper evaluators. Hallucinations can creep in very easily.
Consistency + High Performance = Limited Outcomes: If the goal is to create an agent that balances strong consistency with high performance, its capabilities must be narrowed. The agent will need to focus on a smaller set of outcomes to minimize complexity, enabling faster and more accurate responses.

Here is an Example code for an Agentic workflow supporting multiple outcomes that has to be consistent.

#Building a Consistent Agent capable of doing many things
class ConsistentMultiCapabilityAgent:
    async def execute_task(self, input_data):
        # Performance impact: Multiple validation steps
        validated_input = await self.validate_input(input_data)
        #Run the tasks to achieve multiple outcomes
        results = await self.llm_process(validated_input)
        #Grader to evaluate Task outcome
        validation_scores = await self.grade_output(results)
        #Ranker
        ranked_results = await self.rank_output(results)

        if validation_score < self.threshold:
            return await self.fallback_handler(input_data)
            #Introduce Loop with Observed grades to improve outcome
        
        return results

Although these tradeoffs exist right now, Its reasonable to assume that advancements in model development—such as larger models and faster inference techniques—will independently improve both consistency and performance over time. Given this trajectory, this approach not only challenges us to develop more versatile and horizontally integrated capabilities but also makes the agent significantly more valuable to users. From a go-to-market (GTM) perspective, it deeply embeds the agent into users’ workflows, increasing its utility and fostering greater dependency. The more things your agent can do, the more powerful it gets as it has larger control over the entire workflow. For example, If you’re building an AI accounting agent, don’t just build one that can manage transactions, integrate book keeping abilities, statements, budgeting etc. The dependencies between these workflows can also be easily managed via a single system.

This focus on complex, high-value outcomes is what will truly set your agent apart, making it a distinctive and impactful solution in a competitive landscape. The more horizontal capabilities we build into our agents, more likely they stand apart and is able to create value for the users, and us.

Now, How does one go about building these outcomes in an effective manner?

How do you build effective Agentic outcomes?

Break Down Outcomes into Tasks/Functions
Start by deconstructing your Agentic outcomes into granular, well-defined simple tasks or functions. A task or function represents an atomic operation necessary for your agent to achieve its intended outcome. These tasks may or may not rely on an LLM for execution. Think of them as modular, composable building blocks that can be reused across various parts of your system, making the design more flexible and scalable.
A Composable pattern while task design will help you scale your Agentic system very well, bringing in more reliability and robustness to the system. Another thing to keep in mind while breaking down the tasks is to decrease Cognitive Load (As discussed in part 1) to avoid hallucinations and undesirable outcomes.
Don’t make one task do Everything. Let is just pass the butter.
Implement Thread-Safe Shared Memory
A thread-safe shared memory is essential for maintaining the state of the system across tasks. This memory allows tasks to access and update the shared context independently, enabling them to be context-aware without creating conflicts. This shared state is the glue that holds your workflow together, ensuring seamless transitions and information flow between tasks.
In a completely distributed system, you can use something like Redis to hold this Shared Memory. You can also keep this in memory of the application in a Threadsafe manner. It is very important to look out for race conditions while accessing/updating the memory. In a fully asynchronous systems, race conditions can cause unpredictable behaviour.
If you think about it, all you need to build State machines are Conditional Variables and Shared Memory space. This is CS101, but still an important concept while creating synchronous systems (Read more here). Just don’t get run over by the race conditions.

Chain Tasks into a Workflow
Once you have defined your tasks, the next step is to organise them into a workflow. We don’t need any specialised Agentic frameworks for this, as they don’t do anything more than chaining these tasks together. This can be done easily, by defining functions for your tasks and defining the dependencies between directly in your code or using any Graph Modeling Language. You can even define these dependencies as configurable files using YAML or XML or even JSON.
For example, see the following pseudo code to see how three tasks with dependencies between each other can be executed serially

from typing import Callable, List

class TaskNode:
    def __init__(self, task_func: Callable, task_id: str):
        self.task_func = task_func
        self.task_id = task_id
        self.dependencies: List[TaskNode] = []

    def add_dependency(self, task_node: 'TaskNode'):
        self.dependencies.append(task_node)

    async def execute(self):
        # Execute all dependencies first
        for dependency in self.dependencies:
            await dependency.execute()
        # Execute this task
        print(f"Executing task: {self.task_id}")
        await self.task_func(self.task_id)

# Define tasks
async def task_1(task_id: str):
    print(f"Task 1 logic for {task_id}")

async def task_2(task_id: str):
    print(f"Task 2 logic for {task_id}")

async def task_3(task_id: str):
    print(f"Task 3 logic for {task_id}")

# Create task nodes
node_1 = TaskNode(task_1, "Task1")
node_2 = TaskNode(task_2, "Task2")
node_3 = TaskNode(task_3, "Task3")

# Chain tasks
node_2.add_dependency(node_1)  # Task2 depends on Task1
node_3.add_dependency(node_2)  # Task3 depends on Task2

# Execute the graph
import asyncio
asyncio.run(node_3.execute())  # This will execute Task1 -> Task2 -> Task3

Representing this workflow as a graph provides significant advantages. A graph-based model enhances scalability, simplifies debugging, and ensures predictable outcomes. Each execution of the agent corresponds to a specific path traced through the graph, giving you better control and visibility over how outcomes are achieved. You always know what to expect.

Need a Laugh? These 36 Funny Flow Charts Can Help

You can further design your monitoring/eval on top of these graphs. Telemetry can collect metrics on each nodes in the graph and you can independently monitor and evaluate the tasks based on these nodal values.

This graph-based approach is the foundation of many advanced agent frameworks. It’s also why tools like LangChain have evolved into LangGraph, recognising the power of graph-based modelling for building robust and efficient agents.

At Breakout, We have designed our own Orchestration Framework for Agentic workflows and an Actor Based system for Multi-Agentic workflows (More on that later) The reason why we didn’t use any existing Agentic frameworks was because we never felt it gave any significant advantage over just defining the workflows ourselves. Some of it were also over bloated software with unnecessary abstractions with no significant advantages.

While building robust scalable systems, simplicity is key.

To further elaborate on tasks, tasks within an Agentic system can be broadly categorised into three types:

Deterministic Tasks
Semi-Deterministic Tasks
Non-Deterministic Tasks

These categories reflect varying levels of predictability in task outputs for a given input.

1. Deterministic Tasks

These tasks produce consistent and predictable outputs for specific inputs. Traditional software development primarily deals with deterministic tasks, making it relatively straightforward to build reliable systems.

def retrieve_data(text: str) -> documents:
    return self.retrieve(text) #get relevant documents

2. Semi-Deterministic Tasks

These tasks involve some variability but operate within a defined structure, often incorporating LLMs. For instance, an LLM call that generates structured outputs (e.g., filling predefined fields or responding within a strict schema) falls into this category. Outputs are reasonably predictable but still influenced by the model's inherent probabilistic nature.

async def planner(text: str, state_variables: dict) -> Dict[str, str]:
    response = await llm.plan(
        "Plan Function: {text}",
        "prompt_variables: {state_variables}"
        output_schema={"functions": List,
                       "confidence": float, 
                       "lookup": List}
    )
    return response

3. Non-Deterministic Tasks

These tasks have high variability and unpredictable outputs, often arising from less structured interactions with LLMs. For example, content generation without strict guidelines or parameters typically belongs in this category.

async def generate_response(context: str) -> str:
    return await llm.predict(f"Generate helpful response: {context}")

Design Considerations

When working with Agentic systems, it’s critical to plan, develop, and test each type of task according to its nature. Dependencies between tasks must also be carefully managed. For instance, chaining multiple semi- or non-deterministic tasks can lead to unpredictable or inconsistent workflows. This unpredictability may undermine the reliability of the overall system. Remember the game Chinese whispers? Yeah you don’t want your agent to do that, ever.

Striking a Balance in Production-Grade Agents

In production-grade agents, achieving a balance between flexibility and predictability is key. Try to design workflows that prioritize deterministic and semi-deterministic tasks while minimizing reliance on non-deterministic ones. This approach allows for a more predictable and coherent system that still appears intelligent and flexible enough to achieve complex outcomes.

By carefully selecting and structuring tasks, you can create agents that strike the right balance between adaptability and reliability, ensuring that they perform effectively in real-world scenarios.

Breaking down your workflow into smaller, manageable tasks isn’t just about modularity—it’s also essential for improving the testability of your agentic system. Adopting a test-driven development approach ensures each task operates within well-defined boundaries, governed by a clear operating policy.

By testing each task against these policies, you can guarantee that as long as tasks remain within their specified limits, the overall workflow will behave predictably. This reduces the risk of hallucinations or unpredictable outputs that could lead to unfavorable outcomes.

Moreover, this modular approach allows each task to be independently managed, developed, and scaled. It ensures that improvements or updates to one part of the system don’t unintentionally disrupt the entire workflow. In essence, a well-structured, testable system is key to building robust, reliable, and scalable Agentic systems. I couldn’t stress enough how important these tests are with respect to building reliable Agentic systems. Infact, I’d argue, in a truly test driven development fashion, you should first come up with the test cases while breaking down your tasks. This methodical approach significantly reduces the risk of failure, enabling the agent to deliver consistent, high-quality outcomes that stand out in a competitive landscape.

As we push the boundaries of what AI agents can achieve, the guiding principles discussed here—balancing the C-A-P trade-offs, breaking down workflows into testable tasks, and adopting modular, graph-based designs—become more and more important in building dependable, production-grade systems.

The shift from deterministic to non-deterministic tasks is not just a technical challenge in development; it’s an opportunity to redefine what "intelligence" means in the context of software. By anchoring our workflows in solid engineering practices and focusing on composability, we ensure that our agents not only perform well but do so predictably and at scale.

The agents that will truly stand out are the ones that can seamlessly orchestrate complex, multi-step processes while maintaining reliability and trustworthiness.

By embracing modularity, prioritising testability, and leaning into the evolving capabilities of LLMs, we can create solutions that do more than automate tasks—they amplify human potential. So, as you embark on building your own agents, remember: the key to standing out isn’t just innovation; it’s thoughtful design, relentless iteration, and a commitment to solving meaningful problems.

At Breakout, We are putting these ideas into practice and building the smartest AI Sales Rep in the world. If this sounds like a challenging task to you, Do reach out to me at ashfakh@getbreakout.ai

ggpush

Building Production grade AI Agents - Part 1

Discussion about this post