Deploying AI Agents in Production Using Open Source Architecture

AI agents are becoming core parts of modern applications, but deploying them in production at scale is challenging. The diversity of agent frameworks, unpredictable latency, and the need for streaming responses make it hard to use simple REST-only patterns. Most proof-of-concept deployments break down when real workloads, traffic, and observability requirements hit.

This blog post explains how to design a production-ready, open-source architecture for AI agents using FastAPI, Celery, Redis, Kubernetes, KEDA, Prometheus, Grafana, LangFuse, and LangGraph.

Why Deploying AI Agents Is Hard

  • Agent frameworks differ widely in required infrastructure.

  • Latency varies from milliseconds to minutes depending on workflow complexity.

  • Real-time streaming is needed for modern AI UX.

  • REST-only patterns can’t handle long execution, retries, or async scheduling.

  • Scaling compute-heavy agents is fundamentally different from scaling API servers.

Why the Traditional REST Architecture Fails

While Flask or FastAPI can handle simple POCs, they break under production requirements:

  • API workers block on long tasks.

  • No job scheduling, retries, or queueing.

  • No ability to scale compute separately from API layer.

  • Limited visibility into agent internals.

  • No support for streaming partial output.

To scale AI agents to thousands or millions of users, you need a dual-layer architecture: an API layer for coordination and a worker layer for execution.

An Enterprise-Ready Architecture for AI Agents



Below is a reliable, cloud-native pattern observed in real production deployments.

1. Remote Execution with FastAPI, Celery, and Redis

Pattern: FastAPI receives a request, creates a job, and immediately returns a task_id. Celery workers handle execution asynchronously.

Key points:

  • FastAPI is the control plane.

  • Celery handles asynchronous execution.

  • Redis acts as both message broker and result store.

  • Clients retrieve progress via API or WebSocket.

This keeps the API responsive while the heavy work happens elsewhere.


2. Handling Long-Running Agents with Celery

Enterprise agents often:

  • run multi-step reasoning

  • call multiple LLMs

  • access several external systems

Celery can handle long and complex workflows with features like:

  • delayed acknowledgments for safe processing

  • automatic retries

  • workflow patterns (chains, groups, chords)

Infrastructure guidelines:

  • Deploy Celery workers as a dedicated Kubernetes Deployment.

  • Use Redis StatefulSet for persistence.

  • Configure PodDisruptionBudgets for uptime.

  • Spread Redis replicas across nodes using anti-affinity rules.

3. Framework-Agnostic Agent Runtime

The architecture should run any agent framework without modification.

Supported frameworks include:

  • LangGraph

  • CrewAI

  • Agno

  • ADK

  • Custom agent runtimes

Each agent is simply wrapped in an asynchronous task executed by Celery.

4. Traceability and Observability

Production systems require deep visibility into agent execution.

Recommended tools:

  • LangFuse: trace events, token usage, cost, and errors.

  • Prometheus: track queue depth, worker load, API latency.

  • Grafana: dashboards and alerting.

This allows you to detect bottlenecks, optimize costs, and debug agent behavior.

5. Real-Time Streaming Using Redis Pub/Sub and WebSockets

AI agents often generate output incrementally. To support this:

  • Celery workers publish intermediate updates to Redis Pub/Sub.

  • A WebSocket endpoint subscribes to these channels.

  • The UI receives partial updates in real time.

This pattern supports:

  • token streaming

  • intermediate reasoning results

  • state and progress updates

It also avoids blocking API threads during long-running tasks.

Celery worker publishing events:

FastAPI WebSocket subscriber:

6. Horizontal Scalability

AI agent workloads scale in two separate dimensions.

API Scaling

  • Scale FastAPI using Kubernetes HPA.

  • Metrics: CPU, memory, or requests-per-second.

  • APIs remain stateless and lightweight.

Worker Scaling

  • Scale Celery workers with KEDA using Redis queue depth.

  • Automatically increase workers during high load.

  • Isolate agent execution from user-facing traffic.

This dual-scaling model is essential for enterprise workloads.

Keda Snippet

7. Monitoring, Metrics, and Alerting

A production deployment should expose:

  • agent execution time

  • failure rates

  • queue depth over time

  • worker concurrency

  • LLM cost per request

Prometheus and Grafana provide a reliable end-to-end observability stack.

Putting Everything Together

End-to-end architecture:

  1. Client sends request → FastAPI

  2. FastAPI → Redis queue (Celery)

  3. Celery worker runs agent workflow

  4. Worker pushes intermediate output to Redis Pub/Sub

  5. WebSocket streams updates to UI

  6. Traces → LangFuse / LangGraph

  7. Metrics → Prometheus → Grafana

  8. Scaling: HPA for API, KEDA for workers

This architecture supports:

  • long-running tasks

  • real-time streaming

  • millions of users

  • full observability

  • agent framework independence

  • cloud-native scalability

Conclusion

AI agents are powerful, but deploying them at scale requires more than a simple REST endpoint. With the right open-source components - FastAPI, Celery, Redis, Kubernetes, KEDA, Prometheus, Grafana, and LangFuse, you can build a scalable, reliable, and observable production system without relying on proprietary platforms.

If you have questions or want help applying this architecture to your use case, feel free to comment!

Comments

Popular posts from this blog

Welcome to Pods and Prompts

Semantic caching for LLM Applications and AI Agents