Deploying AI Agents in Production Using Open Source Architecture

AI agents are becoming core parts of modern applications, but deploying them in production at scale is challenging. The diversity of agent frameworks, unpredictable latency, and the need for streaming responses make it hard to use simple REST-only patterns. Most proof-of-concept deployments break down when real workloads, traffic, and observability requirements hit.

This blog post explains how to design a production-ready, open-source architecture for AI agents using FastAPI, Celery, Redis, Kubernetes, KEDA, Prometheus, Grafana, LangFuse, and LangGraph.

Why Deploying AI Agents Is Hard

Agent frameworks differ widely in required infrastructure.
Latency varies from milliseconds to minutes depending on workflow complexity.
Real-time streaming is needed for modern AI UX.
REST-only patterns can’t handle long execution, retries, or async scheduling.
Scaling compute-heavy agents is fundamentally different from scaling API servers.

Why the Traditional REST Architecture Fails

While Flask or FastAPI can handle simple POCs, they break under production requirements:

API workers block on long tasks.
No job scheduling, retries, or queueing.
No ability to scale compute separately from API layer.
Limited visibility into agent internals.
No support for streaming partial output.

To scale AI agents to thousands or millions of users, you need a dual-layer architecture: an API layer for coordination and a worker layer for execution.

An Enterprise-Ready Architecture for AI Agents

Below is a reliable, cloud-native pattern observed in real production deployments.

1. Remote Execution with FastAPI, Celery, and Redis

Pattern: FastAPI receives a request, creates a job, and immediately returns a task_id. Celery workers handle execution asynchronously.

Key points:

FastAPI is the control plane.
Celery handles asynchronous execution.
Redis acts as both message broker and result store.
Clients retrieve progress via API or WebSocket.

This keeps the API responsive while the heavy work happens elsewhere.

2. Handling Long-Running Agents with Celery

Enterprise agents often:

run multi-step reasoning
call multiple LLMs
access several external systems

Celery can handle long and complex workflows with features like:

delayed acknowledgments for safe processing
automatic retries
workflow patterns (chains, groups, chords)

Infrastructure guidelines:

Deploy Celery workers as a dedicated Kubernetes Deployment.
Use Redis StatefulSet for persistence.
Configure PodDisruptionBudgets for uptime.
Spread Redis replicas across nodes using anti-affinity rules.

3. Framework-Agnostic Agent Runtime

The architecture should run any agent framework without modification.

Supported frameworks include:

LangGraph
CrewAI
Agno
ADK
Custom agent runtimes

Each agent is simply wrapped in an asynchronous task executed by Celery.

4. Traceability and Observability

Production systems require deep visibility into agent execution.

Recommended tools:

LangFuse: trace events, token usage, cost, and errors.
Prometheus: track queue depth, worker load, API latency.
Grafana: dashboards and alerting.

This allows you to detect bottlenecks, optimize costs, and debug agent behavior.

5. Real-Time Streaming Using Redis Pub/Sub and WebSockets

AI agents often generate output incrementally. To support this:

Celery workers publish intermediate updates to Redis Pub/Sub.
A WebSocket endpoint subscribes to these channels.
The UI receives partial updates in real time.

This pattern supports:

token streaming
intermediate reasoning results
state and progress updates

It also avoids blocking API threads during long-running tasks.

Celery worker publishing events:

FastAPI WebSocket subscriber:

6. Horizontal Scalability

AI agent workloads scale in two separate dimensions.

API Scaling

Scale FastAPI using Kubernetes HPA.
Metrics: CPU, memory, or requests-per-second.
APIs remain stateless and lightweight.

Worker Scaling

Scale Celery workers with KEDA using Redis queue depth.
Automatically increase workers during high load.
Isolate agent execution from user-facing traffic.

This dual-scaling model is essential for enterprise workloads.

Keda Snippet

7. Monitoring, Metrics, and Alerting

A production deployment should expose:

agent execution time
failure rates
queue depth over time
worker concurrency
LLM cost per request

Prometheus and Grafana provide a reliable end-to-end observability stack.

Putting Everything Together

End-to-end architecture:

Client sends request → FastAPI
FastAPI → Redis queue (Celery)
Celery worker runs agent workflow
Worker pushes intermediate output to Redis Pub/Sub
WebSocket streams updates to UI
Traces → LangFuse / LangGraph
Metrics → Prometheus → Grafana
Scaling: HPA for API, KEDA for workers

This architecture supports:

long-running tasks
real-time streaming
millions of users
full observability
agent framework independence
cloud-native scalability

Conclusion

AI agents are powerful, but deploying them at scale requires more than a simple REST endpoint. With the right open-source components - FastAPI, Celery, Redis, Kubernetes, KEDA, Prometheus, Grafana, and LangFuse, you can build a scalable, reliable, and observable production system without relying on proprietary platforms.

If you have questions or want help applying this architecture to your use case, feel free to comment!

Search This Blog

Pods and Prompts