Operations¶

Managing Reflex in production.

💓 Health Monitoring¶

Basic Health Check¶

curl http://localhost:8000/health

{"status": "healthy"}

Detailed Health Check¶

curl http://localhost:8000/health/detailed

{
  "status": "healthy",
  "indicators": [
    {"name": "database", "status": "healthy", "latency_ms": 1.5},
    {"name": "event_queue", "status": "healthy", "message": "42 pending"},
    {"name": "dlq", "status": "healthy", "message": "0 in DLQ"}
  ]
}

Load Balancer Health Checks

Use /health for load balancer probes (fast, simple response). Use /health/detailed for monitoring dashboards.

📬 Dead-Letter Queue (DLQ)¶

Events that fail after max retries move to the DLQ for manual intervention.

List DLQ Events¶

python scripts/dlq.py list

Retry Events¶

Single EventAll Events

python scripts/dlq.py retry <event-id>

python scripts/dlq.py retry-all

Before Retrying

Investigate why events failed before retrying. Check:

Application logs for error details
Event payload for malformed data
External service availability

🔄 Event Replay¶

Replay historical events for debugging or reprocessing:

python scripts/replay.py

Use cases:

Debugging agent behavior with specific events
Reprocessing events after bug fixes
Testing new trigger configurations

🔭 Observability¶

Reflex integrates with Logfire for observability.

Automatic Tracing¶

Traces are captured automatically for:

HTTP requests
WebSocket connections
Event store operations
Agent tool calls

Custom Spans¶

import logfire

with logfire.span("my-operation", key=value):
    # Your code here
    ...

Configuration¶

Set LOGFIRE_TOKEN in your environment:

LOGFIRE_TOKEN=your-token-here

📖 Runbook¶

High DLQ Count¶

Symptoms

/health/detailed shows high DLQ count

Steps:

Check recent deployments for bugs
Review DLQ events: python scripts/dlq.py list
Check external service status
Fix root cause before retrying

High Event Latency¶

Symptoms

Events taking long to process

Steps:

Check agent loop logs for slow operations
Review Logfire traces for bottlenecks
Consider scaling horizontally
Check database query performance

Database Connection Exhaustion¶

Symptoms

Connection pool errors in logs

Steps:

Check DB_POOL_MAX vs running instances
Verify PostgreSQL max_connections
Look for connection leaks (unclosed sessions)
Increase pool size or reduce instances