Operations¶
Managing Reflex in production.
💓 Health Monitoring¶
Basic Health Check¶
Detailed Health Check¶
{
"status": "healthy",
"indicators": [
{"name": "database", "status": "healthy", "latency_ms": 1.5},
{"name": "event_queue", "status": "healthy", "message": "42 pending"},
{"name": "dlq", "status": "healthy", "message": "0 in DLQ"}
]
}
Load Balancer Health Checks
Use /health for load balancer probes (fast, simple response). Use /health/detailed for monitoring dashboards.
📬 Dead-Letter Queue (DLQ)¶
Events that fail after max retries move to the DLQ for manual intervention.
List DLQ Events¶
Retry Events¶
Before Retrying
Investigate why events failed before retrying. Check:
- Application logs for error details
- Event payload for malformed data
- External service availability
🔄 Event Replay¶
Replay historical events for debugging or reprocessing:
Use cases:
- Debugging agent behavior with specific events
- Reprocessing events after bug fixes
- Testing new trigger configurations
🔠Observability¶
Reflex integrates with Logfire for observability.
Automatic Tracing¶
Traces are captured automatically for:
- HTTP requests
- WebSocket connections
- Event store operations
- Agent tool calls
Custom Spans¶
Configuration¶
Set LOGFIRE_TOKEN in your environment:
📖 Runbook¶
High DLQ Count¶
Symptoms
/health/detailed shows high DLQ count
Steps:
- Check recent deployments for bugs
- Review DLQ events:
python scripts/dlq.py list - Check external service status
- Fix root cause before retrying
High Event Latency¶
Symptoms
Events taking long to process
Steps:
- Check agent loop logs for slow operations
- Review Logfire traces for bottlenecks
- Consider scaling horizontally
- Check database query performance
Database Connection Exhaustion¶
Symptoms
Connection pool errors in logs
Steps:
- Check
DB_POOL_MAXvs running instances - Verify PostgreSQL
max_connections - Look for connection leaks (unclosed sessions)
- Increase pool size or reduce instances