Documentation Index
Fetch the complete documentation index at: https://mintlify.com/OpsMill/infrahub/llms.txt
Use this file to discover all available pages before exploring further.
Monitor your Infrahub deployment to ensure reliability, identify performance bottlenecks, and troubleshoot issues. This page covers monitoring strategies, metrics, logging, and observability.
Monitoring components
Health checks
Infrahub provides built-in health check endpoints:
API server health:
curl http://localhost:8000/api/health
Returns HTTP 200 if healthy.
Configuration endpoint:
curl http://localhost:8000/api/config
Returns version and configuration details.
Component health checks:
Container health
Docker Compose includes built-in health checks:
# Check all services
docker compose ps
# Services should show (healthy) status
Kubernetes liveness and readiness probes:
livenessProbe:
httpGet:
path: /api/health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /api/health
port: 8000
initialDelaySeconds: 10
periodSeconds: 5
Metrics collection
Prometheus metrics
Enable OpenTelemetry metrics export:
docker-compose.override.yml
services:
infrahub-server:
environment:
INFRAHUB_TRACE_ENABLE: "true"
INFRAHUB_TRACE_EXPORTER_TYPE: otlp
INFRAHUB_TRACE_EXPORTER_PROTOCOL: grpc
INFRAHUB_TRACE_EXPORTER_ENDPOINT: http://prometheus:9090
Key metrics to monitor
API server metrics:
- Request rate (requests/second)
- Response time (p50, p95, p99)
- Error rate (4xx, 5xx responses)
- Active connections
- Worker utilization
Database metrics:
- Query execution time
- Transaction rate
- Cache hit ratio
- Connection pool usage
- Page cache usage
- Heap memory usage
Task worker metrics:
- Active tasks
- Task queue depth
- Task failure rate
- Task execution time
- Worker concurrency
System metrics:
- CPU usage
- Memory usage
- Disk I/O
- Network throughput
- Disk space usage
Neo4j metrics
Query Neo4j metrics:
// Database size
CALL db.stats.retrieve('GRAPH COUNTS');
// Query performance
CALL dbms.queryJmx('org.neo4j:instance=kernel#0,name=Transactions');
// Page cache
CALL dbms.queryJmx('org.neo4j:instance=kernel#0,name=Page cache');
Expose Neo4j metrics to Prometheus:
docker-compose.override.yml
services:
database:
environment:
NEO4J_metrics_prometheus_enabled: "true"
NEO4J_metrics_prometheus_endpoint: "0.0.0.0:2004"
ports:
- "2004:2004"
RabbitMQ metrics
Access RabbitMQ management interface:
Credentials: infrahub / infrahub
Export Prometheus metrics:
docker-compose.override.yml
services:
message-queue:
environment:
RABBITMQ_PROMETHEUS_PLUGIN: "true"
ports:
- "15692:15692"
Scrape endpoint:
scrape_configs:
- job_name: 'rabbitmq'
static_configs:
- targets: ['message-queue:15692']
Redis metrics
Query Redis info:
# Memory usage
docker compose exec cache redis-cli info memory
# Stats
docker compose exec cache redis-cli info stats
# All info
docker compose exec cache redis-cli info
Export to Prometheus using redis_exporter:
docker-compose.override.yml
services:
redis-exporter:
image: oliver006/redis_exporter:latest
environment:
REDIS_ADDR: cache:6379
ports:
- "9121:9121"
Logging
Log levels
Configure log verbosity:
# Set log level
INFRAHUB_LOG_LEVEL=INFO # DEBUG, INFO, WARNING, ERROR, CRITICAL
Centralized logging
Aggregate logs using Loki, Elasticsearch, or CloudWatch:
Structured logging
Infrahub logs are JSON-formatted for easy parsing:
{
"timestamp": "2025-03-02T12:00:00Z",
"level": "INFO",
"logger": "infrahub.server",
"message": "Request processed",
"request_id": "abc123",
"duration_ms": 45
}
Parse with jq:
docker compose logs infrahub-server | jq 'select(.level=="ERROR")'
Log retention
Configure Docker log rotation:
docker-compose.override.yml
services:
infrahub-server:
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
Distributed tracing
Enable OpenTelemetry tracing:
docker-compose.override.yml
services:
infrahub-server:
environment:
INFRAHUB_TRACE_ENABLE: "true"
INFRAHUB_TRACE_EXPORTER_TYPE: otlp
INFRAHUB_TRACE_EXPORTER_ENDPOINT: http://jaeger:4317
OTEL_RESOURCE_ATTRIBUTES: service.name=infrahub-server
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
Access Jaeger UI:
Alerting
Prometheus alerts
Define alerts for critical conditions:
groups:
- name: infrahub
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.instance }}"
- alert: HighResponseTime
expr: histogram_quantile(0.95, http_request_duration_seconds_bucket) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High response time on {{ $labels.instance }}"
- alert: DatabaseDown
expr: up{job="neo4j"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Neo4j database is down"
- alert: HighMemoryUsage
expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.container_label_com_docker_compose_service }}"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space below 10% on {{ $labels.instance }}"
Alert notifications
Configure Alertmanager:
route:
receiver: 'default'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receivers:
- name: 'default'
email_configs:
- to: 'ops@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX'
channel: '#infrahub-alerts'
Dashboards
Grafana setup
Deploy Grafana:
docker-compose.override.yml
services:
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./grafana/datasources:/etc/grafana/provisioning/datasources
volumes:
grafana_data:
Access Grafana:
Sample dashboard panels
API request rate:
rate(http_requests_total[5m])
Database query latency:
histogram_quantile(0.95, rate(neo4j_cypher_query_duration_seconds_bucket[5m]))
Task queue depth:
rabbitmq_queue_messages{queue="infrahub"}
Memory usage:
container_memory_usage_bytes{container_label_com_docker_compose_service="infrahub-server"}
Enable query logging:
docker-compose.override.yml
services:
infrahub-server:
environment:
INFRAHUB_MISC_PRINT_QUERY_DETAILS: "true"
View slow queries:
docker compose logs infrahub-server | grep "query_duration" | jq 'select(.query_duration_ms > 1000)'
Database profiling
Profile Neo4j queries:
// Explain query plan
EXPLAIN MATCH (n:Node) WHERE n.name = 'example' RETURN n;
// Profile query execution
PROFILE MATCH (n:Node) WHERE n.name = 'example' RETURN n;
Resource utilization
Monitor container resource usage:
# Docker stats
docker stats
# Specific container
docker stats infrahub-server-1
Kubernetes resource metrics:
# Pod metrics
kubectl top pods -n infrahub
# Node metrics
kubectl top nodes
Troubleshooting
Common issues
High memory usage:
# Check Neo4j heap
docker compose exec database cypher-shell -u neo4j \
-c "CALL dbms.queryJmx('java.lang:type=Memory');"
# Increase heap size
NEO4J_dbms_memory_heap_max__size=4G
Slow API responses:
# Check query cache hit rate
curl http://localhost:8000/api/metrics | grep cache_hit_rate
# Increase cache size
INFRAHUB_CACHE_DATABASE=1
Task queue backlog:
# Check queue depth
curl -u infrahub:infrahub http://localhost:15672/api/queues | jq '.[] | {name, messages}'
# Scale workers
docker compose up -d --scale task-worker=4