Monitoring & Observability Guide
For: Operators managing production systems
Level: Advanced
Time to read: 35 minutes
Tools: Prometheus, Grafana, Jaeger, Loki
This guide covers observability patterns for monitoring workflows, debugging issues, and optimizing performance.
Observability Stack
┌─────────────────────────────────────┐
│ Grafana (Dashboards) │
└──────────────┬──────────────────────┘
│
┌─────────┼─────────┬────────────┐
▼ ▼ ▼ ▼
Prometheus Jaeger Loki AlertManager
(Metrics) (Traces) (Logs) (Alerts)
│ │ │ │
└─────────┴─────────┴────────────┘
▲
Cascade PlatformMetrics Collection
Key Metrics
# prometheus.yaml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'cascade'
static_configs:
- targets: ['localhost:9090']Workflow Metrics:
cascade_workflow_duration_seconds{workflow="ProcessOrder"}
cascade_workflow_executions_total{workflow="ProcessOrder", status="completed"}
cascade_workflow_errors_total{workflow="ProcessOrder", error_type="timeout"}
cascade_workflow_active{workflow="ProcessOrder"}Activity Metrics:
cascade_activity_duration_seconds{activity="validate_order"}
cascade_activity_executions_total{activity="validate_order", status="success"}
cascade_activity_errors_total{activity="validate_order"}System Metrics:
cascade_database_connections
cascade_cache_hit_rate
cascade_queue_depth
cascade_memory_bytes_total
cascade_cpu_seconds_totalDashboards
Executive Dashboard
┌────────────────────────────────────────────┐
│ Cascade Platform - Executive Overview │
├────────────────────────────────────────────┤
│ │
│ Throughput: 1,234 workflows/min ↑ 5% │
│ Success Rate: 99.8% ↓ 0.1% │
│ Avg Latency: 234ms → stable │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Errors (24h) │ │ Throughput │ │
│ │ 125 errors │ │ (last hour) │ │
│ │ (0.05%) │ │ 1,234 wf/min │ │
│ └──────────────┘ └──────────────┘ │
│ │
│ Recent Issues: │
│ • 3x timeouts on PaymentAPI (1h ago) │
│ • Database CPU high 15 mins │
└────────────────────────────────────────────┘Technical Dashboard
┌────────────────────────────────────────────┐
│ Cascade - Technical Details │
├────────────────────────────────────────────┤
│ │
│ Activity Performance (P95) │
│ ├─ validate_order: 45ms │
│ ├─ check_credit: 123ms │
│ ├─ process_payment: 234ms │
│ └─ send_email: 12ms │
│ │
│ Resource Usage: │
│ ├─ CPU: 65% (2000m/3000m) │
│ ├─ Memory: 72% (1.8GB/2.5GB) │
│ ├─ Connections: 18/20 │
│ └─ Cache Hit Rate: 94.2% │
│ │
│ Top Issues (P99 Latency): │
│ ├─ state: HumanTask [523ms] │
│ ├─ activity: check_credit [234ms] │
│ └─ state: Parallel [156ms] │
└────────────────────────────────────────────┘Distributed Tracing
Traces with Jaeger
# View trace for specific execution
cascade trace execution-123 --format jaeger
# Output shows:
# ├─ Workflow Start [0ms]
# │ ├─ State: ValidateOrder [2ms]
# │ │ └─ Activity: validate_order [15ms]
# │ │ ├─ DB Query [8ms]
# │ │ └─ Processing [7ms]
# │ ├─ State: CheckCredit [45ms]
# │ │ └─ Activity: check_credit [42ms]
# │ │ ├─ External API [38ms]
# │ │ └─ Processing [4ms]
# │ └─ State: ProcessPayment [234ms]
# │ └─ Activity: process_payment [231ms]
# └─ Workflow End [296ms total]Trace Instrumentation
import "go.opentelemetry.io/otel"
func ProcessOrder(ctx context.Context, input *OrderInput) (*OrderOutput, error) {
tracer := otel.Tracer("cascade/activities")
// Create span
ctx, span := tracer.Start(ctx, "ProcessOrder")
defer span.End()
// DB query
ctx, dbSpan := tracer.Start(ctx, "DatabaseQuery")
result, err := db.GetOrder(ctx, input.OrderID)
dbSpan.End()
// External API
ctx, apiSpan := tracer.Start(ctx, "ExternalAPI")
approved, err := paymentAPI.Approve(ctx, result)
apiSpan.End()
return &OrderOutput{Approved: approved}, nil
}Logs Analysis
Loki Setup
# loki-config.yaml
auth_enabled: false
ingester:
chunk_idle_period: 3m
max_chunk_age: 1h
schema_config:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: cascade_index_
period: 24hLog Queries
# All logs for workflow
{workflow="ProcessOrder"}
# Error logs in last hour
{severity="error"} | json | timestamp > "1h ago"
# Logs with duration > 1s
{duration_ms > 1000} | json
# Count errors by type
count by (error_type) ({severity="error"})Structured Logging
import "go.uber.org/zap"
func ProcessOrder(ctx context.Context, input *OrderInput) (*OrderOutput, error) {
logger := log.FromContext(ctx)
logger.Info("Processing order",
zap.String("order_id", input.OrderID),
zap.String("customer_id", input.CustomerID),
)
if err := validate(input); err != nil {
logger.Error("Validation failed",
zap.String("order_id", input.OrderID),
zap.Error(err),
)
return nil, err
}
logger.Info("Order validated",
zap.String("order_id", input.OrderID),
zap.Duration("validation_time", time.Since(start)),
)
return result, nil
}Alerting
Alert Rules
# prometheus-rules.yaml
groups:
- name: cascade_alerts
rules:
- alert: HighErrorRate
expr: |
rate(cascade_workflow_errors_total[5m]) > 0.01
for: 5m
annotations:
summary: "High error rate detected"
action: "Investigate error logs"
- alert: HighLatency
expr: |
cascade_workflow_duration_seconds{quantile="0.95"} > 1
for: 10m
annotations:
summary: "P95 latency > 1s"
action: "Check resource usage"
- alert: QueueBacklog
expr: |
cascade_queue_depth > 1000
for: 5m
annotations:
summary: "Queue depth critical"
action: "Scale up activity workers"Notification Channels
alertmanager:
routes:
- match:
severity: critical
receiver: pagerduty
continue: true
- match:
severity: warning
receiver: slack
- match:
severity: info
receiver: email
receivers:
- name: pagerduty
pagerduty_configs:
- service_key: ${PAGERDUTY_KEY}
- name: slack
slack_configs:
- api_url: ${SLACK_WEBHOOK}Health Checks
Liveness Probe
// Is the service running?
func livenessProbe(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
w.Write([]byte("alive"))
}Readiness Probe
// Is the service ready to accept traffic?
func readinessProbe(w http.ResponseWriter, r *http.Request) {
if !db.IsConnected() {
w.WriteHeader(http.StatusServiceUnavailable)
return
}
if !cache.IsHealthy() {
w.WriteHeader(http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
}Kubernetes Probes
spec:
containers:
- name: cascade
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5Best Practices
✅ DO:
- Collect structured metrics
- Use distributed tracing
- Set meaningful alerts
- Monitor trends
- Test alerts regularly
- Document dashboards
- Rotate logs
❌ DON’T:
- Alert on every metric
- Ignore low-priority logs
- Mix metrics & logs
- Set unrealistic thresholds
- Forget context
- Disable monitoring
Updated: October 29, 2025
Version: 1.0
Stack: Prometheus, Jaeger, Loki, Grafana
Last updated on