Skip to Content
GuidesMonitoring & Observability

Monitoring & Observability Guide

For: Operators managing production systems
Level: Advanced
Time to read: 35 minutes
Tools: Prometheus, Grafana, Jaeger, Loki

This guide covers observability patterns for monitoring workflows, debugging issues, and optimizing performance.


Observability Stack

┌─────────────────────────────────────┐ │ Grafana (Dashboards) │ └──────────────┬──────────────────────┘ ┌─────────┼─────────┬────────────┐ ▼ ▼ ▼ ▼ Prometheus Jaeger Loki AlertManager (Metrics) (Traces) (Logs) (Alerts) │ │ │ │ └─────────┴─────────┴────────────┘ Cascade Platform

Metrics Collection

Key Metrics

# prometheus.yaml global: scrape_interval: 15s scrape_configs: - job_name: 'cascade' static_configs: - targets: ['localhost:9090']

Workflow Metrics:

cascade_workflow_duration_seconds{workflow="ProcessOrder"} cascade_workflow_executions_total{workflow="ProcessOrder", status="completed"} cascade_workflow_errors_total{workflow="ProcessOrder", error_type="timeout"} cascade_workflow_active{workflow="ProcessOrder"}

Activity Metrics:

cascade_activity_duration_seconds{activity="validate_order"} cascade_activity_executions_total{activity="validate_order", status="success"} cascade_activity_errors_total{activity="validate_order"}

System Metrics:

cascade_database_connections cascade_cache_hit_rate cascade_queue_depth cascade_memory_bytes_total cascade_cpu_seconds_total

Dashboards

Executive Dashboard

┌────────────────────────────────────────────┐ │ Cascade Platform - Executive Overview │ ├────────────────────────────────────────────┤ │ │ │ Throughput: 1,234 workflows/min ↑ 5% │ │ Success Rate: 99.8% ↓ 0.1% │ │ Avg Latency: 234ms → stable │ │ │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ Errors (24h) │ │ Throughput │ │ │ │ 125 errors │ │ (last hour) │ │ │ │ (0.05%) │ │ 1,234 wf/min │ │ │ └──────────────┘ └──────────────┘ │ │ │ │ Recent Issues: │ │ • 3x timeouts on PaymentAPI (1h ago) │ │ • Database CPU high 15 mins │ └────────────────────────────────────────────┘

Technical Dashboard

┌────────────────────────────────────────────┐ │ Cascade - Technical Details │ ├────────────────────────────────────────────┤ │ │ │ Activity Performance (P95) │ │ ├─ validate_order: 45ms │ │ ├─ check_credit: 123ms │ │ ├─ process_payment: 234ms │ │ └─ send_email: 12ms │ │ │ │ Resource Usage: │ │ ├─ CPU: 65% (2000m/3000m) │ │ ├─ Memory: 72% (1.8GB/2.5GB) │ │ ├─ Connections: 18/20 │ │ └─ Cache Hit Rate: 94.2% │ │ │ │ Top Issues (P99 Latency): │ │ ├─ state: HumanTask [523ms] │ │ ├─ activity: check_credit [234ms] │ │ └─ state: Parallel [156ms] │ └────────────────────────────────────────────┘

Distributed Tracing

Traces with Jaeger

# View trace for specific execution cascade trace execution-123 --format jaeger # Output shows: # ├─ Workflow Start [0ms] # │ ├─ State: ValidateOrder [2ms] # │ │ └─ Activity: validate_order [15ms] # │ │ ├─ DB Query [8ms] # │ │ └─ Processing [7ms] # │ ├─ State: CheckCredit [45ms] # │ │ └─ Activity: check_credit [42ms] # │ │ ├─ External API [38ms] # │ │ └─ Processing [4ms] # │ └─ State: ProcessPayment [234ms] # │ └─ Activity: process_payment [231ms] # └─ Workflow End [296ms total]

Trace Instrumentation

import "go.opentelemetry.io/otel" func ProcessOrder(ctx context.Context, input *OrderInput) (*OrderOutput, error) { tracer := otel.Tracer("cascade/activities") // Create span ctx, span := tracer.Start(ctx, "ProcessOrder") defer span.End() // DB query ctx, dbSpan := tracer.Start(ctx, "DatabaseQuery") result, err := db.GetOrder(ctx, input.OrderID) dbSpan.End() // External API ctx, apiSpan := tracer.Start(ctx, "ExternalAPI") approved, err := paymentAPI.Approve(ctx, result) apiSpan.End() return &OrderOutput{Approved: approved}, nil }

Logs Analysis

Loki Setup

# loki-config.yaml auth_enabled: false ingester: chunk_idle_period: 3m max_chunk_age: 1h schema_config: configs: - from: 2024-01-01 store: boltdb-shipper object_store: filesystem schema: v11 index: prefix: cascade_index_ period: 24h

Log Queries

# All logs for workflow {workflow="ProcessOrder"} # Error logs in last hour {severity="error"} | json | timestamp > "1h ago" # Logs with duration > 1s {duration_ms > 1000} | json # Count errors by type count by (error_type) ({severity="error"})

Structured Logging

import "go.uber.org/zap" func ProcessOrder(ctx context.Context, input *OrderInput) (*OrderOutput, error) { logger := log.FromContext(ctx) logger.Info("Processing order", zap.String("order_id", input.OrderID), zap.String("customer_id", input.CustomerID), ) if err := validate(input); err != nil { logger.Error("Validation failed", zap.String("order_id", input.OrderID), zap.Error(err), ) return nil, err } logger.Info("Order validated", zap.String("order_id", input.OrderID), zap.Duration("validation_time", time.Since(start)), ) return result, nil }

Alerting

Alert Rules

# prometheus-rules.yaml groups: - name: cascade_alerts rules: - alert: HighErrorRate expr: | rate(cascade_workflow_errors_total[5m]) > 0.01 for: 5m annotations: summary: "High error rate detected" action: "Investigate error logs" - alert: HighLatency expr: | cascade_workflow_duration_seconds{quantile="0.95"} > 1 for: 10m annotations: summary: "P95 latency > 1s" action: "Check resource usage" - alert: QueueBacklog expr: | cascade_queue_depth > 1000 for: 5m annotations: summary: "Queue depth critical" action: "Scale up activity workers"

Notification Channels

alertmanager: routes: - match: severity: critical receiver: pagerduty continue: true - match: severity: warning receiver: slack - match: severity: info receiver: email receivers: - name: pagerduty pagerduty_configs: - service_key: ${PAGERDUTY_KEY} - name: slack slack_configs: - api_url: ${SLACK_WEBHOOK}

Health Checks

Liveness Probe

// Is the service running? func livenessProbe(w http.ResponseWriter, r *http.Request) { w.WriteHeader(http.StatusOK) w.Write([]byte("alive")) }

Readiness Probe

// Is the service ready to accept traffic? func readinessProbe(w http.ResponseWriter, r *http.Request) { if !db.IsConnected() { w.WriteHeader(http.StatusServiceUnavailable) return } if !cache.IsHealthy() { w.WriteHeader(http.StatusServiceUnavailable) return } w.WriteHeader(http.StatusOK) }

Kubernetes Probes

spec: containers: - name: cascade livenessProbe: httpGet: path: /health/live port: 8080 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /health/ready port: 8080 initialDelaySeconds: 10 periodSeconds: 5

Best Practices

DO:

  • Collect structured metrics
  • Use distributed tracing
  • Set meaningful alerts
  • Monitor trends
  • Test alerts regularly
  • Document dashboards
  • Rotate logs

DON’T:

  • Alert on every metric
  • Ignore low-priority logs
  • Mix metrics & logs
  • Set unrealistic thresholds
  • Forget context
  • Disable monitoring

Updated: October 29, 2025
Version: 1.0
Stack: Prometheus, Jaeger, Loki, Grafana

Last updated on