Monitoring & Observability Guide

For: Operators managing production systems
Level: Advanced
Time to read: 35 minutes
Tools: Prometheus, Grafana, Jaeger, Loki

This guide covers observability patterns for monitoring workflows, debugging issues, and optimizing performance.

Observability Stack


┌─────────────────────────────────────┐
│ Grafana (Dashboards)                │
└──────────────┬──────────────────────┘
               │
     ┌─────────┼─────────┬────────────┐
     ▼         ▼         ▼            ▼
Prometheus Jaeger    Loki         AlertManager
(Metrics) (Traces) (Logs)         (Alerts)
     │         │         │            │
     └─────────┴─────────┴────────────┘
             ▲
       Cascade Platform

Metrics Collection

Key Metrics


# prometheus.yaml
global:
  scrape_interval: 15s
 
scrape_configs:
  - job_name: 'cascade'
    static_configs:
      - targets: ['localhost:9090']

Workflow Metrics:


cascade_workflow_duration_seconds{workflow="ProcessOrder"}
cascade_workflow_executions_total{workflow="ProcessOrder", status="completed"}
cascade_workflow_errors_total{workflow="ProcessOrder", error_type="timeout"}
cascade_workflow_active{workflow="ProcessOrder"}

Activity Metrics:


cascade_activity_duration_seconds{activity="validate_order"}
cascade_activity_executions_total{activity="validate_order", status="success"}
cascade_activity_errors_total{activity="validate_order"}

System Metrics:


cascade_database_connections
cascade_cache_hit_rate
cascade_queue_depth
cascade_memory_bytes_total
cascade_cpu_seconds_total

Dashboards

Executive Dashboard


┌────────────────────────────────────────────┐
│ Cascade Platform - Executive Overview      │
├────────────────────────────────────────────┤
│                                            │
│ Throughput: 1,234 workflows/min  ↑ 5%   │
│ Success Rate: 99.8%  ↓ 0.1%             │
│ Avg Latency: 234ms  → stable             │
│                                            │
│ ┌──────────────┐  ┌──────────────┐       │
│ │ Errors (24h) │  │ Throughput   │       │
│ │ 125 errors   │  │ (last hour)  │       │
│ │ (0.05%)      │  │ 1,234 wf/min │       │
│ └──────────────┘  └──────────────┘       │
│                                            │
│ Recent Issues:                            │
│ • 3x timeouts on PaymentAPI (1h ago)     │
│ • Database CPU high 15 mins              │
└────────────────────────────────────────────┘

Technical Dashboard


┌────────────────────────────────────────────┐
│ Cascade - Technical Details                │
├────────────────────────────────────────────┤
│                                            │
│ Activity Performance (P95)                │
│ ├─ validate_order:     45ms               │
│ ├─ check_credit:       123ms              │
│ ├─ process_payment:    234ms              │
│ └─ send_email:         12ms               │
│                                            │
│ Resource Usage:                           │
│ ├─ CPU: 65% (2000m/3000m)                │
│ ├─ Memory: 72% (1.8GB/2.5GB)             │
│ ├─ Connections: 18/20                    │
│ └─ Cache Hit Rate: 94.2%                 │
│                                            │
│ Top Issues (P99 Latency):                │
│ ├─ state: HumanTask [523ms]              │
│ ├─ activity: check_credit [234ms]        │
│ └─ state: Parallel [156ms]               │
└────────────────────────────────────────────┘

Distributed Tracing

Traces with Jaeger


# View trace for specific execution
cascade trace execution-123 --format jaeger
 
# Output shows:
# ├─ Workflow Start [0ms]
# │  ├─ State: ValidateOrder [2ms]
# │  │  └─ Activity: validate_order [15ms]
# │  │     ├─ DB Query [8ms]
# │  │     └─ Processing [7ms]
# │  ├─ State: CheckCredit [45ms]
# │  │  └─ Activity: check_credit [42ms]
# │  │     ├─ External API [38ms]
# │  │     └─ Processing [4ms]
# │  └─ State: ProcessPayment [234ms]
# │     └─ Activity: process_payment [231ms]
# └─ Workflow End [296ms total]

Trace Instrumentation


import "go.opentelemetry.io/otel"
 
func ProcessOrder(ctx context.Context, input *OrderInput) (*OrderOutput, error) {
    tracer := otel.Tracer("cascade/activities")
    
    // Create span
    ctx, span := tracer.Start(ctx, "ProcessOrder")
    defer span.End()
    
    // DB query
    ctx, dbSpan := tracer.Start(ctx, "DatabaseQuery")
    result, err := db.GetOrder(ctx, input.OrderID)
    dbSpan.End()
    
    // External API
    ctx, apiSpan := tracer.Start(ctx, "ExternalAPI")
    approved, err := paymentAPI.Approve(ctx, result)
    apiSpan.End()
    
    return &OrderOutput{Approved: approved}, nil
}

Logs Analysis

Loki Setup


# loki-config.yaml
auth_enabled: false
 
ingester:
  chunk_idle_period: 3m
  max_chunk_age: 1h
 
schema_config:
  configs:
    - from: 2024-01-01
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: cascade_index_
        period: 24h

Log Queries


# All logs for workflow
{workflow="ProcessOrder"}

# Error logs in last hour
{severity="error"} | json | timestamp > "1h ago"

# Logs with duration > 1s
{duration_ms > 1000} | json

# Count errors by type
count by (error_type) ({severity="error"})

Structured Logging


import "go.uber.org/zap"
 
func ProcessOrder(ctx context.Context, input *OrderInput) (*OrderOutput, error) {
    logger := log.FromContext(ctx)
    
    logger.Info("Processing order",
        zap.String("order_id", input.OrderID),
        zap.String("customer_id", input.CustomerID),
    )
    
    if err := validate(input); err != nil {
        logger.Error("Validation failed",
            zap.String("order_id", input.OrderID),
            zap.Error(err),
        )
        return nil, err
    }
    
    logger.Info("Order validated",
        zap.String("order_id", input.OrderID),
        zap.Duration("validation_time", time.Since(start)),
    )
    
    return result, nil
}

Alerting

Alert Rules


# prometheus-rules.yaml
groups:
  - name: cascade_alerts
    rules:
      - alert: HighErrorRate
        expr: |
          rate(cascade_workflow_errors_total[5m]) > 0.01
        for: 5m
        annotations:
          summary: "High error rate detected"
          action: "Investigate error logs"
      
      - alert: HighLatency
        expr: |
          cascade_workflow_duration_seconds{quantile="0.95"} > 1
        for: 10m
        annotations:
          summary: "P95 latency > 1s"
          action: "Check resource usage"
      
      - alert: QueueBacklog
        expr: |
          cascade_queue_depth > 1000
        for: 5m
        annotations:
          summary: "Queue depth critical"
          action: "Scale up activity workers"

Notification Channels


alertmanager:
  routes:
    - match:
        severity: critical
      receiver: pagerduty
      continue: true
    
    - match:
        severity: warning
      receiver: slack
    
    - match:
        severity: info
      receiver: email
 
receivers:
  - name: pagerduty
    pagerduty_configs:
      - service_key: ${PAGERDUTY_KEY}
  
  - name: slack
    slack_configs:
      - api_url: ${SLACK_WEBHOOK}

Health Checks

Liveness Probe


// Is the service running?
func livenessProbe(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
    w.Write([]byte("alive"))
}

Readiness Probe


// Is the service ready to accept traffic?
func readinessProbe(w http.ResponseWriter, r *http.Request) {
    if !db.IsConnected() {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    if !cache.IsHealthy() {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    w.WriteHeader(http.StatusOK)
}

Kubernetes Probes


spec:
  containers:
    - name: cascade
      livenessProbe:
        httpGet:
          path: /health/live
          port: 8080
        initialDelaySeconds: 30
        periodSeconds: 10
      
      readinessProbe:
        httpGet:
          path: /health/ready
          port: 8080
        initialDelaySeconds: 10
        periodSeconds: 5

Best Practices

✅ DO:

Collect structured metrics
Use distributed tracing
Set meaningful alerts
Monitor trends
Test alerts regularly
Document dashboards
Rotate logs

❌ DON’T:

Alert on every metric
Ignore low-priority logs
Mix metrics & logs
Set unrealistic thresholds
Forget context
Disable monitoring

Updated: October 29, 2025
Version: 1.0
Stack: Prometheus, Jaeger, Loki, Grafana