Skip to Content
GuidesError Handling Best Practices

Best Practices: Error Handling & Recovery

For: Developers building resilient workflows
Level: Advanced
Time to read: 40 minutes
Reference: CSL_semantics.dspec.yaml, errors.dspec.yaml

This guide covers error classification, retry strategies, and recovery patterns based on CSL semantics.


Error Classification Model

Permanent vs Transient Errors

From CSL Error Handling Model (specs/shared/csl_semantics.dspec.yaml): ┌────────────────────────────┐ │ Error Occurs │ └─────────────┬──────────────┘ ┌─────────┴─────────┐ │ │ ▼ ▼ PERMANENT TRANSIENT (Do NOT retry) (Retry with backoff) │ │ ├─ ValidationError ├─ ServiceUnavailable ├─ AuthorizedError ├─ TimeoutError ├─ NotFoundError ├─ NetworkError └─ InvalidInput └─ ConnectionError

Behavior:

  • Permanent Errors: Immediately transition to Catch handler (no retry)
  • Transient Errors: Retry according to policy, then Catch handler

Retry Policy Configuration

Retry → Catch Precedence

From specs/shared/csl_semantics.dspec.yaml: 1. Activity fails 2. Retry policy is applied (retries N times with backoff) 3. After MaxAttempts exhausted, Catch handler is evaluated 4. First matching Catch handler transitions to error recovery state 5. If no Catch matches, workflow fails

Default Retry Policy

retries: initial_interval: 1s max_interval: 60s multiplier: 2.0 default: max_attempts: 3

Timeline:

Attempt 1: 0s (immediate) Attempt 2: after 1s Attempt 3: after 2s Attempt 4: after 4s (stops after max_attempts)

Customized Retry Strategy

# ✅ Good: Conservative for external APIs - name: CallPaymentAPI type: Task timeout: 30s retries: max_attempts: 3 backoff: initial_interval: 2s max_interval: 30s multiplier: 2.0 # ❌ Bad: Too aggressive, wastes resources - name: CallPaymentAPI type: Task timeout: 30s retries: max_attempts: 10 # Too many backoff: initial_interval: 100ms # Too short multiplier: 1.5 # Too slow progression

Catch Handlers

Error Catching Pattern

states: - name: ProcessPayment type: Task resource: urn:cascade:activity:charge_card parameters: amount: "{{ $.total }}" timeout: 30s retries: max_attempts: 3 backoff: initial_interval: 2s max_interval: 30s multiplier: 2.0 # Catch handlers for different errors catch: - error_equals: ["ValidationError"] result_path: $.error next: HandleValidationError - error_equals: ["TimeoutError", "ConnectionError"] result_path: $.error next: RetryLater - error_equals: ["States.ALL"] # Catch-all result_path: $.error next: HandleUnexpectedError next: ProcessSuccess - name: HandleValidationError type: Task resource: urn:cascade:activity:notify_invalid_payment end: true - name: RetryLater type: Wait duration: 5m next: ProcessPayment # Retry the payment - name: ProcessSuccess type: Task end: true

Error Recovery Patterns

Pattern 1: Graceful Degradation

states: - name: GetUserPreferences type: Task resource: urn:cascade:activity:fetch_preferences timeout: 5s catch: - error_equals: ["TimeoutError"] result_path: $.error next: UseDefaults next: ProcessWithPreferences - name: UseDefaults type: Task resource: urn:cascade:activity:apply_default_preferences next: ProcessWithPreferences - name: ProcessWithPreferences type: Task end: true

Pattern 2: Fallback Service

states: - name: CallPrimaryService type: Task resource: urn:cascade:activity:primary_api_call timeout: 10s catch: - error_equals: ["ServiceUnavailable", "TimeoutError"] next: CallFallbackService next: Success - name: CallFallbackService type: Task resource: urn:cascade:activity:fallback_api_call timeout: 30s catch: - error_equals: ["States.ALL"] next: DegradedMode next: Success - name: DegradedMode type: Task resource: urn:cascade:activity:cached_response next: Success - name: Success type: Task end: true

Pattern 3: Compensation (Rollback)

states: - name: TransferFunds type: Task resource: urn:cascade:activity:debit_account parameters: from_account: "{{ $.source }}" amount: "{{ $.amount }}" result: $.debit_transaction catch: - error_equals: ["InsufficientFunds"] next: TransferFailed next: CreditAccount - name: CreditAccount type: Task resource: urn:cascade:activity:credit_account parameters: to_account: "{{ $.destination }}" amount: "{{ $.amount }}" catch: - error_equals: ["States.ALL"] next: CompensateDebit # Rollback the debit next: Success - name: CompensateDebit type: Task resource: urn:cascade:activity:refund_account parameters: account: "{{ $.source }}" amount: "{{ $.amount }}" original_transaction: "{{ $.debit_transaction }}" next: TransferFailed - name: TransferFailed type: Task end: true - name: Success type: Task end: true

Idempotency for Safe Retries

Idempotency Key Pattern

func ChargePayment(ctx context.Context, input *PaymentInput) (*ChargeOutput, error) { // Use idempotency key to prevent duplicate charges on retry idempotencyKey := input.IdempotencyKey // Same on retry // Check if we already processed this existing, err := db.GetChargeByIdempotencyKey(ctx, idempotencyKey) if err == nil { // Already processed, return cached result return existing, nil } // First time: perform charge result, err := stripe.Charge(ctx, &stripe.ChargeParams{ Amount: input.Amount, Idempotent: idempotencyKey, // Stripe handles idempotency }) if err != nil { return nil, err } // Store result for retry safety db.SaveCharge(ctx, idempotencyKey, result) return result, nil } // In CDL: // - name: ChargePayment // type: Task // parameters: // amount: "{{ $.amount }}" // idempotency_key: "{{ workflow.execution_id }}-payment"

Circuit Breaker Pattern

Protecting Against Cascading Failures

import "github.com/grpc-ecosystem/go-grpc-middleware/retry" type CircuitBreaker struct { MaxFailures int ResetTimeout time.Duration ConsecutiveFails int LastFailTime time.Time State string // "CLOSED", "OPEN", "HALF_OPEN" } func (cb *CircuitBreaker) Call(fn func() error) error { switch cb.State { case "OPEN": if time.Since(cb.LastFailTime) > cb.ResetTimeout { cb.State = "HALF_OPEN" cb.ConsecutiveFails = 0 } else { return fmt.Errorf("circuit breaker OPEN (opens for %v)", cb.ResetTimeout-time.Since(cb.LastFailTime)) } } err := fn() if err != nil { cb.ConsecutiveFails++ cb.LastFailTime = time.Now() if cb.ConsecutiveFails >= cb.MaxFailures { cb.State = "OPEN" } return err } // Success: reset cb.ConsecutiveFails = 0 cb.State = "CLOSED" return nil } // Usage in activity func CallExternalAPI(ctx context.Context, input *APIInput) (*APIOutput, error) { return circuitBreaker.Call(func() error { return externalAPI.Call(ctx, input) }) }

Timeouts

Timeout Strategy

# ✅ Good: Realistic timeouts states: - name: FastOperation type: Task timeout: 5s # Local operation - name: APICall type: Task timeout: 30s # External API - name: HumanTask type: HumanTask timeout: 24h # User decision - name: Database type: Task timeout: 10s # Database operation # ❌ Bad: Unrealistic timeouts - name: SlowAPI type: Task timeout: 100ms # Too short for external API - name: DataProcess type: Task timeout: 1s # Too short for processing

Timeout Action

- name: LongRunningTask type: HumanTask timeout: 24h timeout_action: ESCALATE_TO_MANAGER # Options: ESCALATE_*, CANCEL, AUTO_APPROVE, AUTO_DENY

Error Monitoring

Error Metrics

# Error rate by activity rate(cascade_activity_errors_total[5m]) by (activity) # Error rate by type rate(cascade_activity_errors_total[5m]) by (error_type) # Permanent vs transient errors cascade_activity_errors_total{error_classification="permanent"} cascade_activity_errors_total{error_classification="transient"} # Recovery success rate rate(cascade_workflow_recovered_total[5m])

Error Alerts

alerts: - name: HighErrorRate condition: "rate(cascade_activity_errors_total[5m]) > 0.05" severity: warning - name: PermanentErrorIncrease condition: | rate(cascade_activity_errors_total{error_classification="permanent"}[5m]) > 0.01 severity: critical action: page_oncall - name: RecoveryFailure condition: | cascade_workflow_failed_total - cascade_workflow_recovered_total > 10 severity: critical

Best Practices

DO:

  • Classify errors as permanent/transient
  • Retry transient errors only
  • Use exponential backoff
  • Implement idempotency
  • Set realistic timeouts
  • Use circuit breakers
  • Catch errors explicitly
  • Log error context

DON’T:

  • Retry permanent errors
  • Use linear backoff
  • Retry without idempotency
  • Forget error logging
  • Set very long timeouts
  • Retry indefinitely
  • Catch generic errors
  • Suppress error details

Updated: October 29, 2025
Version: 1.0
Reference: CSL_semantics.dspec.yaml

Last updated on