Best Practices: Error Handling & Recovery

For: Developers building resilient workflows
Level: Advanced
Time to read: 40 minutes
Reference: CSL_semantics.dspec.yaml, errors.dspec.yaml

This guide covers error classification, retry strategies, and recovery patterns based on CSL semantics.

Error Classification Model

Permanent vs Transient Errors


From CSL Error Handling Model (specs/shared/csl_semantics.dspec.yaml):

┌────────────────────────────┐
│ Error Occurs               │
└─────────────┬──────────────┘
              │
    ┌─────────┴─────────┐
    │                   │
    ▼                   ▼
PERMANENT            TRANSIENT
(Do NOT retry)       (Retry with backoff)
    │                   │
    ├─ ValidationError  ├─ ServiceUnavailable
    ├─ AuthorizedError  ├─ TimeoutError
    ├─ NotFoundError    ├─ NetworkError
    └─ InvalidInput     └─ ConnectionError

Behavior:

Permanent Errors: Immediately transition to Catch handler (no retry)
Transient Errors: Retry according to policy, then Catch handler

Retry Policy Configuration

Retry → Catch Precedence


From specs/shared/csl_semantics.dspec.yaml:

1. Activity fails
2. Retry policy is applied (retries N times with backoff)
3. After MaxAttempts exhausted, Catch handler is evaluated
4. First matching Catch handler transitions to error recovery state
5. If no Catch matches, workflow fails

Default Retry Policy


retries:
  initial_interval: 1s
  max_interval: 60s
  multiplier: 2.0
  default:
    max_attempts: 3

Timeline:


Attempt 1: 0s (immediate)
Attempt 2: after 1s
Attempt 3: after 2s
Attempt 4: after 4s
(stops after max_attempts)

Customized Retry Strategy


# ✅ Good: Conservative for external APIs
- name: CallPaymentAPI
  type: Task
  timeout: 30s
  retries:
    max_attempts: 3
    backoff:
      initial_interval: 2s
      max_interval: 30s
      multiplier: 2.0
 
# ❌ Bad: Too aggressive, wastes resources
- name: CallPaymentAPI
  type: Task
  timeout: 30s
  retries:
    max_attempts: 10  # Too many
    backoff:
      initial_interval: 100ms  # Too short
      multiplier: 1.5  # Too slow progression

Catch Handlers

Error Catching Pattern


states:
  - name: ProcessPayment
    type: Task
    resource: urn:cascade:activity:charge_card
    parameters:
      amount: "{{ $.total }}"
    timeout: 30s
    retries:
      max_attempts: 3
      backoff:
        initial_interval: 2s
        max_interval: 30s
        multiplier: 2.0
    
    # Catch handlers for different errors
    catch:
      - error_equals: ["ValidationError"]
        result_path: $.error
        next: HandleValidationError
      
      - error_equals: ["TimeoutError", "ConnectionError"]
        result_path: $.error
        next: RetryLater
      
      - error_equals: ["States.ALL"]  # Catch-all
        result_path: $.error
        next: HandleUnexpectedError
    
    next: ProcessSuccess
 
  - name: HandleValidationError
    type: Task
    resource: urn:cascade:activity:notify_invalid_payment
    end: true
 
  - name: RetryLater
    type: Wait
    duration: 5m
    next: ProcessPayment  # Retry the payment
 
  - name: ProcessSuccess
    type: Task
    end: true

Error Recovery Patterns

Pattern 1: Graceful Degradation


states:
  - name: GetUserPreferences
    type: Task
    resource: urn:cascade:activity:fetch_preferences
    timeout: 5s
    catch:
      - error_equals: ["TimeoutError"]
        result_path: $.error
        next: UseDefaults
    next: ProcessWithPreferences
 
  - name: UseDefaults
    type: Task
    resource: urn:cascade:activity:apply_default_preferences
    next: ProcessWithPreferences
 
  - name: ProcessWithPreferences
    type: Task
    end: true

Pattern 2: Fallback Service


states:
  - name: CallPrimaryService
    type: Task
    resource: urn:cascade:activity:primary_api_call
    timeout: 10s
    catch:
      - error_equals: ["ServiceUnavailable", "TimeoutError"]
        next: CallFallbackService
    next: Success
 
  - name: CallFallbackService
    type: Task
    resource: urn:cascade:activity:fallback_api_call
    timeout: 30s
    catch:
      - error_equals: ["States.ALL"]
        next: DegradedMode
    next: Success
 
  - name: DegradedMode
    type: Task
    resource: urn:cascade:activity:cached_response
    next: Success
 
  - name: Success
    type: Task
    end: true

Pattern 3: Compensation (Rollback)


states:
  - name: TransferFunds
    type: Task
    resource: urn:cascade:activity:debit_account
    parameters:
      from_account: "{{ $.source }}"
      amount: "{{ $.amount }}"
    result: $.debit_transaction
    catch:
      - error_equals: ["InsufficientFunds"]
        next: TransferFailed
    next: CreditAccount
 
  - name: CreditAccount
    type: Task
    resource: urn:cascade:activity:credit_account
    parameters:
      to_account: "{{ $.destination }}"
      amount: "{{ $.amount }}"
    catch:
      - error_equals: ["States.ALL"]
        next: CompensateDebit  # Rollback the debit
    next: Success
 
  - name: CompensateDebit
    type: Task
    resource: urn:cascade:activity:refund_account
    parameters:
      account: "{{ $.source }}"
      amount: "{{ $.amount }}"
      original_transaction: "{{ $.debit_transaction }}"
    next: TransferFailed
 
  - name: TransferFailed
    type: Task
    end: true
 
  - name: Success
    type: Task
    end: true

Idempotency for Safe Retries

Idempotency Key Pattern


func ChargePayment(ctx context.Context, input *PaymentInput) (*ChargeOutput, error) {
    // Use idempotency key to prevent duplicate charges on retry
    idempotencyKey := input.IdempotencyKey  // Same on retry
    
    // Check if we already processed this
    existing, err := db.GetChargeByIdempotencyKey(ctx, idempotencyKey)
    if err == nil {
        // Already processed, return cached result
        return existing, nil
    }
    
    // First time: perform charge
    result, err := stripe.Charge(ctx, &stripe.ChargeParams{
        Amount: input.Amount,
        Idempotent: idempotencyKey,  // Stripe handles idempotency
    })
    
    if err != nil {
        return nil, err
    }
    
    // Store result for retry safety
    db.SaveCharge(ctx, idempotencyKey, result)
    
    return result, nil
}
 
// In CDL:
// - name: ChargePayment
//   type: Task
//   parameters:
//     amount: "{{ $.amount }}"
//     idempotency_key: "{{ workflow.execution_id }}-payment"

Circuit Breaker Pattern

Protecting Against Cascading Failures


import "github.com/grpc-ecosystem/go-grpc-middleware/retry"
 
type CircuitBreaker struct {
    MaxFailures      int
    ResetTimeout     time.Duration
    ConsecutiveFails int
    LastFailTime     time.Time
    State            string // "CLOSED", "OPEN", "HALF_OPEN"
}
 
func (cb *CircuitBreaker) Call(fn func() error) error {
    switch cb.State {
    case "OPEN":
        if time.Since(cb.LastFailTime) > cb.ResetTimeout {
            cb.State = "HALF_OPEN"
            cb.ConsecutiveFails = 0
        } else {
            return fmt.Errorf("circuit breaker OPEN (opens for %v)", cb.ResetTimeout-time.Since(cb.LastFailTime))
        }
    }
    
    err := fn()
    
    if err != nil {
        cb.ConsecutiveFails++
        cb.LastFailTime = time.Now()
        
        if cb.ConsecutiveFails >= cb.MaxFailures {
            cb.State = "OPEN"
        }
        return err
    }
    
    // Success: reset
    cb.ConsecutiveFails = 0
    cb.State = "CLOSED"
    return nil
}
 
// Usage in activity
func CallExternalAPI(ctx context.Context, input *APIInput) (*APIOutput, error) {
    return circuitBreaker.Call(func() error {
        return externalAPI.Call(ctx, input)
    })
}

Timeouts

Timeout Strategy


# ✅ Good: Realistic timeouts
states:
  - name: FastOperation
    type: Task
    timeout: 5s      # Local operation
  
  - name: APICall
    type: Task
    timeout: 30s     # External API
  
  - name: HumanTask
    type: HumanTask
    timeout: 24h     # User decision
  
  - name: Database
    type: Task
    timeout: 10s     # Database operation
 
# ❌ Bad: Unrealistic timeouts
  - name: SlowAPI
    type: Task
    timeout: 100ms   # Too short for external API
  
  - name: DataProcess
    type: Task
    timeout: 1s      # Too short for processing

Timeout Action


- name: LongRunningTask
  type: HumanTask
  timeout: 24h
  timeout_action: ESCALATE_TO_MANAGER
  # Options: ESCALATE_*, CANCEL, AUTO_APPROVE, AUTO_DENY

Error Monitoring

Error Metrics


# Error rate by activity
rate(cascade_activity_errors_total[5m]) by (activity)

# Error rate by type
rate(cascade_activity_errors_total[5m]) by (error_type)

# Permanent vs transient errors
cascade_activity_errors_total{error_classification="permanent"}
cascade_activity_errors_total{error_classification="transient"}

# Recovery success rate
rate(cascade_workflow_recovered_total[5m])

Error Alerts


alerts:
  - name: HighErrorRate
    condition: "rate(cascade_activity_errors_total[5m]) > 0.05"
    severity: warning
  
  - name: PermanentErrorIncrease
    condition: |
      rate(cascade_activity_errors_total{error_classification="permanent"}[5m]) > 0.01
    severity: critical
    action: page_oncall
  
  - name: RecoveryFailure
    condition: |
      cascade_workflow_failed_total - cascade_workflow_recovered_total > 10
    severity: critical

Best Practices

✅ DO:

Classify errors as permanent/transient
Retry transient errors only
Use exponential backoff
Implement idempotency
Set realistic timeouts
Use circuit breakers
Catch errors explicitly
Log error context

❌ DON’T:

Retry permanent errors
Use linear backoff
Retry without idempotency
Forget error logging
Set very long timeouts
Retry indefinitely
Catch generic errors
Suppress error details

Updated: October 29, 2025
Version: 1.0
Reference: CSL_semantics.dspec.yaml