Best Practices: Error Handling & Recovery
For: Developers building resilient workflows
Level: Advanced
Time to read: 40 minutes
Reference: CSL_semantics.dspec.yaml, errors.dspec.yaml
This guide covers error classification, retry strategies, and recovery patterns based on CSL semantics.
Error Classification Model
Permanent vs Transient Errors
From CSL Error Handling Model (specs/shared/csl_semantics.dspec.yaml):
┌────────────────────────────┐
│ Error Occurs │
└─────────────┬──────────────┘
│
┌─────────┴─────────┐
│ │
▼ ▼
PERMANENT TRANSIENT
(Do NOT retry) (Retry with backoff)
│ │
├─ ValidationError ├─ ServiceUnavailable
├─ AuthorizedError ├─ TimeoutError
├─ NotFoundError ├─ NetworkError
└─ InvalidInput └─ ConnectionErrorBehavior:
- Permanent Errors: Immediately transition to Catch handler (no retry)
- Transient Errors: Retry according to policy, then Catch handler
Retry Policy Configuration
Retry → Catch Precedence
From specs/shared/csl_semantics.dspec.yaml:
1. Activity fails
2. Retry policy is applied (retries N times with backoff)
3. After MaxAttempts exhausted, Catch handler is evaluated
4. First matching Catch handler transitions to error recovery state
5. If no Catch matches, workflow failsDefault Retry Policy
retries:
initial_interval: 1s
max_interval: 60s
multiplier: 2.0
default:
max_attempts: 3Timeline:
Attempt 1: 0s (immediate)
Attempt 2: after 1s
Attempt 3: after 2s
Attempt 4: after 4s
(stops after max_attempts)Customized Retry Strategy
# ✅ Good: Conservative for external APIs
- name: CallPaymentAPI
type: Task
timeout: 30s
retries:
max_attempts: 3
backoff:
initial_interval: 2s
max_interval: 30s
multiplier: 2.0
# ❌ Bad: Too aggressive, wastes resources
- name: CallPaymentAPI
type: Task
timeout: 30s
retries:
max_attempts: 10 # Too many
backoff:
initial_interval: 100ms # Too short
multiplier: 1.5 # Too slow progressionCatch Handlers
Error Catching Pattern
states:
- name: ProcessPayment
type: Task
resource: urn:cascade:activity:charge_card
parameters:
amount: "{{ $.total }}"
timeout: 30s
retries:
max_attempts: 3
backoff:
initial_interval: 2s
max_interval: 30s
multiplier: 2.0
# Catch handlers for different errors
catch:
- error_equals: ["ValidationError"]
result_path: $.error
next: HandleValidationError
- error_equals: ["TimeoutError", "ConnectionError"]
result_path: $.error
next: RetryLater
- error_equals: ["States.ALL"] # Catch-all
result_path: $.error
next: HandleUnexpectedError
next: ProcessSuccess
- name: HandleValidationError
type: Task
resource: urn:cascade:activity:notify_invalid_payment
end: true
- name: RetryLater
type: Wait
duration: 5m
next: ProcessPayment # Retry the payment
- name: ProcessSuccess
type: Task
end: trueError Recovery Patterns
Pattern 1: Graceful Degradation
states:
- name: GetUserPreferences
type: Task
resource: urn:cascade:activity:fetch_preferences
timeout: 5s
catch:
- error_equals: ["TimeoutError"]
result_path: $.error
next: UseDefaults
next: ProcessWithPreferences
- name: UseDefaults
type: Task
resource: urn:cascade:activity:apply_default_preferences
next: ProcessWithPreferences
- name: ProcessWithPreferences
type: Task
end: truePattern 2: Fallback Service
states:
- name: CallPrimaryService
type: Task
resource: urn:cascade:activity:primary_api_call
timeout: 10s
catch:
- error_equals: ["ServiceUnavailable", "TimeoutError"]
next: CallFallbackService
next: Success
- name: CallFallbackService
type: Task
resource: urn:cascade:activity:fallback_api_call
timeout: 30s
catch:
- error_equals: ["States.ALL"]
next: DegradedMode
next: Success
- name: DegradedMode
type: Task
resource: urn:cascade:activity:cached_response
next: Success
- name: Success
type: Task
end: truePattern 3: Compensation (Rollback)
states:
- name: TransferFunds
type: Task
resource: urn:cascade:activity:debit_account
parameters:
from_account: "{{ $.source }}"
amount: "{{ $.amount }}"
result: $.debit_transaction
catch:
- error_equals: ["InsufficientFunds"]
next: TransferFailed
next: CreditAccount
- name: CreditAccount
type: Task
resource: urn:cascade:activity:credit_account
parameters:
to_account: "{{ $.destination }}"
amount: "{{ $.amount }}"
catch:
- error_equals: ["States.ALL"]
next: CompensateDebit # Rollback the debit
next: Success
- name: CompensateDebit
type: Task
resource: urn:cascade:activity:refund_account
parameters:
account: "{{ $.source }}"
amount: "{{ $.amount }}"
original_transaction: "{{ $.debit_transaction }}"
next: TransferFailed
- name: TransferFailed
type: Task
end: true
- name: Success
type: Task
end: trueIdempotency for Safe Retries
Idempotency Key Pattern
func ChargePayment(ctx context.Context, input *PaymentInput) (*ChargeOutput, error) {
// Use idempotency key to prevent duplicate charges on retry
idempotencyKey := input.IdempotencyKey // Same on retry
// Check if we already processed this
existing, err := db.GetChargeByIdempotencyKey(ctx, idempotencyKey)
if err == nil {
// Already processed, return cached result
return existing, nil
}
// First time: perform charge
result, err := stripe.Charge(ctx, &stripe.ChargeParams{
Amount: input.Amount,
Idempotent: idempotencyKey, // Stripe handles idempotency
})
if err != nil {
return nil, err
}
// Store result for retry safety
db.SaveCharge(ctx, idempotencyKey, result)
return result, nil
}
// In CDL:
// - name: ChargePayment
// type: Task
// parameters:
// amount: "{{ $.amount }}"
// idempotency_key: "{{ workflow.execution_id }}-payment"Circuit Breaker Pattern
Protecting Against Cascading Failures
import "github.com/grpc-ecosystem/go-grpc-middleware/retry"
type CircuitBreaker struct {
MaxFailures int
ResetTimeout time.Duration
ConsecutiveFails int
LastFailTime time.Time
State string // "CLOSED", "OPEN", "HALF_OPEN"
}
func (cb *CircuitBreaker) Call(fn func() error) error {
switch cb.State {
case "OPEN":
if time.Since(cb.LastFailTime) > cb.ResetTimeout {
cb.State = "HALF_OPEN"
cb.ConsecutiveFails = 0
} else {
return fmt.Errorf("circuit breaker OPEN (opens for %v)", cb.ResetTimeout-time.Since(cb.LastFailTime))
}
}
err := fn()
if err != nil {
cb.ConsecutiveFails++
cb.LastFailTime = time.Now()
if cb.ConsecutiveFails >= cb.MaxFailures {
cb.State = "OPEN"
}
return err
}
// Success: reset
cb.ConsecutiveFails = 0
cb.State = "CLOSED"
return nil
}
// Usage in activity
func CallExternalAPI(ctx context.Context, input *APIInput) (*APIOutput, error) {
return circuitBreaker.Call(func() error {
return externalAPI.Call(ctx, input)
})
}Timeouts
Timeout Strategy
# ✅ Good: Realistic timeouts
states:
- name: FastOperation
type: Task
timeout: 5s # Local operation
- name: APICall
type: Task
timeout: 30s # External API
- name: HumanTask
type: HumanTask
timeout: 24h # User decision
- name: Database
type: Task
timeout: 10s # Database operation
# ❌ Bad: Unrealistic timeouts
- name: SlowAPI
type: Task
timeout: 100ms # Too short for external API
- name: DataProcess
type: Task
timeout: 1s # Too short for processingTimeout Action
- name: LongRunningTask
type: HumanTask
timeout: 24h
timeout_action: ESCALATE_TO_MANAGER
# Options: ESCALATE_*, CANCEL, AUTO_APPROVE, AUTO_DENYError Monitoring
Error Metrics
# Error rate by activity
rate(cascade_activity_errors_total[5m]) by (activity)
# Error rate by type
rate(cascade_activity_errors_total[5m]) by (error_type)
# Permanent vs transient errors
cascade_activity_errors_total{error_classification="permanent"}
cascade_activity_errors_total{error_classification="transient"}
# Recovery success rate
rate(cascade_workflow_recovered_total[5m])Error Alerts
alerts:
- name: HighErrorRate
condition: "rate(cascade_activity_errors_total[5m]) > 0.05"
severity: warning
- name: PermanentErrorIncrease
condition: |
rate(cascade_activity_errors_total{error_classification="permanent"}[5m]) > 0.01
severity: critical
action: page_oncall
- name: RecoveryFailure
condition: |
cascade_workflow_failed_total - cascade_workflow_recovered_total > 10
severity: criticalBest Practices
✅ DO:
- Classify errors as permanent/transient
- Retry transient errors only
- Use exponential backoff
- Implement idempotency
- Set realistic timeouts
- Use circuit breakers
- Catch errors explicitly
- Log error context
❌ DON’T:
- Retry permanent errors
- Use linear backoff
- Retry without idempotency
- Forget error logging
- Set very long timeouts
- Retry indefinitely
- Catch generic errors
- Suppress error details
Updated: October 29, 2025
Version: 1.0
Reference: CSL_semantics.dspec.yaml
Last updated on