Error Recovery Guide

Configure automatic error recovery, retry logic, and fault tolerance

Error Recovery & Retry Logic

When things go wrong in your workflow (API timeout, service down, bad data), you don't want everything to fail. DeepChain has built-in retry and error recovery to keep things running.

Status: Production-Ready ✅ Version: 1.0 Last Updated: November 2025

How Retries Work

When a node fails, you can automatically retry it:

Node executes
    ↓
❌ Fails
    ↓
Wait (exponential backoff)
    ↓
Retry 1 → Fails
    ↓
Wait longer
    ↓
Retry 2 → Fails
    ↓
Wait even longer
    ↓
Retry 3 → Succeeds!

Each retry waits longer than the previous one, so you don't hammer a service that's temporarily down.

Basic Retry Configuration

Add retry logic to any node:

node:
  # Your normal node config...

  # Retry settings
  retryConfig:
    enabled: true
    maxAttempts: 3        # Try up to 3 times
    initialDelayMs: 1000  # Start with 1 second wait
    backoffMultiplier: 2  # Double the wait each time
    maxDelayMs: 30000     # Never wait more than 30 seconds

Example: First retry waits 1s, second waits 2s, third waits 4s.

Common Retry Patterns

Pattern 1: API Calls (Default)

APIs often have temporary blips. Use moderate retries:

HTTP Request Node:
  retryConfig:
    enabled: true
    maxAttempts: 3
    initialDelayMs: 1000
    backoffMultiplier: 2

Why: APIs recover quickly. 3 attempts cover most transient failures.

Pattern 2: Database Queries

Databases might have locks or maintenance. Use more retries:

Database Query Node:
  retryConfig:
    enabled: true
    maxAttempts: 5
    initialDelayMs: 500
    backoffMultiplier: 1.5

Why: Database issues might take longer to resolve. More retries = better success rate.

Pattern 3: External Services

Third-party services are less predictable. Be aggressive:

External Service Node:
  retryConfig:
    enabled: true
    maxAttempts: 5
    initialDelayMs: 2000
    backoffMultiplier: 2
    maxDelayMs: 60000

Why: You don't control these services, so more retries help.

Pattern 4: Critical Path (No Retry)

For nodes that should fail fast:

Validation Node:
  retryConfig:
    enabled: false

Why: Invalid data won't be fixed by retrying. Fail immediately.

Retry Conditions

Control WHEN to retry:

node:
  retryConfig:
    enabled: true
    maxAttempts: 3
    retryOn:
      - "TIMEOUT"           # Timeout errors
      - "CONNECTION_ERROR"  # Network issues
      - "SERVICE_UNAVAILABLE"  # 503 errors
    # Don't retry on:
    dontRetryOn:
      - "UNAUTHORIZED"      # Auth won't fix by retrying
      - "NOT_FOUND"         # 404 won't appear on retry

Tip: Only retry errors that might be transient. Don't retry auth errors.

Error Handling

When retries are exhausted, what happens next?

Option 1: Fail the Workflow

node:
  retryConfig:
    enabled: true
    maxAttempts: 3
    fallback: "fail"  # Stop entire workflow

Option 2: Use a Default Value

node:
  retryConfig:
    enabled: true
    maxAttempts: 3
    fallback: "default"
    defaultValue:
      status: "unknown"
      data: []

Option 3: Route to Error Path

Use the node's error output port:

┌─────────────┐
│ HTTP Node   │
└──────┬──────┘
   ✓  │ Success
      ↓
   Default path

   ✗ Exhausted retries
      ↓
   Error output port → Error handling node

Example Workflows

Example 1: Weather API with Fallback

Get Weather Node:
  url: "https://api.weather.example.com/forecast"
  retryConfig:
    enabled: true
    maxAttempts: 3
    initialDelayMs: 1000
    backoffMultiplier: 2

  # If all retries fail
  fallback: "default"
  defaultValue:
    temp: null
    condition: "unknown"
    error: "Weather service unavailable"

Example 2: Database with Aggressive Retry

Database Node:
  query: "SELECT * FROM users WHERE id = {{ $trigger.userId }}"
  retryConfig:
    enabled: true
    maxAttempts: 5
    initialDelayMs: 500
    backoffMultiplier: 1.5
    retryOn:
      - "TIMEOUT"
      - "DEADLOCK"

Example 3: Critical Path (No Retry)

Validate Email:
  expression: "{{ $trigger.email.includes('@') }}"
  retryConfig:
    enabled: false  # Validation logic doesn't need retry

Monitoring Retries

Check what's being retried:

In your logs, you'll see:

[INFO] Executing node: Get User Data
[WARN] Attempt 1 failed: timeout. Retrying in 1000ms...
[WARN] Attempt 2 failed: timeout. Retrying in 2000ms...
[INFO] Attempt 3 succeeded.

Check your workflow metrics:

How many nodes are being retried?
Which nodes retry most often?
Are retries actually succeeding?

Tip: If a node retries every execution, it's a sign something's wrong. Investigate!

Best Practices

✅ Retry transient errors — Timeouts, network errors, service temporarily down
✅ Set reasonable limits — Don't retry forever
✅ Use exponential backoff — Don't hammer a struggling service
✅ Monitor retry rates — High retry rates = underlying problem
✅ Test error paths — Verify the fallback works
❌ Don't retry validation errors — Bad data won't fix itself
❌ Don't retry auth failures — Wrong credentials won't work on retry
❌ Don't retry forever — Set max attempts

Common Mistakes

Mistake 1: Retrying everything

❌ Wrong:

retryConfig:
  enabled: true  # Even validation errors!
  maxAttempts: 10

✅ Correct:

retryConfig:
  enabled: true
  maxAttempts: 3
  retryOn: ["TIMEOUT", "CONNECTION_ERROR"]

Mistake 2: Retry without exponential backoff

❌ Wrong:

retryConfig:
  enabled: true
  backoffMultiplier: 1  # Same delay every time!

✅ Correct:

retryConfig:
  enabled: true
  backoffMultiplier: 2  # Double the wait each time

Mistake 3: No fallback strategy

❌ Wrong:

retryConfig:
  enabled: true
  maxAttempts: 3
  # What happens if all retries fail? Unknown!

✅ Correct:

retryConfig:
  enabled: true
  maxAttempts: 3
  fallback: "default"
  defaultValue: { status: "unavailable" }

Next Steps

Add retry logic — Enable retries on API calls
Set appropriate limits — Different nodes need different retry configs
Monitor retry rates — Watch for patterns
Test error scenarios — Disable a service and verify retries work
Create fallback paths — Use error output ports for handling exhausted retries

Related:

Error Output Ports — Handle node errors
Monitoring Guide — Track retry patterns
Approval Workflows — Human approval before critical actions