Error Recovery Guide
Configure automatic error recovery, retry logic, and fault tolerance
Error Recovery & Retry Logic
When things go wrong in your workflow (API timeout, service down, bad data), you don't want everything to fail. DeepChain has built-in retry and error recovery to keep things running.
Status: Production-Ready ✅ Version: 1.0 Last Updated: November 2025
How Retries Work
When a node fails, you can automatically retry it:
Node executes
↓
❌ Fails
↓
Wait (exponential backoff)
↓
Retry 1 → Fails
↓
Wait longer
↓
Retry 2 → Fails
↓
Wait even longer
↓
Retry 3 → Succeeds!
Each retry waits longer than the previous one, so you don't hammer a service that's temporarily down.
Basic Retry Configuration
Add retry logic to any node:
node:
# Your normal node config...
# Retry settings
retryConfig:
enabled: true
maxAttempts: 3 # Try up to 3 times
initialDelayMs: 1000 # Start with 1 second wait
backoffMultiplier: 2 # Double the wait each time
maxDelayMs: 30000 # Never wait more than 30 seconds
Example: First retry waits 1s, second waits 2s, third waits 4s.
Common Retry Patterns
Pattern 1: API Calls (Default)
APIs often have temporary blips. Use moderate retries:
HTTP Request Node:
retryConfig:
enabled: true
maxAttempts: 3
initialDelayMs: 1000
backoffMultiplier: 2
Why: APIs recover quickly. 3 attempts cover most transient failures.
Pattern 2: Database Queries
Databases might have locks or maintenance. Use more retries:
Database Query Node:
retryConfig:
enabled: true
maxAttempts: 5
initialDelayMs: 500
backoffMultiplier: 1.5
Why: Database issues might take longer to resolve. More retries = better success rate.
Pattern 3: External Services
Third-party services are less predictable. Be aggressive:
External Service Node:
retryConfig:
enabled: true
maxAttempts: 5
initialDelayMs: 2000
backoffMultiplier: 2
maxDelayMs: 60000
Why: You don't control these services, so more retries help.
Pattern 4: Critical Path (No Retry)
For nodes that should fail fast:
Validation Node:
retryConfig:
enabled: false
Why: Invalid data won't be fixed by retrying. Fail immediately.
Retry Conditions
Control WHEN to retry:
node:
retryConfig:
enabled: true
maxAttempts: 3
retryOn:
- "TIMEOUT" # Timeout errors
- "CONNECTION_ERROR" # Network issues
- "SERVICE_UNAVAILABLE" # 503 errors
# Don't retry on:
dontRetryOn:
- "UNAUTHORIZED" # Auth won't fix by retrying
- "NOT_FOUND" # 404 won't appear on retry
Tip: Only retry errors that might be transient. Don't retry auth errors.
Error Handling
When retries are exhausted, what happens next?
Option 1: Fail the Workflow
node:
retryConfig:
enabled: true
maxAttempts: 3
fallback: "fail" # Stop entire workflow
Option 2: Use a Default Value
node:
retryConfig:
enabled: true
maxAttempts: 3
fallback: "default"
defaultValue:
status: "unknown"
data: []
Option 3: Route to Error Path
Use the node's error output port:
┌─────────────┐
│ HTTP Node │
└──────┬──────┘
✓ │ Success
↓
Default path
✗ Exhausted retries
↓
Error output port → Error handling node
Example Workflows
Example 1: Weather API with Fallback
Get Weather Node:
url: "https://api.weather.example.com/forecast"
retryConfig:
enabled: true
maxAttempts: 3
initialDelayMs: 1000
backoffMultiplier: 2
# If all retries fail
fallback: "default"
defaultValue:
temp: null
condition: "unknown"
error: "Weather service unavailable"
Example 2: Database with Aggressive Retry
Database Node:
query: "SELECT * FROM users WHERE id = {{ $trigger.userId }}"
retryConfig:
enabled: true
maxAttempts: 5
initialDelayMs: 500
backoffMultiplier: 1.5
retryOn:
- "TIMEOUT"
- "DEADLOCK"
Example 3: Critical Path (No Retry)
Validate Email:
expression: "{{ $trigger.email.includes('@') }}"
retryConfig:
enabled: false # Validation logic doesn't need retry
Monitoring Retries
Check what's being retried:
In your logs, you'll see:
[INFO] Executing node: Get User Data
[WARN] Attempt 1 failed: timeout. Retrying in 1000ms...
[WARN] Attempt 2 failed: timeout. Retrying in 2000ms...
[INFO] Attempt 3 succeeded.
Check your workflow metrics:
- How many nodes are being retried?
- Which nodes retry most often?
- Are retries actually succeeding?
Tip: If a node retries every execution, it's a sign something's wrong. Investigate!
Best Practices
- ✅ Retry transient errors — Timeouts, network errors, service temporarily down
- ✅ Set reasonable limits — Don't retry forever
- ✅ Use exponential backoff — Don't hammer a struggling service
- ✅ Monitor retry rates — High retry rates = underlying problem
- ✅ Test error paths — Verify the fallback works
- ❌ Don't retry validation errors — Bad data won't fix itself
- ❌ Don't retry auth failures — Wrong credentials won't work on retry
- ❌ Don't retry forever — Set max attempts
Common Mistakes
Mistake 1: Retrying everything
❌ Wrong:
retryConfig:
enabled: true # Even validation errors!
maxAttempts: 10
✅ Correct:
retryConfig:
enabled: true
maxAttempts: 3
retryOn: ["TIMEOUT", "CONNECTION_ERROR"]
Mistake 2: Retry without exponential backoff
❌ Wrong:
retryConfig:
enabled: true
backoffMultiplier: 1 # Same delay every time!
✅ Correct:
retryConfig:
enabled: true
backoffMultiplier: 2 # Double the wait each time
Mistake 3: No fallback strategy
❌ Wrong:
retryConfig:
enabled: true
maxAttempts: 3
# What happens if all retries fail? Unknown!
✅ Correct:
retryConfig:
enabled: true
maxAttempts: 3
fallback: "default"
defaultValue: { status: "unavailable" }
Next Steps
- Add retry logic — Enable retries on API calls
- Set appropriate limits — Different nodes need different retry configs
- Monitor retry rates — Watch for patterns
- Test error scenarios — Disable a service and verify retries work
- Create fallback paths — Use error output ports for handling exhausted retries
Related:
- Error Output Ports — Handle node errors
- Monitoring Guide — Track retry patterns
- Approval Workflows — Human approval before critical actions