Monitoring & Observability
Set up system monitoring, health checks, and observability for DeepChain
Monitoring Your Workflows
Keep your DeepChain workflows running smoothly with built-in monitoring and health checks.
Status: Production-Ready ✅ Version: 1.0 Last Updated: November 2025
What Gets Monitored
DeepChain automatically tracks:
- Workflow executions — When workflows run, how long they take, success/failure rates
- Node performance — Which nodes are slow, which fail most often
- System health — API server, database, message queue, workers
- Error rates — Which errors are most common
- Retry activity — How often retries succeed/fail
- Cost metrics (for AI nodes) — Token usage and expenses
Quick Start: Set Up Monitoring
1. Enable Execution Metrics
# In your workflow config
monitoring:
enabled: true
trackExecutionTime: true
trackErrorRates: true
alertOnFailure: true
2. Check Dashboard
Navigate to your DeepChain dashboard:
- Executions — Recent workflow runs with status
- Errors — Most common errors across all workflows
- Performance — Slowest nodes, longest workflows
- Health — System component status
3. Set Alerts
Get notified when things go wrong:
alerts:
- name: "High Error Rate"
condition: "errorRate > 10%"
notify: "email@company.com"
- name: "Slow Workflow"
condition: "executionTime > 60s"
notify: "slack-channel"
- name: "Service Down"
condition: "systemHealth.api == down"
notify: "pagerduty"
Understanding Metrics
Execution Metrics
Success Rate — % of executions that completed successfully
Good: > 95%
Acceptable: 90-95%
Warning: < 90%
Execution Time — How long workflows take to complete
Good: < 5 seconds
Acceptable: 5-30 seconds
Warning: > 30 seconds
Error Rate — % of executions that failed
Good: < 5%
Acceptable: 5-10%
Warning: > 10%
Node Metrics
For each node, you can see:
- Call count — How many times it executed
- Success rate — % times it succeeded
- Average duration — Mean execution time
- P95 duration — 95th percentile (slower side)
- Error types — What errors occurred
Example:
Get User (API Node)
├─ Calls: 1,523
├─ Success: 98.5%
├─ Avg time: 350ms
├─ P95: 1.2s
└─ Errors:
├─ timeout (1.2%)
└─ connection_error (0.3%)
Common Monitoring Patterns
Pattern 1: Monitor API Calls
Check the health of external APIs:
monitoring:
nodes:
- name: "Get User"
alerts:
- name: "API Error Rate High"
condition: "errorRate > 5%"
notify: "team@company.com"
- name: "API Slow"
condition: "p95Duration > 2000ms"
When error rate spikes, something's wrong with the API or your integration.
Pattern 2: Monitor Approval Workflows
Track approval performance:
monitoring:
nodes:
- name: "Manager Approval"
alerts:
- name: "Approval Timeout"
condition: "pending > 24h"
notify: "managers@company.com"
- name: "High Rejection Rate"
condition: "rejectionRate > 20%"
Track approval times to improve workflow efficiency.
Pattern 3: Monitor AI Costs
Keep AI expenses under control:
monitoring:
nodes:
- name: "AI Agent"
alerts:
- name: "Cost Spike"
condition: "dailyCost > $100"
notify: "finance@company.com"
- name: "High Token Usage"
condition: "tokensPerExecution > 5000"
Monitor token usage to spot inefficient prompts.
Pattern 4: Monitor Retry Patterns
Catch recurring problems:
monitoring:
alerts:
- name: "High Retry Rate"
condition: "retrySuccessRate < 70%"
notify: "devops@company.com"
details: "Some nodes are failing and then recovering"
- name: "Persistent Failures"
condition: "retryExhausted > 10"
notify: "on-call@company.com"
details: "Retries not helping - need investigation"
If retries aren't helping, there's an underlying issue.
Dashboard Widgets
Execution Overview
This Week: 2,345 executions
✅ Success: 2,298 (97.9%)
❌ Failed: 47 (2.1%)
⏱️ Avg Time: 2.3s
Error Top 5
1. timeout (28 errors)
2. database_lock (15 errors)
3. rate_limit (12 errors)
4. connection_refused (8 errors)
5. invalid_response (4 errors)
Slowest Nodes
1. "Get User Data" — 1.2s average
2. "AI Analysis" — 4.5s average
3. "Database Query" — 650ms average
4. "Send Email" — 2.1s average
5. "Format Report" — 890ms average
System Health
API Server: ✅ Healthy
Database: ✅ Healthy
Message Queue: ✅ Healthy
Workers: ✅ Healthy (12 active)
Setting Alerts
Email Alert
alerts:
- name: "Workflow Failed"
condition: "executionStatus == failed"
notify:
type: "email"
recipients: ["team@company.com"]
subject: "DeepChain: Workflow {{ workflow }} failed"
Slack Alert
alerts:
- name: "High Error Rate"
condition: "errorRate > 10%"
notify:
type: "slack"
channel: "#alerts"
message: "Error rate is {{ errorRate }}% (threshold: 10%)"
PagerDuty Alert (Critical)
alerts:
- name: "API Server Down"
condition: "systemHealth.api == down"
severity: "critical"
notify:
type: "pagerduty"
serviceId: "abc123"
Interpreting Metrics
"Success Rate is 95% — Is That Good?"
Context matters:
- 95% for an API call? → Probably acceptable, depends on tolerance
- 95% for approval workflow? → Might be missing approvals
- 95% for critical payment? → Not acceptable, investigate
"One Node is Slow"
Check:
- Is it consistently slow or just sometimes?
- Did it just start being slow?
- Does it depend on data size?
- Is the external API slow?
"Error Rate Spiked"
Action:
- Look at error types — what changed?
- Check when spike started
- Correlate with deployments or external changes
- Look at detailed error logs
Best Practices
- ✅ Set baseline metrics — Know what normal looks like
- ✅ Monitor trends — Watch for gradual degradation
- ✅ Alert on anomalies — Sudden changes matter
- ✅ Regular reviews — Monthly metric reviews catch issues
- ✅ Track business metrics — Not just technical (costs, approvals, etc)
- ❌ Don't ignore errors — Even "rare" ones (1%) matter at scale
- ❌ Don't set alerts too strict — Alert fatigue makes you ignore them
- ❌ Don't forget about performance — Speed matters as much as success
Common Issues & Solutions
Issue: "Success rate is high but workflows are slow"
Cause: Individual nodes taking too long, not failures. Solution: Check P95 duration on each node. Optimize the slowest ones.
Issue: "Same error keeps recurring"
Cause: Not transient — systematic problem. Solution: Check retry success rate. If retries don't help, fix root cause.
Issue: "Alerts aren't working"
Cause: Alert condition is too strict or notifier misconfigured. Solution: Test with a manual alert. Check notification channels (Slack auth, email)
Issue: "Can't find a specific error in logs"
Cause: Logs scrolled off or error is infrequent. Solution: Check comprehensive error dashboard. Search by workflow, date range.
Next Steps
- Check your dashboard — See what's currently running
- Set baseline metrics — Understand normal performance
- Create initial alerts — Start with critical issues
- Review weekly — Build habit of checking metrics
- Optimize slowest nodes — Fix performance issues
- Investigate spikes — Understand what changed
Related:
- Error Recovery — Retry and fallback strategies
- Approval Workflows — Track approval performance
- AI Agent Node — Monitor AI costs and performance