Monitoring & Observability

Set up system monitoring, health checks, and observability for DeepChain

Monitoring Your Workflows

Keep your DeepChain workflows running smoothly with built-in monitoring and health checks.

Status: Production-Ready ✅ Version: 1.0 Last Updated: November 2025

What Gets Monitored

DeepChain automatically tracks:

Workflow executions — When workflows run, how long they take, success/failure rates
Node performance — Which nodes are slow, which fail most often
System health — API server, database, message queue, workers
Error rates — Which errors are most common
Retry activity — How often retries succeed/fail
Cost metrics (for AI nodes) — Token usage and expenses

Quick Start: Set Up Monitoring

1. Enable Execution Metrics

# In your workflow config
monitoring:
  enabled: true
  trackExecutionTime: true
  trackErrorRates: true
  alertOnFailure: true

2. Check Dashboard

Navigate to your DeepChain dashboard:

Executions — Recent workflow runs with status
Errors — Most common errors across all workflows
Performance — Slowest nodes, longest workflows
Health — System component status

3. Set Alerts

Get notified when things go wrong:

alerts:
  - name: "High Error Rate"
    condition: "errorRate > 10%"
    notify: "email@company.com"

  - name: "Slow Workflow"
    condition: "executionTime > 60s"
    notify: "slack-channel"

  - name: "Service Down"
    condition: "systemHealth.api == down"
    notify: "pagerduty"

Understanding Metrics

Execution Metrics

Success Rate — % of executions that completed successfully

Good: > 95%
Acceptable: 90-95%
Warning: < 90%

Execution Time — How long workflows take to complete

Good: < 5 seconds
Acceptable: 5-30 seconds
Warning: > 30 seconds

Error Rate — % of executions that failed

Good: < 5%
Acceptable: 5-10%
Warning: > 10%

Node Metrics

For each node, you can see:

Call count — How many times it executed
Success rate — % times it succeeded
Average duration — Mean execution time
P95 duration — 95th percentile (slower side)
Error types — What errors occurred

Example:

Get User (API Node)
├─ Calls: 1,523
├─ Success: 98.5%
├─ Avg time: 350ms
├─ P95: 1.2s
└─ Errors:
   ├─ timeout (1.2%)
   └─ connection_error (0.3%)

Common Monitoring Patterns

Pattern 1: Monitor API Calls

Check the health of external APIs:

monitoring:
  nodes:
    - name: "Get User"
      alerts:
        - name: "API Error Rate High"
          condition: "errorRate > 5%"
          notify: "team@company.com"
        - name: "API Slow"
          condition: "p95Duration > 2000ms"

When error rate spikes, something's wrong with the API or your integration.

Pattern 2: Monitor Approval Workflows

Track approval performance:

monitoring:
  nodes:
    - name: "Manager Approval"
      alerts:
        - name: "Approval Timeout"
          condition: "pending > 24h"
          notify: "managers@company.com"
        - name: "High Rejection Rate"
          condition: "rejectionRate > 20%"

Track approval times to improve workflow efficiency.

Pattern 3: Monitor AI Costs

Keep AI expenses under control:

monitoring:
  nodes:
    - name: "AI Agent"
      alerts:
        - name: "Cost Spike"
          condition: "dailyCost > $100"
          notify: "finance@company.com"
        - name: "High Token Usage"
          condition: "tokensPerExecution > 5000"

Monitor token usage to spot inefficient prompts.

Pattern 4: Monitor Retry Patterns

Catch recurring problems:

monitoring:
  alerts:
    - name: "High Retry Rate"
      condition: "retrySuccessRate < 70%"
      notify: "devops@company.com"
      details: "Some nodes are failing and then recovering"

    - name: "Persistent Failures"
      condition: "retryExhausted > 10"
      notify: "on-call@company.com"
      details: "Retries not helping - need investigation"

If retries aren't helping, there's an underlying issue.

Dashboard Widgets

Execution Overview

This Week: 2,345 executions
✅ Success: 2,298 (97.9%)
❌ Failed: 47 (2.1%)
⏱️ Avg Time: 2.3s

Error Top 5

1. timeout (28 errors)
2. database_lock (15 errors)
3. rate_limit (12 errors)
4. connection_refused (8 errors)
5. invalid_response (4 errors)

Slowest Nodes

1. "Get User Data" — 1.2s average
2. "AI Analysis" — 4.5s average
3. "Database Query" — 650ms average
4. "Send Email" — 2.1s average
5. "Format Report" — 890ms average

System Health

API Server:     ✅ Healthy
Database:       ✅ Healthy
Message Queue:  ✅ Healthy
Workers:        ✅ Healthy (12 active)

Setting Alerts

Email Alert

alerts:
  - name: "Workflow Failed"
    condition: "executionStatus == failed"
    notify:
      type: "email"
      recipients: ["team@company.com"]
      subject: "DeepChain: Workflow {{ workflow }} failed"

Slack Alert

alerts:
  - name: "High Error Rate"
    condition: "errorRate > 10%"
    notify:
      type: "slack"
      channel: "#alerts"
      message: "Error rate is {{ errorRate }}% (threshold: 10%)"

PagerDuty Alert (Critical)

alerts:
  - name: "API Server Down"
    condition: "systemHealth.api == down"
    severity: "critical"
    notify:
      type: "pagerduty"
      serviceId: "abc123"

Interpreting Metrics

"Success Rate is 95% — Is That Good?"

Context matters:

95% for an API call? → Probably acceptable, depends on tolerance
95% for approval workflow? → Might be missing approvals
95% for critical payment? → Not acceptable, investigate

"One Node is Slow"

Check:

Is it consistently slow or just sometimes?
Did it just start being slow?
Does it depend on data size?
Is the external API slow?

"Error Rate Spiked"

Action:

Look at error types — what changed?
Check when spike started
Correlate with deployments or external changes
Look at detailed error logs

Best Practices

✅ Set baseline metrics — Know what normal looks like
✅ Monitor trends — Watch for gradual degradation
✅ Alert on anomalies — Sudden changes matter
✅ Regular reviews — Monthly metric reviews catch issues
✅ Track business metrics — Not just technical (costs, approvals, etc)
❌ Don't ignore errors — Even "rare" ones (1%) matter at scale
❌ Don't set alerts too strict — Alert fatigue makes you ignore them
❌ Don't forget about performance — Speed matters as much as success

Common Issues & Solutions

Issue: "Success rate is high but workflows are slow"

Cause: Individual nodes taking too long, not failures. Solution: Check P95 duration on each node. Optimize the slowest ones.

Issue: "Same error keeps recurring"

Cause: Not transient — systematic problem. Solution: Check retry success rate. If retries don't help, fix root cause.

Issue: "Alerts aren't working"

Cause: Alert condition is too strict or notifier misconfigured. Solution: Test with a manual alert. Check notification channels (Slack auth, email)

Issue: "Can't find a specific error in logs"

Cause: Logs scrolled off or error is infrequent. Solution: Check comprehensive error dashboard. Search by workflow, date range.

Next Steps

Check your dashboard — See what's currently running
Set baseline metrics — Understand normal performance
Create initial alerts — Start with critical issues
Review weekly — Build habit of checking metrics
Optimize slowest nodes — Fix performance issues
Investigate spikes — Understand what changed

Related:

Error Recovery — Retry and fallback strategies
Approval Workflows — Track approval performance
AI Agent Node — Monitor AI costs and performance