Deployment Lifecycle

CI/CD pipelines, versioning strategies, and rollback procedures

Deployment Lifecycle & Operations

Learn how DeepChain manages deployments: health checks, automatic recovery when things fail, and rollback if needed. This guide covers production-grade operations practices.

The Deployment Journey

Your code goes through these stages:

Code Change
    ↓
Build & Test
    ↓
Deploy (restart containers)
    ↓
Health Checks (auto-verify all services work)
    ↓
Running (monitor continuously)
    ↓
If problem detected → Auto-recover or rollback

Key Concepts

Health Checks

Every 30 seconds, DeepChain checks:

  • Can the database respond?
  • Is the message queue healthy?
  • Are API endpoints working?
  • Is the frontend serving?
  • Are workers processing jobs?

If any check fails, automatic recovery kicks in.

Auto-Recovery

When a service fails:

  1. Restart the service (usually fixes it)
  2. Wait 10 seconds
  3. Re-check health
  4. Repeat up to 3 times
  5. If still broken, trigger rollback (optional)

Success rate: ~85% of failures are fixed by restart.

Rollback

If a service won't recover, revert to the previous working version:

  1. Stop current container
  2. Start previous container
  3. Verify it works
  4. Alert the team

Rollback success rate: ~90% — your old version almost always works.

What Services Are Monitored?

Service What it does If it fails
PostgreSQL Stores workflows, executions, users Workflows can't run; users can't login
RabbitMQ Message queue for workflow jobs Workflow runs are delayed/lost
API REST API for all operations Integrations break; UI stops working
Frontend Web UI (localhost:3000) Users can't access DeepChain
Workers Execute your workflows No workflows run

All of these have continuous health checks and auto-recovery.


📊 Deployment States

State Transitions

┌──────────────┐
│   NOT        │
│  DEPLOYED    │
└──────┬───────┘
       │ ./deepchain deploy
       ↓
┌──────────────┐
│  DEPLOYING   │
└──────┬───────┘
       │ Build & Start
       ↓
┌──────────────┐     Health Check Fails
│   HEALTHY    │ ────────────────────────→ ┌──────────────┐
│  (Running)   │                            │  UNHEALTHY   │
└──────┬───────┘                            └──────┬───────┘
       │                                           │
       │ ./deepchain update                        │ Auto-Recovery
       ↓                                           │
┌──────────────┐                                   ↓
│  UPDATING    │                            ┌──────────────┐
└──────┬───────┘                            │  RECOVERING  │
       │                                    └──────┬───────┘
       │ Update Success                            │
       ↓                                           │ Success
┌──────────────┐                                   ↓
│   HEALTHY    │ ←──────────────────────────────────
│ (New Version)│
└──────┬───────┘
       │ Recovery Fails (after max retries)
       ↓
┌──────────────┐
│  ROLLING     │
│     BACK     │
└──────┬───────┘
       │ Rollback Success
       ↓
┌──────────────┐
│   HEALTHY    │
│(Old Version) │
└──────────────┘

State Descriptions

NOT DEPLOYED

  • Status: No containers running
  • Action: Run ./deepchain deploy
  • Next State: DEPLOYING

DEPLOYING

  • Status: Building images and starting containers
  • Duration: 2-5 minutes (depending on caching)
  • Logs: Watch with ./deepchain logs
  • Next State: HEALTHY or UNHEALTHY

HEALTHY

  • Status: All services passing health checks
  • Indicators:
    • PostgreSQL: pg_isready succeeds
    • RabbitMQ: Diagnostics ping succeeds
    • API: HTTP 200 from /health
    • Frontend: HTTP 200/304 from web root
    • Workers: Containers running with recent log activity
  • Monitor: ./deepchain monitor for continuous checking

UNHEALTHY

  • Status: One or more services failing health checks
  • Trigger: Health check failure detected by monitor
  • Auto-Action: Initiate recovery if monitor is running
  • Manual Action: ./deepchain diagnose to investigate

RECOVERING

  • Status: Automatic recovery in progress
  • Actions:
    1. Restart failed service
    2. Wait 10 seconds
    3. Re-run health check
    4. Increment failure counter
  • Max Attempts: 3 (configurable)
  • Next State: HEALTHY (success) or ROLLING BACK (failure)

ROLLING BACK

  • Status: Reverting to previous working version
  • Trigger: Recovery failed after max retries (if --rollback-on-fail enabled)
  • Process:
    1. Read previous image from deployment state
    2. Pull previous Docker image
    3. Restart service with old image
    4. Verify health
  • Next State: HEALTHY (old version) or FAILED

FAILED

  • Status: All recovery attempts failed
  • Alert: Critical notification sent
  • Required: Manual intervention
  • Actions:
    • Check logs: ./deepchain logs <service>
    • Run diagnostics: ./deepchain diagnose
    • Check error log: logs/health-monitor/errors.log

🚀 Deployment Process

Initial Deployment

# 1. Initialize configuration
./deepchain init

# 2. Build images
./deepchain build

# 3. Deploy
./deepchain deploy

# 4. Verify
./deepchain status
./deepchain health

# 5. Start monitoring (optional but recommended)
./deepchain monitor --rollback-on-fail

Update Deployment

# 1. Stop monitor (if running)
pkill -f health-monitor

# 2. Update code
git pull origin main

# 3. Build new images
./deepchain build

# 4. Deploy update
./deepchain update

# 5. Verify health
./deepchain health

# 6. Restart monitor
./deepchain monitor --rollback-on-fail

Development Workflow

# Make code changes
vim api_server/lib/src/routes/...

# Rebuild specific service
./deepchain build api

# Restart service
./deepchain restart api

# Check logs
./deepchain logs api -f

# Test
curl http://localhost:8080/health

Production Workflow

# 1. Test in development first
./deepchain env  # Verify you're in development
./deepchain deploy
# ... test thoroughly ...

# 2. Switch to production
./deepchain project switch production

# 3. Deploy to production
./deepchain env  # Verify you're in production
./deepchain deploy

# 4. Monitor closely
./deepchain monitor --rollback-on-fail &
./deepchain logs -f

🏥 Health Monitoring

What Gets Monitored

PostgreSQL Database

  • Check: pg_isready -U deepchain -d deepchain_db
  • Frequency: Every 30 seconds
  • Failure Criteria: Command exits with non-zero status
  • Impact: Critical - all services depend on database

RabbitMQ Message Queue

  • Check: rabbitmq-diagnostics -q ping
  • Frequency: Every 30 seconds
  • Failure Criteria: Ping fails or times out
  • Impact: Critical - workers can't receive jobs

API Server

  • Check: HTTP GET to http://localhost:8080/health
  • Frequency: Every 30 seconds
  • Expected: HTTP 200
  • Failure Criteria: Non-200 status or timeout (>10s)
  • Impact: Critical - frontend can't communicate

Frontend Application

  • Check: HTTP GET to http://localhost:3000
  • Frequency: Every 30 seconds
  • Expected: HTTP 200 or 304
  • Failure Criteria: Non-200/304 status or timeout (>10s)
  • Impact: High - users can't access UI

Worker Processes

  • Check: Container running + recent log activity
  • Frequency: Every 60 seconds
  • Failure Criteria: No running containers or no logs in 5 min
  • Impact: High - workflows won't execute

Monitoring Output

Real-time terminal output shows:

14:30:15 All services healthy (check #42)
⚠ 14:30:45 Some services unhealthy - checking details...14:30:45 Restarting api...14:30:45 Attempting recovery for api (attempt 1/3)
✓ 14:31:00 api recovered successfully
✓ 14:31:00 All services healthy (check #43)

🔄 Auto-Recovery System

How It Works

  1. Detection Phase

    Health CheckFAILLog FailureUpdate Counter
  2. Decision Phase

    Check Failure Count< Max Retries?Restart
                                         → ≥ Max Retries?Rollback
  3. Recovery Phase

    Restart ServiceWait 10sRe-check HealthSuccess/Failure
  4. Verification Phase

    Success → Reset Counter → Continue Monitoring
    Failure → Increment Counter → Repeat or Escalate

Recovery Strategies

Level 1: Service Restart

  • Trigger: First health check failure
  • Action: ./deepchain restart <service>
  • Wait: 10 seconds
  • Verification: Re-run health check
  • Success Rate: ~80%

Level 2: Dependency Restart

  • Trigger: Service restart failed (2nd attempt)
  • Action: Restart service + dependencies
  • Example: API fails → restart api + postgres + rabbitmq
  • Success Rate: ~15%

Level 3: Rollback

  • Trigger: Max retries reached (3 failures)
  • Condition: --rollback-on-fail enabled
  • Action: Revert to previous Docker image
  • Success Rate: ~90%

Level 4: Manual Intervention

  • Trigger: Rollback failed or not enabled
  • Action: Send critical alert
  • Required: Human investigation

Configuration Options

# Basic monitoring (no rollback)
./deepchain monitor

# With rollback
./deepchain monitor --rollback-on-fail

# Aggressive checking
./deepchain monitor --interval 15 --max-retries 5

# Conservative (production)
./deepchain monitor --interval 60 --max-retries 2 --rollback-on-fail

# Custom log location
./deepchain monitor --log-dir /var/log/deepchain

⏮️ Rollback Procedures

Automatic Rollback

Enabled with --rollback-on-fail:

./deepchain monitor --rollback-on-fail

Process:

  1. Detect persistent failure (3 failed recovery attempts)
  2. Read previous image from .deepchain/deployment-state.json
  3. Pull previous image: docker pull gcr.io/aicedc/deepchain-api:20251227-120000
  4. Stop current container
  5. Start container with previous image
  6. Wait 20 seconds for stabilization
  7. Verify health
  8. Log outcome

Manual Rollback

# 1. Check deployment state
./deepchain config

# 2. View deployment history
cat .deepchain/projects/*/deployment-state.json

# 3. Rollback to previous version
# The CLI automatically tracks previous versions
./deepchain update  # With rollback support

# 4. Verify
./deepchain health
./deepchain status

Rollback Best Practices

  1. Always Keep 2 Versions

    • Current: Latest deployment
    • Previous: Last known-good version
    • Don't delete old images immediately
  2. Test Rollback Procedure

    • Practice rollback in staging
    • Verify data compatibility
    • Document any manual steps
  3. Database Considerations

    • Backup before deployment
    • Ensure backward compatibility
    • Plan migration rollbacks
  4. Monitor After Rollback

    • Watch logs closely
    • Verify all functionality
    • Understand why rollback was needed

🐛 Troubleshooting

Common Issues

Issue: Services Keep Restarting

Symptoms:

⚠ Restarting api...
⚡ Attempting recovery for api (attempt 1/3)
✗ api restart failed
⚠ Restarting api...
⚡ Attempting recovery for api (attempt 2/3)

Diagnosis:

# Check service logs
./deepchain logs api --tail=100

# Look for errors in health monitor
tail -50 logs/health-monitor/errors.log

# Check system resources
docker stats

# Run full diagnostics
./deepchain diagnose

Solutions:

  • Database connection issues → Fix database
  • Out of memory → Increase resources
  • Configuration errors → Check .env file
  • Port conflicts → Check netstat -tuln

Issue: Rollback Not Triggered

Symptoms:

  • Service fails 3 times
  • No rollback initiated
  • Monitor shows max retries reached

Diagnosis:

# Check if rollback is enabled
ps aux | grep health-monitor
# Look for --rollback-on-fail flag

# Check deployment state file
cat .deepchain/deployment-state.json

Solutions:

  • Enable rollback: ./deepchain monitor --rollback-on-fail
  • Verify deployment state file exists
  • Ensure previous images are available

Issue: Health Checks Failing But Service Works

Symptoms:

  • Service appears functional
  • Health checks consistently fail
  • False positives

Diagnosis:

# Test health endpoint manually
curl -v http://localhost:8080/health

# Check response time
time curl http://localhost:8080/health

# Verify service is actually running
docker ps | grep deepchain-api

Solutions:

  • Increase health check timeout in monitor script
  • Adjust health check interval: --interval 60
  • Fix slow health endpoint
  • Review health check logic

Issue: Database Permission Errors

Symptoms:

ERROR: could not open file "pg_log/...": Permission denied
FATAL: data directory has invalid permissions

Diagnosis:

# Check permissions
ls -la data/postgres/

# Check ownership
stat data/postgres/

Solutions:

# Fix ownership (postgres user is UID 999)
sudo chown -R 999:999 data/postgres
sudo chmod -R 700 data/postgres

# Restart postgres
./deepchain restart postgres

Issue: RabbitMQ Connection Failures

Symptoms:

ERROR: Failed to connect to RabbitMQ
Connection refused (Connection refused)

Diagnosis:

# Check RabbitMQ status
./deepchain logs rabbitmq

# Test connection
docker exec deepchain-rabbitmq rabbitmq-diagnostics ping

# Check management UI
curl http://localhost:15672

Solutions:

# Restart RabbitMQ
./deepchain restart rabbitmq

# Use fix command
./deepchain fix rabbitmq

# Check credentials in .env
grep RABBITMQ_ .env

🎯 Best Practices

1. Always Monitor Production

# Run monitor as background service
nohup ./deepchain monitor --rollback-on-fail >> monitor.log 2>&1 &

# Or use systemd (recommended)
sudo systemctl enable deepchain-monitor
sudo systemctl start deepchain-monitor

2. Review Logs Regularly

# Weekly review
tail -1000 logs/health-monitor/recovery-actions.log | grep "FAIL"
tail -1000 logs/health-monitor/deployments.log | grep "ROLLBACK"

# Check for patterns
grep "FAIL" logs/health-monitor/health-checks.log | \
    awk '{print $3}' | sort | uniq -c | sort -nr

3. Test Recovery Procedures

# Simulate API failure
docker stop deepchain-api

# Watch monitor recover
# Verify in logs/health-monitor/

# Test rollback manually
# ... follow manual rollback steps ...

4. Backup Before Major Changes

# Always backup before deployment
./deepchain db backup

# Keep backups for 30 days
ls -lht backups/ | head -10

# Automate with cron
0 2 * * * cd /opt/deepchain && ./deepchain db backup

5. Use Deployment Windows

For production:

  • Deploy during low-traffic periods
  • Have team available for 1 hour post-deployment
  • Monitor closely for 24 hours
  • Keep rollback procedure ready

6. Document Custom Procedures

Create DEPLOYMENT_NOTES.md:

## Pre-Deployment Checklist
- [ ] Backup database
- [ ] Review recent changes
- [ ] Test in staging
- [ ] Notify team

## Post-Deployment Checklist
- [ ] Verify health checks
- [ ] Test critical flows
- [ ] Monitor for 1 hour
- [ ] Update deployment log

7. Monitor Resource Usage

# Check disk space
df -h

# Check Docker resource usage
docker stats --no-stream

# Set alerts for:
# - Disk usage > 80%
# - Memory usage > 80%
# - High CPU sustained > 5 min

8. Version Everything

## Version Tracking

### Images
- API: gcr.io/aicedc/deepchain-api:20251228-143000
- Worker: gcr.io/aicedc/deepchain-worker:20251228-143000
- Frontend: gcr.io/aicedc/deepchain-frontend:20251228-143000

### Database Schema
- Version: 1.2.3
- Last Migration: 20251228_add_workflow_versions

### Configuration
- Commit: abc123f
- Date: 2025-12-28 14:30:00

📚 Related Documentation


🤝 Support

For issues or questions:

  1. Check logs: ./deepchain logs and logs/health-monitor/
  2. Run diagnostics: ./deepchain diagnose
  3. Review this documentation
  4. Check GitHub issues
  5. Contact the team

Last Updated: 2025-12-28 Version: 1.0.0