Deployment Lifecycle

CI/CD pipelines, versioning strategies, and rollback procedures

Deployment Lifecycle & Operations

Learn how DeepChain manages deployments: health checks, automatic recovery when things fail, and rollback if needed. This guide covers production-grade operations practices.

The Deployment Journey

Your code goes through these stages:

Code Change
    ↓
Build & Test
    ↓
Deploy (restart containers)
    ↓
Health Checks (auto-verify all services work)
    ↓
Running (monitor continuously)
    ↓
If problem detected → Auto-recover or rollback

Key Concepts

Health Checks

Every 30 seconds, DeepChain checks:

Can the database respond?
Is the message queue healthy?
Are API endpoints working?
Is the frontend serving?
Are workers processing jobs?

If any check fails, automatic recovery kicks in.

Auto-Recovery

When a service fails:

Restart the service (usually fixes it)
Wait 10 seconds
Re-check health
Repeat up to 3 times
If still broken, trigger rollback (optional)

Success rate: ~85% of failures are fixed by restart.

Rollback

If a service won't recover, revert to the previous working version:

Stop current container
Start previous container
Verify it works
Alert the team

Rollback success rate: ~90% — your old version almost always works.

What Services Are Monitored?

Service	What it does	If it fails
PostgreSQL	Stores workflows, executions, users	Workflows can't run; users can't login
RabbitMQ	Message queue for workflow jobs	Workflow runs are delayed/lost
API	REST API for all operations	Integrations break; UI stops working
Frontend	Web UI (localhost:3000)	Users can't access DeepChain
Workers	Execute your workflows	No workflows run

All of these have continuous health checks and auto-recovery.

📊 Deployment States

State Transitions

┌──────────────┐
│   NOT        │
│  DEPLOYED    │
└──────┬───────┘
       │ ./deepchain deploy
       ↓
┌──────────────┐
│  DEPLOYING   │
└──────┬───────┘
       │ Build & Start
       ↓
┌──────────────┐     Health Check Fails
│   HEALTHY    │ ────────────────────────→ ┌──────────────┐
│  (Running)   │                            │  UNHEALTHY   │
└──────┬───────┘                            └──────┬───────┘
       │                                           │
       │ ./deepchain update                        │ Auto-Recovery
       ↓                                           │
┌──────────────┐                                   ↓
│  UPDATING    │                            ┌──────────────┐
└──────┬───────┘                            │  RECOVERING  │
       │                                    └──────┬───────┘
       │ Update Success                            │
       ↓                                           │ Success
┌──────────────┐                                   ↓
│   HEALTHY    │ ←──────────────────────────────────
│ (New Version)│
└──────┬───────┘
       │ Recovery Fails (after max retries)
       ↓
┌──────────────┐
│  ROLLING     │
│     BACK     │
└──────┬───────┘
       │ Rollback Success
       ↓
┌──────────────┐
│   HEALTHY    │
│(Old Version) │
└──────────────┘

State Descriptions

NOT DEPLOYED

Status: No containers running
Action: Run ./deepchain deploy
Next State: DEPLOYING

DEPLOYING

Status: Building images and starting containers
Duration: 2-5 minutes (depending on caching)
Logs: Watch with ./deepchain logs
Next State: HEALTHY or UNHEALTHY

HEALTHY

Status: All services passing health checks
Indicators:
- PostgreSQL: pg_isready succeeds
- RabbitMQ: Diagnostics ping succeeds
- API: HTTP 200 from /health
- Frontend: HTTP 200/304 from web root
- Workers: Containers running with recent log activity
Monitor: ./deepchain monitor for continuous checking

UNHEALTHY

Status: One or more services failing health checks
Trigger: Health check failure detected by monitor
Auto-Action: Initiate recovery if monitor is running
Manual Action: ./deepchain diagnose to investigate

RECOVERING

Status: Automatic recovery in progress
Actions:
1. Restart failed service
2. Wait 10 seconds
3. Re-run health check
4. Increment failure counter
Max Attempts: 3 (configurable)
Next State: HEALTHY (success) or ROLLING BACK (failure)

ROLLING BACK

Status: Reverting to previous working version
Trigger: Recovery failed after max retries (if --rollback-on-fail enabled)
Process:
1. Read previous image from deployment state
2. Pull previous Docker image
3. Restart service with old image
4. Verify health
Next State: HEALTHY (old version) or FAILED

FAILED

Status: All recovery attempts failed
Alert: Critical notification sent
Required: Manual intervention
Actions:
- Check logs: ./deepchain logs <service>
- Run diagnostics: ./deepchain diagnose
- Check error log: logs/health-monitor/errors.log

🚀 Deployment Process

Initial Deployment

# 1. Initialize configuration
./deepchain init

# 2. Build images
./deepchain build

# 3. Deploy
./deepchain deploy

# 4. Verify
./deepchain status
./deepchain health

# 5. Start monitoring (optional but recommended)
./deepchain monitor --rollback-on-fail

Update Deployment

# 1. Stop monitor (if running)
pkill -f health-monitor

# 2. Update code
git pull origin main

# 3. Build new images
./deepchain build

# 4. Deploy update
./deepchain update

# 5. Verify health
./deepchain health

# 6. Restart monitor
./deepchain monitor --rollback-on-fail

Development Workflow

# Make code changes
vim api_server/lib/src/routes/...

# Rebuild specific service
./deepchain build api

# Restart service
./deepchain restart api

# Check logs
./deepchain logs api -f

# Test
curl http://localhost:8080/health

Production Workflow

# 1. Test in development first
./deepchain env  # Verify you're in development
./deepchain deploy
# ... test thoroughly ...

# 2. Switch to production
./deepchain project switch production

# 3. Deploy to production
./deepchain env  # Verify you're in production
./deepchain deploy

# 4. Monitor closely
./deepchain monitor --rollback-on-fail &
./deepchain logs -f

🏥 Health Monitoring

What Gets Monitored

PostgreSQL Database

Check: pg_isready -U deepchain -d deepchain_db
Frequency: Every 30 seconds
Failure Criteria: Command exits with non-zero status
Impact: Critical - all services depend on database

RabbitMQ Message Queue

Check: rabbitmq-diagnostics -q ping
Frequency: Every 30 seconds
Failure Criteria: Ping fails or times out
Impact: Critical - workers can't receive jobs

API Server

Check: HTTP GET to http://localhost:8080/health
Frequency: Every 30 seconds
Expected: HTTP 200
Failure Criteria: Non-200 status or timeout (>10s)
Impact: Critical - frontend can't communicate

Frontend Application

Check: HTTP GET to http://localhost:3000
Frequency: Every 30 seconds
Expected: HTTP 200 or 304
Failure Criteria: Non-200/304 status or timeout (>10s)
Impact: High - users can't access UI

Worker Processes

Check: Container running + recent log activity
Frequency: Every 60 seconds
Failure Criteria: No running containers or no logs in 5 min
Impact: High - workflows won't execute

Monitoring Output

Real-time terminal output shows:

✓ 14:30:15 All services healthy (check #42)
⚠ 14:30:45 Some services unhealthy - checking details...
⚠ 14:30:45 Restarting api...
⚡ 14:30:45 Attempting recovery for api (attempt 1/3)
✓ 14:31:00 api recovered successfully
✓ 14:31:00 All services healthy (check #43)

🔄 Auto-Recovery System

How It Works

Detection Phase

Health Check → FAIL → Log Failure → Update Counter

Decision Phase

Check Failure Count → < Max Retries? → Restart
                                     → ≥ Max Retries? → Rollback

Recovery Phase

Restart Service → Wait 10s → Re-check Health → Success/Failure

Verification Phase

Success → Reset Counter → Continue Monitoring
Failure → Increment Counter → Repeat or Escalate

Recovery Strategies

Level 1: Service Restart

Trigger: First health check failure
Action: ./deepchain restart <service>
Wait: 10 seconds
Verification: Re-run health check
Success Rate: ~80%

Level 2: Dependency Restart

Trigger: Service restart failed (2nd attempt)
Action: Restart service + dependencies
Example: API fails → restart api + postgres + rabbitmq
Success Rate: ~15%

Level 3: Rollback

Trigger: Max retries reached (3 failures)
Condition: --rollback-on-fail enabled
Action: Revert to previous Docker image
Success Rate: ~90%

Level 4: Manual Intervention

Trigger: Rollback failed or not enabled
Action: Send critical alert
Required: Human investigation

Configuration Options

# Basic monitoring (no rollback)
./deepchain monitor

# With rollback
./deepchain monitor --rollback-on-fail

# Aggressive checking
./deepchain monitor --interval 15 --max-retries 5

# Conservative (production)
./deepchain monitor --interval 60 --max-retries 2 --rollback-on-fail

# Custom log location
./deepchain monitor --log-dir /var/log/deepchain

⏮️ Rollback Procedures

Automatic Rollback

Enabled with --rollback-on-fail:

./deepchain monitor --rollback-on-fail

Process:

Detect persistent failure (3 failed recovery attempts)
Read previous image from .deepchain/deployment-state.json
Pull previous image: docker pull gcr.io/aicedc/deepchain-api:20251227-120000
Stop current container
Start container with previous image
Wait 20 seconds for stabilization
Verify health
Log outcome

Manual Rollback

# 1. Check deployment state
./deepchain config

# 2. View deployment history
cat .deepchain/projects/*/deployment-state.json

# 3. Rollback to previous version
# The CLI automatically tracks previous versions
./deepchain update  # With rollback support

# 4. Verify
./deepchain health
./deepchain status

Rollback Best Practices

Always Keep 2 Versions
- Current: Latest deployment
- Previous: Last known-good version
- Don't delete old images immediately
Test Rollback Procedure
- Practice rollback in staging
- Verify data compatibility
- Document any manual steps
Database Considerations
- Backup before deployment
- Ensure backward compatibility
- Plan migration rollbacks
Monitor After Rollback
- Watch logs closely
- Verify all functionality
- Understand why rollback was needed

🐛 Troubleshooting

Common Issues

Issue: Services Keep Restarting

Symptoms:

⚠ Restarting api...
⚡ Attempting recovery for api (attempt 1/3)
✗ api restart failed
⚠ Restarting api...
⚡ Attempting recovery for api (attempt 2/3)

Diagnosis:

# Check service logs
./deepchain logs api --tail=100

# Look for errors in health monitor
tail -50 logs/health-monitor/errors.log

# Check system resources
docker stats

# Run full diagnostics
./deepchain diagnose

Solutions:

Database connection issues → Fix database
Out of memory → Increase resources
Configuration errors → Check .env file
Port conflicts → Check netstat -tuln

Issue: Rollback Not Triggered

Symptoms:

Service fails 3 times
No rollback initiated
Monitor shows max retries reached

Diagnosis:

# Check if rollback is enabled
ps aux | grep health-monitor
# Look for --rollback-on-fail flag

# Check deployment state file
cat .deepchain/deployment-state.json

Solutions:

Enable rollback: ./deepchain monitor --rollback-on-fail
Verify deployment state file exists
Ensure previous images are available

Issue: Health Checks Failing But Service Works

Symptoms:

Service appears functional
Health checks consistently fail
False positives

Diagnosis:

# Test health endpoint manually
curl -v http://localhost:8080/health

# Check response time
time curl http://localhost:8080/health

# Verify service is actually running
docker ps | grep deepchain-api

Solutions:

Increase health check timeout in monitor script
Adjust health check interval: --interval 60
Fix slow health endpoint
Review health check logic

Issue: Database Permission Errors

Symptoms:

ERROR: could not open file "pg_log/...": Permission denied
FATAL: data directory has invalid permissions

Diagnosis:

# Check permissions
ls -la data/postgres/

# Check ownership
stat data/postgres/

Solutions:

# Fix ownership (postgres user is UID 999)
sudo chown -R 999:999 data/postgres
sudo chmod -R 700 data/postgres

# Restart postgres
./deepchain restart postgres

Issue: RabbitMQ Connection Failures

Symptoms:

ERROR: Failed to connect to RabbitMQ
Connection refused (Connection refused)

Diagnosis:

# Check RabbitMQ status
./deepchain logs rabbitmq

# Test connection
docker exec deepchain-rabbitmq rabbitmq-diagnostics ping

# Check management UI
curl http://localhost:15672

Solutions:

# Restart RabbitMQ
./deepchain restart rabbitmq

# Use fix command
./deepchain fix rabbitmq

# Check credentials in .env
grep RABBITMQ_ .env

🎯 Best Practices

1. Always Monitor Production

# Run monitor as background service
nohup ./deepchain monitor --rollback-on-fail >> monitor.log 2>&1 &

# Or use systemd (recommended)
sudo systemctl enable deepchain-monitor
sudo systemctl start deepchain-monitor

2. Review Logs Regularly

# Weekly review
tail -1000 logs/health-monitor/recovery-actions.log | grep "FAIL"
tail -1000 logs/health-monitor/deployments.log | grep "ROLLBACK"

# Check for patterns
grep "FAIL" logs/health-monitor/health-checks.log | \
    awk '{print $3}' | sort | uniq -c | sort -nr

3. Test Recovery Procedures

# Simulate API failure
docker stop deepchain-api

# Watch monitor recover
# Verify in logs/health-monitor/

# Test rollback manually
# ... follow manual rollback steps ...

4. Backup Before Major Changes

# Always backup before deployment
./deepchain db backup

# Keep backups for 30 days
ls -lht backups/ | head -10

# Automate with cron
0 2 * * * cd /opt/deepchain && ./deepchain db backup

5. Use Deployment Windows

For production:

Deploy during low-traffic periods
Have team available for 1 hour post-deployment
Monitor closely for 24 hours
Keep rollback procedure ready

6. Document Custom Procedures

Create DEPLOYMENT_NOTES.md:

## Pre-Deployment Checklist
- [ ] Backup database
- [ ] Review recent changes
- [ ] Test in staging
- [ ] Notify team

## Post-Deployment Checklist
- [ ] Verify health checks
- [ ] Test critical flows
- [ ] Monitor for 1 hour
- [ ] Update deployment log

7. Monitor Resource Usage

# Check disk space
df -h

# Check Docker resource usage
docker stats --no-stream

# Set alerts for:
# - Disk usage > 80%
# - Memory usage > 80%
# - High CPU sustained > 5 min

8. Version Everything

## Version Tracking

### Images
- API: gcr.io/aicedc/deepchain-api:20251228-143000
- Worker: gcr.io/aicedc/deepchain-worker:20251228-143000
- Frontend: gcr.io/aicedc/deepchain-frontend:20251228-143000

### Database Schema
- Version: 1.2.3
- Last Migration: 20251228_add_workflow_versions

### Configuration
- Commit: abc123f
- Date: 2025-12-28 14:30:00

📚 Related Documentation

Auto Recovery Guide - Detailed auto-recovery system docs
Deployment Guide - Production configuration
Monitoring Guide - Health monitoring

🤝 Support

For issues or questions:

Check logs: ./deepchain logs and logs/health-monitor/
Run diagnostics: ./deepchain diagnose
Review this documentation
Check GitHub issues
Contact the team

Last Updated: 2025-12-28 Version: 1.0.0