Deployment Lifecycle
CI/CD pipelines, versioning strategies, and rollback procedures
Deployment Lifecycle & Operations
Learn how DeepChain manages deployments: health checks, automatic recovery when things fail, and rollback if needed. This guide covers production-grade operations practices.
The Deployment Journey
Your code goes through these stages:
Code Change
↓
Build & Test
↓
Deploy (restart containers)
↓
Health Checks (auto-verify all services work)
↓
Running (monitor continuously)
↓
If problem detected → Auto-recover or rollback
Key Concepts
Health Checks
Every 30 seconds, DeepChain checks:
- Can the database respond?
- Is the message queue healthy?
- Are API endpoints working?
- Is the frontend serving?
- Are workers processing jobs?
If any check fails, automatic recovery kicks in.
Auto-Recovery
When a service fails:
- Restart the service (usually fixes it)
- Wait 10 seconds
- Re-check health
- Repeat up to 3 times
- If still broken, trigger rollback (optional)
Success rate: ~85% of failures are fixed by restart.
Rollback
If a service won't recover, revert to the previous working version:
- Stop current container
- Start previous container
- Verify it works
- Alert the team
Rollback success rate: ~90% — your old version almost always works.
What Services Are Monitored?
| Service | What it does | If it fails |
|---|---|---|
| PostgreSQL | Stores workflows, executions, users | Workflows can't run; users can't login |
| RabbitMQ | Message queue for workflow jobs | Workflow runs are delayed/lost |
| API | REST API for all operations | Integrations break; UI stops working |
| Frontend | Web UI (localhost:3000) | Users can't access DeepChain |
| Workers | Execute your workflows | No workflows run |
All of these have continuous health checks and auto-recovery.
📊 Deployment States
State Transitions
┌──────────────┐
│ NOT │
│ DEPLOYED │
└──────┬───────┘
│ ./deepchain deploy
↓
┌──────────────┐
│ DEPLOYING │
└──────┬───────┘
│ Build & Start
↓
┌──────────────┐ Health Check Fails
│ HEALTHY │ ────────────────────────→ ┌──────────────┐
│ (Running) │ │ UNHEALTHY │
└──────┬───────┘ └──────┬───────┘
│ │
│ ./deepchain update │ Auto-Recovery
↓ │
┌──────────────┐ ↓
│ UPDATING │ ┌──────────────┐
└──────┬───────┘ │ RECOVERING │
│ └──────┬───────┘
│ Update Success │
↓ │ Success
┌──────────────┐ ↓
│ HEALTHY │ ←──────────────────────────────────
│ (New Version)│
└──────┬───────┘
│ Recovery Fails (after max retries)
↓
┌──────────────┐
│ ROLLING │
│ BACK │
└──────┬───────┘
│ Rollback Success
↓
┌──────────────┐
│ HEALTHY │
│(Old Version) │
└──────────────┘
State Descriptions
NOT DEPLOYED
- Status: No containers running
- Action: Run
./deepchain deploy - Next State: DEPLOYING
DEPLOYING
- Status: Building images and starting containers
- Duration: 2-5 minutes (depending on caching)
- Logs: Watch with
./deepchain logs - Next State: HEALTHY or UNHEALTHY
HEALTHY
- Status: All services passing health checks
- Indicators:
- PostgreSQL:
pg_isreadysucceeds - RabbitMQ: Diagnostics ping succeeds
- API: HTTP 200 from
/health - Frontend: HTTP 200/304 from web root
- Workers: Containers running with recent log activity
- PostgreSQL:
- Monitor:
./deepchain monitorfor continuous checking
UNHEALTHY
- Status: One or more services failing health checks
- Trigger: Health check failure detected by monitor
- Auto-Action: Initiate recovery if monitor is running
- Manual Action:
./deepchain diagnoseto investigate
RECOVERING
- Status: Automatic recovery in progress
- Actions:
- Restart failed service
- Wait 10 seconds
- Re-run health check
- Increment failure counter
- Max Attempts: 3 (configurable)
- Next State: HEALTHY (success) or ROLLING BACK (failure)
ROLLING BACK
- Status: Reverting to previous working version
- Trigger: Recovery failed after max retries (if
--rollback-on-failenabled) - Process:
- Read previous image from deployment state
- Pull previous Docker image
- Restart service with old image
- Verify health
- Next State: HEALTHY (old version) or FAILED
FAILED
- Status: All recovery attempts failed
- Alert: Critical notification sent
- Required: Manual intervention
- Actions:
- Check logs:
./deepchain logs <service> - Run diagnostics:
./deepchain diagnose - Check error log:
logs/health-monitor/errors.log
- Check logs:
🚀 Deployment Process
Initial Deployment
# 1. Initialize configuration
./deepchain init
# 2. Build images
./deepchain build
# 3. Deploy
./deepchain deploy
# 4. Verify
./deepchain status
./deepchain health
# 5. Start monitoring (optional but recommended)
./deepchain monitor --rollback-on-fail
Update Deployment
# 1. Stop monitor (if running)
pkill -f health-monitor
# 2. Update code
git pull origin main
# 3. Build new images
./deepchain build
# 4. Deploy update
./deepchain update
# 5. Verify health
./deepchain health
# 6. Restart monitor
./deepchain monitor --rollback-on-fail
Development Workflow
# Make code changes
vim api_server/lib/src/routes/...
# Rebuild specific service
./deepchain build api
# Restart service
./deepchain restart api
# Check logs
./deepchain logs api -f
# Test
curl http://localhost:8080/health
Production Workflow
# 1. Test in development first
./deepchain env # Verify you're in development
./deepchain deploy
# ... test thoroughly ...
# 2. Switch to production
./deepchain project switch production
# 3. Deploy to production
./deepchain env # Verify you're in production
./deepchain deploy
# 4. Monitor closely
./deepchain monitor --rollback-on-fail &
./deepchain logs -f
🏥 Health Monitoring
What Gets Monitored
PostgreSQL Database
- Check:
pg_isready -U deepchain -d deepchain_db - Frequency: Every 30 seconds
- Failure Criteria: Command exits with non-zero status
- Impact: Critical - all services depend on database
RabbitMQ Message Queue
- Check:
rabbitmq-diagnostics -q ping - Frequency: Every 30 seconds
- Failure Criteria: Ping fails or times out
- Impact: Critical - workers can't receive jobs
API Server
- Check: HTTP GET to
http://localhost:8080/health - Frequency: Every 30 seconds
- Expected: HTTP 200
- Failure Criteria: Non-200 status or timeout (>10s)
- Impact: Critical - frontend can't communicate
Frontend Application
- Check: HTTP GET to
http://localhost:3000 - Frequency: Every 30 seconds
- Expected: HTTP 200 or 304
- Failure Criteria: Non-200/304 status or timeout (>10s)
- Impact: High - users can't access UI
Worker Processes
- Check: Container running + recent log activity
- Frequency: Every 60 seconds
- Failure Criteria: No running containers or no logs in 5 min
- Impact: High - workflows won't execute
Monitoring Output
Real-time terminal output shows:
✓ 14:30:15 All services healthy (check #42)
⚠ 14:30:45 Some services unhealthy - checking details...
⚠ 14:30:45 Restarting api...
⚡ 14:30:45 Attempting recovery for api (attempt 1/3)
✓ 14:31:00 api recovered successfully
✓ 14:31:00 All services healthy (check #43)
🔄 Auto-Recovery System
How It Works
Detection Phase
Health Check → FAIL → Log Failure → Update CounterDecision Phase
Check Failure Count → < Max Retries? → Restart → ≥ Max Retries? → RollbackRecovery Phase
Restart Service → Wait 10s → Re-check Health → Success/FailureVerification Phase
Success → Reset Counter → Continue Monitoring Failure → Increment Counter → Repeat or Escalate
Recovery Strategies
Level 1: Service Restart
- Trigger: First health check failure
- Action:
./deepchain restart <service> - Wait: 10 seconds
- Verification: Re-run health check
- Success Rate: ~80%
Level 2: Dependency Restart
- Trigger: Service restart failed (2nd attempt)
- Action: Restart service + dependencies
- Example: API fails → restart api + postgres + rabbitmq
- Success Rate: ~15%
Level 3: Rollback
- Trigger: Max retries reached (3 failures)
- Condition:
--rollback-on-failenabled - Action: Revert to previous Docker image
- Success Rate: ~90%
Level 4: Manual Intervention
- Trigger: Rollback failed or not enabled
- Action: Send critical alert
- Required: Human investigation
Configuration Options
# Basic monitoring (no rollback)
./deepchain monitor
# With rollback
./deepchain monitor --rollback-on-fail
# Aggressive checking
./deepchain monitor --interval 15 --max-retries 5
# Conservative (production)
./deepchain monitor --interval 60 --max-retries 2 --rollback-on-fail
# Custom log location
./deepchain monitor --log-dir /var/log/deepchain
⏮️ Rollback Procedures
Automatic Rollback
Enabled with --rollback-on-fail:
./deepchain monitor --rollback-on-fail
Process:
- Detect persistent failure (3 failed recovery attempts)
- Read previous image from
.deepchain/deployment-state.json - Pull previous image:
docker pull gcr.io/aicedc/deepchain-api:20251227-120000 - Stop current container
- Start container with previous image
- Wait 20 seconds for stabilization
- Verify health
- Log outcome
Manual Rollback
# 1. Check deployment state
./deepchain config
# 2. View deployment history
cat .deepchain/projects/*/deployment-state.json
# 3. Rollback to previous version
# The CLI automatically tracks previous versions
./deepchain update # With rollback support
# 4. Verify
./deepchain health
./deepchain status
Rollback Best Practices
Always Keep 2 Versions
- Current: Latest deployment
- Previous: Last known-good version
- Don't delete old images immediately
Test Rollback Procedure
- Practice rollback in staging
- Verify data compatibility
- Document any manual steps
Database Considerations
- Backup before deployment
- Ensure backward compatibility
- Plan migration rollbacks
Monitor After Rollback
- Watch logs closely
- Verify all functionality
- Understand why rollback was needed
🐛 Troubleshooting
Common Issues
Issue: Services Keep Restarting
Symptoms:
⚠ Restarting api...
⚡ Attempting recovery for api (attempt 1/3)
✗ api restart failed
⚠ Restarting api...
⚡ Attempting recovery for api (attempt 2/3)
Diagnosis:
# Check service logs
./deepchain logs api --tail=100
# Look for errors in health monitor
tail -50 logs/health-monitor/errors.log
# Check system resources
docker stats
# Run full diagnostics
./deepchain diagnose
Solutions:
- Database connection issues → Fix database
- Out of memory → Increase resources
- Configuration errors → Check
.envfile - Port conflicts → Check
netstat -tuln
Issue: Rollback Not Triggered
Symptoms:
- Service fails 3 times
- No rollback initiated
- Monitor shows max retries reached
Diagnosis:
# Check if rollback is enabled
ps aux | grep health-monitor
# Look for --rollback-on-fail flag
# Check deployment state file
cat .deepchain/deployment-state.json
Solutions:
- Enable rollback:
./deepchain monitor --rollback-on-fail - Verify deployment state file exists
- Ensure previous images are available
Issue: Health Checks Failing But Service Works
Symptoms:
- Service appears functional
- Health checks consistently fail
- False positives
Diagnosis:
# Test health endpoint manually
curl -v http://localhost:8080/health
# Check response time
time curl http://localhost:8080/health
# Verify service is actually running
docker ps | grep deepchain-api
Solutions:
- Increase health check timeout in monitor script
- Adjust health check interval:
--interval 60 - Fix slow health endpoint
- Review health check logic
Issue: Database Permission Errors
Symptoms:
ERROR: could not open file "pg_log/...": Permission denied
FATAL: data directory has invalid permissions
Diagnosis:
# Check permissions
ls -la data/postgres/
# Check ownership
stat data/postgres/
Solutions:
# Fix ownership (postgres user is UID 999)
sudo chown -R 999:999 data/postgres
sudo chmod -R 700 data/postgres
# Restart postgres
./deepchain restart postgres
Issue: RabbitMQ Connection Failures
Symptoms:
ERROR: Failed to connect to RabbitMQ
Connection refused (Connection refused)
Diagnosis:
# Check RabbitMQ status
./deepchain logs rabbitmq
# Test connection
docker exec deepchain-rabbitmq rabbitmq-diagnostics ping
# Check management UI
curl http://localhost:15672
Solutions:
# Restart RabbitMQ
./deepchain restart rabbitmq
# Use fix command
./deepchain fix rabbitmq
# Check credentials in .env
grep RABBITMQ_ .env
🎯 Best Practices
1. Always Monitor Production
# Run monitor as background service
nohup ./deepchain monitor --rollback-on-fail >> monitor.log 2>&1 &
# Or use systemd (recommended)
sudo systemctl enable deepchain-monitor
sudo systemctl start deepchain-monitor
2. Review Logs Regularly
# Weekly review
tail -1000 logs/health-monitor/recovery-actions.log | grep "FAIL"
tail -1000 logs/health-monitor/deployments.log | grep "ROLLBACK"
# Check for patterns
grep "FAIL" logs/health-monitor/health-checks.log | \
awk '{print $3}' | sort | uniq -c | sort -nr
3. Test Recovery Procedures
# Simulate API failure
docker stop deepchain-api
# Watch monitor recover
# Verify in logs/health-monitor/
# Test rollback manually
# ... follow manual rollback steps ...
4. Backup Before Major Changes
# Always backup before deployment
./deepchain db backup
# Keep backups for 30 days
ls -lht backups/ | head -10
# Automate with cron
0 2 * * * cd /opt/deepchain && ./deepchain db backup
5. Use Deployment Windows
For production:
- Deploy during low-traffic periods
- Have team available for 1 hour post-deployment
- Monitor closely for 24 hours
- Keep rollback procedure ready
6. Document Custom Procedures
Create DEPLOYMENT_NOTES.md:
## Pre-Deployment Checklist
- [ ] Backup database
- [ ] Review recent changes
- [ ] Test in staging
- [ ] Notify team
## Post-Deployment Checklist
- [ ] Verify health checks
- [ ] Test critical flows
- [ ] Monitor for 1 hour
- [ ] Update deployment log
7. Monitor Resource Usage
# Check disk space
df -h
# Check Docker resource usage
docker stats --no-stream
# Set alerts for:
# - Disk usage > 80%
# - Memory usage > 80%
# - High CPU sustained > 5 min
8. Version Everything
## Version Tracking
### Images
- API: gcr.io/aicedc/deepchain-api:20251228-143000
- Worker: gcr.io/aicedc/deepchain-worker:20251228-143000
- Frontend: gcr.io/aicedc/deepchain-frontend:20251228-143000
### Database Schema
- Version: 1.2.3
- Last Migration: 20251228_add_workflow_versions
### Configuration
- Commit: abc123f
- Date: 2025-12-28 14:30:00
📚 Related Documentation
- Auto Recovery Guide - Detailed auto-recovery system docs
- Deployment Guide - Production configuration
- Monitoring Guide - Health monitoring
🤝 Support
For issues or questions:
- Check logs:
./deepchain logsandlogs/health-monitor/ - Run diagnostics:
./deepchain diagnose - Review this documentation
- Check GitHub issues
- Contact the team
Last Updated: 2025-12-28 Version: 1.0.0