Quick Reference

Quick reference for on-call engineers and developers working with MonoTask monitoring.

🚨 Alert Response

When You Receive an Alert

Check Severity:
- 🔴 CRITICAL: Immediate action required (15 min response time)
- 🟡 WARNING: Investigate within 1 hour
- 🔵 INFO: Informational only, review during business hours

Access Dashboard:

https://dash.cloudflare.com/[account]/workers/analytics

Find Relevant Runbook:
Alert Type Runbook
Error rate elevated high-error-rate.md
Queue backup queue-backup.md
Slow database database-slow.md
Worker timeout worker-timeout.md
Sandbox stuck sandbox-stuck.md

Alert Type	Runbook
Error rate elevated	high-error-rate.md
Queue backup	queue-backup.md
Slow database	database-slow.md
Worker timeout	worker-timeout.md
Sandbox stuck	sandbox-stuck.md

🔍 Common Commands

Check Worker Health

bash

# View worker logs
wrangler tail [worker-name]

# Filter for errors only
wrangler tail [worker-name] --status error

# View specific worker
wrangler tail monotask-api-gateway --format pretty

Check Queue Status

bash

# List all queues
wrangler queues list

# Get specific queue stats
wrangler queues get agent-queue

# Check DLQ
wrangler queues get agent-dlq

Database Commands

bash

# Execute query
wrangler d1 execute monotask-production --command "SELECT COUNT(*) FROM tasks"

# Check database size
wrangler d1 info monotask-production

# List tables
wrangler d1 execute monotask-production --command "SELECT name FROM sqlite_master WHERE type='table'"

Deployment Commands

bash

# List recent deployments
wrangler deployments list --name monotask-api-gateway

# Rollback to previous version
wrangler rollback monotask-api-gateway

# Deploy to staging first
wrangler deploy --env staging

📊 Key Metrics

Healthy System Indicators

Metric	Healthy Range	Alert Threshold
Error Rate	< 0.5%	> 1% (warning), > 5% (critical)
P95 API Latency	< 150ms	> 200ms
P95 Task Latency	< 400ms	> 500ms
Queue Depth	< 50 messages	> 100 (warning), > 500 (critical)
Active Sandboxes	< 5	> 20
D1 Query Time (P95)	< 30ms	> 50ms (warning), > 100ms (critical)

SLO Targets

API Availability: 99.9% (43.2 min downtime/month)
Error Budget Remaining: Should be > 20%
Queue Processing: P95 < 5 seconds
Database Queries: P95 < 50ms

🛠️ Quick Fixes

High Error Rate

bash

# 1. Check recent deployments
wrangler deployments list --name [worker]

# 2. Rollback if needed
wrangler rollback [worker]

# 3. Verify fix
curl https://[worker].workers.dev/health

Queue Backup

bash

# 1. Check queue depth
wrangler queues get [queue-name]

# 2. Increase batch size (edit wrangler.toml)
max_batch_size = 20  # increase from 10

# 3. Deploy
wrangler deploy [worker]

Slow Database

sql

-- 1. Check query plan
EXPLAIN QUERY PLAN
SELECT * FROM tasks WHERE project_id = ?;

-- 2. Add index if needed
CREATE INDEX IF NOT EXISTS idx_tasks_project ON tasks(project_id);

-- 3. Analyze database
ANALYZE;

Worker Timeout

typescript

// 1. Add timeout to long operations
const result = await Promise.race([
  longOperation(),
  new Promise((_, reject) =>
    setTimeout(() => reject(new Error('Timeout')), 5000)
  ),
]);

// 2. Move to queue if > 5s
await env.QUEUE.send({ operation: 'long-running', data });

📞 Escalation Contacts

Level 1 - On-Call Engineer (15 min response)

Slack: #monotask-oncall
PagerDuty: Primary on-call rotation
Scope: Standard incidents, runbook procedures

Level 2 - Engineering Lead (30 min response)

Slack: @engineering-lead
Phone: [Contact in PagerDuty]
Scope: Complex issues, architectural decisions

Level 3 - CTO/VP Eng (1 hour response)

Scope: Major outages, business-critical issues
Criteria: Error rate > 10%, Downtime > 1 hour

🔗 Important Links

Dashboard: https://dash.cloudflare.com/[account]/workers/analytics
SLO Dashboard: See cloudflare-dashboard.json
Runbooks: /monitoring/runbooks/
Logs: Cloudflare Dashboard > Workers > Logs
Status Page: https://status.monotask.dev (if configured)

📝 Incident Response Checklist

When responding to an incident:

[ ] Acknowledge alert in PagerDuty/Slack
[ ] Check dashboard for current metrics
[ ] Identify affected components
[ ] Follow relevant runbook
[ ] Communicate status in #incidents
[ ] Document actions taken
[ ] Verify resolution
[ ] Update stakeholders
[ ] Schedule post-mortem (if severity > warning)

💡 Tips

Before Deploying

bash

# Always deploy to staging first
wrangler deploy --env staging

# Run smoke tests
curl https://[worker]-staging.workers.dev/health

# Monitor for 5-10 minutes before production deploy

During Incidents

Don't panic: Follow the runbook step-by-step
Communicate: Post updates every 15-30 minutes
Document: Record all commands and observations
Ask for help: Escalate if unsure or stuck > 30 min

After Incidents

Write post-mortem within 24 hours
Update runbooks with lessons learned
Create GitHub issues for follow-up tasks
Share with team in weekly review

🎯 Performance Optimization Checklist

If performance is degrading:

[ ] Check recent deployments for changes
[ ] Review slow query log for new patterns
[ ] Verify cache hit rates are normal
[ ] Check external API response times
[ ] Monitor resource usage (CPU, memory)
[ ] Look for N+1 query patterns
[ ] Verify indexes exist for common queries
[ ] Check queue depths for backups

🧪 Testing Alert Delivery

bash

# Test Slack webhook
curl -X POST [SLACK_WEBHOOK_URL] \
  -H 'Content-Type: application/json' \
  -d '{"text": "Test alert from MonoTask monitoring"}'

# Test email (if configured)
# Use your email service API

# Trigger test alert from worker
curl -X POST https://[worker].workers.dev/api/test-alert

Keep This Guide Handy!

Bookmark this page
Print for on-call rotation
Review monthly

Questions? Ask in #sre or #monitoring

Last Updated: 2025-10-26

Quick Reference ​

🚨 Alert Response ​

When You Receive an Alert ​

🔍 Common Commands ​

Check Worker Health ​

Check Queue Status ​

Database Commands ​

Deployment Commands ​

📊 Key Metrics ​

Healthy System Indicators ​

SLO Targets ​

🛠️ Quick Fixes ​

High Error Rate ​

Queue Backup ​

Slow Database ​

Worker Timeout ​

📞 Escalation Contacts ​

Level 1 - On-Call Engineer (15 min response) ​

Level 2 - Engineering Lead (30 min response) ​

Level 3 - CTO/VP Eng (1 hour response) ​

🔗 Important Links ​

📝 Incident Response Checklist ​

💡 Tips ​

Before Deploying ​

During Incidents ​

After Incidents ​

🎯 Performance Optimization Checklist ​

🧪 Testing Alert Delivery ​

Quick Reference

🚨 Alert Response

When You Receive an Alert

🔍 Common Commands

Check Worker Health

Check Queue Status

Database Commands

Deployment Commands

📊 Key Metrics

Healthy System Indicators

SLO Targets

🛠️ Quick Fixes

High Error Rate

Queue Backup

Slow Database

Worker Timeout

📞 Escalation Contacts

Level 1 - On-Call Engineer (15 min response)

Level 2 - Engineering Lead (30 min response)

Level 3 - CTO/VP Eng (1 hour response)

🔗 Important Links

📝 Incident Response Checklist

💡 Tips

Before Deploying

During Incidents

After Incidents

🎯 Performance Optimization Checklist

🧪 Testing Alert Delivery