Appearance
Quick Reference โ
Quick reference for on-call engineers and developers working with MonoTask monitoring.
๐จ Alert Response โ
When You Receive an Alert โ
Check Severity:
- ๐ด CRITICAL: Immediate action required (15 min response time)
- ๐ก WARNING: Investigate within 1 hour
- ๐ต INFO: Informational only, review during business hours
Access Dashboard:
https://dash.cloudflare.com/[account]/workers/analyticsFind Relevant Runbook:
Alert Type Runbook Error rate elevated high-error-rate.md Queue backup queue-backup.md Slow database database-slow.md Worker timeout worker-timeout.md Sandbox stuck sandbox-stuck.md
๐ Common Commands โ
Check Worker Health โ
bash
# View worker logs
wrangler tail [worker-name]
# Filter for errors only
wrangler tail [worker-name] --status error
# View specific worker
wrangler tail monotask-api-gateway --format prettyCheck Queue Status โ
bash
# List all queues
wrangler queues list
# Get specific queue stats
wrangler queues get agent-queue
# Check DLQ
wrangler queues get agent-dlqDatabase Commands โ
bash
# Execute query
wrangler d1 execute monotask-production --command "SELECT COUNT(*) FROM tasks"
# Check database size
wrangler d1 info monotask-production
# List tables
wrangler d1 execute monotask-production --command "SELECT name FROM sqlite_master WHERE type='table'"Deployment Commands โ
bash
# List recent deployments
wrangler deployments list --name monotask-api-gateway
# Rollback to previous version
wrangler rollback monotask-api-gateway
# Deploy to staging first
wrangler deploy --env staging๐ Key Metrics โ
Healthy System Indicators โ
| Metric | Healthy Range | Alert Threshold |
|---|---|---|
| Error Rate | < 0.5% | > 1% (warning), > 5% (critical) |
| P95 API Latency | < 150ms | > 200ms |
| P95 Task Latency | < 400ms | > 500ms |
| Queue Depth | < 50 messages | > 100 (warning), > 500 (critical) |
| Active Sandboxes | < 5 | > 20 |
| D1 Query Time (P95) | < 30ms | > 50ms (warning), > 100ms (critical) |
SLO Targets โ
- API Availability: 99.9% (43.2 min downtime/month)
- Error Budget Remaining: Should be > 20%
- Queue Processing: P95 < 5 seconds
- Database Queries: P95 < 50ms
๐ ๏ธ Quick Fixes โ
High Error Rate โ
bash
# 1. Check recent deployments
wrangler deployments list --name [worker]
# 2. Rollback if needed
wrangler rollback [worker]
# 3. Verify fix
curl https://[worker].workers.dev/healthQueue Backup โ
bash
# 1. Check queue depth
wrangler queues get [queue-name]
# 2. Increase batch size (edit wrangler.toml)
max_batch_size = 20 # increase from 10
# 3. Deploy
wrangler deploy [worker]Slow Database โ
sql
-- 1. Check query plan
EXPLAIN QUERY PLAN
SELECT * FROM tasks WHERE project_id = ?;
-- 2. Add index if needed
CREATE INDEX IF NOT EXISTS idx_tasks_project ON tasks(project_id);
-- 3. Analyze database
ANALYZE;Worker Timeout โ
typescript
// 1. Add timeout to long operations
const result = await Promise.race([
longOperation(),
new Promise((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), 5000)
),
]);
// 2. Move to queue if > 5s
await env.QUEUE.send({ operation: 'long-running', data });๐ Escalation Contacts โ
Level 1 - On-Call Engineer (15 min response) โ
- Slack: #monotask-oncall
- PagerDuty: Primary on-call rotation
- Scope: Standard incidents, runbook procedures
Level 2 - Engineering Lead (30 min response) โ
- Slack: @engineering-lead
- Phone: [Contact in PagerDuty]
- Scope: Complex issues, architectural decisions
Level 3 - CTO/VP Eng (1 hour response) โ
- Scope: Major outages, business-critical issues
- Criteria: Error rate > 10%, Downtime > 1 hour
๐ Important Links โ
- Dashboard: https://dash.cloudflare.com/[account]/workers/analytics
- SLO Dashboard: See cloudflare-dashboard.json
- Runbooks:
/monitoring/runbooks/ - Logs: Cloudflare Dashboard > Workers > Logs
- Status Page: https://status.monotask.dev (if configured)
๐ Incident Response Checklist โ
When responding to an incident:
- [ ] Acknowledge alert in PagerDuty/Slack
- [ ] Check dashboard for current metrics
- [ ] Identify affected components
- [ ] Follow relevant runbook
- [ ] Communicate status in #incidents
- [ ] Document actions taken
- [ ] Verify resolution
- [ ] Update stakeholders
- [ ] Schedule post-mortem (if severity > warning)
๐ก Tips โ
Before Deploying โ
bash
# Always deploy to staging first
wrangler deploy --env staging
# Run smoke tests
curl https://[worker]-staging.workers.dev/health
# Monitor for 5-10 minutes before production deployDuring Incidents โ
- Don't panic: Follow the runbook step-by-step
- Communicate: Post updates every 15-30 minutes
- Document: Record all commands and observations
- Ask for help: Escalate if unsure or stuck > 30 min
After Incidents โ
- Write post-mortem within 24 hours
- Update runbooks with lessons learned
- Create GitHub issues for follow-up tasks
- Share with team in weekly review
๐ฏ Performance Optimization Checklist โ
If performance is degrading:
- [ ] Check recent deployments for changes
- [ ] Review slow query log for new patterns
- [ ] Verify cache hit rates are normal
- [ ] Check external API response times
- [ ] Monitor resource usage (CPU, memory)
- [ ] Look for N+1 query patterns
- [ ] Verify indexes exist for common queries
- [ ] Check queue depths for backups
๐งช Testing Alert Delivery โ
bash
# Test Slack webhook
curl -X POST [SLACK_WEBHOOK_URL] \
-H 'Content-Type: application/json' \
-d '{"text": "Test alert from MonoTask monitoring"}'
# Test email (if configured)
# Use your email service API
# Trigger test alert from worker
curl -X POST https://[worker].workers.dev/api/test-alertKeep This Guide Handy!
- Bookmark this page
- Print for on-call rotation
- Review monthly
Questions? Ask in #sre or #monitoring
Last Updated: 2025-10-26