Skip to content

Quick Reference โ€‹

Quick reference for on-call engineers and developers working with MonoTask monitoring.

๐Ÿšจ Alert Response โ€‹

When You Receive an Alert โ€‹

  1. Check Severity:

    • ๐Ÿ”ด CRITICAL: Immediate action required (15 min response time)
    • ๐ŸŸก WARNING: Investigate within 1 hour
    • ๐Ÿ”ต INFO: Informational only, review during business hours
  2. Access Dashboard:

    https://dash.cloudflare.com/[account]/workers/analytics
  3. Find Relevant Runbook:

    Alert TypeRunbook
    Error rate elevatedhigh-error-rate.md
    Queue backupqueue-backup.md
    Slow databasedatabase-slow.md
    Worker timeoutworker-timeout.md
    Sandbox stucksandbox-stuck.md

๐Ÿ” Common Commands โ€‹

Check Worker Health โ€‹

bash
# View worker logs
wrangler tail [worker-name]

# Filter for errors only
wrangler tail [worker-name] --status error

# View specific worker
wrangler tail monotask-api-gateway --format pretty

Check Queue Status โ€‹

bash
# List all queues
wrangler queues list

# Get specific queue stats
wrangler queues get agent-queue

# Check DLQ
wrangler queues get agent-dlq

Database Commands โ€‹

bash
# Execute query
wrangler d1 execute monotask-production --command "SELECT COUNT(*) FROM tasks"

# Check database size
wrangler d1 info monotask-production

# List tables
wrangler d1 execute monotask-production --command "SELECT name FROM sqlite_master WHERE type='table'"

Deployment Commands โ€‹

bash
# List recent deployments
wrangler deployments list --name monotask-api-gateway

# Rollback to previous version
wrangler rollback monotask-api-gateway

# Deploy to staging first
wrangler deploy --env staging

๐Ÿ“Š Key Metrics โ€‹

Healthy System Indicators โ€‹

MetricHealthy RangeAlert Threshold
Error Rate< 0.5%> 1% (warning), > 5% (critical)
P95 API Latency< 150ms> 200ms
P95 Task Latency< 400ms> 500ms
Queue Depth< 50 messages> 100 (warning), > 500 (critical)
Active Sandboxes< 5> 20
D1 Query Time (P95)< 30ms> 50ms (warning), > 100ms (critical)

SLO Targets โ€‹

  • API Availability: 99.9% (43.2 min downtime/month)
  • Error Budget Remaining: Should be > 20%
  • Queue Processing: P95 < 5 seconds
  • Database Queries: P95 < 50ms

๐Ÿ› ๏ธ Quick Fixes โ€‹

High Error Rate โ€‹

bash
# 1. Check recent deployments
wrangler deployments list --name [worker]

# 2. Rollback if needed
wrangler rollback [worker]

# 3. Verify fix
curl https://[worker].workers.dev/health

Queue Backup โ€‹

bash
# 1. Check queue depth
wrangler queues get [queue-name]

# 2. Increase batch size (edit wrangler.toml)
max_batch_size = 20  # increase from 10

# 3. Deploy
wrangler deploy [worker]

Slow Database โ€‹

sql
-- 1. Check query plan
EXPLAIN QUERY PLAN
SELECT * FROM tasks WHERE project_id = ?;

-- 2. Add index if needed
CREATE INDEX IF NOT EXISTS idx_tasks_project ON tasks(project_id);

-- 3. Analyze database
ANALYZE;

Worker Timeout โ€‹

typescript
// 1. Add timeout to long operations
const result = await Promise.race([
  longOperation(),
  new Promise((_, reject) =>
    setTimeout(() => reject(new Error('Timeout')), 5000)
  ),
]);

// 2. Move to queue if > 5s
await env.QUEUE.send({ operation: 'long-running', data });

๐Ÿ“ž Escalation Contacts โ€‹

Level 1 - On-Call Engineer (15 min response) โ€‹

  • Slack: #monotask-oncall
  • PagerDuty: Primary on-call rotation
  • Scope: Standard incidents, runbook procedures

Level 2 - Engineering Lead (30 min response) โ€‹

  • Slack: @engineering-lead
  • Phone: [Contact in PagerDuty]
  • Scope: Complex issues, architectural decisions

Level 3 - CTO/VP Eng (1 hour response) โ€‹

  • Scope: Major outages, business-critical issues
  • Criteria: Error rate > 10%, Downtime > 1 hour

๐Ÿ“ Incident Response Checklist โ€‹

When responding to an incident:

  • [ ] Acknowledge alert in PagerDuty/Slack
  • [ ] Check dashboard for current metrics
  • [ ] Identify affected components
  • [ ] Follow relevant runbook
  • [ ] Communicate status in #incidents
  • [ ] Document actions taken
  • [ ] Verify resolution
  • [ ] Update stakeholders
  • [ ] Schedule post-mortem (if severity > warning)

๐Ÿ’ก Tips โ€‹

Before Deploying โ€‹

bash
# Always deploy to staging first
wrangler deploy --env staging

# Run smoke tests
curl https://[worker]-staging.workers.dev/health

# Monitor for 5-10 minutes before production deploy

During Incidents โ€‹

  • Don't panic: Follow the runbook step-by-step
  • Communicate: Post updates every 15-30 minutes
  • Document: Record all commands and observations
  • Ask for help: Escalate if unsure or stuck > 30 min

After Incidents โ€‹

  • Write post-mortem within 24 hours
  • Update runbooks with lessons learned
  • Create GitHub issues for follow-up tasks
  • Share with team in weekly review

๐ŸŽฏ Performance Optimization Checklist โ€‹

If performance is degrading:

  • [ ] Check recent deployments for changes
  • [ ] Review slow query log for new patterns
  • [ ] Verify cache hit rates are normal
  • [ ] Check external API response times
  • [ ] Monitor resource usage (CPU, memory)
  • [ ] Look for N+1 query patterns
  • [ ] Verify indexes exist for common queries
  • [ ] Check queue depths for backups

๐Ÿงช Testing Alert Delivery โ€‹

bash
# Test Slack webhook
curl -X POST [SLACK_WEBHOOK_URL] \
  -H 'Content-Type: application/json' \
  -d '{"text": "Test alert from MonoTask monitoring"}'

# Test email (if configured)
# Use your email service API

# Trigger test alert from worker
curl -X POST https://[worker].workers.dev/api/test-alert

Keep This Guide Handy!

  • Bookmark this page
  • Print for on-call rotation
  • Review monthly

Questions? Ask in #sre or #monitoring

Last Updated: 2025-10-26

MonoKernel MonoTask Documentation