MonoTask Monitoring and Alerting System

This directory contains comprehensive monitoring, alerting, dashboard configurations, SLO definitions, and operational runbooks for the MonoTask Cloudflare Workers infrastructure.

📁 Directory Structure

monitoring/
├── cloudflare-dashboard.json  # Dashboard configuration for Cloudflare Analytics
├── slos.yaml                  # Service Level Indicators and Objectives
└── runbooks/                  # Operational runbooks for incident response
    ├── high-error-rate.md
    ├── queue-backup.md
    ├── database-slow.md
    ├── worker-timeout.md
    └── sandbox-stuck.md

🎯 Overview

Monitoring Infrastructure

The monitoring system provides:

Error Tracking & Alerting: Automatic error categorization, alerting, and notification routing
Performance Monitoring: Request duration, database queries, external API latency tracking
Resource Monitoring: CPU, memory, D1/R2/KV operation tracking
Queue Monitoring: Queue depth, processing time, DLQ, retry rates
Sandbox Monitoring: Active sandboxes, provision time, timeout tracking

Key Components

1. Analytics Configuration (`wrangler.toml`)

All workers are configured with:

Analytics Engine: Custom metrics collection
Logpush: Structured logs sent to R2 for retention
Sampling Rate: 10% of requests sampled for detailed metrics

2. Error Alerting System (`packages/cloudflare-workers/monitoring/error-alerter.ts`)

Features:

Automatic error severity classification (CRITICAL, WARNING, INFO)
Multi-channel alert routing (Email, Slack, PagerDuty)
Alert deduplication to prevent alert fatigue
Contextual error tracking with stack traces
Configurable alert thresholds per worker

Usage:

typescript

import { createErrorAlerter } from '@monotask/monitoring';

const alerter = createErrorAlerter(env, 'api-gateway');
const error = new Error('Database connection failed');
const monitoringError = alerter.createErrorContext(error, request, {
  userId: 'user_123',
  endpoint: '/api/tasks',
});
await alerter.sendAlert(monitoringError);

3. Performance Tracking (`packages/cloudflare-workers/monitoring/performance-tracker.ts`)

Capabilities:

Request duration tracking with P50/P95/P99 percentiles
Database query performance monitoring
External API latency tracking (GitHub, Claude APIs)
Queue processing time measurement
Automatic slow request detection

Usage:

typescript

import { createPerformanceTracker } from '@monotask/monitoring';

const tracker = createPerformanceTracker(env, 'task-worker');

tracker.startRequest();

// Track database query
await tracker.wrapDbQuery(() => db.query('SELECT * FROM tasks'));

// Track external API
await tracker.wrapExternalApi(() => fetch('https://api.github.com'));

await tracker.endRequest(request, response, requestId);

4. Middleware

Error Tracking Middleware:

typescript

import { createErrorTracker } from '@monotask/monitoring/middleware/error-tracker';

const errorTracker = createErrorTracker(env, {
  workerName: 'api-gateway',
  captureStackTraces: true,
});

try {
  // Your handler code
} catch (error) {
  return await errorTracker.onError(error, request, { requestId });
}

Performance Middleware:

typescript

import { createPerformanceMiddleware } from '@monotask/monitoring/middleware/performance-middleware';

const perfMiddleware = createPerformanceMiddleware(env, {
  workerName: 'api-gateway',
});

return await perfMiddleware.trackPerformance(request, async (tracker) => {
  // Handler receives tracker for custom metrics
  const result = await handleRequest(request, tracker);
  return new Response(JSON.stringify(result));
});

📊 Dashboard Configuration

The cloudflare-dashboard.json defines a comprehensive monitoring dashboard with the following sections:

1. Worker Health

Uptime Percentage: 24-hour rolling uptime with 99.9% SLO line
Error Rates: Errors/min by worker and severity
Active Requests: Current request load gauge
Request Rate: Requests per second with sparkline

2. Queue Metrics

Queue Depth: Messages in queue by queue name
Processing Rate: Messages processed per second
Processing Time Distribution: Histogram of processing durations
DLQ Count: Dead letter queue message accumulation
Retry Rate: Percentage of messages being retried

3. Performance Metrics

P50/P95/P99 Latency: Response time percentiles by endpoint
Database Query Time: Average D1 query duration
External API Latency: P95 latency for GitHub and Claude APIs
Cache Hit Rate: KV cache effectiveness

4. Resource Usage

CPU Time: CPU milliseconds consumed by worker
Memory Usage: Average memory consumption
D1/R2/KV Operation Counts: Storage operation rates

5. Sandbox Metrics

Active Sandboxes: Current count with capacity alerts
Provision Time: Histogram of sandbox startup duration
Timeout Frequency: Sandbox timeout rate over time
Resource Utilization: Sandbox resource usage percentage

🎯 Service Level Objectives (SLOs)

Defined in slos.yaml, our SLOs establish performance and reliability targets:

Availability SLOs

API Availability: 99.9% uptime (30-day window)
Database Availability: 99.95% query success rate

Latency SLOs

API Gateway: P95 < 200ms
Task Operations: P95 < 500ms
Agent Execution: P95 < 30s, P99 < 60s
Queue Processing: P95 < 5s
D1 Queries: P95 < 50ms, P99 < 100ms

Error Rate SLOs

Overall Error Rate: < 1% (1-hour window)
Queue Success Rate: > 99% (24-hour window)

Resource SLOs

Sandbox Provision Time: P95 < 2s
Sandbox Timeout Rate: < 2% (24-hour window)

Error Budget

Monthly Budget: 43.2 minutes downtime (0.1%)
Burn Rate Alerts: Alert if burning > 10x normal rate
Budget Exhaustion: Alert when < 10% remaining

📖 Operational Runbooks

Comprehensive step-by-step guides for common incident scenarios:

1. High Error Rate (high-error-rate.md)

When to use: Error rate > 1% sustained for 5+ minutes

Covers:

Error type classification and investigation
Database, code, external API, and rate limiting issues
Rollback procedures
Circuit breaker implementation
Escalation paths

2. Queue Backup (queue-backup.md)

When to use: Queue depth > 100 messages sustained

Covers:

Queue congestion detection and analysis
Consumer performance tuning
Scaling worker capacity
DLQ processing
Poison message handling

3. Database Slow Queries (database-slow.md)

When to use: P95 query latency > 100ms

Covers:

Query performance analysis
Missing index detection and creation
N+1 query problem resolution
Query optimization techniques
Write lock contention handling

4. Worker Timeout (worker-timeout.md)

When to use: Timeout rate > 2% or consistent 504 errors

Covers:

Timeout pattern identification
CPU time profiling
External API timeout handling
Queue offloading strategies
Code optimization techniques

5. Sandbox Stuck (sandbox-stuck.md)

When to use: Active sandboxes > 20 or timeout rate > 5%

Covers:

Sandbox lifecycle debugging
Infinite loop detection
Manual and bulk cleanup procedures
State reconciliation
Resource exhaustion handling

🚨 Alert Channels

Alerts are routed based on severity:

Critical Alerts

Channels: Email + Slack + PagerDuty
Examples:
- Error rate > 5%
- API availability < 99%
- Queue depth > 500
- Database errors > 5%

Warning Alerts

Channels: Slack
Examples:
- Error rate > 1%
- P95 latency exceeds SLO
- Queue depth > 100
- Slow database queries

Info Alerts

Channels: Logged only
Examples:
- Validation errors
- Rate limit responses
- Individual request failures

🔧 Configuration

Environment Variables

Required in all worker wrangler.toml files:

toml

[vars]
LOG_LEVEL = "info"                # info, warn, error
METRICS_SAMPLING_RATE = "0.1"     # 10% sampling
SLACK_WEBHOOK_URL = "https://..."  # For Slack alerts
ALERT_EMAIL = "[email protected]"

Alert Deduplication

Alerts are deduplicated with a 1-hour cooldown period using KV storage:

typescript

// Deduplication key format
`alert:{ruleName}:{workerName}`

// Cooldown: 3600 seconds (1 hour)

📈 Metrics Collection

Analytics Engine Datasets

All workers write to the ANALYTICS binding:

typescript

env.ANALYTICS.writeDataPoint({
  blobs: [requestId, workerName, endpoint, method, statusCode],
  doubles: [durationMs, dbQueryTimeMs, externalApiTimeMs, queueProcessingTimeMs],
  indexes: [`worker:${workerName}`, `endpoint:${endpoint}`, `status:${statusCode}`],
});

Sampling Strategy

Default: 10% of requests sampled for detailed metrics
Always Sampled:
- Errors (status >= 400)
- Slow requests (above SLO threshold)
- Critical operations (agent execution, task state transitions)

Retention

Live Metrics: 30 days in Analytics Engine
Logs: Retained in R2 via Logpush (90 days)
Aggregated Metrics: Permanent retention in dashboard

🎓 Best Practices

1. Error Handling

typescript

// Always use try-catch with error tracking
try {
  await operation();
} catch (error) {
  await errorTracker.trackError(error, request, {
    operation: 'operation_name',
    userId,
    context: additionalData,
  });
  throw error;
}

2. Performance Tracking

typescript

// Track critical operations
const tracker = createPerformanceTracker(env, workerName);

tracker.startRequest();
await tracker.wrapDbQuery(() => dbOperation());
await tracker.wrapExternalApi(() => apiCall());
await tracker.endRequest(request, response, requestId);

3. Custom Metrics

typescript

// Track business metrics
await tracker.trackCustomMetric(
  'tasks_completed',
  1,
  'count',
  { project_id: projectId, state: 'completed' }
);

4. Structured Logging

typescript

// Use structured logs for better querying
console.log(JSON.stringify({
  level: 'info',
  timestamp: Date.now(),
  workerName: 'api-gateway',
  requestId,
  message: 'Task completed',
  metadata: { taskId, duration, state },
}));

📞 Support and Escalation

On-Call Rotation

Slack: #monotask-oncall
PagerDuty: Automatic escalation for critical alerts

Escalation Levels

Level 1 - On-Call Engineer (15 min response time)

Initial investigation
Standard runbook procedures
Most incidents resolved at this level

Level 2 - Team Lead (30 min response time)

Complex issues requiring architectural decisions
Cross-team coordination
Capacity planning

Level 3 - Engineering Management (1 hour response time)

Major outages
Customer-impacting issues
Business-critical escalations

🔄 Maintenance

Weekly Tasks

Review SLO compliance reports
Check error budget consumption
Update dashboard for new metrics
Review and acknowledge alerts

Monthly Tasks

Run recovery drill (test backup/restore)
Review and update runbooks
Audit slow queries and add indexes
Optimize monitoring costs

Quarterly Tasks

SLO review and adjustment
Dashboard redesign based on usage
Alert threshold tuning
Runbook effectiveness review

📚 Additional Resources

Last Updated: 2025-10-26 Owner: SRE Team Reviewers: Engineering Team, Product Team

MonoTask Monitoring and Alerting System ​

📁 Directory Structure ​

🎯 Overview ​

Monitoring Infrastructure ​

Key Components ​

1. Analytics Configuration (wrangler.toml) ​

2. Error Alerting System (packages/cloudflare-workers/monitoring/error-alerter.ts) ​

3. Performance Tracking (packages/cloudflare-workers/monitoring/performance-tracker.ts) ​

4. Middleware ​

📊 Dashboard Configuration ​

1. Worker Health ​

2. Queue Metrics ​

3. Performance Metrics ​

4. Resource Usage ​

5. Sandbox Metrics ​

🎯 Service Level Objectives (SLOs) ​

Availability SLOs ​

Latency SLOs ​

Error Rate SLOs ​

Resource SLOs ​

Error Budget ​

📖 Operational Runbooks ​

1. High Error Rate (high-error-rate.md) ​

2. Queue Backup (queue-backup.md) ​

3. Database Slow Queries (database-slow.md) ​

4. Worker Timeout (worker-timeout.md) ​

5. Sandbox Stuck (sandbox-stuck.md) ​

🚨 Alert Channels ​

Critical Alerts ​

Warning Alerts ​

Info Alerts ​

🔧 Configuration ​

Environment Variables ​

Alert Deduplication ​

📈 Metrics Collection ​

Analytics Engine Datasets ​

Sampling Strategy ​

Retention ​

🎓 Best Practices ​

1. Error Handling ​

2. Performance Tracking ​

3. Custom Metrics ​

4. Structured Logging ​

📞 Support and Escalation ​

On-Call Rotation ​

Escalation Levels ​

🔄 Maintenance ​

Weekly Tasks ​

Monthly Tasks ​

Quarterly Tasks ​

📚 Additional Resources ​

MonoTask Monitoring and Alerting System

📁 Directory Structure

🎯 Overview

Monitoring Infrastructure

Key Components

1. Analytics Configuration (`wrangler.toml`)

2. Error Alerting System (`packages/cloudflare-workers/monitoring/error-alerter.ts`)

3. Performance Tracking (`packages/cloudflare-workers/monitoring/performance-tracker.ts`)

4. Middleware

📊 Dashboard Configuration

1. Worker Health

2. Queue Metrics

3. Performance Metrics

4. Resource Usage

5. Sandbox Metrics

🎯 Service Level Objectives (SLOs)

Availability SLOs

Latency SLOs

Error Rate SLOs

Resource SLOs

Error Budget

📖 Operational Runbooks

1. High Error Rate (high-error-rate.md)

2. Queue Backup (queue-backup.md)

3. Database Slow Queries (database-slow.md)

4. Worker Timeout (worker-timeout.md)

5. Sandbox Stuck (sandbox-stuck.md)

🚨 Alert Channels

Critical Alerts

Warning Alerts

Info Alerts

🔧 Configuration

Environment Variables

Alert Deduplication

📈 Metrics Collection

Analytics Engine Datasets

Sampling Strategy

Retention

🎓 Best Practices

1. Error Handling

2. Performance Tracking

3. Custom Metrics

4. Structured Logging

📞 Support and Escalation

On-Call Rotation

Escalation Levels

🔄 Maintenance

Weekly Tasks

Monthly Tasks

Quarterly Tasks

📚 Additional Resources