Skip to content

MonoTask Monitoring and Alerting System

This directory contains comprehensive monitoring, alerting, dashboard configurations, SLO definitions, and operational runbooks for the MonoTask Cloudflare Workers infrastructure.

📁 Directory Structure

monitoring/
├── cloudflare-dashboard.json  # Dashboard configuration for Cloudflare Analytics
├── slos.yaml                  # Service Level Indicators and Objectives
└── runbooks/                  # Operational runbooks for incident response
    ├── high-error-rate.md
    ├── queue-backup.md
    ├── database-slow.md
    ├── worker-timeout.md
    └── sandbox-stuck.md

🎯 Overview

Monitoring Infrastructure

The monitoring system provides:

  • Error Tracking & Alerting: Automatic error categorization, alerting, and notification routing
  • Performance Monitoring: Request duration, database queries, external API latency tracking
  • Resource Monitoring: CPU, memory, D1/R2/KV operation tracking
  • Queue Monitoring: Queue depth, processing time, DLQ, retry rates
  • Sandbox Monitoring: Active sandboxes, provision time, timeout tracking

Key Components

1. Analytics Configuration (wrangler.toml)

All workers are configured with:

  • Analytics Engine: Custom metrics collection
  • Logpush: Structured logs sent to R2 for retention
  • Sampling Rate: 10% of requests sampled for detailed metrics

2. Error Alerting System (packages/cloudflare-workers/monitoring/error-alerter.ts)

Features:

  • Automatic error severity classification (CRITICAL, WARNING, INFO)
  • Multi-channel alert routing (Email, Slack, PagerDuty)
  • Alert deduplication to prevent alert fatigue
  • Contextual error tracking with stack traces
  • Configurable alert thresholds per worker

Usage:

typescript
import { createErrorAlerter } from '@monotask/monitoring';

const alerter = createErrorAlerter(env, 'api-gateway');
const error = new Error('Database connection failed');
const monitoringError = alerter.createErrorContext(error, request, {
  userId: 'user_123',
  endpoint: '/api/tasks',
});
await alerter.sendAlert(monitoringError);

3. Performance Tracking (packages/cloudflare-workers/monitoring/performance-tracker.ts)

Capabilities:

  • Request duration tracking with P50/P95/P99 percentiles
  • Database query performance monitoring
  • External API latency tracking (GitHub, Claude APIs)
  • Queue processing time measurement
  • Automatic slow request detection

Usage:

typescript
import { createPerformanceTracker } from '@monotask/monitoring';

const tracker = createPerformanceTracker(env, 'task-worker');

tracker.startRequest();

// Track database query
await tracker.wrapDbQuery(() => db.query('SELECT * FROM tasks'));

// Track external API
await tracker.wrapExternalApi(() => fetch('https://api.github.com'));

await tracker.endRequest(request, response, requestId);

4. Middleware

Error Tracking Middleware:

typescript
import { createErrorTracker } from '@monotask/monitoring/middleware/error-tracker';

const errorTracker = createErrorTracker(env, {
  workerName: 'api-gateway',
  captureStackTraces: true,
});

try {
  // Your handler code
} catch (error) {
  return await errorTracker.onError(error, request, { requestId });
}

Performance Middleware:

typescript
import { createPerformanceMiddleware } from '@monotask/monitoring/middleware/performance-middleware';

const perfMiddleware = createPerformanceMiddleware(env, {
  workerName: 'api-gateway',
});

return await perfMiddleware.trackPerformance(request, async (tracker) => {
  // Handler receives tracker for custom metrics
  const result = await handleRequest(request, tracker);
  return new Response(JSON.stringify(result));
});

📊 Dashboard Configuration

The cloudflare-dashboard.json defines a comprehensive monitoring dashboard with the following sections:

1. Worker Health

  • Uptime Percentage: 24-hour rolling uptime with 99.9% SLO line
  • Error Rates: Errors/min by worker and severity
  • Active Requests: Current request load gauge
  • Request Rate: Requests per second with sparkline

2. Queue Metrics

  • Queue Depth: Messages in queue by queue name
  • Processing Rate: Messages processed per second
  • Processing Time Distribution: Histogram of processing durations
  • DLQ Count: Dead letter queue message accumulation
  • Retry Rate: Percentage of messages being retried

3. Performance Metrics

  • P50/P95/P99 Latency: Response time percentiles by endpoint
  • Database Query Time: Average D1 query duration
  • External API Latency: P95 latency for GitHub and Claude APIs
  • Cache Hit Rate: KV cache effectiveness

4. Resource Usage

  • CPU Time: CPU milliseconds consumed by worker
  • Memory Usage: Average memory consumption
  • D1/R2/KV Operation Counts: Storage operation rates

5. Sandbox Metrics

  • Active Sandboxes: Current count with capacity alerts
  • Provision Time: Histogram of sandbox startup duration
  • Timeout Frequency: Sandbox timeout rate over time
  • Resource Utilization: Sandbox resource usage percentage

🎯 Service Level Objectives (SLOs)

Defined in slos.yaml, our SLOs establish performance and reliability targets:

Availability SLOs

  • API Availability: 99.9% uptime (30-day window)
  • Database Availability: 99.95% query success rate

Latency SLOs

  • API Gateway: P95 < 200ms
  • Task Operations: P95 < 500ms
  • Agent Execution: P95 < 30s, P99 < 60s
  • Queue Processing: P95 < 5s
  • D1 Queries: P95 < 50ms, P99 < 100ms

Error Rate SLOs

  • Overall Error Rate: < 1% (1-hour window)
  • Queue Success Rate: > 99% (24-hour window)

Resource SLOs

  • Sandbox Provision Time: P95 < 2s
  • Sandbox Timeout Rate: < 2% (24-hour window)

Error Budget

  • Monthly Budget: 43.2 minutes downtime (0.1%)
  • Burn Rate Alerts: Alert if burning > 10x normal rate
  • Budget Exhaustion: Alert when < 10% remaining

📖 Operational Runbooks

Comprehensive step-by-step guides for common incident scenarios:

1. High Error Rate (high-error-rate.md)

When to use: Error rate > 1% sustained for 5+ minutes

Covers:

  • Error type classification and investigation
  • Database, code, external API, and rate limiting issues
  • Rollback procedures
  • Circuit breaker implementation
  • Escalation paths

2. Queue Backup (queue-backup.md)

When to use: Queue depth > 100 messages sustained

Covers:

  • Queue congestion detection and analysis
  • Consumer performance tuning
  • Scaling worker capacity
  • DLQ processing
  • Poison message handling

3. Database Slow Queries (database-slow.md)

When to use: P95 query latency > 100ms

Covers:

  • Query performance analysis
  • Missing index detection and creation
  • N+1 query problem resolution
  • Query optimization techniques
  • Write lock contention handling

4. Worker Timeout (worker-timeout.md)

When to use: Timeout rate > 2% or consistent 504 errors

Covers:

  • Timeout pattern identification
  • CPU time profiling
  • External API timeout handling
  • Queue offloading strategies
  • Code optimization techniques

5. Sandbox Stuck (sandbox-stuck.md)

When to use: Active sandboxes > 20 or timeout rate > 5%

Covers:

  • Sandbox lifecycle debugging
  • Infinite loop detection
  • Manual and bulk cleanup procedures
  • State reconciliation
  • Resource exhaustion handling

🚨 Alert Channels

Alerts are routed based on severity:

Critical Alerts

  • Channels: Email + Slack + PagerDuty
  • Examples:
    • Error rate > 5%
    • API availability < 99%
    • Queue depth > 500
    • Database errors > 5%

Warning Alerts

  • Channels: Slack
  • Examples:
    • Error rate > 1%
    • P95 latency exceeds SLO
    • Queue depth > 100
    • Slow database queries

Info Alerts

  • Channels: Logged only
  • Examples:
    • Validation errors
    • Rate limit responses
    • Individual request failures

🔧 Configuration

Environment Variables

Required in all worker wrangler.toml files:

toml
[vars]
LOG_LEVEL = "info"                # info, warn, error
METRICS_SAMPLING_RATE = "0.1"     # 10% sampling
SLACK_WEBHOOK_URL = "https://..."  # For Slack alerts
ALERT_EMAIL = "[email protected]"

Alert Deduplication

Alerts are deduplicated with a 1-hour cooldown period using KV storage:

typescript
// Deduplication key format
`alert:{ruleName}:{workerName}`

// Cooldown: 3600 seconds (1 hour)

📈 Metrics Collection

Analytics Engine Datasets

All workers write to the ANALYTICS binding:

typescript
env.ANALYTICS.writeDataPoint({
  blobs: [requestId, workerName, endpoint, method, statusCode],
  doubles: [durationMs, dbQueryTimeMs, externalApiTimeMs, queueProcessingTimeMs],
  indexes: [`worker:${workerName}`, `endpoint:${endpoint}`, `status:${statusCode}`],
});

Sampling Strategy

  • Default: 10% of requests sampled for detailed metrics
  • Always Sampled:
    • Errors (status >= 400)
    • Slow requests (above SLO threshold)
    • Critical operations (agent execution, task state transitions)

Retention

  • Live Metrics: 30 days in Analytics Engine
  • Logs: Retained in R2 via Logpush (90 days)
  • Aggregated Metrics: Permanent retention in dashboard

🎓 Best Practices

1. Error Handling

typescript
// Always use try-catch with error tracking
try {
  await operation();
} catch (error) {
  await errorTracker.trackError(error, request, {
    operation: 'operation_name',
    userId,
    context: additionalData,
  });
  throw error;
}

2. Performance Tracking

typescript
// Track critical operations
const tracker = createPerformanceTracker(env, workerName);

tracker.startRequest();
await tracker.wrapDbQuery(() => dbOperation());
await tracker.wrapExternalApi(() => apiCall());
await tracker.endRequest(request, response, requestId);

3. Custom Metrics

typescript
// Track business metrics
await tracker.trackCustomMetric(
  'tasks_completed',
  1,
  'count',
  { project_id: projectId, state: 'completed' }
);

4. Structured Logging

typescript
// Use structured logs for better querying
console.log(JSON.stringify({
  level: 'info',
  timestamp: Date.now(),
  workerName: 'api-gateway',
  requestId,
  message: 'Task completed',
  metadata: { taskId, duration, state },
}));

📞 Support and Escalation

On-Call Rotation

  • Slack: #monotask-oncall
  • PagerDuty: Automatic escalation for critical alerts

Escalation Levels

Level 1 - On-Call Engineer (15 min response time)

  • Initial investigation
  • Standard runbook procedures
  • Most incidents resolved at this level

Level 2 - Team Lead (30 min response time)

  • Complex issues requiring architectural decisions
  • Cross-team coordination
  • Capacity planning

Level 3 - Engineering Management (1 hour response time)

  • Major outages
  • Customer-impacting issues
  • Business-critical escalations

🔄 Maintenance

Weekly Tasks

  • Review SLO compliance reports
  • Check error budget consumption
  • Update dashboard for new metrics
  • Review and acknowledge alerts

Monthly Tasks

  • Run recovery drill (test backup/restore)
  • Review and update runbooks
  • Audit slow queries and add indexes
  • Optimize monitoring costs

Quarterly Tasks

  • SLO review and adjustment
  • Dashboard redesign based on usage
  • Alert threshold tuning
  • Runbook effectiveness review

📚 Additional Resources


Last Updated: 2025-10-26 Owner: SRE Team Reviewers: Engineering Team, Product Team

MonoKernel MonoTask Documentation