Skip to content

This document outlines the monitoring and alerting configuration for the Campaign Architecture v2.0 deployment.


Overview

We monitor four critical dimensions:

  1. Error Rates - Application health
  2. AI Costs - Budget control
  3. Latency - Performance
  4. Job Execution - Background task health

1. Error Rate Monitoring

Google Cloud Logging

View Recent Errors:

bash
gcloud logging read \
  "resource.type=cloud_run_revision AND \
   resource.labels.service_name=tendsocial-api AND \
   severity>=ERROR" \
  --limit 50 \
  --format json

Real-time Error Streaming:

bash
gcloud logging tail \
  "resource.type=cloud_run_revision AND \
   resource.labels.service_name=tendsocial-api AND \
   severity>=ERROR"

Alert Configuration

Create an alert policy for error rate threshold:

bash
gcloud alpha monitoring policies create \
  --notification-channels=CHANNEL_ID \
  --display-name="TendSocial API - High Error Rate" \
  --condition-display-name="Error rate > 5%" \
  --condition-threshold-value=5 \
  --condition-threshold-duration=300s \
  --condition-filter='resource.type="cloud_run_revision" AND resource.labels.service_name="tendsocial-api" AND severity>=ERROR'

Alert Thresholds:

  • Warning: > 5% error rate over 5 minutes
  • Critical: > 10% error rate over 5 minutes

2. AI Cost Monitoring

Database Queries

Daily Cost Summary:

sql
SELECT 
  DATE(created_at) as date,
  COUNT(*) as total_requests,
  SUM(total_cost_cents) / 100 as total_cost_usd,
  AVG(total_cost_cents) / 100 as avg_cost_usd,
  MAX(total_cost_cents) / 100 as max_cost_usd
FROM "AIUsageLog"
WHERE created_at >= NOW() - INTERVAL '7 days'
GROUP BY DATE(created_at)
ORDER BY date DESC;

Cost by Task:

sql
SELECT 
  task,
  COUNT(*) as requests,
  SUM(total_cost_cents) / 100 as total_cost_usd,
  AVG(total_cost_cents) / 100 as avg_cost_usd
FROM "AIUsageLog"
WHERE created_at >= NOW() - INTERVAL '1 day'
GROUP BY task
ORDER BY total_cost_usd DESC;

Cost by Model:

sql
SELECT 
  provider,
  model,
  COUNT(*) as requests,
  SUM(total_cost_cents) / 100 as total_cost_usd
FROM "AIUsageLog"
WHERE created_at >= NOW() - INTERVAL '1 day'
GROUP BY provider, model
ORDER BY total_cost_usd DESC;

Top Spending Companies:

sql
SELECT 
  c.name as company,
  COUNT(*) as requests,
  SUM(l.total_cost_cents) / 100 as total_cost_usd
FROM "AIUsageLog" l
JOIN "Company" c ON l."companyId" = c.id
WHERE l.created_at >= NOW() - INTERVAL '1 day'
GROUP BY c.name
ORDER BY total_cost_usd DESC
LIMIT 10;

Cost Alert Script

Run the cost alert script daily to detect anomalies:

bash
tsx src/scripts/cost-alerts.ts

Alert Thresholds:

  • Daily Budget: $50
  • Hourly Spike: > 200% of 24-hour average
  • Per-Request Anomaly: > $1.00

3. Latency Monitoring

Cloud Run Metrics

View Request Latency:

bash
gcloud monitoring time-series list \
  --filter='metric.type="run.googleapis.com/request_latencies" AND resource.labels.service_name="tendsocial-api"' \
  --format="table(metric.labels.response_code_class, value)"

Database Query Performance

Slow Query Log (if enabled):

sql
SELECT 
  query,
  calls,
  total_time,
  mean_time,
  max_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 20;

AI Generation Latency:

sql
SELECT 
  task,
  AVG(latency_ms) as avg_latency_ms,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) as p95_latency_ms,
  PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms) as p99_latency_ms,
  MAX(latency_ms) as max_latency_ms
FROM "AIUsageLog"
WHERE created_at >= NOW() - INTERVAL '1 day'
GROUP BY task
ORDER BY avg_latency_ms DESC;

Alert Thresholds:

  • P95 Latency: > 5 seconds
  • P99 Latency: > 10 seconds

4. Job Execution Monitoring

Job Status Queries

Profile Analysis Jobs (Last 24 Hours):

sql
SELECT 
  status,
  COUNT(*) as count,
  AVG("durationMs") / 1000 as avg_duration_seconds,
  SUM("usersProcessed") as total_users_processed,
  SUM("postsAnalyzed") as total_posts_analyzed
FROM "ProfileAnalysisJob"
WHERE "createdAt" >= NOW() - INTERVAL '1 day'
GROUP BY status;

Performance Sync Jobs (Last 24 Hours):

sql
SELECT 
  status,
  COUNT(*) as count,
  SUM("postsProcessed") as total_posts,
  SUM("metricsUpdated") as total_metrics,
  SUM("apiCallsMade") as total_api_calls
FROM "PerformanceSyncJob"
WHERE "createdAt" >= NOW() - INTERVAL '1 day'
GROUP BY status;

Recent Job Failures:

sql
SELECT 
  'ProfileAnalysisJob' as job_type,
  id,
  "companyId",
  status,
  "errorMessage",
  "createdAt"
FROM "ProfileAnalysisJob"
WHERE status = 'failed' AND "createdAt" >= NOW() - INTERVAL '7 days'
UNION ALL
SELECT 
  'PerformanceSyncJob' as job_type,
  id,
  "companyId",
  status,
  "errorMessage",
  "createdAt"
FROM "PerformanceSyncJob"
WHERE status = 'failed' AND "createdAt" >= NOW() - INTERVAL '7 days'
ORDER BY "createdAt" DESC;

Alert Conditions:

  • Job Failure Rate: > 20% over 1 hour
  • No Jobs Executed: No jobs completed in last 2 hours (during business hours)
  • Job Duration Spike: > 300% of average

5. Dashboard Setup (Optional)

Google Cloud Monitoring Dashboard

Create a custom dashboard with the following widgets:

  1. Error Rate Chart (Line chart)

    • Metric: run.googleapis.com/request_count filtered by response_code_class != "2xx"
    • Aggregation: Rate
  2. Request Latency (Line chart)

    • Metric: run.googleapis.com/request_latencies
    • Aggregation: P95, P99
  3. Container Instance Count (Area chart)

    • Metric: run.googleapis.com/container/instance_count
  4. AI Cost Trend (Custom - requires BigQuery export)

    • Export AIUsageLog to BigQuery
    • Create chart from BigQuery data source

Admin Dashboard

The application includes a built-in admin analytics dashboard at:

  • URL: https://app.tendsocial.com/admin/analytics
  • Access: Super admin only
  • Features:
    • Daily cost trends
    • Requests by task/model
    • Company-level drill-down
    • AI vs Human performance comparison

6. Alert Notification Channels

Email Notifications

Configure email alerts for critical issues:

bash
# Create notification channel
gcloud alpha monitoring channels create \
  --display-name="Engineering Team" \
  --type=email \
  --channel-labels=email_address=engineering@tendsocial.com

Slack Integration (Optional)

For real-time alerts:

  1. Create Slack webhook URL
  2. Configure Cloud Logging sink
  3. Set up Cloud Function to forward to Slack

7. Regular Monitoring Tasks

Daily Tasks (Automated)

  • ✅ Run cost-alerts script (automated via cron)
  • ✅ Generate performance snapshots (automated via cron)
  • ✅ Update A/B test results (automated via cron)

Weekly Tasks (Manual)

  • 📊 Review error logs for patterns
  • 💰 Analyze cost trends by company
  • ⚡ Check P95/P99 latency trends
  • 🔄 Review job failure logs

Monthly Tasks (Manual)

  • 📈 Compare month-over-month AI costs
  • 🎯 Review A/B test results
  • 🔍 Audit top spending companies
  • 📝 Update monitoring thresholds based on trends

8. Troubleshooting Common Issues

High Error Rate

  1. Check Cloud Run logs: gcloud logging read ...
  2. Identify error pattern (auth, database, AI gateway, etc.)
  3. Check recent deployments
  4. Verify environment variables
  5. Check database connection pool

AI Cost Spike

  1. Run cost analysis queries
  2. Identify which task/model is expensive
  3. Check for unusual company activity
  4. Verify A/B test isn't skewed to expensive model
  5. Check for retry loops

Job Failures

  1. Query job failure logs
  2. Check company-specific issues
  3. Verify social platform API status
  4. Check rate limits
  5. Verify Cloud Tasks queue health

High Latency

  1. Check database query performance
  2. Check AI gateway response times
  3. Check Cloud Run auto-scaling
  4. Verify no cold starts impacting P99
  5. Check for cache misses

Last Updated: 2025-11-30
Owner: Engineering Team

TendSocial Documentation