This document outlines the monitoring and alerting configuration for the Campaign Architecture v2.0 deployment.
Overview
We monitor four critical dimensions:
- Error Rates - Application health
- AI Costs - Budget control
- Latency - Performance
- Job Execution - Background task health
1. Error Rate Monitoring
Google Cloud Logging
View Recent Errors:
gcloud logging read \
"resource.type=cloud_run_revision AND \
resource.labels.service_name=tendsocial-api AND \
severity>=ERROR" \
--limit 50 \
--format jsonReal-time Error Streaming:
gcloud logging tail \
"resource.type=cloud_run_revision AND \
resource.labels.service_name=tendsocial-api AND \
severity>=ERROR"Alert Configuration
Create an alert policy for error rate threshold:
gcloud alpha monitoring policies create \
--notification-channels=CHANNEL_ID \
--display-name="TendSocial API - High Error Rate" \
--condition-display-name="Error rate > 5%" \
--condition-threshold-value=5 \
--condition-threshold-duration=300s \
--condition-filter='resource.type="cloud_run_revision" AND resource.labels.service_name="tendsocial-api" AND severity>=ERROR'Alert Thresholds:
- Warning: > 5% error rate over 5 minutes
- Critical: > 10% error rate over 5 minutes
2. AI Cost Monitoring
Database Queries
Daily Cost Summary:
SELECT
DATE(created_at) as date,
COUNT(*) as total_requests,
SUM(total_cost_cents) / 100 as total_cost_usd,
AVG(total_cost_cents) / 100 as avg_cost_usd,
MAX(total_cost_cents) / 100 as max_cost_usd
FROM "AIUsageLog"
WHERE created_at >= NOW() - INTERVAL '7 days'
GROUP BY DATE(created_at)
ORDER BY date DESC;Cost by Task:
SELECT
task,
COUNT(*) as requests,
SUM(total_cost_cents) / 100 as total_cost_usd,
AVG(total_cost_cents) / 100 as avg_cost_usd
FROM "AIUsageLog"
WHERE created_at >= NOW() - INTERVAL '1 day'
GROUP BY task
ORDER BY total_cost_usd DESC;Cost by Model:
SELECT
provider,
model,
COUNT(*) as requests,
SUM(total_cost_cents) / 100 as total_cost_usd
FROM "AIUsageLog"
WHERE created_at >= NOW() - INTERVAL '1 day'
GROUP BY provider, model
ORDER BY total_cost_usd DESC;Top Spending Companies:
SELECT
c.name as company,
COUNT(*) as requests,
SUM(l.total_cost_cents) / 100 as total_cost_usd
FROM "AIUsageLog" l
JOIN "Company" c ON l."companyId" = c.id
WHERE l.created_at >= NOW() - INTERVAL '1 day'
GROUP BY c.name
ORDER BY total_cost_usd DESC
LIMIT 10;Cost Alert Script
Run the cost alert script daily to detect anomalies:
tsx src/scripts/cost-alerts.tsAlert Thresholds:
- Daily Budget: $50
- Hourly Spike: > 200% of 24-hour average
- Per-Request Anomaly: > $1.00
3. Latency Monitoring
Cloud Run Metrics
View Request Latency:
gcloud monitoring time-series list \
--filter='metric.type="run.googleapis.com/request_latencies" AND resource.labels.service_name="tendsocial-api"' \
--format="table(metric.labels.response_code_class, value)"Database Query Performance
Slow Query Log (if enabled):
SELECT
query,
calls,
total_time,
mean_time,
max_time
FROM pg_stat_statements
ORDER BY mean_time DESC
LIMIT 20;AI Generation Latency:
SELECT
task,
AVG(latency_ms) as avg_latency_ms,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) as p95_latency_ms,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms) as p99_latency_ms,
MAX(latency_ms) as max_latency_ms
FROM "AIUsageLog"
WHERE created_at >= NOW() - INTERVAL '1 day'
GROUP BY task
ORDER BY avg_latency_ms DESC;Alert Thresholds:
- P95 Latency: > 5 seconds
- P99 Latency: > 10 seconds
4. Job Execution Monitoring
Job Status Queries
Profile Analysis Jobs (Last 24 Hours):
SELECT
status,
COUNT(*) as count,
AVG("durationMs") / 1000 as avg_duration_seconds,
SUM("usersProcessed") as total_users_processed,
SUM("postsAnalyzed") as total_posts_analyzed
FROM "ProfileAnalysisJob"
WHERE "createdAt" >= NOW() - INTERVAL '1 day'
GROUP BY status;Performance Sync Jobs (Last 24 Hours):
SELECT
status,
COUNT(*) as count,
SUM("postsProcessed") as total_posts,
SUM("metricsUpdated") as total_metrics,
SUM("apiCallsMade") as total_api_calls
FROM "PerformanceSyncJob"
WHERE "createdAt" >= NOW() - INTERVAL '1 day'
GROUP BY status;Recent Job Failures:
SELECT
'ProfileAnalysisJob' as job_type,
id,
"companyId",
status,
"errorMessage",
"createdAt"
FROM "ProfileAnalysisJob"
WHERE status = 'failed' AND "createdAt" >= NOW() - INTERVAL '7 days'
UNION ALL
SELECT
'PerformanceSyncJob' as job_type,
id,
"companyId",
status,
"errorMessage",
"createdAt"
FROM "PerformanceSyncJob"
WHERE status = 'failed' AND "createdAt" >= NOW() - INTERVAL '7 days'
ORDER BY "createdAt" DESC;Alert Conditions:
- Job Failure Rate: > 20% over 1 hour
- No Jobs Executed: No jobs completed in last 2 hours (during business hours)
- Job Duration Spike: > 300% of average
5. Dashboard Setup (Optional)
Google Cloud Monitoring Dashboard
Create a custom dashboard with the following widgets:
Error Rate Chart (Line chart)
- Metric:
run.googleapis.com/request_countfiltered byresponse_code_class != "2xx" - Aggregation: Rate
- Metric:
Request Latency (Line chart)
- Metric:
run.googleapis.com/request_latencies - Aggregation: P95, P99
- Metric:
Container Instance Count (Area chart)
- Metric:
run.googleapis.com/container/instance_count
- Metric:
AI Cost Trend (Custom - requires BigQuery export)
- Export AIUsageLog to BigQuery
- Create chart from BigQuery data source
Admin Dashboard
The application includes a built-in admin analytics dashboard at:
- URL:
https://app.tendsocial.com/admin/analytics - Access: Super admin only
- Features:
- Daily cost trends
- Requests by task/model
- Company-level drill-down
- AI vs Human performance comparison
6. Alert Notification Channels
Email Notifications
Configure email alerts for critical issues:
# Create notification channel
gcloud alpha monitoring channels create \
--display-name="Engineering Team" \
--type=email \
--channel-labels=email_address=engineering@tendsocial.comSlack Integration (Optional)
For real-time alerts:
- Create Slack webhook URL
- Configure Cloud Logging sink
- Set up Cloud Function to forward to Slack
7. Regular Monitoring Tasks
Daily Tasks (Automated)
- ✅ Run cost-alerts script (automated via cron)
- ✅ Generate performance snapshots (automated via cron)
- ✅ Update A/B test results (automated via cron)
Weekly Tasks (Manual)
- 📊 Review error logs for patterns
- 💰 Analyze cost trends by company
- ⚡ Check P95/P99 latency trends
- 🔄 Review job failure logs
Monthly Tasks (Manual)
- 📈 Compare month-over-month AI costs
- 🎯 Review A/B test results
- 🔍 Audit top spending companies
- 📝 Update monitoring thresholds based on trends
8. Troubleshooting Common Issues
High Error Rate
- Check Cloud Run logs:
gcloud logging read ... - Identify error pattern (auth, database, AI gateway, etc.)
- Check recent deployments
- Verify environment variables
- Check database connection pool
AI Cost Spike
- Run cost analysis queries
- Identify which task/model is expensive
- Check for unusual company activity
- Verify A/B test isn't skewed to expensive model
- Check for retry loops
Job Failures
- Query job failure logs
- Check company-specific issues
- Verify social platform API status
- Check rate limits
- Verify Cloud Tasks queue health
High Latency
- Check database query performance
- Check AI gateway response times
- Check Cloud Run auto-scaling
- Verify no cold starts impacting P99
- Check for cache misses
Last Updated: 2025-11-30
Owner: Engineering Team