Alerting & Rules
Configure intelligent, real-time alerting policies to monitor your CloudNexus infrastructure. Define metric thresholds, route notifications to multiple channels, and automate escalation workflows.
Overview
CloudNexus Alerting provides a unified rules engine that evaluates metrics, logs, and custom events across your infrastructure. When conditions are met, the system triggers notifications through your preferred channels and can automatically execute runbooks or scaling policies.
Sub-Second Evaluation
Rules are evaluated against a streaming data pipeline with <500ms latency for critical infrastructure metrics.
Multi-Channel Routing
Route alerts to Slack, PagerDuty, Email, SMS, or custom Webhooks based on severity and team ownership.
Stateful Deduplication
Prevents alert fatigue with intelligent grouping, suppression windows, and automatic resolution tracking.
Migration Notice
Legacy alert rules (v1) are deprecated. All new configurations must use the alert_rules_v2 schema. Migration tools are available in the console.
Rule Architecture
Every alert rule consists of four core components: Trigger Conditions, Evaluation Window, Routing Targets, and Actions. The engine evaluates rules asynchronously against your data store and maintains state across evaluation cycles.
JSON Syntax & Schema
Rules are defined as JSON objects. Below is a complete example of a production-ready rule for CPU throttling detection:
"alert_rule": { "id": "cpu-throttle-prod-01", "name": "Production CPU Throttle Detection", "severity": "critical", "enabled": true, "evaluation_interval": "30s", "for": "5m", "condition": { "metric": "cloudnexus.compute.cpu.throttled", "operator": "above", "threshold": 15.0, "aggregation": "avg", "dimensions": { "environment": "production", "instance_type": "t3.xlarge" } }, "routing": { "channels": ["slack-ops", "pagerduty-sre"], "escalation_policy": "oncall-24x7" } }
| Parameter | Type | Description |
|---|---|---|
| evaluation_interval | Duration | How often the rule is evaluated (e.g., 15s, 1m, 5m) |
| for | Duration | Condition must persist for this duration before firing |
| operator | String | Comparison logic: above, below, equal, contains, absent |
| aggregation | String | Metric function: avg, max, p95, rate, count |
| dimensions | Object | Filter scope for evaluation across tags, regions, or environments |
Notification Channels
CloudNexus supports native integrations and webhook forwarding. Channels are configured globally and referenced by ID in alert rules.
Slack & Teams
Rich formatting with embedded runbooks, resolution buttons, and channel-level filtering.
PagerDuty & Opsgenie
Seamless incident creation with automatic severity mapping and callback handling.
Custom Webhooks
Send alerts to any HTTP endpoint. Supports OAuth2, API keys, and retry/backoff policies.
Email & SMS
Fallback channels with digest options for non-critical warnings and daily summaries.
Rate Limiting
Webhook channels are limited to 50 requests/second. Exceeding this triggers automatic throttling and fallback to queue storage.
Escalation Policies
Define how alerts progress through on-call rotations when unresolved. Policies support time-based triggers, retry intervals, and custom business hours.
- L1 - Initial Alert: Sent to primary on-call engineer via Slack & Email
- L2 - Escalation (15m): If unresolved, notifies shift lead and pages via PagerDuty
- L3 - Critical (30m): Triggers management notification and initiates auto-remediation runbook
- Resolution: Automatic silence until metrics return to baseline for 10m
Configure escalation windows using the schedule object in your routing configuration. IFC-compliant time zones and DST adjustments are applied automatically.
Configuration Workflow
Rules can be managed via the Console UI, Terraform provider, or REST API. The recommended workflow for production environments:
# Deploy rule via CLI curl -X POST https://api.cloudnexus.io/v2/alert-rules \n -H "Authorization: Bearer $API_KEY" \n -H "Content-Type: application/json" \n -d '@alert-rule.json'
Once deployed, the rule enters a PENDING state while the evaluation pipeline initializes. After the first successful cycle, it transitions to ACTIVE. Use the dry-run flag in the console to validate conditions against historical data without triggering notifications.
Best Practices
- Use meaningful annotations: Every rule should include runbook links and context for responders.
- Implement for windows: Prevents flapping on transient spikes. Use 2-5m for CPU/Memory, 30s-1m for error rates.
- Group by service/region: Leverage group_by to send one alert per affected component instead of per-instance floods.
- Test before production: Use the simulate endpoint to verify thresholds against recent data.
- Rotate secrets: Webhook URLs and API keys should be stored in CloudNexus Secrets Manager, not inline in rule definitions.
Troubleshooting
If alerts aren't firing as expected:
- Check the Rule Health Dashboard for pipeline latency or evaluation errors.
- Verify metric cardinality. Rules with >10k unique dimensions may require aggregation tuning.
- Confirm channel connectivity via the /test-connection endpoint.
- Review suppression_logs to see if duplicate alerts were filtered by the state manager.
Need help configuring complex rules?
Our SRE architects can review your alerting topology and optimize for noise reduction and SLA compliance.
Request Architecture Review