CloudNexus Alerting & Rules Documentation

Overview

CloudNexus Alerting provides a unified rules engine that evaluates metrics, logs, and custom events across your infrastructure. When conditions are met, the system triggers notifications through your preferred channels and can automatically execute runbooks or scaling policies.

⚡

Sub-Second Evaluation

Rules are evaluated against a streaming data pipeline with <500ms latency for critical infrastructure metrics.

🔗

Multi-Channel Routing

Route alerts to Slack, PagerDuty, Email, SMS, or custom Webhooks based on severity and team ownership.

🔄

Stateful Deduplication

Prevents alert fatigue with intelligent grouping, suppression windows, and automatic resolution tracking.

ℹ️

Migration Notice

Legacy alert rules (v1) are deprecated. All new configurations must use the alert_rules_v2 schema. Migration tools are available in the console.

Rule Architecture

Every alert rule consists of four core components: Trigger Conditions, Evaluation Window, Routing Targets, and Actions. The engine evaluates rules asynchronously against your data store and maintains state across evaluation cycles.

JSON Syntax & Schema

Rules are defined as JSON objects. Below is a complete example of a production-ready rule for CPU throttling detection:

JSON

"alert_rule": {
  "id": "cpu-throttle-prod-01",
  "name": "Production CPU Throttle Detection",
  "severity": "critical",
  "enabled": true,
  "evaluation_interval": "30s",
  "for": "5m",
  "condition": {
    "metric": "cloudnexus.compute.cpu.throttled",
    "operator": "above",
    "threshold": 15.0,
    "aggregation": "avg",
    "dimensions": {
      "environment": "production",
      "instance_type": "t3.xlarge"
    }
  },
  "routing": {
    "channels": ["slack-ops", "pagerduty-sre"],
    "escalation_policy": "oncall-24x7"
  }
}

Parameter	Type	Description
evaluation_interval	Duration	How often the rule is evaluated (e.g., 15s, 1m, 5m)
for	Duration	Condition must persist for this duration before firing
operator	String	Comparison logic: above, below, equal, contains, absent
aggregation	String	Metric function: avg, max, p95, rate, count
dimensions	Object	Filter scope for evaluation across tags, regions, or environments

Notification Channels

CloudNexus supports native integrations and webhook forwarding. Channels are configured globally and referenced by ID in alert rules.

💬

Slack & Teams

Rich formatting with embedded runbooks, resolution buttons, and channel-level filtering.

🚨

PagerDuty & Opsgenie

Seamless incident creation with automatic severity mapping and callback handling.

🔌

Custom Webhooks

Send alerts to any HTTP endpoint. Supports OAuth2, API keys, and retry/backoff policies.

📧

Email & SMS

Fallback channels with digest options for non-critical warnings and daily summaries.

⚠️

Rate Limiting

Webhook channels are limited to 50 requests/second. Exceeding this triggers automatic throttling and fallback to queue storage.

Escalation Policies

Define how alerts progress through on-call rotations when unresolved. Policies support time-based triggers, retry intervals, and custom business hours.

L1 - Initial Alert: Sent to primary on-call engineer via Slack & Email
L2 - Escalation (15m): If unresolved, notifies shift lead and pages via PagerDuty
L3 - Critical (30m): Triggers management notification and initiates auto-remediation runbook
Resolution: Automatic silence until metrics return to baseline for 10m

Configure escalation windows using the schedule object in your routing configuration. IFC-compliant time zones and DST adjustments are applied automatically.

Configuration Workflow

Rules can be managed via the Console UI, Terraform provider, or REST API. The recommended workflow for production environments:

cURL

# Deploy rule via CLI
curl -X POST https://api.cloudnexus.io/v2/alert-rules \n  -H "Authorization: Bearer $API_KEY" \n  -H "Content-Type: application/json" \n  -d '@alert-rule.json'

Once deployed, the rule enters a PENDING state while the evaluation pipeline initializes. After the first successful cycle, it transitions to ACTIVE. Use the dry-run flag in the console to validate conditions against historical data without triggering notifications.

Best Practices

Use meaningful annotations: Every rule should include runbook links and context for responders.
Implement for windows: Prevents flapping on transient spikes. Use 2-5m for CPU/Memory, 30s-1m for error rates.
Group by service/region: Leverage group_by to send one alert per affected component instead of per-instance floods.
Test before production: Use the simulate endpoint to verify thresholds against recent data.
Rotate secrets: Webhook URLs and API keys should be stored in CloudNexus Secrets Manager, not inline in rule definitions.

Troubleshooting

If alerts aren't firing as expected:

Check the Rule Health Dashboard for pipeline latency or evaluation errors.
Verify metric cardinality. Rules with >10k unique dimensions may require aggregation tuning.
Confirm channel connectivity via the /test-connection endpoint.
Review suppression_logs to see if duplicate alerts were filtered by the state manager.

Need help configuring complex rules?

Our SRE architects can review your alerting topology and optimize for noise reduction and SLA compliance.

Request Architecture Review