Monitoring & Observability

Overview

Metrics

Logs

Traces

Platform Overview

CloudNexus Observability provides a unified platform for infrastructure and application monitoring. Built on a columnar time-series database and distributed log storage, it delivers sub-second query performance at petabyte scale.

📊 High-Resolution Metrics

Collect custom and system metrics at 1s intervals with automatic downsampling and retention policies.

📜 Structured Logs

Parse, index, and query logs with full-text search. Correlate log events with metrics and traces automatically.

🔍 Distributed Tracing

End-to-end request tracing with OpenTelemetry native support, service maps, and span-level analysis.

🚨 Intelligent Alerting

Threshold, anomaly detection, and composite alert rules with Slack, PagerDuty, and webhook integration.

Architecture

The observability pipeline operates on a distributed collector architecture. The CloudNexus Agent (CN-Agent) handles local metric scraping, log tailing, and trace aggregation before forwarding to the regional ingestion gateway. Data is sharded across availability zones for high availability and low-latency query resolution.

cn-agent.yaml

# CloudNexus Agent Configuration
global:
  endpoint: https://ingest.cloudnexus.io/v2
  api_key: ${CN_API_KEY}
  region: us-east-1
  flush_interval: 10s

metrics:
  collection_interval: 15s
  scrape_targets:
    - localhost:9100  # node_exporter
    - localhost:9090  # cloudnexus_metrics

logs:
  sources:
    - type: file
      path: /var/log/app/*.log
      parser: json
      fields:
        service: payment-gateway
        environment: production

Metrics Collection & Querying

CloudNexus supports Prometheus exposition format, StatsD, and OpenMetrics. All metrics are automatically tagged with host, region, and instance metadata. Use our PromQL-compatible query language for complex aggregations.

Supported Metric Types

Type	Description	Use Case
Gauge	Instantaneous value	CPU usage, memory, queue depth
Counter	Monotonically increasing	HTTP requests, error counts
Histogram	Observations distribution	Request latency, payload size
Summary	Client-side quantiles	Service response times

Query Language

Write PromQL expressions to slice, dice, and aggregate your metrics. Native functions include `rate()`, `histogram_quantile()`, `absent()`, and custom window functions.

Query Example

# 95th percentile HTTP response time per service
histogram_quantile(0.95, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

Log Management & Pipelines

Ingest, parse, and query logs with full-text search and structured field extraction. Logs are automatically correlated with metrics and traces for root cause analysis.

Log Pipelines

Transform logs before storage using our declarative pipeline syntax. Drop noise, enrich with geo-IP or DNS data, redact PII, and route to different retention tiers.

pipeline.yaml

pipeline:
  name: prod-logs-processor
  input:
    type: cloudnexus_ingest
    match: "environment:production"

  stages:
    - type: parse
      format: grok
      pattern: "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:message}"
    
    - type: filter
      drop: "level:DEBUG"
    
    - type: redact
      fields: ["email", "credit_card"]
      replacement: "***REDACTED***"

  output:
    storage: hot_tier
    retention: 90d

Search Syntax

level:error AND service:auth - Boolean operators
message:*timeout* - Wildcard search
trace_id:abc123 - Cross-pillar correlation
duration:[5s TO 10s] - Range queries

Distributed Tracing

Trace requests across microservices, serverless functions, and external dependencies. Native OpenTelemetry SDK integration with automatic instrumentation for 20+ runtimes.

Span Attributes

Each span captures timing, status, and custom attributes. CloudNexus automatically extracts HTTP, gRPC, and database context attributes. Enrich spans with user IDs, session tokens, and deployment versions.

trace-explorer.js

import { trace } from '@cloudnexus/otel-sdk';

const tracer = trace.getTracer('payment-service');

async function processPayment(orderId) {
  return await tracer.startActiveSpan('payment.charge', async (span) => {
    span.setAttribute('order.id', orderId);
    try {
      const result = await stripe.charges.create({ amount: 5000 });
      span.setAttribute('stripe.id', result.id);
      return result;
    } catch (err) {
      span.recordException(err);
      span.setStatus({ code: 2 }); // ERROR
      throw err;
    }
  });
}

Service Maps

Automatically generated topology graphs show service dependencies, error rates, and latency percentiles per edge. Click any node to drill into detailed span lists and error logs.

Alerting & Notification Rules

Define alert conditions using PromQL, log queries, or trace error rates. Supports evaluation windows, grouping, inhibition rules, and multi-channel routing.

alert-rule.yaml

apiVersion: cloudnexus.io/v1
kind: AlertRule
metadata:
  name: high-error-rate
  namespace: production
spec:
  query: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
  for: 3m
  severity: critical
  labels:
    team: platform
    runbook_url: https://wiki.cloudnexus.io/alerts/5xx
  annotations:
    summary: "{{ $labels.service }} error rate exceeds 5%"
    description: "{{ $value | humanizePercentage }} of requests failing"
  routes:
    - channel: slack
      webhook: ${SLACK_WEBHOOK}
    - channel: pagerduty
      service_key: ${PD_KEY}
      escalate_after: 5m

Supported Integrations

Native exporters and SDKs for major ecosystems. Configure once, deploy across environments.

Kubernetes

Auto-discovery of pods, services, and deployments. Heapster & cAdvisor integration.

OpenTelemetry

SDKs for Go, Python, Java, Node.js, .NET. Context propagation & batching.

Cloud Providers

AWS CloudWatch, GCP Stackdriver, Azure Monitor metric forwarding.

CI/CD

Terraform provider, Kubernetes Operator, GitHub Actions integration.

Quick Start

Deploy the CloudNexus Agent in under 60 seconds:

Generate an API key from the Console → Settings → Integrations
Apply the Kubernetes manifest or run the Docker container
Verify ingestion in the Metrics Explorer dashboard
Configure your first alert rule via CLI or API