Skip to content
Observability

Observability

SRE Portal exposes custom Prometheus metrics on the controller-runtime /metrics endpoint alongside the built-in Go runtime and controller-runtime metrics. A pre-built Grafana dashboard is included in the repository.

Metrics Endpoint

The metrics endpoint is configured via the --metrics-bind-address flag:

# Disabled by default
--metrics-bind-address=0

# HTTP
--metrics-bind-address=:8080

# HTTPS (auto-generated or cert-manager certificates)
--metrics-bind-address=:8443 --metrics-secure=true

When --metrics-secure=true, the endpoint is protected with Kubernetes authn/authz via controller-runtime FilterProvider.

Custom Metrics

All custom metrics use the sreportal_ prefix and are defined in internal/metrics/metrics.go.

Controller Metrics

Reconciliation performance and error tracking across all controllers (dns, portal, alertmanager, release).

MetricTypeLabelsDescription
sreportal_controller_reconcile_totalCountercontroller, resultReconciliation count by result (success, error)
sreportal_controller_reconcile_duration_secondsHistogramcontrollerReconciliation latency distribution

DNS Metrics

Track the volume of DNS data managed by the operator.

MetricTypeLabelsDescription
sreportal_dns_fqdns_totalGaugeportal, sourceNumber of FQDNs per portal and source (manual, external-dns, remote)
sreportal_dns_groups_totalGaugeportalNumber of DNS groups per portal

Source Metrics

Monitor the external-dns source collection pipeline.

MetricTypeLabelsDescription
sreportal_source_endpoints_collectedGaugesource_typeEndpoints collected per source type (service, ingress, dnsendpoint, etc.)
sreportal_source_errors_totalCountersource_typeCumulative source collection errors

Alertmanager Metrics

Monitor alert fetching from Alertmanager instances.

MetricTypeLabelsDescription
sreportal_alertmanager_alerts_activeGaugeportal, alertmanagerNumber of active alerts per Alertmanager resource
sreportal_alertmanager_fetch_errors_totalCounteralertmanagerCumulative alert fetch errors

Portal Metrics

Track portal inventory and remote synchronization health.

MetricTypeLabelsDescription
sreportal_portal_totalGaugetypeNumber of portals by type (local, remote)
sreportal_portal_remote_sync_errors_totalCounterportalCumulative remote portal sync errors
sreportal_portal_remote_fqdns_syncedGaugeportalFQDNs synced from each remote portal

HTTP Server Metrics

Request-level metrics for the web server (Connect API, MCP, static files).

MetricTypeLabelsDescription
sreportal_http_requests_totalCountermethod, handler, codeHTTP requests by method, handler, and status code
sreportal_http_request_duration_secondsHistogrammethod, handlerHTTP request latency distribution
sreportal_http_requests_in_flightGaugeNumber of HTTP requests currently being processed

The handler label uses low-cardinality values: connect (gRPC/Connect API), mcp (MCP servers), api (health endpoints), swagger (Swagger UI), static (web UI files).

MCP Server Metrics

Track MCP tool usage and session activity.

MetricTypeLabelsDescription
sreportal_mcp_tool_calls_totalCounterserver, toolMCP tool invocations
sreportal_mcp_tool_call_duration_secondsHistogramserver, toolMCP tool call latency
sreportal_mcp_tool_call_errors_totalCounterserver, toolMCP tool call errors
sreportal_mcp_sessions_activeGaugeserverActive MCP sessions (dns, alerts, metrics, releases)

Built-in Metrics

The /metrics endpoint also exposes standard metrics from controller-runtime and the Go runtime:

CategoryExamples
Reconciliationcontroller_runtime_reconcile_total, controller_runtime_reconcile_time_seconds
Work queueworkqueue_adds_total, workqueue_depth, workqueue_queue_duration_seconds
REST clientrest_client_requests_total, rest_client_request_duration_seconds
Leader electionleader_election_master_status
Go runtimego_goroutines, go_memstats_*, go_gc_duration_seconds
Processprocess_cpu_seconds_total, process_resident_memory_bytes, process_open_fds

Grafana Dashboard

A pre-built Grafana dashboard is available at config/grafana/sreportal-dashboard.json.

Import

  1. Open Grafana
  2. Go to Dashboards → Import
  3. Upload config/grafana/sreportal-dashboard.json or paste the JSON content
  4. Select your Prometheus datasource

Variables

The dashboard includes two template variables:

VariableTypeDescription
datasourceDatasourcePrometheus datasource picker — select from all available Prometheus datasources
jobQueryPrometheus job filter — auto-discovered from sreportal_controller_reconcile_total, multi-select with “All”

Dashboard Layout

The dashboard is organized into two rows:

Row 1 — Application Metrics

PanelVisualizationContent
Reconciliations / secTime seriesRate of sreportal_controller_reconcile_total by controller and result
Reconciliation DurationTime seriesp50 / p95 / p99 of reconcile_duration_seconds by controller
FQDNs TotalStatSum of sreportal_dns_fqdns_total per portal
DNS GroupsStatSum of sreportal_dns_groups_total per portal
Active AlertsStatSum of sreportal_alertmanager_alerts_active per portal/alertmanager (thresholds: green → orange → red)
Remote FQDNs SyncedStatsreportal_portal_remote_fqdns_synced per portal
HTTP Requests / secTime series (stacked)Rate of sreportal_http_requests_total by handler and status code
HTTP LatencyTime seriesp50 / p95 / p99 of http_request_duration_seconds by handler
HTTP In-Flight / MCP SessionsTime seriesrequests_in_flight and mcp_sessions_active
MCP Tool Calls / secTime series (bars)Rate of mcp_tool_calls_total by server/tool
Source Endpoints CollectedTime seriessource_endpoints_collected by source type
Errors / secTime seriesCombined error rates: source, alertmanager fetch, remote sync, MCP tool errors
PortalsGaugesreportal_portal_total by type (local / remote)

Row 2 — System Metrics

PanelVisualizationContent
CPU UsageTime seriesrate(process_cpu_seconds_total)
Memory UsageTime seriesRSS, heap alloc, heap in-use, stack in-use
GoroutinesTime seriesgo_goroutines
Open File DescriptorsTime seriesprocess_open_fds vs process_max_fds
GC DurationTime seriesgo_gc_duration_seconds p50 / p75 / p100
K8s REST Client Requests / secTime series (stacked)rest_client_requests_total by method and status code
Workqueue DepthTime seriesworkqueue_depth per controller queue

Provisioning

To auto-provision the dashboard via Grafana’s provisioning system, add it to your Grafana provisioning configuration:

# grafana/provisioning/dashboards/sreportal.yaml
apiVersion: 1
providers:
  - name: sreportal
    type: file
    options:
      path: /var/lib/grafana/dashboards/sreportal

Then mount or copy config/grafana/sreportal-dashboard.json into the configured path.

Prometheus Scrape Configuration

Example ServiceMonitor for Prometheus Operator:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: sreportal
  namespace: sreportal-system
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: sreportal
  endpoints:
    - port: metrics
      scheme: https
      tlsConfig:
        insecureSkipVerify: true
      bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token