Skip to content
Architecture

Architecture

SRE Portal runs as a single container that combines a Kubernetes controller, a gRPC/Connect API, and a web UI server.

High-Level Overview

    graph TB
    Browser["Browser"] --> WebUI
    AIAssistant["AI Assistant\n(Claude / Cursor)"] --> MCP

    subgraph Pod["SRE Portal Pod (single container)"]
        Controllers["Controllers\n(controller-runtime)"]
        ReadStore["ReadStore\n(in-memory projections)"]
        API["Connect gRPC API\n(h2c)"]
        WebUI["Web UI\n(React SPA / Echo v5)"]
        MCP["MCP Servers\n(/mcp, /mcp/dns, /mcp/alerts,\n/mcp/metrics, /mcp/releases,\n/mcp/netpol)"]
    end

    Controllers -->|write| ReadStore
    Controllers --> K8s["Kubernetes API Server"]
    ReadStore -->|read| API
    ReadStore -->|read| MCP
    WebUI --> API
  

The six components share the same process:

  • Controllers reconcile CRDs using controller-runtime and push projections into the ReadStore
  • ReadStore holds pre-aggregated read models in-memory, decoupling reads from the Kubernetes API
  • Connect API serves gRPC-compatible endpoints over HTTP/2 (h2c), reading from the ReadStore
  • Web UI serves the React SPA as static files via Echo v5
  • MCP Servers expose Model Context Protocol at /mcp and /mcp/dns (DNS/portals), /mcp/alerts (alerts), /mcp/metrics (Prometheus metrics), /mcp/releases (release tracking), /mcp/netpol (network flows), and /mcp/status (status page) for AI assistant integration

Custom Resource Definitions

SRE Portal defines eight CRDs that work together:

    graph TD
    Portal -->|portalRef| DNS["DNS
    (manual entries)"]
    Portal -->|portalRef| DNSRecord["DNSRecord
    (auto-discovered)"]
    Portal -->|portalRef| Alertmanager["Alertmanager
    (alert source)"]
    Portal -->|portalRef| Component["Component
    (platform status)"]
    Portal -->|portalRef| Maintenance["Maintenance
    (scheduled windows)"]
    Portal -->|portalRef| Incident["Incident
    (declared incidents)"]
    Maintenance -->|"components[]"| Component
    Incident -->|"components[]"| Component
    Release["Release
    (release events per day)"]
  

Portal

Defines a named web dashboard view. Each portal has a title, an optional subpath, and a main flag. The operator creates a default main portal on startup.

A portal can optionally set spec.remote to fetch DNS data from a remote SRE Portal instance instead of collecting it locally. Remote portals are periodically synchronized (every 5 minutes) and their FQDNs appear with source remote in the DNS status.

DNS

Contains manually defined DNS entry groups linked to a portal via spec.portalRef. The DNS controller aggregates these manual entries with auto-discovered endpoints into status.groups.

DNSRecord

Created and managed automatically by the source controller. Each DNSRecord represents endpoints discovered from a specific source type (Service, Ingress, etc.) for a specific portal.

Alertmanager

Links an Alertmanager instance to a portal via spec.portalRef. The spec defines url.local (used by the controller to fetch active alerts from the Alertmanager API) and optional url.remote (for dashboard links). The Alertmanager controller periodically fetches alerts and stores them in status.activeAlerts.

Release

Stores release events (deployments, rollbacks, hotfixes, feature flags, etc.) organized by day. Each Release CR is named release-YYYY-MM-DD and holds an array of entries with type, version, origin, date, and optional author, message, and link fields. Release CRs are not linked to a portal — they are global and shown on the main portal’s Releases page.

The Release controller automatically deletes CRs older than the configured TTL (default: 30 days).

Component

Represents a monitored platform component (e.g., GKE cluster, Cloud SQL, API Gateway) linked to a portal via spec.portalRef. The user declares the operational status in spec.status (operational, degraded, partial_outage, major_outage, unknown). The controller computes status.computedStatus, which may be overridden to maintenance when an active Maintenance targets this component.

Maintenance

Represents a scheduled maintenance window linked to a portal. The spec defines scheduledStart, scheduledEnd, a list of affected components[], and the affectedStatus to apply during the window. The controller automatically computes the phase (upcomingin_progresscompleted) and uses strategic RequeueAfter to transition at the exact scheduled times.

Incident

Represents a declared incident linked to a portal. The spec contains severity (critical, major, minor), affected components[], and a timeline of updates[] appended via kubectl edit/patch. The controller derives status.currentPhase from the latest update, and computes startedAt, resolvedAt, and durationMinutes when resolved.

Design Principles

Domain-Driven Design (DDD)

Domain logic lives in internal/domain/ with no external dependencies. Infrastructure concerns (Kubernetes API, gRPC, HTTP) are isolated in adapters and controllers.

Clean Architecture

Dependencies point inward: controllers depend on the domain layer, but the domain never imports controller or infrastructure packages.

Idempotent Reconciliation

All controllers are safe to run multiple times. They compute desired state from the current state and converge toward it without side effects from repeated runs.

ReadStore (CQRS Read Path)

All gRPC and MCP services read from in-memory ReadStores instead of querying the Kubernetes API directly. Controllers write projections into these stores during reconciliation.

    flowchart LR
    Controller -->|"writer.Replace(key, models)"| Store["ReadStore<br/>(map + RWMutex + broadcast)"]
    Store -->|"reader.List(ctx, filters)"| gRPC
    Store -->|"reader.Subscribe()"| Stream["gRPC Stream"]
    Store -->|"reader.List(ctx, filters)"| MCP
  

Each bounded context has its own domain interfaces (internal/domain/<ctx>/reader.go, writer.go) and store implementation (internal/readstore/<ctx>/), backed by a shared generic readstore.Store[T].

ContextStoreKeyReaderWriter
DNSFQDNStoreDNS CR resource keyFQDNReaderFQDNWriter
PortalPortalStorePortal resource keyPortalReaderPortalWriter
AlertmanagerAlertmanagerStoreAlertmanager resource keyAlertmanagerReaderAlertmanagerWriter
Network PolicyFlowGraphStoreNFD name (dual node/edge stores + portalRef mapping)FlowGraphReaderFlowGraphWriter
ReleaseReleaseStoreDay string (2026-03-25)ReleaseReaderReleaseWriter
ComponentComponentStoreComponent resource keyComponentReaderComponentWriter
MaintenanceMaintenanceStoreMaintenance resource keyMaintenanceReaderMaintenanceWriter
IncidentIncidentStoreIncident resource keyIncidentReaderIncidentWriter

Mutations broadcast to subscribers via a channel-close pattern, enabling event-driven streams (e.g. StreamFQDNs) without polling.

Controllers

DNS Controller (Chain of Responsibility)

The DNS controller uses a generic Chain-of-Responsibility framework (internal/reconciler/handler.go) that executes handlers sequentially, short-circuiting on error or requeue.

The chain has four steps:

StepHandlerDescription
1CollectManualEntriesExtract manual groups from DNS.spec.groups
2AggregateFQDNsConvert to FQDNGroupStatus with source markers, sort by name
3ResolveDNSParallel DNS lookup per FQDN (10 workers, 5s timeout). Skipped if disableDNSCheck: true
4UpdateStatusWrite the aggregated groups to DNS.status and project to the ReadStore

See the DNS Controller Flow for a detailed step-by-step diagram.

Source Controller (Chain of Responsibility, Runnable)

The source controller implements manager.Runnable for periodic reconciliation (configurable interval). It uses the same Chain-of-Responsibility pattern as the DNS and Alertmanager controllers, with six sequential handlers:

StepHandlerDescription
1RebuildSourcesBuild external-dns sources from operator config (lazy — only if none exist)
2BuildPortalIndexList Portal CRs and build a name→portal lookup index
3CollectEndpointsFetch endpoints from each source, enrich with sreportal.io/* annotations, route to (portal, sourceType) buckets
4DeduplicateRemove duplicate FQDNs across source types using configured sources.priority order
5ReconcileDNSRecordsCreate or update a DNSRecord CR per (portal, sourceType) pair. Computes SHA-256 hash of endpoints and skips status writes when unchanged
6DeleteOrphanedDelete DNSRecord CRs for disabled sources or portals with no endpoints

See the DNS Source Flow for a detailed step-by-step diagram.

Portal Controller

A simple controller that sets status.ready = true with a Ready condition. It also runs an EnsureMainPortalRunnable that creates the default main portal on startup if none exists.

Alertmanager Controller (Chain)

Uses the same Chain-of-Responsibility pattern as the DNS controller. The chain has two steps: FetchAlerts (HTTP GET to spec.url.local, Alertmanager API v2) and UpdateStatus (write activeAlerts, conditions, lastReconcileTime to status). Reconciles periodically (default ~2 minutes). See the Alertmanager Flow for details.

Release Controller (ReadStore + TTL Cleanup)

The Release controller watches Release CRs and performs two duties:

  1. ReadStore projection: converts Release CR entries to domain read models and pushes them to the ReadStore. External modifications (e.g. kubectl edit, CI pipelines) are detected immediately via the watch.

  2. TTL cleanup: checks whether the CR’s day is older than the configured TTL (default: 30 days). Expired CRs are automatically deleted. Each CR is re-checked every 12 hours via RequeueAfter.

Component Controller (Chain)

Uses the Chain-of-Responsibility pattern with three steps: ValidatePortalRef (verify Portal exists), ComputeStatus (read MaintenanceReader for active maintenance override), UpdateStatus (patch computedStatus, label, project to ReadStore). Also watches Maintenance CRs to re-enqueue affected components. See the Component Flow for details.

Maintenance Controller (Chain)

Uses two chain steps: ComputePhase (upcoming/in_progress/completed from current time) and UpdateStatus (patch phase, project to ReadStore, set strategic RequeueAfter). No fixed polling — requeues at exact transition times (scheduledStart, scheduledEnd). See the Maintenance Flow for details.

Incident Controller (Chain)

Uses two chain steps: ComputeStatus (derive currentPhase from latest update, compute duration if resolved) and UpdateStatus (patch status fields, project to ReadStore). Uses the domain function incident.ComputeStatus() for all phase/duration computation.

Connect API

The API uses the Connect protocol (gRPC-compatible over HTTP/1.1 and HTTP/2) with protobuf definitions in proto/sreportal/v1/.

All Connect handlers share a unary interceptor (internal/grpc/interceptor.go) that logs handler errors at WARN level. Connect often returns HTTP 200 even when the RPC fails with a coded error, so this makes failures visible in logs alongside the HTTP access log.

DNSService

RPCDescription
ListFQDNsLists all FQDNs with optional filters (namespace, source, search, portal)
StreamFQDNsServer-streaming RPC that pushes FQDN updates (polls every 5s)

PortalService

RPCDescription
ListPortalsLists all portals

AlertmanagerService

RPCDescription
ListAlertsLists Alertmanager resources with active alerts (filters: portal, namespace, search, state)

ReleaseService

RPCDescription
AddReleaseAppend a release entry to the day’s Release CR (type validated against configured allowlist when release.types is set)
ListReleasesList release entries for a day (pagination within the day; previous_day / next_day for adjacent days with data)
ListReleaseDaysReturn all days that have Release CRs and the TTL window (ttl_days) for the UI

StatusService

RPCDescription
ListComponentsLists platform components with status (filters: portal_ref, group)
ListMaintenancesLists maintenance windows (filters: portal_ref, phase)
ListIncidentsLists incidents (filters: portal_ref, phase)

VersionService

RPCDescription
GetVersionReturn build metadata (version, commit, date)

MetricsService

RPCDescription
ListMetricsList Prometheus metrics from the operator’s metrics registry

MCP Servers

The operator includes five built-in Model Context Protocol (MCP) servers on the web server port, using Streamable HTTP transport:

DNS / Portals (mounted at /mcp and /mcp/dns; /mcp is kept for backward compatibility):

ToolDescription
search_fqdnsSearch FQDNs by query, source, group, portal, or namespace
list_portalsList all available portals
get_fqdn_detailsGet detailed information about a specific FQDN

Alerts (mounted at /mcp/alerts):

ToolDescription
list_alertsList Alertmanager resources and their active alerts (optional filters: portal, search, state)

Metrics (mounted at /mcp/metrics):

ToolDescription
list_metricsList Prometheus metrics from the operator’s metrics registry

Releases (mounted at /mcp/releases):

ToolDescription
list_releasesList release entries for a day (previous_day / next_day in the result)

Network Policies (mounted at /mcp/netpol):

ToolDescription
list_network_flowsList network flow nodes and edges (optional filters: portal, namespace, search with 1-hop expansion)
get_service_flowsGet all incoming and outgoing flows for a specific service

Status Page (mounted at /mcp/status):

ToolDescription
list_componentsList platform components with operational status (filters: portal, group)
list_maintenancesList maintenance windows (filters: portal, phase)
list_incidentsList incidents (filters: portal, phase)
get_platform_statusAggregated platform health summary with status counts

Data Flow

The complete flow from Kubernetes resource to web dashboard:

    flowchart TD
    K8s["K8s Resources\n(Service, Ingress, ...)"]
    Source["Source Controller\n(periodic)"]
    DNSRecord["DNSRecord CRs\n(one per portal+sourceType)"]
    DNSCtrl["DNS Controller\n(chain)"]
    Store["ReadStore\n(in-memory)"]
    API["Connect API / MCP"]
    WebUI["Web UI\n(React SPA)"]

    K8s -->|"external-dns.alpha.kubernetes.io/hostname annotation"| Source
    Source -->|"Enrich with sreportal.io/portal, sreportal.io/groups, sreportal.io/ignore\nRoute to portal (ignored endpoints dropped)"| DNSRecord
    DNSRecord -->|"Watched by DNS controller"| DNSCtrl
    DNSCtrl -->|"Push projections"| Store
    Store -->|"Read by"| API
    API --> WebUI
  

Owner References

DNSRecord resources are managed by the source controller with Kubernetes owner references, enabling automatic garbage collection when a portal is deleted.