Ctrl Plane

Health Checks

Configure and run health checks against instances using HTTP, TCP, gRPC, or custom commands.

The health subsystem lets you configure checks per instance, run them on a schedule, and query historical results. Health status feeds into the telemetry dashboard and can trigger alerts through the event bus.

Check types

Ctrl Plane ships with four built-in health checkers:

TypeWhat it does
httpSends an HTTP request to a URL and checks the status code
tcpOpens a TCP connection to a host:port
grpcCalls the gRPC health checking protocol
commandExecutes a command inside the instance container

Configuring a check

check, err := cp.Health.Configure(ctx, health.ConfigureRequest{
    InstanceID: instanceID,
    Name:       "api-health",
    Type:       health.CheckHTTP,
    Target:     "http://localhost:8080/healthz",
    Interval:   30 * time.Second,
    Timeout:    5 * time.Second,
    Retries:    3,
})

The check runs automatically at the configured interval once the background HealthRunner worker is running.

Running a check manually

result, err := cp.Health.RunCheck(ctx, checkID)
// result.Status is health.StatusHealthy, StatusDegraded, StatusUnhealthy, or StatusUnknown
// result.Latency is the time taken
// result.Message has details on failure

Health status

Each check execution produces a HealthResult:

type HealthResult struct {
    ctrlplane.Entity
    CheckID    id.ID         `json:"check_id"    db:"check_id"`
    InstanceID id.ID         `json:"instance_id"  db:"instance_id"`
    Status     Status        `json:"status"       db:"status"`
    Latency    time.Duration `json:"latency"      db:"latency"`
    Message    string        `json:"message"      db:"message"`
    StatusCode int           `json:"status_code"  db:"status_code"`
    CheckedAt  time.Time     `json:"checked_at"   db:"checked_at"`
}

Status values:

StatusMeaning
healthyCheck passed
degradedCheck passed but with warnings (e.g., high latency)
unhealthyCheck failed
unknownCheck could not be executed

Aggregate health

Get the overall health of an instance across all its configured checks:

health, err := cp.Health.GetHealth(ctx, instanceID)
// health.Status is the worst status across all checks
// health.Checks is a list of individual check summaries

Check history

Query past results for a specific check:

results, err := cp.Health.GetHistory(ctx, checkID, health.HistoryOptions{
    Limit: 100,
    Since: time.Now().Add(-24 * time.Hour),
})

Custom checkers

Implement the health.Checker interface and register it with the health service:

type Checker interface {
    Type() CheckType
    Check(ctx context.Context, check *HealthCheck) (*HealthResult, error)
}

// Register during setup
cp.Health.RegisterChecker(myCustomChecker)

Events

EventWhen
HealthCheckPassedA check transitions to healthy
HealthCheckFailedA check transitions to unhealthy
HealthDegradedA check transitions to degraded
HealthRecoveredA check transitions from unhealthy back to healthy

On this page