Scheduler
The scheduler is a background service that runs a reconciliation loop. It matches pending job demand to available RISC-V node capacity, provisions runner pods, syncs worker state with Kubernetes and GitHub, and cleans up terminated pods.
Source: container/scheduler.py
Reconciliation loop
The scheduler is woken by PostgreSQL LISTEN/NOTIFY (the ghfe webhook handler emits a queue_event notification when a new job is recorded) or by a 15-second timeout. Each iteration runs three operations:
- Job sync (
sync_jobs_state): reconciles job rows in PostgreSQL with GitHub’s actual job status. Catches missed or out-of-order webhooks. - Worker sync (
sync_workers_state): a single transaction that holdsLOCK TABLE workers IN EXCLUSIVE MODEfor its duration and runs five phases: orphan sweep, k8s pod phase → worker status sync, health checks (kill stuck-pending and never-registered runners), GitHub-side cleanup of terminal/orphaned runners, and deletion of pods past the 6-hour grace period. - Demand match (
demand_match): provisions new runner pods where demand exceeds supply.
Only one scheduler at a time may run sync_workers_state. If a second instance is deployed it blocks on the table lock until the first commits.
Demand matching
For each pending job (processed FIFO by created_at):
- Demand check. Compute
demand = COUNT(jobs WHERE entity_id, job_labels match AND status IN (pending, running))andsupply = COUNT(workers WHERE entity_id, job_labels match AND status IN (pending, running)). Skip ifsupply >= demand. - Max workers cap. Skip if the entity (organization or personal account) has reached its
max_workerslimit across all pools. - Capacity check. Query Kubernetes for available
riseproject.com/runnerslots on nodes matching the pool’s node selector. Skip if no slots are free. - Provision. If all checks pass:
- Reserve a worker name in PostgreSQL.
- Authenticate with the correct GitHub App (org app or personal app, based on entity type).
- For organizations: ensure a runner group named “RISE RISC-V Runners” exists, then create an org-scoped JIT runner config.
- For personal accounts: create a repo-scoped JIT runner config.
- Create a Kubernetes pod with the JIT config injected as
RUNNER_JITCONFIG.
Demand and supply are matched by (entity_id, job_labels) rather than by pool. This avoids stuck workers when different label sets map to the same pool but a workflow expects matching runner labels.
Pod provisioning
Pods are created via the Kubernetes API with:
- Node selector:
riseproject.dev/board: {pool}(targets the correct hardware). - Resource limit:
riseproject.com/runner: 1(enforces one pod per node via the device plugin). - Active deadline: 525,600 seconds (~6 days) (prevents stuck pods).
- Security context:
privileged: true(required by the in-pod Docker daemon). - Environment:
RUNNER_JITCONFIG(the base64 JIT token). - No init containers, no volumes. Runner pods are a single container; the image’s entrypoint launches the GitHub Actions runner directly. See Container Images.
Health checks
Two health checks run inside sync_workers_state. Rather than deleting the pod directly, the scheduler patches spec.activeDeadlineSeconds = 1. The kubelet then transitions the pod to Failed (reason DeadlineExceeded) so it enters the normal grace-and-delete flow and its logs and events remain inspectable.
runner_never_registered: pod has beenRunningfor more thanRUNNER_REGISTRATION_TIMEOUT_SECONDS(120s) but the runner never appeared in the GitHub API. The worker is markedfailedwith full diagnostics infailure_info, then the pod is killed so its slot frees up for a retry.pod_stuck_pending: pod has beenPendingfor more thanPOD_PENDING_TIMEOUT_SECONDS(600s), typically due to missing capacity or an image-pull failure.
If GitHub refuses to delete a runner that is still busy, sync_workers_state aborts the cleanup for that worker and retries on the next loop iteration.
Cleanup
Terminated pods (Succeeded or Failed) are kept for POD_DELETE_GRACE_SECONDS (6 hours) so logs and events remain accessible via kubectl. The worker row in PostgreSQL is updated to completed/failed immediately on phase transition, so pool supply accounting stays accurate throughout the grace period. After the grace period the pod is deleted; the worker row is never deleted.
Configuration
| Setting | Value | Source |
|---|---|---|
| Poll interval | 15 seconds | scheduler.py |
| Max workers per entity | Configurable per org/account | constants.py (ENTITY_CONFIG) |
| Pod active deadline | 525,600 seconds | k8s.py |
| Pod delete grace | 6 hours | scheduler.py |
| Runner registration timeout | 120 seconds | constants.py |
| Pod pending timeout | 600 seconds | constants.py |
Related files
container/scheduler.py: reconciliation loop, demand matching, worker state sync, cleanup.container/k8s.py: Kubernetes pod provisioning, capacity checks, failure-info collection.container/github.py: GitHub API (JIT config, runner groups, job status).container/db.py: PostgreSQL operations for jobs and workers.