Skip to content

Operations and Scalability Analysis

TALOS v0.3.0 | April 2026 | Engineering Research Document


1. Current Operational Profile

TALOS v0.3.0 runs three services on Fly.io (AMS region):

Component Fly VM RAM Public endpoint
talos-core shared-cpu-1x 1 GB HTTPS :443
talos-director shared-cpu-1x 512 MB None (worker)
talos-broker shared-cpu-1x 128 MB (compose) TCP 1883, WS 9001
talos-db Fly Postgres 10 GB vol Internal only

The system today handles a small deployment: single-digit stations, single-digit campaigns, and a handful of concurrent dashboard users. The Director runs a physics loop at 2 Hz (LOOP_INTERVAL=0.5s) driving rotator/rig commands and satellite visualization.

Per-tick work (steady state, N stations, C campaigns):

Operation Frequency Cost per call
get_active_assignments() Every tick (2 Hz) 1 SQL query + N+C eager loads
Az/El + Doppler per station Every tick 2 SGP4 propagations + vector math
get_campaign_transmitters() Per tracking station 1 SQL query per station with LOS
update_assignment_status() On AOS/LOS transitions 1 SQL write
predict_passes() Every 10 s per assignment find_events() over 24 h window
calculate_ground_track() Once per campaign (cached) 48 SGP4 propagations
MQTT publishes 2-4 per station per tick QoS 0 for rot/rig, QoS 1 for session
Heartbeat Every tick 1 MQTT publish (QoS 0)
Visualization Per campaign per tick footprint calc + JSON serialize

Current MQTT message rates (N stations, C campaigns, all tracking):

  • Rotator commands: N * 2/s = 2N msg/s
  • Rig commands: N * 2/s = 2N msg/s (when transmitter selected)
  • Viz payloads: C * 2/s msg/s
  • Heartbeats: 2 msg/s (constant)
  • Pass predictions: N * C / 10 msg/s (throttled)
  • Session start/stop: rare (AOS/LOS events only)

For 5 stations and 2 campaigns: ~24 msg/s outbound from Director.


2. Performance Bottlenecks

2.1 Director Loop Timing

The tick function (director.py:404) does all work synchronously on a single thread. The critical path per tick:

  1. Database round-trips dominate. get_active_assignments() opens a session and eagerly loads Assignment, Campaign, Station, and Organization rows. For N active assignments this is 1 + 3N individual session.get() calls. With PostgreSQL on Fly private network (~1 ms RTT), 50 assignments costs ~150 ms just in eager loading.

  2. get_campaign_transmitters() is called for every station that has LOS on every tick. At 2 Hz with 10 stations tracking, that is 20 SQL queries/s.

  3. Pass prediction (predict_passes) calls sat.find_events() over a 24-hour window. Load tests show this takes ~20-50 ms per station. At 100 stations this is 2-5 s every 10 seconds -- enough to cause loop drift.

  4. No batching. MQTT publishes happen one at a time inside the loop. The paho-mqtt client queues them, but serialization of model_dump_json() for each station adds up.

2.2 Database Query Patterns

core/main.py (1630 lines) serves the FastAPI API. Key concerns:

  • N+1 query patterns. The dashboard endpoint (/dashboard, line 501) fetches campaigns then iterates c.assignments triggering lazy loads. The org settings and members pages do the same pattern.

  • SatNOGS sync on startup. sync_satnogs_data() downloads the full TLE catalog (~8000+ entries), compiles EarthSatellite objects in chunks of 500, and holds them all in GLOBAL_SAT_REGISTRY (list in memory). This consumed enough RAM to trigger OOM on the original 512 MB VM, prompting the upgrade to 1 GB and a 30-second deferred start.

  • No connection pooling tuning. Both core and director use create_engine(DATABASE_URL, pool_pre_ping=True) with default pool size (5) and overflow (10). Under concurrent API load this is adequate, but the director occupies a connection on every tick (2/s), potentially starving the API pool.

  • SQLite dev database. The dev database (talos_dev.db, 348 KB) is tiny, masking query performance issues that appear only against Postgres at scale.

2.3 MQTT Fan-Out

Mosquitto is configured with 128 MB (docker-compose.yml) and runs as a single process. Topic structure is per-station (talos/gs/{station_id}/cmd/rot), so the broker does not perform fan-out on command topics -- each message goes to exactly one subscriber (the station agent). The broadcast topics (talos/mission/viz, talos/director/heartbeat) fan out to all connected dashboard WebSocket clients.

The v0.2 org-scoped topics (talos/{org}/gs/{sid}/cmd/rot) add a namespace layer but do not change the 1:1 delivery pattern for commands. Dashboard clients subscribing to talos/{org}/campaign/{id}/viz will receive per-campaign viz at 2 Hz each -- with 50 campaigns that is 100 msg/s to every dashboard.


3. Scaling Analysis

3.1 Physics Loop Budget

The Director must complete one tick within 500 ms (the loop interval). Load test results from test_load.py provide measured per-station physics costs:

Stations Az/El + Doppler Per-station Pass prediction (10s cycle)
10 < 0.5 s (asserted) ~5-10 ms ~200-500 ms
50 < 2.0 s (asserted) ~5-10 ms ~1-2.5 s
100 logged, no assert ~5-10 ms ~2-5 s

The physics computation itself scales linearly -- SGP4 propagation is ~0.1 ms per call, and each station needs 2 propagations (az/el + Doppler). The real bottleneck is database I/O and pass prediction.

3.2 Scenario Projections

10 stations, 3 campaigns (current target)

  • Tick budget: ~50 ms physics + ~50 ms DB = ~100 ms. Comfortable.
  • MQTT: ~64 msg/s. Mosquitto handles this trivially.
  • DB: ~60 queries/s across both director and API. Fine with default pool.
  • RAM: Director ~100 MB, Core ~400 MB (with TLE registry). Within limits.

50 stations, 10 campaigns

  • Tick budget: ~250 ms physics + ~500 ms DB (eager loads) = ~750 ms. Exceeds 500 ms budget.
  • MQTT: ~220 msg/s. Still within Mosquitto single-process capacity (~50K msg/s).
  • DB: ~300 queries/s. Connection pool saturation likely (default pool_size=5).
  • Pass prediction cycle: ~2.5 s for 50 stations every 10 s. Blocks the main loop.
  • Mitigation required: batch DB queries, run pass prediction in a thread pool.

100 stations, 20 campaigns

  • Tick: physics alone is fine (~500 ms) but DB I/O pushes to ~1.5 s. Loop drift.
  • MQTT: ~440 msg/s. Fine.
  • DB: ~600 queries/s. Need pool_size >= 15 and query batching.
  • Director RAM: ~200 MB (100 station objects, 20 TLE managers). Fits in 512 MB.
  • Pass prediction: ~5 s blocking. Must be moved to async/threaded.
  • Architecture change needed: Director needs async DB queries or a job queue.

500 stations, 50 campaigns

  • Tick: completely infeasible in a single synchronous loop.
  • MQTT: ~2200 msg/s. Still within Mosquitto capacity, but broker VM needs upgrade.
  • DB: ~3000 queries/s. Requires read replicas or query caching (Redis).
  • Architecture change: Director must shard by org or campaign. Multiple director instances, each responsible for a subset of stations. Needs MQTT topic partitioning and a coordination layer (or simply one director per org).

1000 stations, 100 campaigns

  • MQTT: ~4400 msg/s commands + ~200 msg/s viz. Mosquitto on a dedicated 2-CPU VM can handle this, but WebSocket fan-out to dashboards becomes the bottleneck.
  • DB: ~6000+ queries/s. Requires connection pooler (PgBouncer), read replicas, and aggressive caching.
  • Director: must be horizontally sharded. Minimum 5-10 director instances.
  • Core API: must scale to 3+ instances behind a load balancer (Fly already supports flyctl scale count).
  • Total monthly infrastructure: ~$200-400/mo on Fly.io (see Section 7).

3.3 Summary Table

Scale Stations Campaigns Tick fits 500ms? Primary bottleneck Architecture
S 10 3 Yes None Current
M 50 10 No DB eager loads Batch queries
L 100 20 No DB + pass prediction Threaded prediction
XL 500 50 No Single-process director Sharded directors
XXL 1000 100 No Everything Full distributed arch

4. CI/CD Pipeline Review

4.1 Pipeline Structure

The .gitlab-ci.yml defines 7 stages with approximately 18 jobs:

Stage Jobs Key characteristics
lint ruff, mypy ~30 s each, parallel
test test-unit, test-smoke, test-physics, test-campaign, test-integration, test-load, test-agent-hardware 7 jobs, DAG deps on lint
security sast, secret-detection, dependency-scanning GitLab-managed templates
build build-core, build-director, build-agent Docker builds, main branch only
release release-images, create-release Tag-triggered (v*.*.*)
deploy deploy-broker, deploy-core, deploy-director Fly.io, main branch
pages pages MkDocs, main branch + tags

4.2 Build Times and Reliability

  • Test jobs use python:3.10-slim and install dependencies from scratch each run (no pre-built image). The before_script pip installs add 30-60 s to each job. A pre-built CI image would eliminate this overhead.

  • Integration test requires Postgres + Mosquitto services. Service startup adds ~10-15 s. The test itself is the longest single job.

  • Load test is marked slow and runs in CI with a Mosquitto service. It exercises physics benchmarks without hard assertions at 100 stations, which means regressions could go unnoticed.

  • Docker builds use BuildKit but no layer caching between runs. Each build rebuilds from scratch. Adding --cache-from with the registry would cut build times by 50-70%.

  • Retry policy: all jobs retry on runner_system_failure and stuck_or_timeout_failure (max 2). This is appropriate for shared runners.

  • Artifact retention: test reports expire in 30 days; code quality in 7 days. Reasonable for a project this size.

4.3 Missing CI Capabilities

  • No performance regression gate -- load test results are printed but not compared to baselines.
  • No database migration testing -- Alembic migrations are not validated in CI.
  • No end-to-end Docker Compose test (the test_e2e_docker.sh script exists but is not wired into the pipeline).
  • No canary or staged deployment -- main branch deploys directly to production.

5. Observability Gaps

5.1 Logging

Both core and director use Python logging at INFO level with timestamped output. Fly.io captures stdout/stderr. This is functional for debugging but lacks:

  • Structured logging (JSON). Plain text logs are hard to query in log aggregation tools. Switching to JSON format with fields like station_id, campaign_id, elapsed_ms would enable filtering and alerting.

  • Log levels are coarse. The director logs every station registration and tracking state change at INFO. At 100 stations this produces ~200 log lines per AOS/LOS event. No DEBUG-level detail for physics computations.

  • No request ID or correlation ID in the API. Tracing a user action through core -> MQTT -> director -> agent requires manual timestamp matching.

5.2 Metrics

There is no metrics collection (Prometheus, StatsD, or equivalent). Key metrics that should be tracked:

  • Director tick duration (histogram): detect loop drift before it causes missed commands.
  • DB query latency (per-query type): identify N+1 regressions.
  • MQTT publish rate and latency: detect broker backpressure.
  • TLE age per campaign: alert when tracking accuracy degrades.
  • Active station count / tracking sessions: capacity planning baseline.
  • API request latency (p50, p95, p99): SLO tracking.

5.3 Health Checks

  • Core exposes /health (Fly checks every 30 s with 120 s grace period).
  • Director uses pgrep -f 'director.director' -- this only checks the process is alive, not that the physics loop is running or the MQTT connection is healthy.
  • Broker uses a mosquitto_sub self-test.

Gap: No deep health check for the director. If the MQTT connection drops or the database becomes unreachable, the director process stays alive but stops functioning. The heartbeat topic (talos/director/heartbeat) could be monitored by core as a liveness signal, but this is not implemented.

5.4 Alerting

No alerting is configured. Fly.io provides basic machine-level alerts, but application-level conditions (loop drift, TLE staleness, DB connection failures) are only visible in logs.


6. Disaster Recovery

6.1 Database Backup

Fly Postgres provides automatic daily snapshots and WAL-based PITR. The current deployment uses --initial-cluster-size 1 (no HA). A single-node failure means:

  • RPO: last WAL segment (~minutes of data).
  • RTO: time to restore from snapshot + replay WAL. Typically 5-15 minutes on Fly.io, but could be longer if the volume is degraded.
  • Recommendation: --initial-cluster-size 2 for automatic failover.

6.2 State Loss Scenarios

Failure Impact Recovery
Director crash Stations stop receiving commands. No data loss. Auto-restart via Fly. Stations safe-mode (park rotator).
Core crash Dashboard offline. API unavailable. Auto-restart. MQTT continues, director unaffected.
Broker crash All real-time communication stops. Auto-restart. Clients auto-reconnect (reconnect_delay_set). QoS 1 messages redelivered.
DB crash (single node) Both core and director fail to query. Fly snapshot restore. 5-15 min outage.
DB corruption Potential data loss. Restore from snapshot. Campaign/assignment state may be stale.
TLE API unavailable Director uses cached TLEs. Accuracy degrades over hours. Graceful degradation built in (TLEManager._fallback()).

6.3 What Is Not Backed Up

  • MQTT message queues: Mosquitto persistence is enabled in production config but not in Fly deployment (no volume mount for /mosquitto/data). A broker restart loses all QoS 1 pending messages.
  • In-memory state: StationManager, MultiTLEManager, and GLOBAL_SAT_REGISTRY are rebuilt from DB/API on restart. Ground track cache is recomputed. This adds 30-60 seconds to recovery.
  • Session cookies: signed with SECRET_KEY. If the key rotates, all users are logged out.

7. Cost Analysis

7.1 Current Infrastructure (Fly.io)

Resource Spec Monthly cost (est.)
Core VM shared-cpu-1x, 1 GB ~$5.70
Director VM shared-cpu-1x, 512 MB ~$3.57
Broker VM shared-cpu-1x, 256 MB ~$2.28
Postgres shared-cpu-1x, 1 GB, 10 GB vol ~$7.00
Outbound transfer < 1 GB/mo at current scale Free tier
Total ~$19/mo

7.2 Projected Costs at Scale

Scale Stations Infra changes Monthly cost (est.)
S (10) 10 None ~$19
M (50) 50 Core 2 GB, Director 1 GB, pool tuning ~$30
L (100) 100 2x Core, Director 2 GB, PgBouncer ~$60
XL (500) 500 3x Core, 5x Director (sharded), Postgres HA, Redis ~$180
XXL (1000) 1000 5x Core, 10x Director, dedicated Postgres, MQTT cluster ~$350

Agent-side costs are borne by station operators (Raspberry Pi or equivalent). Each agent connects to the broker over TCP 1883. Bandwidth per agent is minimal (~1 KB/s inbound commands, ~0.5 KB/s outbound telemetry).

7.3 SatNOGS API Dependency

The TLE sync downloads ~8000 satellite entries on each startup. This is a single HTTP call to the SatNOGS API (no rate limiting documented, but courtesy limits apply). At current scale this is fine, but if multiple director instances each sync independently, the request volume scales linearly with instance count. A shared TLE cache (Redis or a dedicated microservice) would reduce external API calls.


8. Recommendations

Prioritized by impact and effort. Items 1-3 address the most pressing scaling bottlenecks; items 4-8 are medium-term operational improvements.

P0 -- Critical (before 50 stations)

1. Batch database queries in the Director tick. Replace the N+1 eager-load pattern in get_active_assignments() with a single joined query: SELECT a.*, c.*, s.*, o.* FROM assignment a JOIN campaign c .... This reduces per-tick DB round-trips from 1 + 3N to 1. Estimated effort: 4 hours.

2. Move pass prediction to a background thread. predict_passes() blocks the main loop for 20-50 ms per station. At 50 stations this exceeds the tick budget. Run prediction in a ThreadPoolExecutor and cache results. The main loop reads cached predictions; the background thread refreshes them every 10 seconds. Estimated effort: 8 hours.

3. Add Director deep health check. Expose a simple HTTP endpoint (or publish to a sentinel MQTT topic) that confirms the physics loop completed its last tick within 2x the expected interval. Core can monitor this and alert/restart if the director is stuck.

P1 -- Important (before 100 stations)

4. Structured JSON logging. Switch both services to JSON-formatted logs with contextual fields. This enables log-based alerting (e.g., "loop drift > 200 ms") and integration with Grafana Loki, Datadog, or similar.

5. Add Prometheus metrics. Instrument tick duration, DB query count/latency, MQTT publish rate, active stations, and TLE age. Expose a /metrics endpoint on both core and director. The existing Fly.io Grafana integration can scrape these.

6. Docker build caching in CI. Add --cache-from $CI_REGISTRY_IMAGE/core:latest to the Docker build steps. This should reduce build times from ~3-5 minutes to ~1-2 minutes per image.

7. Performance regression gates. Record load test results as CI artifacts and compare against a baseline. Fail the pipeline if the 50-station tick time regresses by more than 20%.

P2 -- Medium Term (before 500 stations)

8. Director sharding. Partition the workload so each director instance handles a subset of organizations or campaigns. The simplest approach: one director per organization, selected by ORG_ID environment variable. The director queries only assignments for its org. MQTT topics are already org-scoped (v0.2 topic structure), so no broker changes are needed.

9. Connection pooling with PgBouncer. Deploy PgBouncer between the application and Postgres to handle connection multiplexing. Set the director pool to transaction mode (short-lived sessions) and the API pool to session mode (for long-running requests).

10. Upgrade Fly Postgres to HA. Switch to --initial-cluster-size 2 for automatic primary failover. Cost increase: ~$7/mo. Eliminates the single-node DB as a SPOF.

11. MQTT broker persistence volume. Mount a Fly volume to /mosquitto/data so QoS 1 messages survive broker restarts. Currently, a broker restart drops all undelivered messages.


Appendix A: File Reference

File Lines Role
director/director.py 821 Main director loop, MQTT callbacks, DB queries
director/physics.py 140 SGP4 propagation, Doppler, pass prediction, ground track
director/station_manager.py 188 Thread-safe station registry with multi-campaign tracking
director/tle_manager.py 209 TLE fetch/cache with multi-campaign support
core/main.py 1630 FastAPI API, auth, RBAC, SatNOGS sync, dashboard
core/database.py ~150 SQLModel definitions (Organization, Station, Campaign, etc.)
ops/docker-compose.yml 105 Production-like local deployment topology
fly/core.toml 51 Fly.io Core config (1 GB, shared-cpu-1x)
fly/director.toml 25 Fly.io Director config (512 MB, shared-cpu-1x)
.gitlab-ci.yml 458 7 stages, ~18 jobs, DAG dependencies
tests/test_integration/test_load.py 447 Physics and MQTT load benchmarks

Appendix B: MQTT Topic Budget at Scale

Messages per second from Director at steady state (all stations tracking):

N stations C campaigns Rot cmd Rig cmd Viz Heartbeat Predictions Total
10 3 20 20 6 2 3 51
50 10 100 100 20 2 50 272
100 20 200 200 40 2 200 642
500 50 1000 1000 100 2 2500 4602
1000 100 2000 2000 200 2 10000 14202

Note: predictions column assumes all station-campaign pairs are active. In practice, only stations assigned to a campaign generate prediction traffic, and the 10-second throttle bounds the burst rate. Mosquitto benchmarks show single-instance throughput of ~50,000 msg/s for small payloads (< 1 KB), so the broker itself is not the bottleneck until well past 1000 stations.