Operations and Scalability Analysis¶
TALOS v0.3.0 | April 2026 | Engineering Research Document
1. Current Operational Profile¶
TALOS v0.3.0 runs three services on Fly.io (AMS region):
| Component | Fly VM | RAM | Public endpoint |
|---|---|---|---|
talos-core |
shared-cpu-1x | 1 GB | HTTPS :443 |
talos-director |
shared-cpu-1x | 512 MB | None (worker) |
talos-broker |
shared-cpu-1x | 128 MB (compose) | TCP 1883, WS 9001 |
talos-db |
Fly Postgres | 10 GB vol | Internal only |
The system today handles a small deployment: single-digit stations, single-digit
campaigns, and a handful of concurrent dashboard users. The Director runs a
physics loop at 2 Hz (LOOP_INTERVAL=0.5s) driving rotator/rig commands and
satellite visualization.
Per-tick work (steady state, N stations, C campaigns):
| Operation | Frequency | Cost per call |
|---|---|---|
get_active_assignments() |
Every tick (2 Hz) | 1 SQL query + N+C eager loads |
| Az/El + Doppler per station | Every tick | 2 SGP4 propagations + vector math |
get_campaign_transmitters() |
Per tracking station | 1 SQL query per station with LOS |
update_assignment_status() |
On AOS/LOS transitions | 1 SQL write |
predict_passes() |
Every 10 s per assignment | find_events() over 24 h window |
calculate_ground_track() |
Once per campaign (cached) | 48 SGP4 propagations |
| MQTT publishes | 2-4 per station per tick | QoS 0 for rot/rig, QoS 1 for session |
| Heartbeat | Every tick | 1 MQTT publish (QoS 0) |
| Visualization | Per campaign per tick | footprint calc + JSON serialize |
Current MQTT message rates (N stations, C campaigns, all tracking):
- Rotator commands: N * 2/s = 2N msg/s
- Rig commands: N * 2/s = 2N msg/s (when transmitter selected)
- Viz payloads: C * 2/s msg/s
- Heartbeats: 2 msg/s (constant)
- Pass predictions: N * C / 10 msg/s (throttled)
- Session start/stop: rare (AOS/LOS events only)
For 5 stations and 2 campaigns: ~24 msg/s outbound from Director.
2. Performance Bottlenecks¶
2.1 Director Loop Timing¶
The tick function (director.py:404) does all work synchronously on a single
thread. The critical path per tick:
-
Database round-trips dominate.
get_active_assignments()opens a session and eagerly loads Assignment, Campaign, Station, and Organization rows. For N active assignments this is 1 + 3N individualsession.get()calls. With PostgreSQL on Fly private network (~1 ms RTT), 50 assignments costs ~150 ms just in eager loading. -
get_campaign_transmitters()is called for every station that has LOS on every tick. At 2 Hz with 10 stations tracking, that is 20 SQL queries/s. -
Pass prediction (
predict_passes) callssat.find_events()over a 24-hour window. Load tests show this takes ~20-50 ms per station. At 100 stations this is 2-5 s every 10 seconds -- enough to cause loop drift. -
No batching. MQTT publishes happen one at a time inside the loop. The
paho-mqttclient queues them, but serialization ofmodel_dump_json()for each station adds up.
2.2 Database Query Patterns¶
core/main.py (1630 lines) serves the FastAPI API. Key concerns:
-
N+1 query patterns. The dashboard endpoint (
/dashboard, line 501) fetches campaigns then iteratesc.assignmentstriggering lazy loads. The org settings and members pages do the same pattern. -
SatNOGS sync on startup.
sync_satnogs_data()downloads the full TLE catalog (~8000+ entries), compilesEarthSatelliteobjects in chunks of 500, and holds them all inGLOBAL_SAT_REGISTRY(list in memory). This consumed enough RAM to trigger OOM on the original 512 MB VM, prompting the upgrade to 1 GB and a 30-second deferred start. -
No connection pooling tuning. Both core and director use
create_engine(DATABASE_URL, pool_pre_ping=True)with default pool size (5) and overflow (10). Under concurrent API load this is adequate, but the director occupies a connection on every tick (2/s), potentially starving the API pool. -
SQLite dev database. The dev database (
talos_dev.db, 348 KB) is tiny, masking query performance issues that appear only against Postgres at scale.
2.3 MQTT Fan-Out¶
Mosquitto is configured with 128 MB (docker-compose.yml) and runs as a single
process. Topic structure is per-station (talos/gs/{station_id}/cmd/rot), so
the broker does not perform fan-out on command topics -- each message goes to
exactly one subscriber (the station agent). The broadcast topics
(talos/mission/viz, talos/director/heartbeat) fan out to all connected
dashboard WebSocket clients.
The v0.2 org-scoped topics (talos/{org}/gs/{sid}/cmd/rot) add a namespace
layer but do not change the 1:1 delivery pattern for commands. Dashboard clients
subscribing to talos/{org}/campaign/{id}/viz will receive per-campaign viz at
2 Hz each -- with 50 campaigns that is 100 msg/s to every dashboard.
3. Scaling Analysis¶
3.1 Physics Loop Budget¶
The Director must complete one tick within 500 ms (the loop interval). Load test
results from test_load.py provide measured per-station physics costs:
| Stations | Az/El + Doppler | Per-station | Pass prediction (10s cycle) |
|---|---|---|---|
| 10 | < 0.5 s (asserted) | ~5-10 ms | ~200-500 ms |
| 50 | < 2.0 s (asserted) | ~5-10 ms | ~1-2.5 s |
| 100 | logged, no assert | ~5-10 ms | ~2-5 s |
The physics computation itself scales linearly -- SGP4 propagation is ~0.1 ms per call, and each station needs 2 propagations (az/el + Doppler). The real bottleneck is database I/O and pass prediction.
3.2 Scenario Projections¶
10 stations, 3 campaigns (current target)
- Tick budget: ~50 ms physics + ~50 ms DB = ~100 ms. Comfortable.
- MQTT: ~64 msg/s. Mosquitto handles this trivially.
- DB: ~60 queries/s across both director and API. Fine with default pool.
- RAM: Director ~100 MB, Core ~400 MB (with TLE registry). Within limits.
50 stations, 10 campaigns
- Tick budget: ~250 ms physics + ~500 ms DB (eager loads) = ~750 ms. Exceeds 500 ms budget.
- MQTT: ~220 msg/s. Still within Mosquitto single-process capacity (~50K msg/s).
- DB: ~300 queries/s. Connection pool saturation likely (default pool_size=5).
- Pass prediction cycle: ~2.5 s for 50 stations every 10 s. Blocks the main loop.
- Mitigation required: batch DB queries, run pass prediction in a thread pool.
100 stations, 20 campaigns
- Tick: physics alone is fine (~500 ms) but DB I/O pushes to ~1.5 s. Loop drift.
- MQTT: ~440 msg/s. Fine.
- DB: ~600 queries/s. Need pool_size >= 15 and query batching.
- Director RAM: ~200 MB (100 station objects, 20 TLE managers). Fits in 512 MB.
- Pass prediction: ~5 s blocking. Must be moved to async/threaded.
- Architecture change needed: Director needs async DB queries or a job queue.
500 stations, 50 campaigns
- Tick: completely infeasible in a single synchronous loop.
- MQTT: ~2200 msg/s. Still within Mosquitto capacity, but broker VM needs upgrade.
- DB: ~3000 queries/s. Requires read replicas or query caching (Redis).
- Architecture change: Director must shard by org or campaign. Multiple director instances, each responsible for a subset of stations. Needs MQTT topic partitioning and a coordination layer (or simply one director per org).
1000 stations, 100 campaigns
- MQTT: ~4400 msg/s commands + ~200 msg/s viz. Mosquitto on a dedicated 2-CPU VM can handle this, but WebSocket fan-out to dashboards becomes the bottleneck.
- DB: ~6000+ queries/s. Requires connection pooler (PgBouncer), read replicas, and aggressive caching.
- Director: must be horizontally sharded. Minimum 5-10 director instances.
- Core API: must scale to 3+ instances behind a load balancer (Fly already supports
flyctl scale count). - Total monthly infrastructure: ~$200-400/mo on Fly.io (see Section 7).
3.3 Summary Table¶
| Scale | Stations | Campaigns | Tick fits 500ms? | Primary bottleneck | Architecture |
|---|---|---|---|---|---|
| S | 10 | 3 | Yes | None | Current |
| M | 50 | 10 | No | DB eager loads | Batch queries |
| L | 100 | 20 | No | DB + pass prediction | Threaded prediction |
| XL | 500 | 50 | No | Single-process director | Sharded directors |
| XXL | 1000 | 100 | No | Everything | Full distributed arch |
4. CI/CD Pipeline Review¶
4.1 Pipeline Structure¶
The .gitlab-ci.yml defines 7 stages with approximately 18 jobs:
| Stage | Jobs | Key characteristics |
|---|---|---|
| lint | ruff, mypy |
~30 s each, parallel |
| test | test-unit, test-smoke, test-physics, test-campaign, test-integration, test-load, test-agent-hardware |
7 jobs, DAG deps on lint |
| security | sast, secret-detection, dependency-scanning |
GitLab-managed templates |
| build | build-core, build-director, build-agent |
Docker builds, main branch only |
| release | release-images, create-release |
Tag-triggered (v*.*.*) |
| deploy | deploy-broker, deploy-core, deploy-director |
Fly.io, main branch |
| pages | pages |
MkDocs, main branch + tags |
4.2 Build Times and Reliability¶
-
Test jobs use
python:3.10-slimand install dependencies from scratch each run (no pre-built image). Thebefore_scriptpip installs add 30-60 s to each job. A pre-built CI image would eliminate this overhead. -
Integration test requires Postgres + Mosquitto services. Service startup adds ~10-15 s. The test itself is the longest single job.
-
Load test is marked
slowand runs in CI with a Mosquitto service. It exercises physics benchmarks without hard assertions at 100 stations, which means regressions could go unnoticed. -
Docker builds use BuildKit but no layer caching between runs. Each build rebuilds from scratch. Adding
--cache-fromwith the registry would cut build times by 50-70%. -
Retry policy: all jobs retry on
runner_system_failureandstuck_or_timeout_failure(max 2). This is appropriate for shared runners. -
Artifact retention: test reports expire in 30 days; code quality in 7 days. Reasonable for a project this size.
4.3 Missing CI Capabilities¶
- No performance regression gate -- load test results are printed but not compared to baselines.
- No database migration testing -- Alembic migrations are not validated in CI.
- No end-to-end Docker Compose test (the
test_e2e_docker.shscript exists but is not wired into the pipeline). - No canary or staged deployment -- main branch deploys directly to production.
5. Observability Gaps¶
5.1 Logging¶
Both core and director use Python logging at INFO level with timestamped output.
Fly.io captures stdout/stderr. This is functional for debugging but lacks:
-
Structured logging (JSON). Plain text logs are hard to query in log aggregation tools. Switching to JSON format with fields like
station_id,campaign_id,elapsed_mswould enable filtering and alerting. -
Log levels are coarse. The director logs every station registration and tracking state change at INFO. At 100 stations this produces ~200 log lines per AOS/LOS event. No DEBUG-level detail for physics computations.
-
No request ID or correlation ID in the API. Tracing a user action through core -> MQTT -> director -> agent requires manual timestamp matching.
5.2 Metrics¶
There is no metrics collection (Prometheus, StatsD, or equivalent). Key metrics that should be tracked:
- Director tick duration (histogram): detect loop drift before it causes missed commands.
- DB query latency (per-query type): identify N+1 regressions.
- MQTT publish rate and latency: detect broker backpressure.
- TLE age per campaign: alert when tracking accuracy degrades.
- Active station count / tracking sessions: capacity planning baseline.
- API request latency (p50, p95, p99): SLO tracking.
5.3 Health Checks¶
- Core exposes
/health(Fly checks every 30 s with 120 s grace period). - Director uses
pgrep -f 'director.director'-- this only checks the process is alive, not that the physics loop is running or the MQTT connection is healthy. - Broker uses a
mosquitto_subself-test.
Gap: No deep health check for the director. If the MQTT connection drops or
the database becomes unreachable, the director process stays alive but stops
functioning. The heartbeat topic (talos/director/heartbeat) could be monitored
by core as a liveness signal, but this is not implemented.
5.4 Alerting¶
No alerting is configured. Fly.io provides basic machine-level alerts, but application-level conditions (loop drift, TLE staleness, DB connection failures) are only visible in logs.
6. Disaster Recovery¶
6.1 Database Backup¶
Fly Postgres provides automatic daily snapshots and WAL-based PITR. The current
deployment uses --initial-cluster-size 1 (no HA). A single-node failure means:
- RPO: last WAL segment (~minutes of data).
- RTO: time to restore from snapshot + replay WAL. Typically 5-15 minutes on Fly.io, but could be longer if the volume is degraded.
- Recommendation:
--initial-cluster-size 2for automatic failover.
6.2 State Loss Scenarios¶
| Failure | Impact | Recovery |
|---|---|---|
| Director crash | Stations stop receiving commands. No data loss. | Auto-restart via Fly. Stations safe-mode (park rotator). |
| Core crash | Dashboard offline. API unavailable. | Auto-restart. MQTT continues, director unaffected. |
| Broker crash | All real-time communication stops. | Auto-restart. Clients auto-reconnect (reconnect_delay_set). QoS 1 messages redelivered. |
| DB crash (single node) | Both core and director fail to query. | Fly snapshot restore. 5-15 min outage. |
| DB corruption | Potential data loss. | Restore from snapshot. Campaign/assignment state may be stale. |
| TLE API unavailable | Director uses cached TLEs. Accuracy degrades over hours. | Graceful degradation built in (TLEManager._fallback()). |
6.3 What Is Not Backed Up¶
- MQTT message queues: Mosquitto persistence is enabled in production config
but not in Fly deployment (no volume mount for
/mosquitto/data). A broker restart loses all QoS 1 pending messages. - In-memory state:
StationManager,MultiTLEManager, andGLOBAL_SAT_REGISTRYare rebuilt from DB/API on restart. Ground track cache is recomputed. This adds 30-60 seconds to recovery. - Session cookies: signed with
SECRET_KEY. If the key rotates, all users are logged out.
7. Cost Analysis¶
7.1 Current Infrastructure (Fly.io)¶
| Resource | Spec | Monthly cost (est.) |
|---|---|---|
| Core VM | shared-cpu-1x, 1 GB | ~$5.70 |
| Director VM | shared-cpu-1x, 512 MB | ~$3.57 |
| Broker VM | shared-cpu-1x, 256 MB | ~$2.28 |
| Postgres | shared-cpu-1x, 1 GB, 10 GB vol | ~$7.00 |
| Outbound transfer | < 1 GB/mo at current scale | Free tier |
| Total | ~$19/mo |
7.2 Projected Costs at Scale¶
| Scale | Stations | Infra changes | Monthly cost (est.) |
|---|---|---|---|
| S (10) | 10 | None | ~$19 |
| M (50) | 50 | Core 2 GB, Director 1 GB, pool tuning | ~$30 |
| L (100) | 100 | 2x Core, Director 2 GB, PgBouncer | ~$60 |
| XL (500) | 500 | 3x Core, 5x Director (sharded), Postgres HA, Redis | ~$180 |
| XXL (1000) | 1000 | 5x Core, 10x Director, dedicated Postgres, MQTT cluster | ~$350 |
Agent-side costs are borne by station operators (Raspberry Pi or equivalent). Each agent connects to the broker over TCP 1883. Bandwidth per agent is minimal (~1 KB/s inbound commands, ~0.5 KB/s outbound telemetry).
7.3 SatNOGS API Dependency¶
The TLE sync downloads ~8000 satellite entries on each startup. This is a single HTTP call to the SatNOGS API (no rate limiting documented, but courtesy limits apply). At current scale this is fine, but if multiple director instances each sync independently, the request volume scales linearly with instance count. A shared TLE cache (Redis or a dedicated microservice) would reduce external API calls.
8. Recommendations¶
Prioritized by impact and effort. Items 1-3 address the most pressing scaling bottlenecks; items 4-8 are medium-term operational improvements.
P0 -- Critical (before 50 stations)¶
1. Batch database queries in the Director tick.
Replace the N+1 eager-load pattern in get_active_assignments() with a single
joined query: SELECT a.*, c.*, s.*, o.* FROM assignment a JOIN campaign c ....
This reduces per-tick DB round-trips from 1 + 3N to 1. Estimated effort: 4 hours.
2. Move pass prediction to a background thread.
predict_passes() blocks the main loop for 20-50 ms per station. At 50 stations
this exceeds the tick budget. Run prediction in a ThreadPoolExecutor and cache
results. The main loop reads cached predictions; the background thread refreshes
them every 10 seconds. Estimated effort: 8 hours.
3. Add Director deep health check. Expose a simple HTTP endpoint (or publish to a sentinel MQTT topic) that confirms the physics loop completed its last tick within 2x the expected interval. Core can monitor this and alert/restart if the director is stuck.
P1 -- Important (before 100 stations)¶
4. Structured JSON logging. Switch both services to JSON-formatted logs with contextual fields. This enables log-based alerting (e.g., "loop drift > 200 ms") and integration with Grafana Loki, Datadog, or similar.
5. Add Prometheus metrics.
Instrument tick duration, DB query count/latency, MQTT publish rate, active
stations, and TLE age. Expose a /metrics endpoint on both core and director.
The existing Fly.io Grafana integration can scrape these.
6. Docker build caching in CI.
Add --cache-from $CI_REGISTRY_IMAGE/core:latest to the Docker build steps.
This should reduce build times from ~3-5 minutes to ~1-2 minutes per image.
7. Performance regression gates. Record load test results as CI artifacts and compare against a baseline. Fail the pipeline if the 50-station tick time regresses by more than 20%.
P2 -- Medium Term (before 500 stations)¶
8. Director sharding.
Partition the workload so each director instance handles a subset of organizations
or campaigns. The simplest approach: one director per organization, selected by
ORG_ID environment variable. The director queries only assignments for its org.
MQTT topics are already org-scoped (v0.2 topic structure), so no broker changes
are needed.
9. Connection pooling with PgBouncer. Deploy PgBouncer between the application and Postgres to handle connection multiplexing. Set the director pool to transaction mode (short-lived sessions) and the API pool to session mode (for long-running requests).
10. Upgrade Fly Postgres to HA.
Switch to --initial-cluster-size 2 for automatic primary failover. Cost
increase: ~$7/mo. Eliminates the single-node DB as a SPOF.
11. MQTT broker persistence volume.
Mount a Fly volume to /mosquitto/data so QoS 1 messages survive broker
restarts. Currently, a broker restart drops all undelivered messages.
Appendix A: File Reference¶
| File | Lines | Role |
|---|---|---|
director/director.py |
821 | Main director loop, MQTT callbacks, DB queries |
director/physics.py |
140 | SGP4 propagation, Doppler, pass prediction, ground track |
director/station_manager.py |
188 | Thread-safe station registry with multi-campaign tracking |
director/tle_manager.py |
209 | TLE fetch/cache with multi-campaign support |
core/main.py |
1630 | FastAPI API, auth, RBAC, SatNOGS sync, dashboard |
core/database.py |
~150 | SQLModel definitions (Organization, Station, Campaign, etc.) |
ops/docker-compose.yml |
105 | Production-like local deployment topology |
fly/core.toml |
51 | Fly.io Core config (1 GB, shared-cpu-1x) |
fly/director.toml |
25 | Fly.io Director config (512 MB, shared-cpu-1x) |
.gitlab-ci.yml |
458 | 7 stages, ~18 jobs, DAG dependencies |
tests/test_integration/test_load.py |
447 | Physics and MQTT load benchmarks |
Appendix B: MQTT Topic Budget at Scale¶
Messages per second from Director at steady state (all stations tracking):
| N stations | C campaigns | Rot cmd | Rig cmd | Viz | Heartbeat | Predictions | Total |
|---|---|---|---|---|---|---|---|
| 10 | 3 | 20 | 20 | 6 | 2 | 3 | 51 |
| 50 | 10 | 100 | 100 | 20 | 2 | 50 | 272 |
| 100 | 20 | 200 | 200 | 40 | 2 | 200 | 642 |
| 500 | 50 | 1000 | 1000 | 100 | 2 | 2500 | 4602 |
| 1000 | 100 | 2000 | 2000 | 200 | 2 | 10000 | 14202 |
Note: predictions column assumes all station-campaign pairs are active. In practice, only stations assigned to a campaign generate prediction traffic, and the 10-second throttle bounds the burst rate. Mosquitto benchmarks show single-instance throughput of ~50,000 msg/s for small payloads (< 1 KB), so the broker itself is not the bottleneck until well past 1000 stations.