Operations and Scalability Analysis¶

TALOS v0.3.0 | April 2026 | Engineering Research Document

1. Current Operational Profile¶

TALOS v0.3.0 runs three services on Fly.io (AMS region):

Component	Fly VM	RAM	Public endpoint
`talos-core`	shared-cpu-1x	1 GB	HTTPS :443
`talos-director`	shared-cpu-1x	512 MB	None (worker)
`talos-broker`	shared-cpu-1x	128 MB (compose)	TCP 1883, WS 9001
`talos-db`	Fly Postgres	10 GB vol	Internal only

The system today handles a small deployment: single-digit stations, single-digit campaigns, and a handful of concurrent dashboard users. The Director runs a physics loop at 2 Hz (LOOP_INTERVAL=0.5s) driving rotator/rig commands and satellite visualization.

Per-tick work (steady state, N stations, C campaigns):

Operation	Frequency	Cost per call
`get_active_assignments()`	Every tick (2 Hz)	1 SQL query + N+C eager loads
Az/El + Doppler per station	Every tick	2 SGP4 propagations + vector math
`get_campaign_transmitters()`	Per tracking station	1 SQL query per station with LOS
`update_assignment_status()`	On AOS/LOS transitions	1 SQL write
`predict_passes()`	Every 10 s per assignment	`find_events()` over 24 h window
`calculate_ground_track()`	Once per campaign (cached)	48 SGP4 propagations
MQTT publishes	2-4 per station per tick	QoS 0 for rot/rig, QoS 1 for session
Heartbeat	Every tick	1 MQTT publish (QoS 0)
Visualization	Per campaign per tick	footprint calc + JSON serialize

Current MQTT message rates (N stations, C campaigns, all tracking):

Rotator commands: N * 2/s = 2N msg/s
Rig commands: N * 2/s = 2N msg/s (when transmitter selected)
Viz payloads: C * 2/s msg/s
Heartbeats: 2 msg/s (constant)
Pass predictions: N * C / 10 msg/s (throttled)
Session start/stop: rare (AOS/LOS events only)

For 5 stations and 2 campaigns: ~24 msg/s outbound from Director.

2. Performance Bottlenecks¶

2.1 Director Loop Timing¶

The tick function (director.py:404) does all work synchronously on a single thread. The critical path per tick:

Database round-trips dominate. get_active_assignments() opens a session and eagerly loads Assignment, Campaign, Station, and Organization rows. For N active assignments this is 1 + 3N individual session.get() calls. With PostgreSQL on Fly private network (~1 ms RTT), 50 assignments costs ~150 ms just in eager loading.
get_campaign_transmitters() is called for every station that has LOS on every tick. At 2 Hz with 10 stations tracking, that is 20 SQL queries/s.
Pass prediction (predict_passes) calls sat.find_events() over a 24-hour window. Load tests show this takes ~20-50 ms per station. At 100 stations this is 2-5 s every 10 seconds -- enough to cause loop drift.
No batching. MQTT publishes happen one at a time inside the loop. The paho-mqtt client queues them, but serialization of model_dump_json() for each station adds up.

2.2 Database Query Patterns¶

core/main.py (1630 lines) serves the FastAPI API. Key concerns:

N+1 query patterns. The dashboard endpoint (/dashboard, line 501) fetches campaigns then iterates c.assignments triggering lazy loads. The org settings and members pages do the same pattern.
SatNOGS sync on startup. sync_satnogs_data() downloads the full TLE catalog (~8000+ entries), compiles EarthSatellite objects in chunks of 500, and holds them all in GLOBAL_SAT_REGISTRY (list in memory). This consumed enough RAM to trigger OOM on the original 512 MB VM, prompting the upgrade to 1 GB and a 30-second deferred start.
No connection pooling tuning. Both core and director use create_engine(DATABASE_URL, pool_pre_ping=True) with default pool size (5) and overflow (10). Under concurrent API load this is adequate, but the director occupies a connection on every tick (2/s), potentially starving the API pool.
SQLite dev database. The dev database (talos_dev.db, 348 KB) is tiny, masking query performance issues that appear only against Postgres at scale.

2.3 MQTT Fan-Out¶

Mosquitto is configured with 128 MB (docker-compose.yml) and runs as a single process. Topic structure is per-station (talos/gs/{station_id}/cmd/rot), so the broker does not perform fan-out on command topics -- each message goes to exactly one subscriber (the station agent). The broadcast topics (talos/mission/viz, talos/director/heartbeat) fan out to all connected dashboard WebSocket clients.

The v0.2 org-scoped topics (talos/{org}/gs/{sid}/cmd/rot) add a namespace layer but do not change the 1:1 delivery pattern for commands. Dashboard clients subscribing to talos/{org}/campaign/{id}/viz will receive per-campaign viz at 2 Hz each -- with 50 campaigns that is 100 msg/s to every dashboard.

3. Scaling Analysis¶

3.1 Physics Loop Budget¶

The Director must complete one tick within 500 ms (the loop interval). Load test results from test_load.py provide measured per-station physics costs:

Stations	Az/El + Doppler	Per-station	Pass prediction (10s cycle)
10	< 0.5 s (asserted)	~5-10 ms	~200-500 ms
50	< 2.0 s (asserted)	~5-10 ms	~1-2.5 s
100	logged, no assert	~5-10 ms	~2-5 s

The physics computation itself scales linearly -- SGP4 propagation is ~0.1 ms per call, and each station needs 2 propagations (az/el + Doppler). The real bottleneck is database I/O and pass prediction.

3.2 Scenario Projections¶

10 stations, 3 campaigns (current target)

Tick budget: ~50 ms physics + ~50 ms DB = ~100 ms. Comfortable.
MQTT: ~64 msg/s. Mosquitto handles this trivially.
DB: ~60 queries/s across both director and API. Fine with default pool.
RAM: Director ~100 MB, Core ~400 MB (with TLE registry). Within limits.

50 stations, 10 campaigns

Tick budget: ~250 ms physics + ~500 ms DB (eager loads) = ~750 ms. Exceeds 500 ms budget.
MQTT: ~220 msg/s. Still within Mosquitto single-process capacity (~50K msg/s).
DB: ~300 queries/s. Connection pool saturation likely (default pool_size=5).
Pass prediction cycle: ~2.5 s for 50 stations every 10 s. Blocks the main loop.
Mitigation required: batch DB queries, run pass prediction in a thread pool.

100 stations, 20 campaigns

Tick: physics alone is fine (~500 ms) but DB I/O pushes to ~1.5 s. Loop drift.
MQTT: ~440 msg/s. Fine.
DB: ~600 queries/s. Need pool_size >= 15 and query batching.
Director RAM: ~200 MB (100 station objects, 20 TLE managers). Fits in 512 MB.
Pass prediction: ~5 s blocking. Must be moved to async/threaded.
Architecture change needed: Director needs async DB queries or a job queue.

500 stations, 50 campaigns

Tick: completely infeasible in a single synchronous loop.
MQTT: ~2200 msg/s. Still within Mosquitto capacity, but broker VM needs upgrade.
DB: ~3000 queries/s. Requires read replicas or query caching (Redis).
Architecture change: Director must shard by org or campaign. Multiple director instances, each responsible for a subset of stations. Needs MQTT topic partitioning and a coordination layer (or simply one director per org).

1000 stations, 100 campaigns

MQTT: ~4400 msg/s commands + ~200 msg/s viz. Mosquitto on a dedicated 2-CPU VM can handle this, but WebSocket fan-out to dashboards becomes the bottleneck.
DB: ~6000+ queries/s. Requires connection pooler (PgBouncer), read replicas, and aggressive caching.
Director: must be horizontally sharded. Minimum 5-10 director instances.
Core API: must scale to 3+ instances behind a load balancer (Fly already supports flyctl scale count).
Total monthly infrastructure: ~$200-400/mo on Fly.io (see Section 7).

3.3 Summary Table¶

Scale	Stations	Campaigns	Tick fits 500ms?	Primary bottleneck	Architecture
S	10	3	Yes	None	Current
M	50	10	No	DB eager loads	Batch queries
L	100	20	No	DB + pass prediction	Threaded prediction
XL	500	50	No	Single-process director	Sharded directors
XXL	1000	100	No	Everything	Full distributed arch

4. CI/CD Pipeline Review¶

4.1 Pipeline Structure¶

The .gitlab-ci.yml defines 7 stages with approximately 18 jobs:

Stage	Jobs	Key characteristics
lint	`ruff`, `mypy`	~30 s each, parallel
test	`test-unit`, `test-smoke`, `test-physics`, `test-campaign`, `test-integration`, `test-load`, `test-agent-hardware`	7 jobs, DAG deps on lint
security	`sast`, `secret-detection`, `dependency-scanning`	GitLab-managed templates
build	`build-core`, `build-director`, `build-agent`	Docker builds, main branch only
release	`release-images`, `create-release`	Tag-triggered (`v..*`)
deploy	`deploy-broker`, `deploy-core`, `deploy-director`	Fly.io, main branch
pages	`pages`	MkDocs, main branch + tags

4.2 Build Times and Reliability¶

Test jobs use python:3.10-slim and install dependencies from scratch each run (no pre-built image). The before_script pip installs add 30-60 s to each job. A pre-built CI image would eliminate this overhead.
Integration test requires Postgres + Mosquitto services. Service startup adds ~10-15 s. The test itself is the longest single job.
Load test is marked slow and runs in CI with a Mosquitto service. It exercises physics benchmarks without hard assertions at 100 stations, which means regressions could go unnoticed.
Docker builds use BuildKit but no layer caching between runs. Each build rebuilds from scratch. Adding --cache-from with the registry would cut build times by 50-70%.
Retry policy: all jobs retry on runner_system_failure and stuck_or_timeout_failure (max 2). This is appropriate for shared runners.
Artifact retention: test reports expire in 30 days; code quality in 7 days. Reasonable for a project this size.

4.3 Missing CI Capabilities¶

No performance regression gate -- load test results are printed but not compared to baselines.
No database migration testing -- Alembic migrations are not validated in CI.
No end-to-end Docker Compose test (the test_e2e_docker.sh script exists but is not wired into the pipeline).
No canary or staged deployment -- main branch deploys directly to production.

5. Observability Gaps¶

5.1 Logging¶

Both core and director use Python logging at INFO level with timestamped output. Fly.io captures stdout/stderr. This is functional for debugging but lacks:

Structured logging (JSON). Plain text logs are hard to query in log aggregation tools. Switching to JSON format with fields like station_id, campaign_id, elapsed_ms would enable filtering and alerting.
Log levels are coarse. The director logs every station registration and tracking state change at INFO. At 100 stations this produces ~200 log lines per AOS/LOS event. No DEBUG-level detail for physics computations.
No request ID or correlation ID in the API. Tracing a user action through core -> MQTT -> director -> agent requires manual timestamp matching.

5.2 Metrics¶

There is no metrics collection (Prometheus, StatsD, or equivalent). Key metrics that should be tracked:

Director tick duration (histogram): detect loop drift before it causes missed commands.
DB query latency (per-query type): identify N+1 regressions.
MQTT publish rate and latency: detect broker backpressure.
TLE age per campaign: alert when tracking accuracy degrades.
Active station count / tracking sessions: capacity planning baseline.
API request latency (p50, p95, p99): SLO tracking.

5.3 Health Checks¶

Core exposes /health (Fly checks every 30 s with 120 s grace period).
Director uses pgrep -f 'director.director' -- this only checks the process is alive, not that the physics loop is running or the MQTT connection is healthy.
Broker uses a mosquitto_sub self-test.

Gap: No deep health check for the director. If the MQTT connection drops or the database becomes unreachable, the director process stays alive but stops functioning. The heartbeat topic (talos/director/heartbeat) could be monitored by core as a liveness signal, but this is not implemented.

5.4 Alerting¶

No alerting is configured. Fly.io provides basic machine-level alerts, but application-level conditions (loop drift, TLE staleness, DB connection failures) are only visible in logs.

6. Disaster Recovery¶

6.1 Database Backup¶

Fly Postgres provides automatic daily snapshots and WAL-based PITR. The current deployment uses --initial-cluster-size 1 (no HA). A single-node failure means:

RPO: last WAL segment (~minutes of data).
RTO: time to restore from snapshot + replay WAL. Typically 5-15 minutes on Fly.io, but could be longer if the volume is degraded.
Recommendation: --initial-cluster-size 2 for automatic failover.

6.2 State Loss Scenarios¶

Failure	Impact	Recovery
Director crash	Stations stop receiving commands. No data loss.	Auto-restart via Fly. Stations safe-mode (park rotator).
Core crash	Dashboard offline. API unavailable.	Auto-restart. MQTT continues, director unaffected.
Broker crash	All real-time communication stops.	Auto-restart. Clients auto-reconnect (`reconnect_delay_set`). QoS 1 messages redelivered.
DB crash (single node)	Both core and director fail to query.	Fly snapshot restore. 5-15 min outage.
DB corruption	Potential data loss.	Restore from snapshot. Campaign/assignment state may be stale.
TLE API unavailable	Director uses cached TLEs. Accuracy degrades over hours.	Graceful degradation built in (`TLEManager._fallback()`).

6.3 What Is Not Backed Up¶

MQTT message queues: Mosquitto persistence is enabled in production config but not in Fly deployment (no volume mount for /mosquitto/data). A broker restart loses all QoS 1 pending messages.
In-memory state: StationManager, MultiTLEManager, and GLOBAL_SAT_REGISTRY are rebuilt from DB/API on restart. Ground track cache is recomputed. This adds 30-60 seconds to recovery.
Session cookies: signed with SECRET_KEY. If the key rotates, all users are logged out.

7. Cost Analysis¶

7.1 Current Infrastructure (Fly.io)¶

Resource	Spec	Monthly cost (est.)
Core VM	shared-cpu-1x, 1 GB	~$5.70
Director VM	shared-cpu-1x, 512 MB	~$3.57
Broker VM	shared-cpu-1x, 256 MB	~$2.28
Postgres	shared-cpu-1x, 1 GB, 10 GB vol	~$7.00
Outbound transfer	< 1 GB/mo at current scale	Free tier
Total		~$19/mo

7.2 Projected Costs at Scale¶

Scale	Stations	Infra changes	Monthly cost (est.)
S (10)	10	None	~$19
M (50)	50	Core 2 GB, Director 1 GB, pool tuning	~$30
L (100)	100	2x Core, Director 2 GB, PgBouncer	~$60
XL (500)	500	3x Core, 5x Director (sharded), Postgres HA, Redis	~$180
XXL (1000)	1000	5x Core, 10x Director, dedicated Postgres, MQTT cluster	~$350

Agent-side costs are borne by station operators (Raspberry Pi or equivalent). Each agent connects to the broker over TCP 1883. Bandwidth per agent is minimal (~1 KB/s inbound commands, ~0.5 KB/s outbound telemetry).

7.3 SatNOGS API Dependency¶

The TLE sync downloads ~8000 satellite entries on each startup. This is a single HTTP call to the SatNOGS API (no rate limiting documented, but courtesy limits apply). At current scale this is fine, but if multiple director instances each sync independently, the request volume scales linearly with instance count. A shared TLE cache (Redis or a dedicated microservice) would reduce external API calls.

8. Recommendations¶

Prioritized by impact and effort. Items 1-3 address the most pressing scaling bottlenecks; items 4-8 are medium-term operational improvements.

P0 -- Critical (before 50 stations)¶

1. Batch database queries in the Director tick. Replace the N+1 eager-load pattern in get_active_assignments() with a single joined query: SELECT a.*, c.*, s.*, o.* FROM assignment a JOIN campaign c .... This reduces per-tick DB round-trips from 1 + 3N to 1. Estimated effort: 4 hours.

2. Move pass prediction to a background thread. predict_passes() blocks the main loop for 20-50 ms per station. At 50 stations this exceeds the tick budget. Run prediction in a ThreadPoolExecutor and cache results. The main loop reads cached predictions; the background thread refreshes them every 10 seconds. Estimated effort: 8 hours.

3. Add Director deep health check. Expose a simple HTTP endpoint (or publish to a sentinel MQTT topic) that confirms the physics loop completed its last tick within 2x the expected interval. Core can monitor this and alert/restart if the director is stuck.

P1 -- Important (before 100 stations)¶

4. Structured JSON logging. Switch both services to JSON-formatted logs with contextual fields. This enables log-based alerting (e.g., "loop drift > 200 ms") and integration with Grafana Loki, Datadog, or similar.

5. Add Prometheus metrics. Instrument tick duration, DB query count/latency, MQTT publish rate, active stations, and TLE age. Expose a /metrics endpoint on both core and director. The existing Fly.io Grafana integration can scrape these.

6. Docker build caching in CI. Add --cache-from $CI_REGISTRY_IMAGE/core:latest to the Docker build steps. This should reduce build times from ~3-5 minutes to ~1-2 minutes per image.

7. Performance regression gates. Record load test results as CI artifacts and compare against a baseline. Fail the pipeline if the 50-station tick time regresses by more than 20%.

P2 -- Medium Term (before 500 stations)¶

8. Director sharding. Partition the workload so each director instance handles a subset of organizations or campaigns. The simplest approach: one director per organization, selected by ORG_ID environment variable. The director queries only assignments for its org. MQTT topics are already org-scoped (v0.2 topic structure), so no broker changes are needed.

9. Connection pooling with PgBouncer. Deploy PgBouncer between the application and Postgres to handle connection multiplexing. Set the director pool to transaction mode (short-lived sessions) and the API pool to session mode (for long-running requests).

10. Upgrade Fly Postgres to HA. Switch to --initial-cluster-size 2 for automatic primary failover. Cost increase: ~$7/mo. Eliminates the single-node DB as a SPOF.

11. MQTT broker persistence volume. Mount a Fly volume to /mosquitto/data so QoS 1 messages survive broker restarts. Currently, a broker restart drops all undelivered messages.

Appendix A: File Reference¶

File	Lines	Role
`director/director.py`	821	Main director loop, MQTT callbacks, DB queries
`director/physics.py`	140	SGP4 propagation, Doppler, pass prediction, ground track
`director/station_manager.py`	188	Thread-safe station registry with multi-campaign tracking
`director/tle_manager.py`	209	TLE fetch/cache with multi-campaign support
`core/main.py`	1630	FastAPI API, auth, RBAC, SatNOGS sync, dashboard
`core/database.py`	~150	SQLModel definitions (Organization, Station, Campaign, etc.)
`ops/docker-compose.yml`	105	Production-like local deployment topology
`fly/core.toml`	51	Fly.io Core config (1 GB, shared-cpu-1x)
`fly/director.toml`	25	Fly.io Director config (512 MB, shared-cpu-1x)
`.gitlab-ci.yml`	458	7 stages, ~18 jobs, DAG dependencies
`tests/test_integration/test_load.py`	447	Physics and MQTT load benchmarks

Appendix B: MQTT Topic Budget at Scale¶

Messages per second from Director at steady state (all stations tracking):

N stations	C campaigns	Rot cmd	Rig cmd	Viz	Heartbeat	Predictions	Total
10	3	20	20	6	2	3	51
50	10	100	100	20	2	50	272
100	20	200	200	40	2	200	642
500	50	1000	1000	100	2	2500	4602
1000	100	2000	2000	200	2	10000	14202

Note: predictions column assumes all station-campaign pairs are active. In practice, only stations assigned to a campaign generate prediction traffic, and the 10-second throttle bounds the burst rate. Mosquitto benchmarks show single-instance throughput of ~50,000 msg/s for small payloads (< 1 KB), so the broker itself is not the bottleneck until well past 1000 stations.