TALOS Operations & Scalability Analysis¶
Date: 2026-04-01 Scope: core/, director/, agent/, ops/
1. Deployment & DevOps¶
Docker Compose Configuration¶
The current docker-compose.yml in ops/ has several structural problems that prevent it from being used reliably beyond a single-developer workstation.
Hardcoded credentials everywhere. The PostgreSQL password (talos_password) and database name appear as plaintext in three places: the db service environment, the core connection string, and the director connection string. The JWT secret in main.py is the literal string super_secret_mission_key. The MQTT broker runs with allow_anonymous true on both port 1883 and the WebSocket port 9001, meaning any device on the network can subscribe to all topics or publish arbitrary commands to ground stations.
No health checks on any service. The core and director services use depends_on without condition: service_healthy, so they will attempt to connect to PostgreSQL and Mosquitto before those services are ready. The restart: always directive papers over this by crash-looping until the dependencies happen to be ready, but this produces noisy logs and unpredictable startup timing.
Source code volume mounts in the production-shaped stack. Both core and director mount ../core:/app, which means the running containers use whatever is on the developer's disk rather than the image built by the Dockerfile. Combined with uvicorn.run(..., reload=True) on line 257 of main.py, this is effectively a development environment masquerading as a deployment. If someone runs this stack on a server, a stray file save would cause the web API to restart mid-operation.
No CI/CD pipeline. There is no GitHub Actions workflow, no Makefile, and no build script. Images are never pushed to a registry. Deployments are presumably done by cloning the repo and running docker compose up, which means there is no versioning, no rollback capability, and no way to reproduce a known-good state.
No environment separation. There is one compose file used for everything. There are no .env files, no profiles, no override files. The same hardcoded credentials and dev-mode settings would be used in any environment.
Recommendations (Priority: HIGH)¶
- Move all secrets to a
.envfile excluded from version control, or use Docker secrets. At minimum: database password, JWT secret key, and MQTT credentials. - Add health checks to
dbandbrokerservices. Usepg_isreadyfor PostgreSQL and a TCP check for Mosquitto. Changedepends_onto usecondition: service_healthy. - Remove the volume mount of source code. Build the image with
COPY . .(which the Dockerfile already does) and use it as-is. Create a separatedocker-compose.dev.ymloverride for development that adds the volume mount and reload flags. - Remove
reload=Truefromuvicorn.run()or gate it behind an environment variable. - Stop exposing PostgreSQL port 5432 to the host in production. Only the internal
talos_netneeds access.
2. Observability¶
Current State: Effectively Blind¶
The entire system uses print() statements with emoji prefixes as its only form of logging. There is no logging framework, no log levels, no structured output, no timestamps (the print statements do not include time), and no correlation IDs.
There is zero monitoring. No metrics are collected from any component. There is no way to know:
- How many ground stations are currently connected
- Whether the physics loop is keeping up with its 0.5-second cadence
- How long SGP4 calculations take per tick
- How many MQTT messages are being published per second
- Database connection pool usage or query latency
- Whether the SatNOGS sync succeeded or failed
- Memory usage of
GLOBAL_SAT_REGISTRY(which holds the entire TLE catalog in RAM)
There is no alerting. If the Mission Director crashes, the only indication is that ground stations stop receiving commands. If the database fills up, nothing notifies anyone. If MQTT connections drop, the bare except: pass blocks in mission_director.py (lines 57, 173) silently swallow all errors.
The Silent Failure Problem¶
The codebase contains at least five bare except: pass blocks that silently discard errors:
mission_director.pyline 57: TLE fetch failuremission_director.pyline 173: station handshake parsingmain.pyline 163: SatNOGS station importmain.pyline 77: individual TLE parse in syncagent.pyline 53: rotator command send
Any of these could be failing continuously with no visible indication.
Recommendations (Priority: HIGH)¶
- Replace all
print()calls with Python'sloggingmodule. Use structured JSON logging (e.g.,python-json-logger) so logs can be ingested by any log aggregator. Include timestamps, component names, and severity levels. - Replace bare
except: passwith specific exception handling that logs the error at WARNING or ERROR level. - Add Prometheus metrics to the web API via
prometheus-fastapi-instrumentatorand to the Mission Director via a dedicated/metricsendpoint or push gateway. Key metrics: director_loop_duration_seconds(histogram)director_stations_active(gauge)director_mqtt_messages_published_total(counter)director_sgp4_calculation_seconds(histogram)web_satnogs_sync_duration_seconds(histogram)web_global_sat_registry_size(gauge)- Add a Grafana + Prometheus stack to
docker-compose.yml(or a separate monitoring compose file). - Implement basic alerting: Director heartbeat missing for >10 seconds, database connection failures, MQTT broker unreachable.
3. Scalability Analysis¶
At 10 Ground Stations (Current Target)¶
The system works, with caveats. The Mission Director's main loop takes the registry lock and iterates over all stations every 0.5 seconds. For 10 stations, the SGP4 position calculation, Doppler computation, and pass prediction are each fast individually. However, predict_passes calls satellite.find_events() which is relatively expensive, and it runs for every station every 10 seconds. At 10 stations this is likely under 1 second total, so the system keeps up.
MQTT message volume at 10 stations: approximately 30 messages per 0.5-second tick (rotator command + rig command + viz per station) plus 10 telemetry messages per second from agents. Roughly 80 messages/second, well within Mosquitto's capacity.
Database pressure is low. The Director opens a new Session via get_active_mission() every 0.5 seconds but the queries are simple indexed lookups.
At 100 Ground Stations¶
The physics loop breaks. The 0.5-second tick budget becomes critical. Each station requires: one SGP4 position evaluation, one topocentric coordinate transform, one Doppler calculation, and two to three MQTT publishes. The predict_passes call (every 10 seconds) involves find_events() over a 24-hour window for each station. At 100 stations, this pass prediction sweep alone could take 5-10 seconds, causing the main loop to stall and ground stations to receive stale pointing data.
MQTT fan-out grows linearly. Each tick publishes roughly 300 messages. The notify_system function in main.py creates a brand new MQTT client connection, publishes one message, and disconnects -- every single time. At 100 stations generating frequent events, this creates significant connection churn on the broker.
The GLOBAL_SAT_REGISTRY scan becomes dangerous. The /api/debug/overhead endpoint iterates over the entire satellite catalog (potentially 5,000+ objects) computing SGP4 positions for each one. A single request takes hundreds of milliseconds. Multiple concurrent requests from 100 stations would saturate the web server.
Database connection pattern is wasteful. Both main.py and mission_director.py create their own engine instances (the engine is created at module load in database.py and also again in mission_director.py line 18). The Director calls get_active_mission() every 0.5 seconds, creating a new session each time. At 100 stations with more frequent mission changes, this becomes noticeable.
At 1,000 Ground Stations¶
The system is fundamentally unworkable at this scale without a redesign.
Single-threaded physics loop is the hard bottleneck. SGP4 for 1,000 stations at 2 Hz means 2,000 position evaluations per second plus Doppler calculations. The GIL-bound Python loop cannot keep up. Pass predictions for 1,000 stations would take minutes, during which no pointing updates are sent.
MQTT topic explosion. Each station has at least 5 topics (info, cmd/config, cmd/session, cmd/rot, cmd/rig, schedule, telemetry/rot). At 1,000 stations this is 7,000+ active topic subscriptions. The Director publishes to each station individually in a loop. Message volume exceeds 6,000 messages/second on the Director alone. A single Mosquitto instance may cope, but the Python MQTT client library becomes the bottleneck.
Memory. GLOBAL_SAT_REGISTRY holds Skyfield EarthSatellite objects for the entire catalog. Each object is relatively heavy. At 5,000+ satellites, this is hundreds of megabytes. The station_registry dict is trivial by comparison.
Single PostgreSQL instance. No connection pooling (no PgBouncer), no read replicas, no partitioning strategy.
Recommendations (Priority: MEDIUM-HIGH)¶
- Vectorize SGP4 calculations. Use
sgp4library's batch propagation or numpy-based vectorized computation instead of per-station sequential loops. This alone could give a 10-50x speedup. - Decouple pass prediction from the realtime loop. Run pass predictions in a background thread or separate worker process on a longer interval (every 60 seconds is sufficient). The main loop should only do position/Doppler calculations.
- Replace the per-call MQTT client in
notify_system. Create a single persistent MQTT client at web API startup and reuse it for all notifications. - Shard the Director by geographic region at 100+ stations. Each Director instance handles a subset of stations. Use MQTT topic prefixes or separate broker virtual hosts to partition traffic.
- Move
GLOBAL_SAT_REGISTRYout of process memory. Use Redis or a dedicated service for the satellite catalog. The/api/debug/overheadendpoint should be rate-limited or moved to an async job. - Use connection pooling. Configure SQLAlchemy engine with
pool_size,max_overflow, andpool_pre_ping. Consider PgBouncer in front of PostgreSQL for the multi-process case.
4. Reliability & Fault Tolerance¶
Single Points of Failure¶
| Component | Failure Mode | Impact | Recovery |
|---|---|---|---|
| PostgreSQL | Crash / disk full | All mission data lost if volume is on ephemeral storage. Director and web API both crash-loop. | restart: always brings it back, but data integrity is unknown. No backups configured. |
| Mosquitto | Crash | All ground station communication stops instantly. Director cannot send commands. Agents cannot report telemetry. | restart: always recovers the process, but all MQTT sessions are lost. Agents must reconnect and re-handshake. |
| Mission Director | Crash / hang | Ground stations receive no pointing updates. Rotators stop tracking. Active passes are missed. | restart: always restarts it, but all in-memory state (station registry, current satellite, TLE cache) is lost and must be rebuilt from scratch. |
| Web API | Crash | Dashboard inaccessible, cannot create stations or missions. Does not affect active tracking (Director is independent). | restart: always recovers. GLOBAL_SAT_REGISTRY must be rebuilt via SatNOGS sync. |
| SatNOGS API | Unavailable | Cannot sync satellite catalog, cannot fetch fresh TLEs. | The Director's fetch_fresh_tle has a bare except: pass so it silently fails and the satellite variable stays None, causing the entire tracking loop to do nothing. |
No Graceful Degradation¶
The Mission Director has no concept of degraded operation. If TLE fetch fails (line 229 of mission_director.py), it sleeps 10 seconds and retries in the next loop iteration, but current_satellite remains None so no tracking happens at all. A better approach would be to use the last known good TLE with increasing staleness warnings.
The agent (agent.py) has no reconnection logic for the rotator socket. If the rotator connection drops (line 53, bare except: pass), commands are silently lost until the agent is manually restarted.
There is no data persistence for the Director's runtime state. A restart means: - Station registry is rebuilt from the database (good) - Current satellite TLE must be re-fetched from SatNOGS (external dependency) - Pass predictions are recalculated from scratch - All station bindings (is_bound) are lost, causing duplicate START/STOP session messages
Recommendations (Priority: HIGH)¶
- Add PostgreSQL backups. At minimum, a daily
pg_dumpcron job writing to a mounted volume. For production, use WAL archiving. - Implement MQTT Last Will and Testament. Have the Director publish a "going offline" LWT message so agents and dashboards know immediately when the Director drops.
- Cache the last good TLE. Store TLE data in PostgreSQL and fall back to the cached version when the SatNOGS API is unreachable. Log a warning about TLE staleness.
- Add reconnection logic to the agent. If the rotator socket drops, attempt reconnection with exponential backoff rather than silently discarding all commands.
- Persist Director state. Write the current mission ID and station bindings to the database or a Redis key so that restarts are seamless.
5. Performance¶
The Physics Loop Bottleneck¶
The main loop in mission_director.py runs every 0.5 seconds and does the following synchronously, holding the GIL:
- Checks for hot reload flag
- Publishes a heartbeat
- Queries the database for the active mission (
get_active_mission()-- a new Session every tick) - If mission changed: fetches TLE from SatNOGS over HTTP (blocking, up to 10s timeout)
- Calculates satellite footprint (one SGP4 evaluation)
- Calculates ground track (48 SGP4 evaluations for the 95-minute orbit visualization)
- Publishes visualization data
- Every 10 seconds: predicts passes for ALL stations (expensive
find_eventscall per station) - For each station: calculates position, determines LOS, publishes rotator/rig commands
Step 6 is particularly wasteful -- it recalculates the full ground track every 0.5 seconds even though the orbit does not change on that timescale. This should be cached and recalculated only when the mission or TLE changes.
Step 8 scales linearly with station count and involves numerical root-finding under the hood (find_events). At 10 stations this takes roughly 0.5-1 second. At 50 stations it would exceed the 10-second interval and start creating a backlog.
Step 4 makes a blocking HTTP request in the main loop. If SatNOGS is slow, the entire system pauses.
MQTT Throughput¶
The Director publishes all messages synchronously in a loop using QoS 0 for rotator commands (fire-and-forget) and QoS 1 for session commands and schedules (at-least-once). The mix is reasonable but the sequential publish pattern means each station adds latency to the loop.
The notify_system() function in main.py is egregiously wasteful:
def notify_system(event_type: str, extra_data: dict = None):
client = mqtt.Client(...) # New client every call
client.connect(BROKER, ...) # TCP handshake every call
client.publish(...)
client.disconnect() # Teardown every call
Every station creation, mission activation, and sync completion creates a new TCP connection to the broker, sends one message, and tears it down.
Database Connection Patterns¶
Two separate SQLAlchemy engines exist: one created in database.py (used by main.py) and another created in mission_director.py line 18. Neither configures pool parameters, so they use SQLAlchemy defaults (pool_size=5, max_overflow=10). The Director's get_active_mission() opens and closes a session every 0.5 seconds (120 session open/close cycles per minute), which is functional but wasteful.
The /api/debug/overhead endpoint does a full table scan of GLOBAL_SAT_REGISTRY (an in-memory list, not a DB query) but then does N individual session.get(SatelliteCache, sat_id) calls in a loop -- one database round trip per candidate satellite.
Recommendations (Priority: MEDIUM)¶
- Cache the ground track. Recalculate only on mission or TLE change, not every tick. This eliminates 48 SGP4 evaluations per tick.
- Move pass prediction to a background thread with its own timer. The main loop should only do position/Doppler at 2 Hz.
- Move TLE fetch out of the main loop. Fetch TLEs in a background thread or on a timer. Cache in the database.
- Batch MQTT publishes. Group all station commands into fewer, larger payloads or use a dedicated publish thread.
- Fix
notify_systemto use a persistent client. Create the MQTT client once at startup. - Batch database lookups in the overhead scanner. Use a single
WHERE sat_id IN (...)query instead of N individual gets.
6. Operational Recommendations Summary¶
Priority 1 -- Do Immediately (Safety/Security)¶
| Item | Effort | Impact |
|---|---|---|
Move secrets to .env / Docker secrets |
1 hour | Prevents credential leaks in version control |
| Enable MQTT authentication | 1 hour | Prevents unauthorized command injection to ground stations |
| Replace hardcoded JWT secret | 15 min | Prevents trivial authentication bypass |
| Stop exposing PostgreSQL to host network | 5 min | Reduces attack surface |
Remove allow_anonymous true from mosquitto.conf |
30 min | Requires adding ACLs and agent credentials |
Priority 2 -- Do This Week (Reliability)¶
| Item | Effort | Impact |
|---|---|---|
| Add Docker health checks | 30 min | Clean startup ordering, no crash-loop roulette |
Replace print() with logging |
2 hours | Structured logs, severity levels, timestamps |
Replace bare except: pass |
1 hour | Visible error reporting instead of silent failures |
Fix notify_system to use persistent MQTT client |
30 min | Eliminates connection churn on broker |
| Cache last good TLE in database | 1 hour | System continues tracking when SatNOGS is down |
Remove reload=True from production |
5 min | Prevents random restarts from file changes |
| Separate dev and prod compose files | 1 hour | Clean environment separation |
Priority 3 -- Do This Month (Scalability)¶
| Item | Effort | Impact |
|---|---|---|
| Cache ground track calculation | 1 hour | Eliminates 48 wasted SGP4 calls per tick |
| Move pass prediction to background thread | 2 hours | Unblocks main loop, enables scaling past 10 stations |
| Add Prometheus metrics to Director and web API | 4 hours | Visibility into system performance |
| Configure SQLAlchemy connection pool parameters | 30 min | Controlled database resource usage |
| Add PostgreSQL backup job | 1 hour | Data recovery capability |
| Add MQTT Last Will and Testament | 30 min | Immediate failure detection |
Priority 4 -- Plan for Future (Architecture)¶
| Item | Effort | Impact |
|---|---|---|
| Vectorize SGP4 with batch propagation | 1 week | 10-50x physics throughput improvement |
| Shard Director by region | 2 weeks | Enables 100+ station operation |
| Move satellite registry to Redis | 3 days | Shared state, reduced memory per process |
| Add agent reconnection with backoff | 2 days | Self-healing ground stations |
| CI/CD pipeline with image registry | 2 days | Reproducible builds, versioned deployments, rollback |
| Horizontal web API scaling behind load balancer | 1 week | Handle concurrent dashboard users |
Appendix: Architecture Risk Matrix¶
Low Impact High Impact
+------------------+------------------+
High | Ground track | Physics loop |
Likelihood | recalculated | exceeds 0.5s |
of Failure | every tick | budget at scale |
| | |
| notify_system | Silent exception |
| connection churn | swallowing |
+------------------+------------------+
Low | GLOBAL_SAT_ | SatNOGS API |
Likelihood | REGISTRY OOM | outage stalls |
of Failure | (needs 5000+ | all tracking |
| satellites) | |
| | DB credential |
| | leak from repo |
+------------------+------------------+
The most dangerous combination is the silent exception swallowing at high likelihood -- problems are occurring today that nobody can see.