Skip to content

TALOS Operations & Scalability Analysis

Date: 2026-04-01 Scope: core/, director/, agent/, ops/


1. Deployment & DevOps

Docker Compose Configuration

The current docker-compose.yml in ops/ has several structural problems that prevent it from being used reliably beyond a single-developer workstation.

Hardcoded credentials everywhere. The PostgreSQL password (talos_password) and database name appear as plaintext in three places: the db service environment, the core connection string, and the director connection string. The JWT secret in main.py is the literal string super_secret_mission_key. The MQTT broker runs with allow_anonymous true on both port 1883 and the WebSocket port 9001, meaning any device on the network can subscribe to all topics or publish arbitrary commands to ground stations.

No health checks on any service. The core and director services use depends_on without condition: service_healthy, so they will attempt to connect to PostgreSQL and Mosquitto before those services are ready. The restart: always directive papers over this by crash-looping until the dependencies happen to be ready, but this produces noisy logs and unpredictable startup timing.

Source code volume mounts in the production-shaped stack. Both core and director mount ../core:/app, which means the running containers use whatever is on the developer's disk rather than the image built by the Dockerfile. Combined with uvicorn.run(..., reload=True) on line 257 of main.py, this is effectively a development environment masquerading as a deployment. If someone runs this stack on a server, a stray file save would cause the web API to restart mid-operation.

No CI/CD pipeline. There is no GitHub Actions workflow, no Makefile, and no build script. Images are never pushed to a registry. Deployments are presumably done by cloning the repo and running docker compose up, which means there is no versioning, no rollback capability, and no way to reproduce a known-good state.

No environment separation. There is one compose file used for everything. There are no .env files, no profiles, no override files. The same hardcoded credentials and dev-mode settings would be used in any environment.

Recommendations (Priority: HIGH)

  1. Move all secrets to a .env file excluded from version control, or use Docker secrets. At minimum: database password, JWT secret key, and MQTT credentials.
  2. Add health checks to db and broker services. Use pg_isready for PostgreSQL and a TCP check for Mosquitto. Change depends_on to use condition: service_healthy.
  3. Remove the volume mount of source code. Build the image with COPY . . (which the Dockerfile already does) and use it as-is. Create a separate docker-compose.dev.yml override for development that adds the volume mount and reload flags.
  4. Remove reload=True from uvicorn.run() or gate it behind an environment variable.
  5. Stop exposing PostgreSQL port 5432 to the host in production. Only the internal talos_net needs access.

2. Observability

Current State: Effectively Blind

The entire system uses print() statements with emoji prefixes as its only form of logging. There is no logging framework, no log levels, no structured output, no timestamps (the print statements do not include time), and no correlation IDs.

There is zero monitoring. No metrics are collected from any component. There is no way to know:

  • How many ground stations are currently connected
  • Whether the physics loop is keeping up with its 0.5-second cadence
  • How long SGP4 calculations take per tick
  • How many MQTT messages are being published per second
  • Database connection pool usage or query latency
  • Whether the SatNOGS sync succeeded or failed
  • Memory usage of GLOBAL_SAT_REGISTRY (which holds the entire TLE catalog in RAM)

There is no alerting. If the Mission Director crashes, the only indication is that ground stations stop receiving commands. If the database fills up, nothing notifies anyone. If MQTT connections drop, the bare except: pass blocks in mission_director.py (lines 57, 173) silently swallow all errors.

The Silent Failure Problem

The codebase contains at least five bare except: pass blocks that silently discard errors:

  • mission_director.py line 57: TLE fetch failure
  • mission_director.py line 173: station handshake parsing
  • main.py line 163: SatNOGS station import
  • main.py line 77: individual TLE parse in sync
  • agent.py line 53: rotator command send

Any of these could be failing continuously with no visible indication.

Recommendations (Priority: HIGH)

  1. Replace all print() calls with Python's logging module. Use structured JSON logging (e.g., python-json-logger) so logs can be ingested by any log aggregator. Include timestamps, component names, and severity levels.
  2. Replace bare except: pass with specific exception handling that logs the error at WARNING or ERROR level.
  3. Add Prometheus metrics to the web API via prometheus-fastapi-instrumentator and to the Mission Director via a dedicated /metrics endpoint or push gateway. Key metrics:
  4. director_loop_duration_seconds (histogram)
  5. director_stations_active (gauge)
  6. director_mqtt_messages_published_total (counter)
  7. director_sgp4_calculation_seconds (histogram)
  8. web_satnogs_sync_duration_seconds (histogram)
  9. web_global_sat_registry_size (gauge)
  10. Add a Grafana + Prometheus stack to docker-compose.yml (or a separate monitoring compose file).
  11. Implement basic alerting: Director heartbeat missing for >10 seconds, database connection failures, MQTT broker unreachable.

3. Scalability Analysis

At 10 Ground Stations (Current Target)

The system works, with caveats. The Mission Director's main loop takes the registry lock and iterates over all stations every 0.5 seconds. For 10 stations, the SGP4 position calculation, Doppler computation, and pass prediction are each fast individually. However, predict_passes calls satellite.find_events() which is relatively expensive, and it runs for every station every 10 seconds. At 10 stations this is likely under 1 second total, so the system keeps up.

MQTT message volume at 10 stations: approximately 30 messages per 0.5-second tick (rotator command + rig command + viz per station) plus 10 telemetry messages per second from agents. Roughly 80 messages/second, well within Mosquitto's capacity.

Database pressure is low. The Director opens a new Session via get_active_mission() every 0.5 seconds but the queries are simple indexed lookups.

At 100 Ground Stations

The physics loop breaks. The 0.5-second tick budget becomes critical. Each station requires: one SGP4 position evaluation, one topocentric coordinate transform, one Doppler calculation, and two to three MQTT publishes. The predict_passes call (every 10 seconds) involves find_events() over a 24-hour window for each station. At 100 stations, this pass prediction sweep alone could take 5-10 seconds, causing the main loop to stall and ground stations to receive stale pointing data.

MQTT fan-out grows linearly. Each tick publishes roughly 300 messages. The notify_system function in main.py creates a brand new MQTT client connection, publishes one message, and disconnects -- every single time. At 100 stations generating frequent events, this creates significant connection churn on the broker.

The GLOBAL_SAT_REGISTRY scan becomes dangerous. The /api/debug/overhead endpoint iterates over the entire satellite catalog (potentially 5,000+ objects) computing SGP4 positions for each one. A single request takes hundreds of milliseconds. Multiple concurrent requests from 100 stations would saturate the web server.

Database connection pattern is wasteful. Both main.py and mission_director.py create their own engine instances (the engine is created at module load in database.py and also again in mission_director.py line 18). The Director calls get_active_mission() every 0.5 seconds, creating a new session each time. At 100 stations with more frequent mission changes, this becomes noticeable.

At 1,000 Ground Stations

The system is fundamentally unworkable at this scale without a redesign.

Single-threaded physics loop is the hard bottleneck. SGP4 for 1,000 stations at 2 Hz means 2,000 position evaluations per second plus Doppler calculations. The GIL-bound Python loop cannot keep up. Pass predictions for 1,000 stations would take minutes, during which no pointing updates are sent.

MQTT topic explosion. Each station has at least 5 topics (info, cmd/config, cmd/session, cmd/rot, cmd/rig, schedule, telemetry/rot). At 1,000 stations this is 7,000+ active topic subscriptions. The Director publishes to each station individually in a loop. Message volume exceeds 6,000 messages/second on the Director alone. A single Mosquitto instance may cope, but the Python MQTT client library becomes the bottleneck.

Memory. GLOBAL_SAT_REGISTRY holds Skyfield EarthSatellite objects for the entire catalog. Each object is relatively heavy. At 5,000+ satellites, this is hundreds of megabytes. The station_registry dict is trivial by comparison.

Single PostgreSQL instance. No connection pooling (no PgBouncer), no read replicas, no partitioning strategy.

Recommendations (Priority: MEDIUM-HIGH)

  1. Vectorize SGP4 calculations. Use sgp4 library's batch propagation or numpy-based vectorized computation instead of per-station sequential loops. This alone could give a 10-50x speedup.
  2. Decouple pass prediction from the realtime loop. Run pass predictions in a background thread or separate worker process on a longer interval (every 60 seconds is sufficient). The main loop should only do position/Doppler calculations.
  3. Replace the per-call MQTT client in notify_system. Create a single persistent MQTT client at web API startup and reuse it for all notifications.
  4. Shard the Director by geographic region at 100+ stations. Each Director instance handles a subset of stations. Use MQTT topic prefixes or separate broker virtual hosts to partition traffic.
  5. Move GLOBAL_SAT_REGISTRY out of process memory. Use Redis or a dedicated service for the satellite catalog. The /api/debug/overhead endpoint should be rate-limited or moved to an async job.
  6. Use connection pooling. Configure SQLAlchemy engine with pool_size, max_overflow, and pool_pre_ping. Consider PgBouncer in front of PostgreSQL for the multi-process case.

4. Reliability & Fault Tolerance

Single Points of Failure

Component Failure Mode Impact Recovery
PostgreSQL Crash / disk full All mission data lost if volume is on ephemeral storage. Director and web API both crash-loop. restart: always brings it back, but data integrity is unknown. No backups configured.
Mosquitto Crash All ground station communication stops instantly. Director cannot send commands. Agents cannot report telemetry. restart: always recovers the process, but all MQTT sessions are lost. Agents must reconnect and re-handshake.
Mission Director Crash / hang Ground stations receive no pointing updates. Rotators stop tracking. Active passes are missed. restart: always restarts it, but all in-memory state (station registry, current satellite, TLE cache) is lost and must be rebuilt from scratch.
Web API Crash Dashboard inaccessible, cannot create stations or missions. Does not affect active tracking (Director is independent). restart: always recovers. GLOBAL_SAT_REGISTRY must be rebuilt via SatNOGS sync.
SatNOGS API Unavailable Cannot sync satellite catalog, cannot fetch fresh TLEs. The Director's fetch_fresh_tle has a bare except: pass so it silently fails and the satellite variable stays None, causing the entire tracking loop to do nothing.

No Graceful Degradation

The Mission Director has no concept of degraded operation. If TLE fetch fails (line 229 of mission_director.py), it sleeps 10 seconds and retries in the next loop iteration, but current_satellite remains None so no tracking happens at all. A better approach would be to use the last known good TLE with increasing staleness warnings.

The agent (agent.py) has no reconnection logic for the rotator socket. If the rotator connection drops (line 53, bare except: pass), commands are silently lost until the agent is manually restarted.

There is no data persistence for the Director's runtime state. A restart means: - Station registry is rebuilt from the database (good) - Current satellite TLE must be re-fetched from SatNOGS (external dependency) - Pass predictions are recalculated from scratch - All station bindings (is_bound) are lost, causing duplicate START/STOP session messages

Recommendations (Priority: HIGH)

  1. Add PostgreSQL backups. At minimum, a daily pg_dump cron job writing to a mounted volume. For production, use WAL archiving.
  2. Implement MQTT Last Will and Testament. Have the Director publish a "going offline" LWT message so agents and dashboards know immediately when the Director drops.
  3. Cache the last good TLE. Store TLE data in PostgreSQL and fall back to the cached version when the SatNOGS API is unreachable. Log a warning about TLE staleness.
  4. Add reconnection logic to the agent. If the rotator socket drops, attempt reconnection with exponential backoff rather than silently discarding all commands.
  5. Persist Director state. Write the current mission ID and station bindings to the database or a Redis key so that restarts are seamless.

5. Performance

The Physics Loop Bottleneck

The main loop in mission_director.py runs every 0.5 seconds and does the following synchronously, holding the GIL:

  1. Checks for hot reload flag
  2. Publishes a heartbeat
  3. Queries the database for the active mission (get_active_mission() -- a new Session every tick)
  4. If mission changed: fetches TLE from SatNOGS over HTTP (blocking, up to 10s timeout)
  5. Calculates satellite footprint (one SGP4 evaluation)
  6. Calculates ground track (48 SGP4 evaluations for the 95-minute orbit visualization)
  7. Publishes visualization data
  8. Every 10 seconds: predicts passes for ALL stations (expensive find_events call per station)
  9. For each station: calculates position, determines LOS, publishes rotator/rig commands

Step 6 is particularly wasteful -- it recalculates the full ground track every 0.5 seconds even though the orbit does not change on that timescale. This should be cached and recalculated only when the mission or TLE changes.

Step 8 scales linearly with station count and involves numerical root-finding under the hood (find_events). At 10 stations this takes roughly 0.5-1 second. At 50 stations it would exceed the 10-second interval and start creating a backlog.

Step 4 makes a blocking HTTP request in the main loop. If SatNOGS is slow, the entire system pauses.

MQTT Throughput

The Director publishes all messages synchronously in a loop using QoS 0 for rotator commands (fire-and-forget) and QoS 1 for session commands and schedules (at-least-once). The mix is reasonable but the sequential publish pattern means each station adds latency to the loop.

The notify_system() function in main.py is egregiously wasteful:

def notify_system(event_type: str, extra_data: dict = None):
    client = mqtt.Client(...)   # New client every call
    client.connect(BROKER, ...)  # TCP handshake every call
    client.publish(...)
    client.disconnect()          # Teardown every call

Every station creation, mission activation, and sync completion creates a new TCP connection to the broker, sends one message, and tears it down.

Database Connection Patterns

Two separate SQLAlchemy engines exist: one created in database.py (used by main.py) and another created in mission_director.py line 18. Neither configures pool parameters, so they use SQLAlchemy defaults (pool_size=5, max_overflow=10). The Director's get_active_mission() opens and closes a session every 0.5 seconds (120 session open/close cycles per minute), which is functional but wasteful.

The /api/debug/overhead endpoint does a full table scan of GLOBAL_SAT_REGISTRY (an in-memory list, not a DB query) but then does N individual session.get(SatelliteCache, sat_id) calls in a loop -- one database round trip per candidate satellite.

Recommendations (Priority: MEDIUM)

  1. Cache the ground track. Recalculate only on mission or TLE change, not every tick. This eliminates 48 SGP4 evaluations per tick.
  2. Move pass prediction to a background thread with its own timer. The main loop should only do position/Doppler at 2 Hz.
  3. Move TLE fetch out of the main loop. Fetch TLEs in a background thread or on a timer. Cache in the database.
  4. Batch MQTT publishes. Group all station commands into fewer, larger payloads or use a dedicated publish thread.
  5. Fix notify_system to use a persistent client. Create the MQTT client once at startup.
  6. Batch database lookups in the overhead scanner. Use a single WHERE sat_id IN (...) query instead of N individual gets.

6. Operational Recommendations Summary

Priority 1 -- Do Immediately (Safety/Security)

Item Effort Impact
Move secrets to .env / Docker secrets 1 hour Prevents credential leaks in version control
Enable MQTT authentication 1 hour Prevents unauthorized command injection to ground stations
Replace hardcoded JWT secret 15 min Prevents trivial authentication bypass
Stop exposing PostgreSQL to host network 5 min Reduces attack surface
Remove allow_anonymous true from mosquitto.conf 30 min Requires adding ACLs and agent credentials

Priority 2 -- Do This Week (Reliability)

Item Effort Impact
Add Docker health checks 30 min Clean startup ordering, no crash-loop roulette
Replace print() with logging 2 hours Structured logs, severity levels, timestamps
Replace bare except: pass 1 hour Visible error reporting instead of silent failures
Fix notify_system to use persistent MQTT client 30 min Eliminates connection churn on broker
Cache last good TLE in database 1 hour System continues tracking when SatNOGS is down
Remove reload=True from production 5 min Prevents random restarts from file changes
Separate dev and prod compose files 1 hour Clean environment separation

Priority 3 -- Do This Month (Scalability)

Item Effort Impact
Cache ground track calculation 1 hour Eliminates 48 wasted SGP4 calls per tick
Move pass prediction to background thread 2 hours Unblocks main loop, enables scaling past 10 stations
Add Prometheus metrics to Director and web API 4 hours Visibility into system performance
Configure SQLAlchemy connection pool parameters 30 min Controlled database resource usage
Add PostgreSQL backup job 1 hour Data recovery capability
Add MQTT Last Will and Testament 30 min Immediate failure detection

Priority 4 -- Plan for Future (Architecture)

Item Effort Impact
Vectorize SGP4 with batch propagation 1 week 10-50x physics throughput improvement
Shard Director by region 2 weeks Enables 100+ station operation
Move satellite registry to Redis 3 days Shared state, reduced memory per process
Add agent reconnection with backoff 2 days Self-healing ground stations
CI/CD pipeline with image registry 2 days Reproducible builds, versioned deployments, rollback
Horizontal web API scaling behind load balancer 1 week Handle concurrent dashboard users

Appendix: Architecture Risk Matrix

                    Low Impact          High Impact
                 +------------------+------------------+
  High           | Ground track     | Physics loop     |
  Likelihood     | recalculated     | exceeds 0.5s     |
  of Failure     | every tick       | budget at scale   |
                 |                  |                  |
                 | notify_system    | Silent exception |
                 | connection churn | swallowing       |
                 +------------------+------------------+
  Low            | GLOBAL_SAT_      | SatNOGS API      |
  Likelihood     | REGISTRY OOM     | outage stalls    |
  of Failure     | (needs 5000+     | all tracking     |
                 | satellites)      |                  |
                 |                  | DB credential    |
                 |                  | leak from repo   |
                 +------------------+------------------+

The most dangerous combination is the silent exception swallowing at high likelihood -- problems are occurring today that nobody can see.