TALOS Ground Station Network -- Architecture Review¶

Date: 2026-04-01 Scope: core/, director/, agent/, ops/ Status: Honest assessment with actionable recommendations

1. Current Architecture Assessment¶

What Works Well¶

The fundamental architecture -- a FastAPI backend coordinating edge agents over MQTT -- is a sound choice for this problem domain. Specific strengths:

Technology selection is appropriate. FastAPI for the control plane, MQTT for the message bus, PostgreSQL for persistence, and Python on Raspberry Pi edge nodes are all reasonable picks for a satellite ground station controller. MQTT in particular is the right protocol here: lightweight, pub/sub, designed for constrained devices and unreliable networks.
The three-repo split (core, agent, ops) reflects real deployment boundaries. The backend runs on a server, the agent runs on a Pi at the antenna site, and ops ties them together. This is a natural decomposition that maps to physical reality.
SGP4 propagation and Doppler correction in the Mission Director demonstrate genuine domain expertise. Running a 0.5-second physics loop for satellite tracking is the correct approach -- you need sub-second updates to keep a rotator accurately pointed at a moving LEO target.
SQLModel for the ORM layer is a pragmatic choice that combines Pydantic validation with SQLAlchemy's query capabilities, avoiding the boilerplate of maintaining separate API models and DB models.
Docker Compose orchestration makes the system reproducible and deployable. For a ground station controller that needs to run reliably at remote sites, containerization is the right call.

What Is Concerning¶

The system has the characteristics of a prototype that grew into a production system without a hardening pass. The issues fall into three categories: security, reliability, and maintainability.

Security issues are severe. A hardcoded SECRET_KEY in main.py, no password hashing on cookie-based auth, unencrypted MQTT with no authentication, and a browser-facing MQTT WebSocket connection with no CSRF protection collectively mean the system has no real security boundary. Anyone on the network can observe and inject MQTT messages, impersonate users, or take control of the rotator. For a system that physically moves an antenna, this is not a theoretical concern.

Reliability issues will cause operational failures. The Mission Director runs a blocking while True loop in a thread with no error recovery. If the loop crashes (network blip, database timeout, malformed TLE data), tracking stops silently. There is no watchdog, no heartbeat, no alerting. The agent connects to rotctld via raw TCP sockets with (presumably) minimal error handling -- if the socket drops mid-pass, the rotator could be left in an arbitrary position.

Maintainability will degrade quickly. Global mutable state (GLOBAL_SAT_REGISTRY), inline JavaScript in dashboard templates, new MQTT client creation per notification, and no database migrations mean that changes to any one part of the system risk breaking others in non-obvious ways.

2. Separation of Concerns¶

The main.py / mission_director.py Split¶

The current split puts HTTP API handling in main.py and the physics/tracking loop in mission_director.py. This is directionally correct -- the web API and the real-time tracking engine have different concerns, different performance profiles, and different failure modes.

However, the separation leaks in both directions:

main.py holds global state (GLOBAL_SAT_REGISTRY) that the Mission Director depends on. This creates a tight coupling: the Mission Director cannot be tested, restarted, or scaled independently of the web process. The registry is essentially shared mutable state across threads with no synchronization primitives mentioned.
main.py creates MQTT clients for notifications, meaning the web layer has direct knowledge of the message bus topology. The API layer should publish events; a separate component should decide how to deliver notifications.
mission_director.py does direct DB polling instead of receiving commands through a defined interface. This means the tracking engine is coupled to the database schema and the polling interval becomes a hidden latency parameter.

Monolith vs. Microservices¶

The current system is a monolith in a trench coat. It runs as a single process (FastAPI + threaded Mission Director) but pretends to be distributed by having MQTT in the middle. This is actually worse than either a clean monolith or proper microservices because you get the complexity of message-based communication without the isolation benefits.

Recommendation: For the current scale (likely single-digit ground stations), stay monolithic but make it a clean monolith. Extract the Mission Director into a separate process that communicates with the API exclusively through MQTT and the database. This gives you:

Independent restart capability -- the Mission Director can crash and recover without taking down the API.
Independent scaling -- you can run multiple Mission Director instances for different satellite groups.
A clear contract between the components (MQTT topics and DB schema).

Do not pursue a full microservices architecture. The operational overhead would dwarf the benefits at this scale.

3. State Management¶

This is the area with the highest risk of subtle, hard-to-diagnose bugs.

Global Mutable State¶

GLOBAL_SAT_REGISTRY in main.py is a dictionary (or similar structure) that appears to be written by the API layer (when users configure satellites) and read by the Mission Director thread. This is a textbook race condition scenario in CPython.

While CPython's GIL prevents true parallel execution of Python bytecode, it does not prevent:

Torn reads during dictionary iteration. If one thread is iterating over the registry while another adds/removes entries, you get RuntimeError: dictionary changed size during iteration.
Stale reads. The Mission Director may operate on satellite data that was partially updated by an API request.
Lost updates. Two concurrent API requests modifying the registry could overwrite each other's changes.

The fact that this likely works most of the time (due to low concurrency) makes it more dangerous, not less -- the bugs will surface during a critical pass when load is highest.

Threading Issues¶

The Mission Director runs in a daemon thread with a while True loop. Specific concerns:

No graceful shutdown. If the FastAPI process receives a SIGTERM, the Mission Director thread is killed mid-iteration. If it was in the middle of sending a rotator command, the rotator could be left in an unknown state.
No backpressure. If the 0.5-second loop iteration takes longer than 0.5 seconds (due to slow DB queries, network issues, or tracking many satellites), iterations will stack up or drift. There is no mechanism to detect or handle this.
Exception handling. If any unhandled exception escapes the loop body, the thread dies silently. The API continues serving requests, the dashboard looks normal, but tracking has stopped. Nobody knows until a pass is missed.

Recommended State Architecture¶

Replace GLOBAL_SAT_REGISTRY with one of:

Database as source of truth + local cache with versioning. The Mission Director reads from the DB on startup and subscribes to an MQTT topic for change notifications. Each notification includes a version number; the Mission Director re-reads from DB when it detects a version change. This eliminates shared in-process state entirely.
Thread-safe data structure. If staying single-process, use a threading.Lock around all registry access, or better, use copy-on-write semantics where the Mission Director holds an immutable snapshot that gets replaced atomically.

Option 1 is strongly preferred because it also solves the separation of concerns problem.

4. Communication Patterns¶

MQTT Topic Design¶

The system uses MQTT for two distinct purposes that should have clearly separated topic hierarchies:

Control plane: Configuration pushes, command dispatch, agent registration.
Data plane: Telemetry, tracking updates, status reports.

A recommended topic structure:

talos/control/{station_id}/config     -- Configuration pushes to agents
talos/control/{station_id}/command    -- Imperative commands (start tracking, stop, park)
talos/control/{station_id}/status     -- Agent status reports (online, tracking, error)
talos/data/{station_id}/telemetry     -- Rotator position, signal strength, etc.
talos/data/{station_id}/tracking      -- Current satellite, AOS/LOS events
talos/system/heartbeat/{station_id}   -- Agent heartbeats
talos/system/announce                 -- System-wide announcements

Key design principles:

Station ID in the topic, not the payload. This allows MQTT ACLs and topic-based filtering.
Separate control and data. Control messages can use QoS 1 (at least once) while telemetry uses QoS 0 (at most once). You never want a stale rotator command replayed after a reconnect, but you also do not want to lose a "stop tracking" command.
Retained messages for configuration. Agent config topics should use the retained flag so that when an agent reconnects after a network drop, it immediately gets its current configuration without requiring the backend to detect the reconnection and re-send.

QoS Considerations¶

The choice of MQTT QoS level matters significantly for a system that physically controls hardware:

Message Type	Recommended QoS	Rationale
Rotator position commands	QoS 0	Stale commands are worse than missed ones. The next loop iteration will send a fresh command.
Start/stop tracking	QoS 1	These are imperative and must be delivered, but idempotent enough to tolerate duplicates.
Configuration updates	QoS 1 + retained	Must be delivered; retained ensures delivery even after reconnect.
Telemetry	QoS 0	High frequency, latest-value-wins. Missed samples are acceptable.
Heartbeats	QoS 0	Absence of heartbeats is the signal, not the content.

The "New Client Per Notification" Problem¶

Creating a new MQTT client for each notification in main.py is expensive and fragile. Each client creation involves TCP connection setup, MQTT handshake, and (if TLS is ever added) a TLS negotiation. At scale, this will exhaust broker connections and file descriptors.

Fix: Create a single, long-lived MQTT client at application startup. Inject it as a FastAPI dependency. If the client disconnects, implement reconnection with exponential backoff. This is a P0 fix -- it is the kind of issue that works fine in testing and fails under real operational load.

5. Data Flow¶

Current Flow (Reconstructed)¶

User (Browser)
  |
  | HTTP (cookie auth)
  v
FastAPI (main.py)
  |
  |--- Reads/writes ---> PostgreSQL (SQLModel, no migrations)
  |
  |--- Writes to GLOBAL_SAT_REGISTRY (in-memory, no sync)
  |
  |--- Creates new MQTT client ---> Mosquitto broker
  |                                    |
  |                                    | (no TLS, no auth)
  |                                    v
  |                              Agent (agent.py, on Raspberry Pi)
  |                                    |
  |                                    | Raw TCP socket
  |                                    v
  |                              rotctld/rigctld (Hamlib)
  |                                    |
  |                                    v
  |                              Physical rotator/radio
  |
  +--- Mission Director thread
         |
         |--- Polls PostgreSQL directly
         |--- SGP4 propagation (0.5s loop)
         |--- Publishes tracking commands via MQTT
         |--- Reads GLOBAL_SAT_REGISTRY (unsynchronized)

Dashboard (browser)
  |
  | MQTT WebSocket (direct to broker, no CSRF)
  v
Mosquitto broker

Issues in the Data Flow¶

The browser has a direct WebSocket to the MQTT broker. This means the browser is a first-class MQTT client. If the broker has no ACLs (and it likely does not, given the lack of MQTT auth), the browser can publish to any topic -- including control topics. An XSS vulnerability or a malicious browser extension could send rotator commands directly.
Two paths to the database. Both main.py (via API requests) and the Mission Director thread access PostgreSQL. Without explicit transaction isolation or application-level coordination, concurrent writes can produce inconsistent state. For example: a user updates satellite TLE data via the API while the Mission Director is mid-propagation using the old TLE.
No event sourcing or audit trail. Commands flow through the system but are not logged in a queryable way. If a rotator moves unexpectedly, there is no way to determine what command caused it, when, or from which component. For a system controlling physical hardware, command auditability is not optional.
Agent-to-rotctld is a single point of failure with no feedback loop. The agent sends position commands over a raw TCP socket. If rotctld crashes, the socket write may succeed (TCP buffer) but the rotator will not move. The agent needs to read back the actual rotator position and compare it to the commanded position. A persistent deviation indicates a fault condition.

Recommended Data Flow¶

User (Browser)
  |
  | HTTPS (JWT auth, CSRF tokens)
  v
FastAPI API Server (main.py)
  |
  |--- Reads/writes ---> PostgreSQL (Alembic migrations)
  |
  |--- Publishes events ---> MQTT (TLS, ACLs, single client)
  |                            |
  v                            v
Mission Director           Agent (agent.py)
(separate process)           |
  |                          | Hamlib client library (not raw TCP)
  |--- Reads DB              v
  |--- Subscribes MQTT     rotctld/rigctld
  |--- Publishes commands     |
  |                          | Position feedback loop
  v                          v
MQTT broker               Physical rotator/radio

Dashboard (browser)
  |
  | HTTPS (Server-Sent Events or WebSocket via FastAPI, not direct MQTT)
  v
FastAPI (proxies MQTT data to authenticated WebSocket)

Key changes: the browser never touches MQTT directly, the Mission Director is a separate process, and the agent implements a feedback loop.

6. Best Practices Recommendations¶

P0 -- Fix Before Next Operational Use¶

These issues can cause security breaches, data loss, or physical equipment damage.

Issue	Current State	Recommended Fix
Hardcoded SECRET_KEY	Plaintext in `main.py`	Move to environment variable. Generate with `secrets.token_hex(32)`. Rotate on any suspected compromise.
No password hashing	Cookie-based auth stores/compares plaintext	Use `bcrypt` or `argon2` via `passlib`. Hash on registration, verify on login. Never store plaintext.
No MQTT authentication	Any network client can connect	Enable Mosquitto `password_file` or plugin-based auth. Every client (agent, backend, dashboard proxy) gets unique credentials.
No MQTT TLS	All messages in cleartext on the wire	Configure Mosquitto with TLS certificates. Use `mqtts://` (port 8883). For agents on local networks, at minimum use pre-shared keys.
Mission Director crash recovery	Thread dies silently on exception	Wrap the loop body in `try/except`, log the exception, and implement a restart mechanism. Add a heartbeat that the API monitors. If the heartbeat stops, raise an alert.
New MQTT client per notification	Connection storm under load	Create a singleton MQTT client at startup with auto-reconnect. Share via FastAPI dependency injection.
Browser direct MQTT access	Dashboard connects to broker WebSocket	Route dashboard data through FastAPI via Server-Sent Events or authenticated WebSocket. The browser should never be an MQTT client.

P1 -- Fix Before Scaling Beyond Single Station¶

These issues will cause operational problems as the system grows.

Issue	Current State	Recommended Fix
Global mutable state	`GLOBAL_SAT_REGISTRY` shared across threads	Use database as source of truth. Mission Director subscribes to change notifications via MQTT. Eliminate in-process shared state.
No database migrations	SQLModel `create_all` presumably	Add Alembic. Generate an initial migration from current schema. All future schema changes go through migration scripts.
No health checks in Docker	Containers run but might be broken inside	Add `HEALTHCHECK` directives. FastAPI: HTTP `GET /health`. PostgreSQL: `pg_isready`. Mosquitto: `mosquitto_sub -t '$SYS/broker/uptime'`.
Hardcoded passwords in docker-compose.yml	Plaintext database credentials	Use Docker secrets or `.env` file (gitignored). Never commit credentials.
No resource limits in Docker	A runaway process can starve others	Set `mem_limit`, `cpus`, and `restart: unless-stopped` for all services. The Mission Director process especially needs memory limits -- a TLE parsing bug could cause unbounded growth.
Agent raw TCP to rotctld	Direct socket manipulation	Use the `python-hamlib` bindings or at minimum wrap socket operations in a class with connection management, timeouts, and retry logic. Implement position readback for closed-loop control.
No CSRF protection	Dashboard has no token validation	Add CSRF middleware to FastAPI. Use `Starlette`'s CSRF or implement double-submit cookie pattern.
Inline JavaScript in dashboard	Business logic embedded in HTML template	Extract to separate `.js` files. Use a minimal build step if needed. This improves cacheability, testability, and Content Security Policy compatibility.

P2 -- Improve for Long-Term Maintainability¶

These are engineering quality improvements that reduce technical debt.

Issue	Current State	Recommended Fix
No structured logging	Presumably `print()` or basic `logging`	Use `structlog` or `python-json-logger`. Include correlation IDs (station ID, pass ID, satellite NORAD ID) in every log line. Ship logs to a central store.
No metrics/observability	Blind to system behavior	Add Prometheus metrics: tracking loop iteration time, MQTT message rates, rotator position error, agent connection status. Expose via `/metrics` endpoint.
No integration tests	(Presumed) manual testing	Write tests that spin up the full stack in Docker, simulate an agent, and verify end-to-end pass execution. Use `testcontainers` for PostgreSQL and Mosquitto.
Mission Director DB polling	Direct queries on a timer	Publish configuration changes as MQTT events. The Mission Director subscribes and updates its local state reactively rather than polling.
No command audit trail	Commands are fire-and-forget	Log every command (source, timestamp, target, payload) to a dedicated database table or append-only log. Essential for debugging operational anomalies.
Blocking 0.5s physics loop	`time.sleep(0.5)` in a while loop (presumed)	Use `asyncio` with a periodic task, or at minimum use a monotonic clock to account for iteration duration: `next_tick = last_tick + 0.5; sleep(max(0, next_tick - now))`. This prevents drift under load.
Agent hardcoded broker address	Broker IP/hostname in source	Move to a configuration file (`agent.yaml`) or environment variables. Support mDNS/DNS-SD for zero-configuration discovery on local networks.

Summary¶

TALOS has a solid architectural foundation -- the technology choices are appropriate, the physical deployment model maps well to the problem, and the domain logic (SGP4 propagation, Doppler correction) demonstrates real expertise. The system clearly works for its current use case.

The risk lies in the gap between "works in the lab" and "works reliably in the field." The P0 items are not hypothetical -- a hardcoded secret key is a guaranteed compromise, an unrecoverable Mission Director crash is a guaranteed missed pass, and unauthenticated MQTT is a guaranteed avenue for interference.

The recommended approach is incremental hardening, not a rewrite. Start with the P0 security and reliability fixes, which can each be done in isolation without rearchitecting. Then tackle P1 items as the system expands to multiple stations. P2 items are ongoing engineering hygiene that should be addressed as part of normal development.

The single most impactful change would be extracting the Mission Director into a separate process with proper MQTT-based communication. This one change addresses the global state problem, the crash recovery problem, the threading problem, and the separation of concerns problem simultaneously.