TALOS v0.3.0 Architecture Review¶

Date: April 2026 Scope: Honest internal assessment of the TALOS codebase as of v0.3.0 Author: Engineering (automated review)

1. System Architecture Assessment¶

TALOS uses a 5-component distributed architecture: Core API, Director, Agent, MQTT Broker, and PostgreSQL. The component boundaries are well-chosen for the domain -- separating the real-time physics loop (Director) from the web-facing CRUD layer (Core) is the right call, and MQTT is a natural fit for the pub/sub command and telemetry patterns of ground station control.

What works well:

The Director runs as a standalone process with its own DB connection and MQTT client, decoupled from the HTTP request/response cycle. This means the 2 Hz physics loop is never blocked by web traffic.
The Agent is genuinely lightweight (109 lines in agent/agent.py) -- appropriate for Raspberry Pi deployment.
The shared modules (shared/schemas.py, shared/topics.py) provide a single source of truth for MQTT topic strings and payload shapes, preventing the silent-misroute bugs that plague distributed systems.
The SatNOGSClient (shared/satnogs_client.py, 199 lines) consolidates all external API calls with caching, stale-fallback, and connection pooling. This is well-engineered.

What does not work well:

core/main.py is a 1,630-line monolith. It contains authentication, RBAC, SatNOGS sync, station provisioning, campaign CRUD, assignment management, legacy mission endpoints, a public satellite tracker, and inline HTML email templates. This file does the work of at least 6 modules.
The Core and Director both create their own SQLAlchemy engines independently (core/database.py line 19, director/director.py line 90). The DATABASE_URL parsing (the postgres:// to postgresql:// fix) is duplicated in both places.
core/tracker.py is a dead file -- an early prototype with hardcoded Celestrak URLs and emoji-laden print statements. It is not imported anywhere.

Component line counts (Python source only):

Component	File	Lines
Core API	`core/main.py`	1,630
Core DB	`core/database.py`	211
Director	`director/director.py`	821
Physics	`director/physics.py`	139
Station Mgr	`director/station_manager.py`	187
TLE Mgr	`director/tle_manager.py`	208
Agent	`agent/agent.py`	109
Schemas	`shared/schemas.py`	249
Topics	`shared/topics.py`	164
SatNOGS Client	`shared/satnogs_client.py`	199
Total production		~3,920
Total tests		~4,750

The test-to-production ratio (1.2:1) is healthy.

2. Code Quality¶

Module Organization¶

The Director side is well-factored. director/director.py orchestrates; director/physics.py provides pure stateless functions; director/station_manager.py and director/tle_manager.py manage specific concerns. This is clean separation.

The Core side is not factored at all. core/main.py defines Pydantic request models inline (lines 341-374), contains the full RBAC implementation (lines 265-308), has 7 page-rendering endpoints that each independently do the same auth/org/membership lookup pattern, and hosts a public satellite tracker that has nothing to do with mission control.

Code Duplication¶

The following patterns are duplicated and should be extracted:

Auth + org + membership lookup. The sequence get_current_user -> select User -> select Org -> select Membership -> check role appears in every page-rendering endpoint (/dashboard, /org/{slug}/settings, /org/{slug}/members, /org/{slug}/stations, /org/{slug}/campaigns). The require_role() dependency already exists for API routes but is not used by page routes, which all reimplement the logic manually (lines 502-548, 554-603, 609-653, 659-696, 702-739).
Station provisioning logic. The station creation code (name sanitization, SatNOGS lookup, ID/key generation) is duplicated between provision_station() (lines 894-930) and create_station_legacy() (lines 1352-1390). These are nearly identical.
Transmitter linking. The transmitter-fetch-and-persist loop appears in both create_campaign() (lines 1004-1023) and add_mission() (lines 1410-1432) with the same structure.
DATABASE_URL parsing. The postgres:// to postgresql:// replacement appears in both core/database.py (line 18) and director/director.py (line 73).

Error Handling¶

Error handling is inconsistent. The Director uses broad except Exception with logger.exception() throughout (e.g., get_active_assignments(), tick()), which is appropriate for a long-running process that must not crash. The Core API routes sometimes raise HTTPException with appropriate status codes but sometimes swallow errors silently (e.g., _apply_safe_migrations() in database.py line 210 catches all exceptions and passes).

The Agent has no error handling for JSON decode failures on incoming MQTT messages (agent/agent.py line 65 will crash on malformed payloads).

3. Data Flow Analysis¶

MQTT Message Flow¶

The MQTT topic hierarchy is well-designed. shared/topics.py defines all topic strings via the Topics class, making typos impossible at the Python level. The org-scoped topic scheme (talos/{org_slug}/gs/{station_id}/...) introduced in v0.2 coexists cleanly with legacy topics.

However, there are two concerns:

Topic duplication in Topics class. The class defines both legacy and org-scoped versions of every topic (e.g., station_rotator_cmd() and org_station_cmd_rot()), and the Director has wrapper functions (_topic_rot, _topic_rig, etc. at lines 311-338) to select between them. This dual-path code increases the surface area for bugs.
The Agent only subscribes to legacy topics. agent/agent.py line 58 subscribes to talos/gs/{STATION_ID}/cmd/# -- it does not know about org-scoped topics. This means the Director must publish to legacy topics for the Agent to receive commands, even though the Director's multi-campaign code path uses org-scoped topics. This is a latent bug for any deployment using organizations.

Database Access Patterns¶

The Director opens a new Session for every database query (see get_all_stations(), get_active_assignments(), get_station_config(), etc.). This is correct for a long-running process -- sessions should not be held across ticks. However, get_active_assignments() performs N+1 queries: it fetches all assignments, then calls session.get() individually for each campaign, station, and organization (lines 153-159). For 10 assignments this is 31 queries per tick (every 0.5s).

The _detach_assignment() function (lines 174-183) creates an anonymous _A class to copy assignment fields outside the session. This works but is fragile -- a dataclass or named tuple would be more appropriate.

API Design¶

The API is org-scoped (/api/orgs/{slug}/...) with proper RBAC, which is well-designed. Response bodies are constructed inline as dicts rather than using Pydantic response models, which means there is no schema validation on outputs and no auto-generated OpenAPI documentation for responses. The schemas defined in shared/schemas.py are used exclusively for MQTT payloads, not for API responses, despite response schemas like CampaignResponse and AssignmentResponse being defined there (lines 197-237) but never imported in core/main.py.

4. State Management¶

State is distributed across four locations, which creates consistency challenges:

State	Location	Lifecycle
Campaigns, assignments, stations	PostgreSQL	Persistent
Tracking state (which station tracks which campaign)	`StationManager` in-memory (`director/station_manager.py`)	Process lifetime, lost on restart
TLE cache	`MultiTLEManager` in-memory (`director/tle_manager.py`)	Process lifetime
Station config, session state	MQTT retained messages	Broker lifetime
Satellite catalog + TLE physics models	`GLOBAL_SAT_REGISTRY` list in `core/main.py` (line 84)	Process lifetime, rebuilt on sync

The critical gap is that the Director's tracking state (StationManager._tracking) is purely in-memory. If the Director restarts mid-pass, it loses all knowledge of which stations are currently tracking. The next tick will re-evaluate line-of-sight and re-send START commands, but there is a window where agents receive no pointing updates. This is acceptable at the current scale but should be addressed before production use.

The GLOBAL_SAT_REGISTRY in core/main.py is a global mutable list that holds up to 8,000+ Skyfield EarthSatellite objects. The atomic-swap pattern (line 197) is correct for reads, but the 30-second deferred load on startup (line 222) means the /api/debug/overhead endpoint returns empty results for the first 30+ seconds after boot.

5. Testing Architecture¶

Coverage¶

The test suite is integration-heavy with 7 test files totaling ~4,750 lines:

Test File	Lines	What it Tests
`test_director_e2e.py`	830	Physics loop, pass prediction, Doppler, multi-station
`test_dashboard_realtime.py`	884	WebSocket relay, MQTT viz updates, campaign display
`test_agent_hardware.py`	609	Hamlib protocol, rotator/rig command translation
`test_campaign_e2e.py`	556	Campaign CRUD, assignment lifecycle, activation
`test_smoke_routes.py`	553	HTTP route smoke tests, auth flow
`test_e2e.py`	534	End-to-end flows (legacy mission path)
`test_load.py`	446	Performance under concurrent load
`test_satnogs_client.py`	243	SatNOGS API client caching, error handling

There are no unit tests. The tests/test_core/, tests/test_director/, and tests/test_agent/ directories exist but are empty (contain only __init__.py). All tests are in tests/test_integration/ or tests/test_shared/.

Test Isolation¶

The integration test conftest (tests/test_integration/conftest.py) provides session-scoped fixtures for Skyfield objects (timescale, ISS satellite, station locations) with hardcoded TLE data. This is good -- tests use a fixed TLE epoch (2024-01-01) so propagation results are deterministic regardless of when tests run.

However, there is no database fixture infrastructure visible. The campaign and smoke route tests presumably require a running PostgreSQL instance, which means they cannot run in CI without Docker.

6. Design Debt¶

Deprecated Code Still Present¶

Mission model (core/database.py lines 123-133): Marked as deprecated with a comment ("Use Campaign instead"), but still has active endpoints in core/main.py (/missions/add, /missions/{mid}/activate, /missions/{mid}/transmitters at lines 1343-1453). The Director still contains a complete legacy code path (_tick_legacy(), lines 606-751) and the dashboard still queries for active_mission (line 534).
core/tracker.py: A 30+ line prototype file with hardcoded TARGET_NAME = "ISS (ZARYA)" and Celestrak URLs. Not imported anywhere. Should be deleted.
AssignmentCreate and AssignmentResponse schemas in shared/schemas.py (lines 216-237) still reference window_start and window_end fields that were removed from the Assignment model (the migration at database.py line 189 explicitly drops these columns).
Legacy topic subscriptions: The Director subscribes to the non-org-scoped Topics.SUB_ALL_STATION_INFO (line 346) and Topics.MISSION_SELECT (line 347). These should be org-scoped.

Hardcoded Values¶

ADMIN_EMAIL = "pierros@papadeas.gr" in core/main.py line 1508. This auto-creates an admin user on public endpoint access. Should be an environment variable.
radius: 2500000 in physics.py line 99. The satellite footprint radius is hardcoded to 2,500 km regardless of orbital altitude. Should be calculated from altitude.
Agent hardcodes BROKER = "localhost" (agent/agent.py line 25) instead of reading from environment.

Missing Features Marked as Stubs¶

schedule_campaign() (line 1201): Returns a stub response with "note": "Auto-scheduling not yet implemented".
No campaign deletion endpoint exists.
No assignment deletion/pause endpoint exists.

7. Recommendations¶

Prioritized by impact and effort:

P0 -- Fix Before Next Release¶

Fix Agent org-scoped topic subscription. The Agent subscribes only to legacy topics. Any org-scoped deployment will silently fail to deliver commands to agents. Estimated fix: add --org CLI argument to Agent, subscribe to talos/{org}/gs/{station_id}/cmd/#.
Add JSON error handling to Agent. A single malformed MQTT message crashes the agent process. Wrap on_message in try/except.
Delete core/tracker.py. Dead code that could confuse contributors.

P1 -- Next Sprint¶

Split core/main.py into a proper FastAPI application package. Suggested structure:
core/app.py -- FastAPI app factory, lifespan, middleware
core/auth.py -- login, verify, logout, cookie management
core/routes/orgs.py -- organization CRUD
core/routes/campaigns.py -- campaign CRUD + activation
core/routes/stations.py -- station provisioning
core/routes/pages.py -- HTML page rendering
core/routes/legacy.py -- deprecated Mission endpoints (with deprecation warnings)
core/sync.py -- SatNOGS sync logic
Extract the duplicated auth/org/membership lookup from page-rendering endpoints into a reusable dependency (extend require_role() to return template context).
Remove stale schemas. Delete AssignmentCreate.window_start/window_end and AssignmentResponse.window_start/window_end from shared/schemas.py. They reference columns that no longer exist.

P2 -- Next Quarter¶

Fix N+1 queries in get_active_assignments(). Use a joined/eager load instead of per-row session.get() calls. At 10 assignments this produces 31 queries per tick (62/second).
Use Pydantic response models for API endpoints. The response schemas exist in shared/schemas.py but are unused. Wire them into FastAPI's response_model parameter for automatic validation and OpenAPI documentation.
Add unit tests. The empty tests/test_core/, tests/test_director/, and tests/test_agent/ directories suggest unit tests were planned but never written. Priority targets:
director/physics.py (pure functions, easy to test)
director/station_manager.py (stateful but no external deps)
shared/satnogs_client.py (mock HTTP, test caching logic)
Move ADMIN_EMAIL to an environment variable. Hardcoded email addresses in source code are a maintenance and security concern.

P3 -- Future¶

Remove legacy Mission code path. The dual-mode Director (tick() vs _tick_legacy()) adds ~150 lines of parallel logic. Once all deployments are migrated to Campaigns, delete the Mission model, legacy endpoints, and legacy Director path.
Calculate footprint radius from altitude instead of hardcoding 2,500 km. The geometric horizon radius is sqrt(2 * R_earth * alt + alt^2) which varies from ~2,200 km at 200 km altitude to ~3,400 km at 800 km.
Implement Director state persistence. Write tracking state to a Redis or DB table so Director restarts do not lose mid-pass tracking context.

Summary¶

TALOS has a sound distributed architecture with clear component boundaries, a well-designed MQTT topic hierarchy, and a healthy test-to-code ratio. The Director subsystem is particularly well-factored. The main structural issues are the monolithic core/main.py (which needs to be split into ~8 modules), duplicated code patterns in the Core layer, and an Agent that does not support org-scoped topics. The deprecated Mission model and its parallel code paths should be removed once migration is complete. None of these issues are blocking, but addressing P0 items before the next release will prevent silent failures in multi-org deployments.