TALOS v0.3.0 Architecture Review¶
Date: April 2026 Scope: Honest internal assessment of the TALOS codebase as of v0.3.0 Author: Engineering (automated review)
1. System Architecture Assessment¶
TALOS uses a 5-component distributed architecture: Core API, Director, Agent, MQTT Broker, and PostgreSQL. The component boundaries are well-chosen for the domain -- separating the real-time physics loop (Director) from the web-facing CRUD layer (Core) is the right call, and MQTT is a natural fit for the pub/sub command and telemetry patterns of ground station control.
What works well:
- The Director runs as a standalone process with its own DB connection and MQTT client, decoupled from the HTTP request/response cycle. This means the 2 Hz physics loop is never blocked by web traffic.
- The Agent is genuinely lightweight (109 lines in
agent/agent.py) -- appropriate for Raspberry Pi deployment. - The shared modules (
shared/schemas.py,shared/topics.py) provide a single source of truth for MQTT topic strings and payload shapes, preventing the silent-misroute bugs that plague distributed systems. - The
SatNOGSClient(shared/satnogs_client.py, 199 lines) consolidates all external API calls with caching, stale-fallback, and connection pooling. This is well-engineered.
What does not work well:
core/main.pyis a 1,630-line monolith. It contains authentication, RBAC, SatNOGS sync, station provisioning, campaign CRUD, assignment management, legacy mission endpoints, a public satellite tracker, and inline HTML email templates. This file does the work of at least 6 modules.- The Core and Director both create their own SQLAlchemy engines independently (
core/database.pyline 19,director/director.pyline 90). The DATABASE_URL parsing (thepostgres://topostgresql://fix) is duplicated in both places. core/tracker.pyis a dead file -- an early prototype with hardcoded Celestrak URLs and emoji-laden print statements. It is not imported anywhere.
Component line counts (Python source only):
| Component | File | Lines |
|---|---|---|
| Core API | core/main.py |
1,630 |
| Core DB | core/database.py |
211 |
| Director | director/director.py |
821 |
| Physics | director/physics.py |
139 |
| Station Mgr | director/station_manager.py |
187 |
| TLE Mgr | director/tle_manager.py |
208 |
| Agent | agent/agent.py |
109 |
| Schemas | shared/schemas.py |
249 |
| Topics | shared/topics.py |
164 |
| SatNOGS Client | shared/satnogs_client.py |
199 |
| Total production | ~3,920 | |
| Total tests | ~4,750 |
The test-to-production ratio (1.2:1) is healthy.
2. Code Quality¶
Module Organization¶
The Director side is well-factored. director/director.py orchestrates; director/physics.py provides pure stateless functions; director/station_manager.py and director/tle_manager.py manage specific concerns. This is clean separation.
The Core side is not factored at all. core/main.py defines Pydantic request models inline (lines 341-374), contains the full RBAC implementation (lines 265-308), has 7 page-rendering endpoints that each independently do the same auth/org/membership lookup pattern, and hosts a public satellite tracker that has nothing to do with mission control.
Code Duplication¶
The following patterns are duplicated and should be extracted:
-
Auth + org + membership lookup. The sequence
get_current_user -> select User -> select Org -> select Membership -> check roleappears in every page-rendering endpoint (/dashboard,/org/{slug}/settings,/org/{slug}/members,/org/{slug}/stations,/org/{slug}/campaigns). Therequire_role()dependency already exists for API routes but is not used by page routes, which all reimplement the logic manually (lines 502-548, 554-603, 609-653, 659-696, 702-739). -
Station provisioning logic. The station creation code (name sanitization, SatNOGS lookup, ID/key generation) is duplicated between
provision_station()(lines 894-930) andcreate_station_legacy()(lines 1352-1390). These are nearly identical. -
Transmitter linking. The transmitter-fetch-and-persist loop appears in both
create_campaign()(lines 1004-1023) andadd_mission()(lines 1410-1432) with the same structure. -
DATABASE_URL parsing. The
postgres://topostgresql://replacement appears in bothcore/database.py(line 18) anddirector/director.py(line 73).
Error Handling¶
Error handling is inconsistent. The Director uses broad except Exception with logger.exception() throughout (e.g., get_active_assignments(), tick()), which is appropriate for a long-running process that must not crash. The Core API routes sometimes raise HTTPException with appropriate status codes but sometimes swallow errors silently (e.g., _apply_safe_migrations() in database.py line 210 catches all exceptions and passes).
The Agent has no error handling for JSON decode failures on incoming MQTT messages (agent/agent.py line 65 will crash on malformed payloads).
3. Data Flow Analysis¶
MQTT Message Flow¶
The MQTT topic hierarchy is well-designed. shared/topics.py defines all topic strings via the Topics class, making typos impossible at the Python level. The org-scoped topic scheme (talos/{org_slug}/gs/{station_id}/...) introduced in v0.2 coexists cleanly with legacy topics.
However, there are two concerns:
-
Topic duplication in Topics class. The class defines both legacy and org-scoped versions of every topic (e.g.,
station_rotator_cmd()andorg_station_cmd_rot()), and the Director has wrapper functions (_topic_rot,_topic_rig, etc. at lines 311-338) to select between them. This dual-path code increases the surface area for bugs. -
The Agent only subscribes to legacy topics.
agent/agent.pyline 58 subscribes totalos/gs/{STATION_ID}/cmd/#-- it does not know about org-scoped topics. This means the Director must publish to legacy topics for the Agent to receive commands, even though the Director's multi-campaign code path uses org-scoped topics. This is a latent bug for any deployment using organizations.
Database Access Patterns¶
The Director opens a new Session for every database query (see get_all_stations(), get_active_assignments(), get_station_config(), etc.). This is correct for a long-running process -- sessions should not be held across ticks. However, get_active_assignments() performs N+1 queries: it fetches all assignments, then calls session.get() individually for each campaign, station, and organization (lines 153-159). For 10 assignments this is 31 queries per tick (every 0.5s).
The _detach_assignment() function (lines 174-183) creates an anonymous _A class to copy assignment fields outside the session. This works but is fragile -- a dataclass or named tuple would be more appropriate.
API Design¶
The API is org-scoped (/api/orgs/{slug}/...) with proper RBAC, which is well-designed. Response bodies are constructed inline as dicts rather than using Pydantic response models, which means there is no schema validation on outputs and no auto-generated OpenAPI documentation for responses. The schemas defined in shared/schemas.py are used exclusively for MQTT payloads, not for API responses, despite response schemas like CampaignResponse and AssignmentResponse being defined there (lines 197-237) but never imported in core/main.py.
4. State Management¶
State is distributed across four locations, which creates consistency challenges:
| State | Location | Lifecycle |
|---|---|---|
| Campaigns, assignments, stations | PostgreSQL | Persistent |
| Tracking state (which station tracks which campaign) | StationManager in-memory (director/station_manager.py) |
Process lifetime, lost on restart |
| TLE cache | MultiTLEManager in-memory (director/tle_manager.py) |
Process lifetime |
| Station config, session state | MQTT retained messages | Broker lifetime |
| Satellite catalog + TLE physics models | GLOBAL_SAT_REGISTRY list in core/main.py (line 84) |
Process lifetime, rebuilt on sync |
The critical gap is that the Director's tracking state (StationManager._tracking) is purely in-memory. If the Director restarts mid-pass, it loses all knowledge of which stations are currently tracking. The next tick will re-evaluate line-of-sight and re-send START commands, but there is a window where agents receive no pointing updates. This is acceptable at the current scale but should be addressed before production use.
The GLOBAL_SAT_REGISTRY in core/main.py is a global mutable list that holds up to 8,000+ Skyfield EarthSatellite objects. The atomic-swap pattern (line 197) is correct for reads, but the 30-second deferred load on startup (line 222) means the /api/debug/overhead endpoint returns empty results for the first 30+ seconds after boot.
5. Testing Architecture¶
Coverage¶
The test suite is integration-heavy with 7 test files totaling ~4,750 lines:
| Test File | Lines | What it Tests |
|---|---|---|
test_director_e2e.py |
830 | Physics loop, pass prediction, Doppler, multi-station |
test_dashboard_realtime.py |
884 | WebSocket relay, MQTT viz updates, campaign display |
test_agent_hardware.py |
609 | Hamlib protocol, rotator/rig command translation |
test_campaign_e2e.py |
556 | Campaign CRUD, assignment lifecycle, activation |
test_smoke_routes.py |
553 | HTTP route smoke tests, auth flow |
test_e2e.py |
534 | End-to-end flows (legacy mission path) |
test_load.py |
446 | Performance under concurrent load |
test_satnogs_client.py |
243 | SatNOGS API client caching, error handling |
There are no unit tests. The tests/test_core/, tests/test_director/, and tests/test_agent/ directories exist but are empty (contain only __init__.py). All tests are in tests/test_integration/ or tests/test_shared/.
Test Isolation¶
The integration test conftest (tests/test_integration/conftest.py) provides session-scoped fixtures for Skyfield objects (timescale, ISS satellite, station locations) with hardcoded TLE data. This is good -- tests use a fixed TLE epoch (2024-01-01) so propagation results are deterministic regardless of when tests run.
However, there is no database fixture infrastructure visible. The campaign and smoke route tests presumably require a running PostgreSQL instance, which means they cannot run in CI without Docker.
6. Design Debt¶
Deprecated Code Still Present¶
-
Mission model (
core/database.pylines 123-133): Marked as deprecated with a comment ("Use Campaign instead"), but still has active endpoints incore/main.py(/missions/add,/missions/{mid}/activate,/missions/{mid}/transmittersat lines 1343-1453). The Director still contains a complete legacy code path (_tick_legacy(), lines 606-751) and the dashboard still queries foractive_mission(line 534). -
core/tracker.py: A 30+ line prototype file with hardcodedTARGET_NAME = "ISS (ZARYA)"and Celestrak URLs. Not imported anywhere. Should be deleted. -
AssignmentCreateandAssignmentResponseschemas inshared/schemas.py(lines 216-237) still referencewindow_startandwindow_endfields that were removed from the Assignment model (the migration atdatabase.pyline 189 explicitly drops these columns). -
Legacy topic subscriptions: The Director subscribes to the non-org-scoped
Topics.SUB_ALL_STATION_INFO(line 346) andTopics.MISSION_SELECT(line 347). These should be org-scoped.
Hardcoded Values¶
ADMIN_EMAIL = "pierros@papadeas.gr"incore/main.pyline 1508. This auto-creates an admin user on public endpoint access. Should be an environment variable.radius: 2500000inphysics.pyline 99. The satellite footprint radius is hardcoded to 2,500 km regardless of orbital altitude. Should be calculated from altitude.- Agent hardcodes
BROKER = "localhost"(agent/agent.pyline 25) instead of reading from environment.
Missing Features Marked as Stubs¶
schedule_campaign()(line 1201): Returns a stub response with"note": "Auto-scheduling not yet implemented".- No campaign deletion endpoint exists.
- No assignment deletion/pause endpoint exists.
7. Recommendations¶
Prioritized by impact and effort:
P0 -- Fix Before Next Release¶
-
Fix Agent org-scoped topic subscription. The Agent subscribes only to legacy topics. Any org-scoped deployment will silently fail to deliver commands to agents. Estimated fix: add
--orgCLI argument to Agent, subscribe totalos/{org}/gs/{station_id}/cmd/#. -
Add JSON error handling to Agent. A single malformed MQTT message crashes the agent process. Wrap
on_messagein try/except. -
Delete
core/tracker.py. Dead code that could confuse contributors.
P1 -- Next Sprint¶
- Split
core/main.pyinto a proper FastAPI application package. Suggested structure: core/app.py-- FastAPI app factory, lifespan, middlewarecore/auth.py-- login, verify, logout, cookie managementcore/routes/orgs.py-- organization CRUDcore/routes/campaigns.py-- campaign CRUD + activationcore/routes/stations.py-- station provisioningcore/routes/pages.py-- HTML page renderingcore/routes/legacy.py-- deprecated Mission endpoints (with deprecation warnings)-
core/sync.py-- SatNOGS sync logic -
Extract the duplicated auth/org/membership lookup from page-rendering endpoints into a reusable dependency (extend
require_role()to return template context). -
Remove stale schemas. Delete
AssignmentCreate.window_start/window_endandAssignmentResponse.window_start/window_endfromshared/schemas.py. They reference columns that no longer exist.
P2 -- Next Quarter¶
-
Fix N+1 queries in
get_active_assignments(). Use a joined/eager load instead of per-rowsession.get()calls. At 10 assignments this produces 31 queries per tick (62/second). -
Use Pydantic response models for API endpoints. The response schemas exist in
shared/schemas.pybut are unused. Wire them into FastAPI'sresponse_modelparameter for automatic validation and OpenAPI documentation. -
Add unit tests. The empty
tests/test_core/,tests/test_director/, andtests/test_agent/directories suggest unit tests were planned but never written. Priority targets: director/physics.py(pure functions, easy to test)director/station_manager.py(stateful but no external deps)-
shared/satnogs_client.py(mock HTTP, test caching logic) -
Move
ADMIN_EMAILto an environment variable. Hardcoded email addresses in source code are a maintenance and security concern.
P3 -- Future¶
-
Remove legacy Mission code path. The dual-mode Director (
tick()vs_tick_legacy()) adds ~150 lines of parallel logic. Once all deployments are migrated to Campaigns, delete the Mission model, legacy endpoints, and legacy Director path. -
Calculate footprint radius from altitude instead of hardcoding 2,500 km. The geometric horizon radius is
sqrt(2 * R_earth * alt + alt^2)which varies from ~2,200 km at 200 km altitude to ~3,400 km at 800 km. -
Implement Director state persistence. Write tracking state to a Redis or DB table so Director restarts do not lose mid-pass tracking context.
Summary¶
TALOS has a sound distributed architecture with clear component boundaries, a well-designed MQTT topic hierarchy, and a healthy test-to-code ratio. The Director subsystem is particularly well-factored. The main structural issues are the monolithic core/main.py (which needs to be split into ~8 modules), duplicated code patterns in the Core layer, and an Agent that does not support org-scoped topics. The deprecated Mission model and its parallel code paths should be removed once migration is complete. None of these issues are blocking, but addressing P0 items before the next release will prevent silent failures in multi-org deployments.