Skip to content

TALOS v0.3.0 Architecture Review

Date: April 2026 Scope: Honest internal assessment of the TALOS codebase as of v0.3.0 Author: Engineering (automated review)


1. System Architecture Assessment

TALOS uses a 5-component distributed architecture: Core API, Director, Agent, MQTT Broker, and PostgreSQL. The component boundaries are well-chosen for the domain -- separating the real-time physics loop (Director) from the web-facing CRUD layer (Core) is the right call, and MQTT is a natural fit for the pub/sub command and telemetry patterns of ground station control.

What works well:

  • The Director runs as a standalone process with its own DB connection and MQTT client, decoupled from the HTTP request/response cycle. This means the 2 Hz physics loop is never blocked by web traffic.
  • The Agent is genuinely lightweight (109 lines in agent/agent.py) -- appropriate for Raspberry Pi deployment.
  • The shared modules (shared/schemas.py, shared/topics.py) provide a single source of truth for MQTT topic strings and payload shapes, preventing the silent-misroute bugs that plague distributed systems.
  • The SatNOGSClient (shared/satnogs_client.py, 199 lines) consolidates all external API calls with caching, stale-fallback, and connection pooling. This is well-engineered.

What does not work well:

  • core/main.py is a 1,630-line monolith. It contains authentication, RBAC, SatNOGS sync, station provisioning, campaign CRUD, assignment management, legacy mission endpoints, a public satellite tracker, and inline HTML email templates. This file does the work of at least 6 modules.
  • The Core and Director both create their own SQLAlchemy engines independently (core/database.py line 19, director/director.py line 90). The DATABASE_URL parsing (the postgres:// to postgresql:// fix) is duplicated in both places.
  • core/tracker.py is a dead file -- an early prototype with hardcoded Celestrak URLs and emoji-laden print statements. It is not imported anywhere.

Component line counts (Python source only):

Component File Lines
Core API core/main.py 1,630
Core DB core/database.py 211
Director director/director.py 821
Physics director/physics.py 139
Station Mgr director/station_manager.py 187
TLE Mgr director/tle_manager.py 208
Agent agent/agent.py 109
Schemas shared/schemas.py 249
Topics shared/topics.py 164
SatNOGS Client shared/satnogs_client.py 199
Total production ~3,920
Total tests ~4,750

The test-to-production ratio (1.2:1) is healthy.


2. Code Quality

Module Organization

The Director side is well-factored. director/director.py orchestrates; director/physics.py provides pure stateless functions; director/station_manager.py and director/tle_manager.py manage specific concerns. This is clean separation.

The Core side is not factored at all. core/main.py defines Pydantic request models inline (lines 341-374), contains the full RBAC implementation (lines 265-308), has 7 page-rendering endpoints that each independently do the same auth/org/membership lookup pattern, and hosts a public satellite tracker that has nothing to do with mission control.

Code Duplication

The following patterns are duplicated and should be extracted:

  1. Auth + org + membership lookup. The sequence get_current_user -> select User -> select Org -> select Membership -> check role appears in every page-rendering endpoint (/dashboard, /org/{slug}/settings, /org/{slug}/members, /org/{slug}/stations, /org/{slug}/campaigns). The require_role() dependency already exists for API routes but is not used by page routes, which all reimplement the logic manually (lines 502-548, 554-603, 609-653, 659-696, 702-739).

  2. Station provisioning logic. The station creation code (name sanitization, SatNOGS lookup, ID/key generation) is duplicated between provision_station() (lines 894-930) and create_station_legacy() (lines 1352-1390). These are nearly identical.

  3. Transmitter linking. The transmitter-fetch-and-persist loop appears in both create_campaign() (lines 1004-1023) and add_mission() (lines 1410-1432) with the same structure.

  4. DATABASE_URL parsing. The postgres:// to postgresql:// replacement appears in both core/database.py (line 18) and director/director.py (line 73).

Error Handling

Error handling is inconsistent. The Director uses broad except Exception with logger.exception() throughout (e.g., get_active_assignments(), tick()), which is appropriate for a long-running process that must not crash. The Core API routes sometimes raise HTTPException with appropriate status codes but sometimes swallow errors silently (e.g., _apply_safe_migrations() in database.py line 210 catches all exceptions and passes).

The Agent has no error handling for JSON decode failures on incoming MQTT messages (agent/agent.py line 65 will crash on malformed payloads).


3. Data Flow Analysis

MQTT Message Flow

The MQTT topic hierarchy is well-designed. shared/topics.py defines all topic strings via the Topics class, making typos impossible at the Python level. The org-scoped topic scheme (talos/{org_slug}/gs/{station_id}/...) introduced in v0.2 coexists cleanly with legacy topics.

However, there are two concerns:

  1. Topic duplication in Topics class. The class defines both legacy and org-scoped versions of every topic (e.g., station_rotator_cmd() and org_station_cmd_rot()), and the Director has wrapper functions (_topic_rot, _topic_rig, etc. at lines 311-338) to select between them. This dual-path code increases the surface area for bugs.

  2. The Agent only subscribes to legacy topics. agent/agent.py line 58 subscribes to talos/gs/{STATION_ID}/cmd/# -- it does not know about org-scoped topics. This means the Director must publish to legacy topics for the Agent to receive commands, even though the Director's multi-campaign code path uses org-scoped topics. This is a latent bug for any deployment using organizations.

Database Access Patterns

The Director opens a new Session for every database query (see get_all_stations(), get_active_assignments(), get_station_config(), etc.). This is correct for a long-running process -- sessions should not be held across ticks. However, get_active_assignments() performs N+1 queries: it fetches all assignments, then calls session.get() individually for each campaign, station, and organization (lines 153-159). For 10 assignments this is 31 queries per tick (every 0.5s).

The _detach_assignment() function (lines 174-183) creates an anonymous _A class to copy assignment fields outside the session. This works but is fragile -- a dataclass or named tuple would be more appropriate.

API Design

The API is org-scoped (/api/orgs/{slug}/...) with proper RBAC, which is well-designed. Response bodies are constructed inline as dicts rather than using Pydantic response models, which means there is no schema validation on outputs and no auto-generated OpenAPI documentation for responses. The schemas defined in shared/schemas.py are used exclusively for MQTT payloads, not for API responses, despite response schemas like CampaignResponse and AssignmentResponse being defined there (lines 197-237) but never imported in core/main.py.


4. State Management

State is distributed across four locations, which creates consistency challenges:

State Location Lifecycle
Campaigns, assignments, stations PostgreSQL Persistent
Tracking state (which station tracks which campaign) StationManager in-memory (director/station_manager.py) Process lifetime, lost on restart
TLE cache MultiTLEManager in-memory (director/tle_manager.py) Process lifetime
Station config, session state MQTT retained messages Broker lifetime
Satellite catalog + TLE physics models GLOBAL_SAT_REGISTRY list in core/main.py (line 84) Process lifetime, rebuilt on sync

The critical gap is that the Director's tracking state (StationManager._tracking) is purely in-memory. If the Director restarts mid-pass, it loses all knowledge of which stations are currently tracking. The next tick will re-evaluate line-of-sight and re-send START commands, but there is a window where agents receive no pointing updates. This is acceptable at the current scale but should be addressed before production use.

The GLOBAL_SAT_REGISTRY in core/main.py is a global mutable list that holds up to 8,000+ Skyfield EarthSatellite objects. The atomic-swap pattern (line 197) is correct for reads, but the 30-second deferred load on startup (line 222) means the /api/debug/overhead endpoint returns empty results for the first 30+ seconds after boot.


5. Testing Architecture

Coverage

The test suite is integration-heavy with 7 test files totaling ~4,750 lines:

Test File Lines What it Tests
test_director_e2e.py 830 Physics loop, pass prediction, Doppler, multi-station
test_dashboard_realtime.py 884 WebSocket relay, MQTT viz updates, campaign display
test_agent_hardware.py 609 Hamlib protocol, rotator/rig command translation
test_campaign_e2e.py 556 Campaign CRUD, assignment lifecycle, activation
test_smoke_routes.py 553 HTTP route smoke tests, auth flow
test_e2e.py 534 End-to-end flows (legacy mission path)
test_load.py 446 Performance under concurrent load
test_satnogs_client.py 243 SatNOGS API client caching, error handling

There are no unit tests. The tests/test_core/, tests/test_director/, and tests/test_agent/ directories exist but are empty (contain only __init__.py). All tests are in tests/test_integration/ or tests/test_shared/.

Test Isolation

The integration test conftest (tests/test_integration/conftest.py) provides session-scoped fixtures for Skyfield objects (timescale, ISS satellite, station locations) with hardcoded TLE data. This is good -- tests use a fixed TLE epoch (2024-01-01) so propagation results are deterministic regardless of when tests run.

However, there is no database fixture infrastructure visible. The campaign and smoke route tests presumably require a running PostgreSQL instance, which means they cannot run in CI without Docker.


6. Design Debt

Deprecated Code Still Present

  1. Mission model (core/database.py lines 123-133): Marked as deprecated with a comment ("Use Campaign instead"), but still has active endpoints in core/main.py (/missions/add, /missions/{mid}/activate, /missions/{mid}/transmitters at lines 1343-1453). The Director still contains a complete legacy code path (_tick_legacy(), lines 606-751) and the dashboard still queries for active_mission (line 534).

  2. core/tracker.py: A 30+ line prototype file with hardcoded TARGET_NAME = "ISS (ZARYA)" and Celestrak URLs. Not imported anywhere. Should be deleted.

  3. AssignmentCreate and AssignmentResponse schemas in shared/schemas.py (lines 216-237) still reference window_start and window_end fields that were removed from the Assignment model (the migration at database.py line 189 explicitly drops these columns).

  4. Legacy topic subscriptions: The Director subscribes to the non-org-scoped Topics.SUB_ALL_STATION_INFO (line 346) and Topics.MISSION_SELECT (line 347). These should be org-scoped.

Hardcoded Values

  • ADMIN_EMAIL = "pierros@papadeas.gr" in core/main.py line 1508. This auto-creates an admin user on public endpoint access. Should be an environment variable.
  • radius: 2500000 in physics.py line 99. The satellite footprint radius is hardcoded to 2,500 km regardless of orbital altitude. Should be calculated from altitude.
  • Agent hardcodes BROKER = "localhost" (agent/agent.py line 25) instead of reading from environment.

Missing Features Marked as Stubs

  • schedule_campaign() (line 1201): Returns a stub response with "note": "Auto-scheduling not yet implemented".
  • No campaign deletion endpoint exists.
  • No assignment deletion/pause endpoint exists.

7. Recommendations

Prioritized by impact and effort:

P0 -- Fix Before Next Release

  1. Fix Agent org-scoped topic subscription. The Agent subscribes only to legacy topics. Any org-scoped deployment will silently fail to deliver commands to agents. Estimated fix: add --org CLI argument to Agent, subscribe to talos/{org}/gs/{station_id}/cmd/#.

  2. Add JSON error handling to Agent. A single malformed MQTT message crashes the agent process. Wrap on_message in try/except.

  3. Delete core/tracker.py. Dead code that could confuse contributors.

P1 -- Next Sprint

  1. Split core/main.py into a proper FastAPI application package. Suggested structure:
  2. core/app.py -- FastAPI app factory, lifespan, middleware
  3. core/auth.py -- login, verify, logout, cookie management
  4. core/routes/orgs.py -- organization CRUD
  5. core/routes/campaigns.py -- campaign CRUD + activation
  6. core/routes/stations.py -- station provisioning
  7. core/routes/pages.py -- HTML page rendering
  8. core/routes/legacy.py -- deprecated Mission endpoints (with deprecation warnings)
  9. core/sync.py -- SatNOGS sync logic

  10. Extract the duplicated auth/org/membership lookup from page-rendering endpoints into a reusable dependency (extend require_role() to return template context).

  11. Remove stale schemas. Delete AssignmentCreate.window_start/window_end and AssignmentResponse.window_start/window_end from shared/schemas.py. They reference columns that no longer exist.

P2 -- Next Quarter

  1. Fix N+1 queries in get_active_assignments(). Use a joined/eager load instead of per-row session.get() calls. At 10 assignments this produces 31 queries per tick (62/second).

  2. Use Pydantic response models for API endpoints. The response schemas exist in shared/schemas.py but are unused. Wire them into FastAPI's response_model parameter for automatic validation and OpenAPI documentation.

  3. Add unit tests. The empty tests/test_core/, tests/test_director/, and tests/test_agent/ directories suggest unit tests were planned but never written. Priority targets:

  4. director/physics.py (pure functions, easy to test)
  5. director/station_manager.py (stateful but no external deps)
  6. shared/satnogs_client.py (mock HTTP, test caching logic)

  7. Move ADMIN_EMAIL to an environment variable. Hardcoded email addresses in source code are a maintenance and security concern.

P3 -- Future

  1. Remove legacy Mission code path. The dual-mode Director (tick() vs _tick_legacy()) adds ~150 lines of parallel logic. Once all deployments are migrated to Campaigns, delete the Mission model, legacy endpoints, and legacy Director path.

  2. Calculate footprint radius from altitude instead of hardcoding 2,500 km. The geometric horizon radius is sqrt(2 * R_earth * alt + alt^2) which varies from ~2,200 km at 200 km altitude to ~3,400 km at 800 km.

  3. Implement Director state persistence. Write tracking state to a Redis or DB table so Director restarts do not lose mid-pass tracking context.


Summary

TALOS has a sound distributed architecture with clear component boundaries, a well-designed MQTT topic hierarchy, and a healthy test-to-code ratio. The Director subsystem is particularly well-factored. The main structural issues are the monolithic core/main.py (which needs to be split into ~8 modules), duplicated code patterns in the Core layer, and an Agent that does not support org-scoped topics. The deprecated Mission model and its parallel code paths should be removed once migration is complete. None of these issues are blocking, but addressing P0 items before the next release will prevent silent failures in multi-org deployments.