TALOS v0.3.0 Technology Roadmap¶
Date: April 2026 Version: 0.3.0 Authors: Libre Space Foundation Scope: Forward-looking technology assessment, migration paths, and prioritized roadmap for the TALOS ground station network
Table of Contents¶
- Current Tech Stack Assessment
- Orbital Propagation
- Visualization
- Messaging
- Scheduling
- Frontend Architecture
- Edge Computing
- AI/ML Opportunities
- Prioritized Roadmap
1. Current Tech Stack Assessment¶
What is working well¶
| Component | Technology | Verdict |
|---|---|---|
| API server | FastAPI 0.100+ / Uvicorn | Excellent. Async-native, OpenAPI docs for free, strong ecosystem. No reason to move. |
| ORM / schema | SQLModel 0.0.14+ / Pydantic v2 | Good fit. Single model definition for DB + API. Alembic migrations in place. |
| Orbital math | Skyfield + NumPy | Correct and well-tested. Pure-Python SGP4 with validated ephemeris loading. |
| Design system | Custom CSS on Astro UXDS tokens | Professional space-operations aesthetic. Lightweight, no framework lock-in. |
| CI/CD | GitLab CI with DAG stages, Docker 24, Fly.io deploy | Mature pipeline: lint, unit, integration, security scanning, Pages docs. |
| Auth | Magic-link email + JWT sessions | Passwordless, simple, appropriate for current scale. |
| Database | PostgreSQL 15 via psycopg2-binary | Solid choice. Room to leverage JSONB, PostGIS, and pg_cron as complexity grows. |
What needs attention¶
| Component | Technology | Concern |
|---|---|---|
| Agent | Pure Python, synchronous threads, raw sockets | High resource use on Raspberry Pi. No reconnect logic. Global mutable state. |
| Frontend | Server-rendered Jinja2 + vanilla JS (~800 lines inline) | Dashboard JS is monolithic. No component reuse, no type checking, no bundler. |
| MQTT client (browser) | Paho MQTT JS 1.0.1 (Eclipse, 2014) | Unmaintained. No MQTT 5.0 support. WebSocket-only with manual reconnect. |
| Propagation | Serial Skyfield calls per satellite | O(n) wall-clock time for n satellites. No batch propagation. |
| Scheduling | Manual station-mission assignment via UI | No automated conflict detection, no optimization, no multi-pass campaign planning. |
| Map | Leaflet 1.9.4 (2D) | Adequate for ground tracks, but no 3D orbit visualization, no terrain, no sensor cones. |
2. Orbital Propagation¶
Current: Skyfield SGP4¶
TALOS uses Skyfield's EarthSatellite for all orbital computations: pass prediction (find_events), Doppler shift, ground tracks, and footprint circles. The code in director/physics.py is clean and stateless, but each satellite is propagated serially. With 48 time steps per ground track and potentially hundreds of satellites across campaigns, this becomes a bottleneck.
Option A: dSGP4 (ESA) for batch/GPU propagation¶
dSGP4 (PyTorch-based, released 2024 by ESA's Advanced Concepts Team) offers:
- Batch CPU propagation: Propagate 1,000 TLEs across 100 time steps in a single vectorized call. On CPU alone this is 10-50x faster than serial Skyfield loops.
- GPU acceleration: With a CUDA-capable GPU, batch propagation scales to 10,000+ objects trivially. Relevant if TALOS grows to full-catalog scheduling.
- ML-corrected model: Neural network corrections trained on historical SP data reduce SGP4's known drag-modeling errors by up to 40% for LEO objects.
- Differentiability: Enables gradient-based orbit determination if TALOS ever ingests station observations.
Trade-offs: Adds PyTorch as a dependency (~800 MB). The ML correction model requires periodic retraining. API is lower-level than Skyfield -- no built-in find_events, so pass prediction logic must be reimplemented using root-finding on elevation angles.
Recommended migration path: Keep Skyfield for single-satellite interactive operations (Doppler, real-time tracking). Introduce dSGP4 as an optional backend in the Director for bulk pass prediction across campaigns. Abstract propagation behind a Propagator protocol so both backends are swappable.
Option B: sgp4-rs (Rust SGP4 with Python bindings)¶
The sgp4 crate on crates.io provides a Rust-native SGP4 implementation with Python bindings via PyO3. Benchmarks show 3-5x speedup over the C-extension SGP4 used by Skyfield, with zero additional dependencies beyond a small compiled wheel.
Trade-off: Faster than Skyfield but still serial per TLE. No batch vectorization. Best suited for the edge agent where PyTorch is impractical.
Recommendation¶
Phase 1 (v0.4): Extract a Propagator interface in director/physics.py. Keep Skyfield as default.
Phase 2 (v0.5): Add dSGP4 batch backend for campaign-wide pass prediction. Gate behind TALOS_PROPAGATOR=dsgp4 env var.
Phase 3 (v0.6+): Evaluate sgp4-rs for the edge agent if it is rewritten in Rust.
3. Visualization¶
Current: Leaflet 1.9.4 (2D)¶
The dashboard uses Leaflet with a dark tile layer for a 2D Mercator map. Ground tracks are polylines, stations are markers, and footprints are circles. This works well for operational awareness but cannot represent orbital altitude, sensor cones, or 3D conjunction geometry.
Option A: CesiumJS for 3D globe¶
CesiumJS (open-source, Apache 2.0) provides a WebGL-based 3D globe with:
- CZML streaming: Native format for time-dynamic satellite positions. The Director could emit CZML documents over MQTT, and the browser renders interpolated orbits without per-frame position messages.
- Sensor volumes: Visualize station antenna patterns as 3D cones, showing coverage overlap.
- Terrain and imagery: Built-in Cesium World Terrain and various imagery providers.
Trade-offs: Large library (~3 MB gzipped). Requires a Cesium Ion token for terrain (free tier: 75,000 monthly tiles). Steeper learning curve than Leaflet. Mobile performance degrades with many entities.
Option B: Leaflet + deck.gl overlay¶
Keep Leaflet for the base map but add deck.gl layers for GPU-accelerated rendering of thousands of satellite points and arcs. Lighter than CesiumJS, stays in 2D, but handles higher entity counts than native Leaflet markers.
Option C: Resium (React + CesiumJS)¶
If the frontend moves to React (see Section 6), Resium provides declarative React components wrapping CesiumJS. This is the cleanest integration path if both visualization and frontend architecture change together.
Recommendation¶
Phase 1 (v0.4): Keep Leaflet. Optimize ground track rendering by caching polylines and updating only on TLE change (already noted in physics.py comments).
Phase 2 (v0.5): Add CesiumJS as an opt-in "3D view" alongside the existing 2D map. Serve CZML from the Director. Do not replace Leaflet yet -- operators may prefer the faster 2D view for daily operations.
Phase 3 (v0.7+): If the frontend migrates to React/Svelte, evaluate Resium or a Svelte CesiumJS wrapper as the primary visualization.
4. Messaging¶
Current: MQTT 3.1.1 via Eclipse Mosquitto¶
TALOS uses Mosquitto as the MQTT broker with Paho Python clients (v2.0) in the Core, Director, and Agent. The browser uses Paho MQTT JS 1.0.1 over WebSocket. The topic hierarchy (talos/gs/{id}/cmd/#, talos/gs/{id}/telemetry/#) is well-structured.
Option A: Upgrade to MQTT 5.0¶
Mosquitto already supports MQTT 5.0. The Python Paho 2.0 client supports MQTT 5.0 properties. Benefits:
- Shared subscriptions (
$share/group/topic): Allow multiple Director instances to load-balance command processing without duplicate handling. Critical for horizontal scaling. - Request/response correlation: Built-in correlation IDs replace the current pattern of publishing a command and waiting for a separate status topic.
- Message expiry: Stale rotator commands auto-expire instead of being delivered to a station that reconnects after an outage.
- User properties: Attach metadata (mission ID, priority) to messages without encoding them in the topic or payload.
Trade-off: The browser Paho MQTT JS library does not support MQTT 5.0 and is unmaintained. Must replace with MQTT.js (npm mqtt, actively maintained, MQTT 5.0 support, ~50 KB gzipped).
Option B: Migrate to NATS¶
NATS offers higher throughput, built-in request/reply, and JetStream for persistence. However, it loses MQTT compatibility with existing agents and the SatNOGS ecosystem. The NATS Python client (nats-py) is less battle-tested in IoT contexts than Paho. Only justified if TALOS scales to hundreds of stations where NATS's queue groups outperform MQTT 5.0 shared subscriptions.
Recommendation¶
Phase 1 (v0.4): Replace browser Paho MQTT JS with MQTT.js. Enable MQTT 5.0 on the Mosquitto broker. Update Python clients to use MQTT 5.0 properties (shared subscriptions, message expiry, correlation IDs). Phase 2 (v0.6+): Evaluate NATS only if MQTT 5.0 shared subscriptions prove insufficient for multi-Director scaling. The SatNOGS ecosystem alignment strongly favors staying on MQTT.
5. Scheduling¶
Current: Manual assignment¶
Station-mission assignments are created manually through the UI. There is no conflict detection (double-booking a station), no optimization across campaigns, and no awareness of pass windows during assignment.
Approach: Google OR-Tools constraint solver¶
OR-Tools (BSD-3, actively maintained by Google) provides a CP-SAT solver suitable for satellite pass scheduling:
Problem formulation: - Variables: Binary assignment of each (pass, station) pair. - Constraints: No overlapping passes on the same station. Minimum elevation mask. Rotator slew time between consecutive passes. Station availability windows. - Objective: Maximize total scheduled contact time (or priority-weighted coverage).
Integration path:
1. The Director computes all pass windows for a campaign's satellites over a time horizon (using batch propagation from Section 2).
2. Pass windows and station constraints are fed to the CP-SAT solver.
3. The solver returns optimal assignments, which are written to the database as Assignment records.
4. Operators review and approve the schedule via the UI.
Performance: CP-SAT solves a 50-station, 200-satellite, 24-hour problem in under 10 seconds. For larger horizons, the solver supports time limits with best-found-so-far solutions.
Trade-off: OR-Tools is a ~50 MB compiled dependency. The constraint model must be carefully designed to avoid over-constraining or under-constraining. Requires validation against manually scheduled passes.
Alternative: Greedy scheduler first¶
For v0.4, a greedy scheduler that assigns passes by priority and rejects conflicts avoids the OR-Tools dependency while eliminating double-booking.
Recommendation¶
Phase 1 (v0.4): Implement conflict detection on assignment creation (database constraint + API validation). Add a greedy auto-scheduler for single campaigns. Phase 2 (v0.5): Introduce OR-Tools CP-SAT for multi-campaign optimization. Expose solver parameters (time limit, priority weights) in the campaign settings UI. Phase 3 (v0.7+): Add rolling-horizon scheduling that re-optimizes as TLEs update and station availability changes.
6. Frontend Architecture¶
Current: Server-rendered Jinja2 + inline JavaScript¶
The dashboard is a single Jinja2 template with ~800 lines of inline JavaScript handling MQTT subscriptions, Leaflet map updates, station card rendering, countdown timers, and modal interactions. There is no module system, no type checking, and no component reuse across pages.
Option A: HTMX + Alpine.js (incremental enhancement)¶
Replace inline JS with HTMX for server-driven partial updates and Alpine.js for client-side reactivity. Benefits:
- Minimal rewrite: Keep Jinja2 templates. Add
hx-*attributes for dynamic content (station cards, pass countdowns). Use Alpine.js for local state (modal open/close, form validation). - No build step: Ship plain HTML/JS. Works with the existing FastAPI static file serving.
- Trade-off: HTMX relies on HTTP round-trips for updates, which adds latency compared to MQTT-pushed state. The real-time dashboard (1 Hz telemetry updates) is a poor fit for HTMX polling. MQTT-driven elements must remain in vanilla JS or Alpine.js reactive stores.
Option B: Svelte SPA with MQTT¶
Replace Jinja2 dashboard with a Svelte single-page application:
- Reactive by design: Svelte's reactive declarations naturally handle MQTT-pushed state updates. A writable store backed by MQTT.js gives components automatic re-rendering.
- Small bundle: Svelte compiles to vanilla JS with no runtime. A full dashboard would likely be 80-120 KB gzipped.
- Component reuse: Station cards, telemetry panels, map views become reusable
.sveltecomponents. - Trade-off: Requires a build step (Vite). The Core must serve the SPA and provide a JSON API (FastAPI already does both). Two deployment artifacts (API + SPA static files). Learning curve for contributors unfamiliar with Svelte.
Option C: React SPA¶
React has the largest ecosystem (including Resium for CesiumJS), but the runtime overhead is larger than Svelte and the boilerplate-to-value ratio is worse for a focused operational dashboard. Not recommended unless CesiumJS/Resium becomes the primary visualization.
Recommendation¶
Phase 1 (v0.4): Extract inline JS into ES modules served from /static/js/. No framework change, just module boundaries: map.js, mqtt.js, stations.js, countdowns.js. Add JSDoc type annotations.
Phase 2 (v0.5): Introduce HTMX for non-real-time pages (stations list, campaign management, settings). Keep the dashboard as vanilla JS + ES modules.
Phase 3 (v0.6+): Evaluate Svelte for the dashboard if component complexity grows beyond manageable vanilla JS. The JSON API already exists in FastAPI; the migration is primarily a frontend concern.
7. Edge Computing¶
Current: Python agent with threading¶
The agent (agent/agent.py) is a ~110-line Python script using Paho MQTT, raw TCP sockets to rotctld, and a threaded telemetry loop. It runs on Raspberry Pi 3/4 at ground stations.
Problems: Python 3.10 runtime requires ~50-70 MB resident memory. No reconnection logic for MQTT or rotctld -- a transient network failure kills the session. Global mutable state makes testing difficult. No watchdog or health monitoring.
Option A: Rust agent¶
Rewrite using rumqttc (async MQTT) and tokio: ~2 MB binary, ~5 MB resident (10x reduction). Built-in reconnection with exponential backoff. Cross-compile for armv7/aarch64 from CI with no Python runtime on device.
Trade-off: Steep learning curve. Fewer Libre Space contributors are Rust-proficient. Debugging on remote Pi hardware is harder without Python's REPL.
Option B: Go agent¶
Go offers a middle ground: compiled binary (~8 MB), garbage collected, strong concurrency via goroutines, and trivial cross-compilation (GOOS=linux GOARCH=arm). Larger binary than Rust but simpler to learn. A fallback if the Rust path proves too steep.
Option C: Improve the Python agent¶
Refactor to use asyncio with aiomqtt (async Paho wrapper) and structured concurrency. Add reconnection, health checks, and a systemd watchdog notifier. This keeps the existing language and contributor familiarity.
- Trade-off: Still ~50 MB resident. Still requires Python runtime on the Pi. But the lowest migration effort.
Recommendation¶
Phase 1 (v0.4): Refactor the Python agent to async (asyncio + aiomqtt). Add reconnection logic, structured state management (dataclass instead of globals), and systemd integration. This is a 1-2 week effort.
Phase 2 (v0.6+): Prototype a Rust agent for one station. Compare memory, CPU, and reliability over a 30-day trial. If validated, migrate all stations over v0.7-v0.8.
Deferred: Go agent is the fallback if the Rust prototype proves too difficult to maintain.
8. AI/ML Opportunities¶
8.1 Anomaly detection on telemetry¶
Station telemetry (rotator position, signal strength, temperature) streams at 1 Hz over MQTT. A lightweight anomaly detector could flag:
- Rotator stalls (position not tracking commanded azimuth/elevation).
- Signal dropouts during expected passes.
- Thermal excursions in outdoor enclosures.
Approach: A simple statistical model (z-score on rolling window, or Isolation Forest from scikit-learn) running in the Director. No deep learning needed for structured 1 Hz time series with clear failure modes. Alerts published to talos/alerts/{station_id}.
Effort: 2-3 weeks. Requires labeled examples of known failures for validation.
8.2 Signal classification¶
If TALOS integrates IQ sample capture (via SatNOGS flowgraphs or GNU Radio), ML classifiers can identify modulation type, detect interference, or flag unexpected signals. Existing open-source models:
- TorchSig (MIT, 2024): PyTorch library for RF signal classification. Pre-trained models for common modulation schemes.
- SigMF + SatNOGS DB: The SatNOGS observation database contains labeled waterfall plots that could serve as training data.
Trade-off: Requires IQ sample pipeline, which TALOS does not currently have. This is a v1.0+ capability dependent on GNU Radio integration.
8.3 Predictive maintenance¶
Given 3-6 months of telemetry history, predict rotator bearing wear, LNA degradation, or cable weathering from trend analysis on rolling signal-to-noise ratio correlated with rotator age and weather data. Requires telemetry persistence (currently ephemeral MQTT) -- add TimescaleDB before ML models can train.
Recommendation¶
Phase 1 (v0.5): Add telemetry persistence to PostgreSQL with TimescaleDB. Implement z-score anomaly detection on rotator telemetry. Phase 2 (v0.7): Evaluate TorchSig integration if IQ sample capture is available. Phase 3 (v1.0+): Predictive maintenance models once 6+ months of telemetry history exists.
9. Prioritized Roadmap¶
Phase 1: v0.4 -- Foundation hardening (Q3 2026, ~6 weeks)¶
| Task | Effort | Depends on | Impact |
|---|---|---|---|
| Extract dashboard JS into ES modules | 1 week | None | Maintainability, testability |
| Replace Paho MQTT JS with MQTT.js | 3 days | JS modularization | MQTT 5.0 support, maintained library |
| Enable MQTT 5.0 on Mosquitto broker | 1 day | None | Shared subscriptions, message expiry |
| Refactor Python agent to asyncio | 2 weeks | None | Reliability, reconnection, testability |
| Add assignment conflict detection | 1 week | None | Data integrity |
Extract Propagator protocol in Director |
3 days | None | Enables future backend swaps |
| Greedy single-campaign auto-scheduler | 1 week | Conflict detection | Reduces manual work |
Phase 2: v0.5 -- Intelligence layer (Q4 2026, ~8 weeks)¶
| Task | Effort | Depends on | Impact |
|---|---|---|---|
| dSGP4 batch propagation backend | 3 weeks | Propagator protocol | 10-50x faster campaign planning |
| OR-Tools CP-SAT multi-campaign scheduler | 3 weeks | Batch propagation, conflict detection | Optimal pass scheduling |
| TimescaleDB telemetry persistence | 1 week | None | Enables ML, historical analysis |
| Z-score anomaly detection on rotator telemetry | 2 weeks | TimescaleDB | Early failure detection |
| HTMX for non-real-time pages | 2 weeks | JS modularization | Reduced page-load JS, server-driven UI |
| CesiumJS opt-in 3D view | 3 weeks | None | 3D orbit visualization |
Phase 3: v0.6-v0.7 -- Scale and polish (H1 2027, ~12 weeks)¶
| Task | Effort | Depends on | Impact |
|---|---|---|---|
| Rust edge agent prototype | 4 weeks | None | 10x memory reduction on Pi |
| Svelte dashboard (if complexity warrants) | 6 weeks | MQTT.js, JSON API | Component reuse, type safety |
| NATS evaluation (if MQTT 5.0 insufficient) | 2 weeks | Multi-Director deployment | Higher-throughput messaging |
| Rolling-horizon re-optimization | 3 weeks | OR-Tools scheduler | Adaptive scheduling |
| SatNOGS API v2 integration | 2 weeks | None | Updated satellite/transmitter data |
Phase 4: v1.0+ -- Advanced capabilities (H2 2027+)¶
| Task | Effort | Depends on | Impact |
|---|---|---|---|
| IQ sample capture pipeline | 8 weeks | GNU Radio integration | Signal analysis capability |
| TorchSig signal classification | 4 weeks | IQ pipeline | Automated modulation ID |
| Predictive maintenance models | 4 weeks | 6+ months telemetry | Proactive maintenance |
| WebRTC waterfall streaming | 4 weeks | IQ pipeline | Live signal monitoring |
| Full-catalog SSA integration | 3 weeks | dSGP4 batch backend | Conjunction screening |
Technology Decision Matrix¶
| Decision | Current | Recommended | When | Risk |
|---|---|---|---|---|
| Propagation | Skyfield (serial) | Skyfield + dSGP4 (batch) | v0.5 | Medium: PyTorch dependency size |
| Browser MQTT | Paho JS 1.0.1 | MQTT.js | v0.4 | Low: drop-in replacement |
| Broker protocol | MQTT 3.1.1 | MQTT 5.0 | v0.4 | Low: Mosquitto already supports it |
| Scheduling | Manual | Greedy, then OR-Tools CP-SAT | v0.4, v0.5 | Medium: constraint model design |
| 3D visualization | None | CesiumJS (opt-in) | v0.5 | Low: additive, not replacement |
| Frontend | Jinja2 + inline JS | ES modules, then HTMX/Svelte | v0.4, v0.6 | Medium: migration effort |
| Edge agent | Python threads | Async Python, then Rust | v0.4, v0.6 | High: Rust contributor availability |
| Telemetry storage | Ephemeral MQTT | TimescaleDB | v0.5 | Low: PostgreSQL extension |
| Messaging backbone | Mosquitto MQTT | Stay MQTT 5.0; evaluate NATS later | v0.4, v0.7 | Low: incremental upgrade |
Guiding Principles¶
-
Incremental migration over big rewrites. Every phase must leave the system deployable and functional. No "dark periods" where the dashboard is half-Svelte, half-Jinja2 without a working state.
-
Optimize the bottleneck, not the framework. Batch propagation and automated scheduling deliver more user value than a frontend rewrite. Prioritize compute and operations over aesthetics.
-
SatNOGS ecosystem alignment. TALOS exists within the Libre Space ecosystem. Technology choices (MQTT over NATS, Python over Go, open data formats) should maintain interoperability with SatNOGS Network, DB, and community tools.
-
Measure before migrating. Before adopting dSGP4, Rust agents, or NATS, run benchmarks against the current stack with realistic workloads (50 stations, 200 satellites, 24-hour horizon). Migrate only when the measured bottleneck justifies the complexity.
-
Contributor accessibility. Libre Space is a volunteer-driven community. Prefer technologies with broad adoption (Python, TypeScript, PostgreSQL) over niche tools that limit the contributor pool.