Skip to content

TALOS v0.5 -- Technology Decisions

Date: April 2026 Scope: Technology evaluation and decisions for v0.5 and beyond Author: Engineering (automated review)


1. Overview

This document records the technology evaluations and decisions made for the TALOS v0.5+ roadmap. Each section presents the problem, evaluates alternatives, and states the decision with rationale. These decisions are revisable as the system scales and requirements change.


2. Satellite Propagation: dSGP4 vs Skyfield

2.1 Problem

TALOS needs SGP4 propagation for two workloads:

  1. Real-time tracking -- Propagate current satellite positions at 2 Hz for each active campaign.
  2. Batch prediction -- Compute pass windows, ground tracks, and scheduling data for many satellites over 24+ hour horizons.

Skyfield handles workload (1) adequately but is slow for workload (2) at scale.

2.2 Evaluation

Criterion Skyfield dSGP4
Single-point speed ~1.5ms per satellite ~2ms per satellite (overhead of PyTorch)
Batch speed (100 sats x 1000 steps) ~12s (sequential) ~0.12s GPU, ~1.2s CPU
Dependencies NumPy, jplephem (~50 MB) PyTorch (~200 MB CPU, ~2 GB CUDA)
API maturity Excellent (stable since 2014) Good (v1.1.5, stable API)
Earth orientation data Built-in (IERS) None (pure SGP4)
Topocentric computation Built-in Manual (TEME to topocentric conversion)
GPU support No Yes (CUDA, Metal)
conda-forge Yes Yes

2.3 Decision

Keep both. Use Skyfield as default; dSGP4 as optional batch backend.

Rationale:

  • Skyfield is superior for real-time single-point propagation (richer API, built-in topocentric/Doppler computation, lighter dependency).
  • dSGP4 is superior for batch workloads (10-100x faster for campaign planning, pass prediction, ground track generation).
  • The PropagatorProtocol interface (introduced in v0.4) enables runtime backend selection.
  • dSGP4's PyTorch dependency is large; making it optional keeps the default Docker image small.

Configuration:

# Default: Skyfield for real-time, Skyfield for batch
TALOS_PROPAGATOR=skyfield

# Optional: Skyfield for real-time, dSGP4 for batch
TALOS_PROPAGATOR=skyfield
TALOS_BATCH_PROPAGATOR=dsgp4

# Full dSGP4 (requires PyTorch)
TALOS_PROPAGATOR=dsgp4

3. Scheduling: OR-Tools CP-SAT

3.1 Problem

Station-to-campaign assignment is manual. Automated scheduling needs a constraint solver that handles interval scheduling with overlap constraints, priority optimization, and antenna slew time transitions.

3.2 Evaluation

Solver Type Interval scheduling Python API Install size License
OR-Tools CP-SAT CP + SAT hybrid Native (IntervalVar, NoOverlap) Excellent ~60 MB Apache 2.0
PuLP + CBC Linear programming Manual encoding Good ~20 MB MIT / EPL
Gurobi Mixed integer programming Via constraints Good ~500 MB Commercial
OptaPlanner Constraint satisfaction Java-native JVM only N/A Apache 2.0
Custom greedy Heuristic Manual implementation N/A 0 N/A

3.3 Decision

OR-Tools CP-SAT for multi-campaign optimization. Greedy fallback for single campaigns.

Rationale:

  • CP-SAT has native interval scheduling primitives (NewIntervalVar, AddNoOverlap) that directly model the pass scheduling problem.
  • The Python API is well-documented and widely used in production scheduling systems.
  • Solves problems with 10,000+ variables in seconds.
  • Apache 2.0 license is compatible with TALOS (AGPL-3.0).
  • 60 MB install size is acceptable.
  • The greedy fallback ensures the system works even if OR-Tools is not installed (e.g., constrained edge deployments).

Key API patterns:

from ortools.sat.python import cp_model

model = cp_model.CpModel()

# Interval variables represent pass windows
interval = model.new_optional_fixed_size_interval_var(
    start=start_time,
    size=duration,
    is_present=is_assigned_var,
    name="pass_1",
)

# No-overlap ensures one pass at a time per station
model.add_no_overlap([interval_1, interval_2, interval_3])

# Objective: maximize weighted pass quality
model.maximize(sum(weight[i] * is_assigned[i] for i in range(n)))

solver = cp_model.CpSolver()
solver.parameters.max_time_in_seconds = 10.0
status = solver.solve(model)

4. 3D Visualization: CesiumJS

4.1 Problem

The Leaflet 2D map cannot convey orbital altitude, coverage geometry, or the spatial relationship between satellites and ground stations. A 3D globe provides more intuitive visualization for operators managing a satellite ground station network.

4.2 Evaluation

Library Rendering Globe Satellite support Size License
CesiumJS WebGL Full 3D globe CZML format, native orbits ~30 MB Apache 2.0
Three.js + globe WebGL Custom sphere Manual orbit rendering ~5 MB MIT
Leaflet (current) Canvas/SVG 2D projection Marker + polyline ~200 KB BSD-2
Mapbox GL WebGL 2.5D tilt No native satellite support ~3 MB BSD-3 (open)
Cesium for Unreal Unreal Engine Photorealistic Full ~1 GB+ Custom

4.3 Decision

CesiumJS for v0.6. Keep Leaflet as default 2D view.

Rationale:

  • CesiumJS is the industry standard for web-based satellite visualization. Organizations like AGI (now Ansys), NASA, and ESA use it.
  • CZML format provides a clean data contract between the TALOS Director (Python, server-side) and the visualization layer (JavaScript, client-side).
  • The czml3 Python library generates valid CZML documents without requiring CesiumJS server-side.
  • Apache 2.0 license is compatible.
  • 30 MB asset size is acceptable for an opt-in feature (loaded only when user selects 3D view).
  • Leaflet remains the default for users on constrained hardware or slow connections.

CZML generation pipeline:

Director
    |-- Propagate satellite positions (existing)
    |-- Format as CZML (new: czml3 library)
    |
    v
FastAPI endpoint: GET /api/v1/org/{slug}/czml
    |
    v
CesiumJS viewer (client-side, loaded on demand)

5. Message Broker: NATS vs MQTT 5.0

5.1 Problem

As the station network grows beyond 100 stations, the MQTT broker becomes a potential bottleneck. MQTT 5.0 introduces shared subscriptions (load balancing across subscribers) but the ecosystem tooling is less mature than NATS for high-throughput distributed systems.

5.2 Evaluation

Criterion MQTT 5.0 (Mosquitto/EMQX) NATS + JetStream
Protocol MQTT 5.0 (pub/sub, QoS 0/1/2) NATS (pub/sub, request/reply, streams)
Shared subscriptions Yes (MQTT 5.0 spec) Yes (queue groups, native)
Persistence QoS 1/2, retained messages JetStream (durable streams, replay)
Clustering EMQX native; Mosquitto via bridge Built-in (RAFT consensus)
Edge compatibility Excellent (lightweight protocol, constrained devices) Good (nats-server is lightweight, but less IoT tooling)
Python client paho-mqtt, aiomqtt (mature) nats-py (mature)
Rust client rumqttc (excellent) nats.rs (excellent)
Auth Username/password, TLS client certs, ACLs Token, NKey, JWT, accounts
Existing TALOS integration Full (agent, director, core all use MQTT) None
Typical throughput ~100K msg/s (EMQX), ~10K msg/s (Mosquitto) ~10M msg/s

5.3 Decision

Stay with MQTT 5.0 for now. Evaluate NATS at 500+ stations.

Rationale:

  • TALOS has significant MQTT investment: topic hierarchy, ACLs, agent protocol, director publishing, WebSocket relay. Migration cost is high.
  • Mosquitto handles 10K msg/s. At 200 stations publishing at 2 Hz, the message rate is ~400 msg/s -- well within capacity.
  • MQTT 5.0 shared subscriptions (supported by EMQX) provide horizontal scaling for the Director if needed.
  • NATS is technically superior for high-throughput distributed systems but the migration cost is not justified until MQTT becomes a demonstrated bottleneck.
  • The agent protocol abstraction (MQTT topics and schemas in shared/) makes a future migration feasible without rewriting business logic.

Trigger for NATS evaluation:

Metric Threshold
Active stations > 500
Broker CPU utilization > 80% sustained
Message delivery latency p99 > 50ms
Need for request/reply pattern Yes
Multi-region clustering required Yes

6. Edge Agent: Rust Agent

6.1 Problem

The Python agent works well on Raspberry Pi 4 (1+ GB RAM, quad-core ARM). On more constrained hardware (Pi Zero, microcontrollers), Python's memory footprint (~30 MB) and startup time (~2 seconds) are significant.

6.2 Evaluation

Criterion Python (current) Rust
Memory footprint ~30 MB ~2 MB
Startup time ~2s ~50ms
Binary size N/A (interpreted) ~5 MB (statically linked)
MQTT client aiomqtt (mature) rumqttc (mature, async)
Hamlib bindings Python ctypes hamlib-sys (FFI)
Cross-compilation N/A cross-rs (ARM, MIPS, RISC-V)
Development speed Fast Moderate
Deployment pip install + venv Single binary, no runtime

6.3 Decision

Defer Rust agent to v0.7+. Document the architecture pattern.

Rationale:

  • The Python agent is 109 lines and works on all target hardware (Pi 3/4/5).
  • No immediate requirement for constrained hardware support.
  • The Rust rewrite effort (~2 weeks) is not justified until there is a concrete deployment target that cannot run Python.
  • Documenting the architecture pattern now (tokio + rumqttc + hamlib-sys) ensures the design is ready when needed.

Documented architecture:

// Cargo.toml dependencies
[dependencies]
tokio = { version = "1", features = ["full"] }
rumqttc = "0.24"
serde = { version = "1", features = ["derive"] }
serde_json = "1"
tracing = "0.1"
tracing-subscriber = "0.3"

// Main structure
#[tokio::main]
async fn main() {
    // Initialize MQTT client
    let (client, mut eventloop) = rumqttc::AsyncClient::new(options, 100);

    // Subscribe to command topics
    client.subscribe("talos/+/gs/+/command/pointing", QoS::AtLeastOnce).await?;
    client.subscribe("talos/+/gs/+/command/radio", QoS::AtLeastOnce).await?;

    // Event loop
    while let Ok(event) = eventloop.poll().await {
        match event {
            Event::Incoming(Incoming::Publish(msg)) => {
                handle_command(&msg).await;
            }
            _ => {}
        }
    }
}

7. Telemetry Persistence: TimescaleDB

7.1 Problem

TALOS publishes tracking telemetry over MQTT but does not persist it. Prometheus captures Director metrics but not per-station measurement data. Post-pass analysis, tracking accuracy assessment, and historical trend analysis are not possible.

7.2 Evaluation

Database Time-series optimized SQL compatible PostgreSQL extension Compression Continuous aggregates
TimescaleDB Yes Full SQL Yes (extension) 90%+ Yes (materialized views)
InfluxDB Yes InfluxQL/Flux No (standalone) Yes Continuous queries
QuestDB Yes SQL subset No (standalone) Yes No
Plain PostgreSQL No (manual partitioning) Full SQL N/A TOAST only No
Prometheus Yes (metrics only) PromQL No Yes Recording rules

7.3 Decision

TimescaleDB for telemetry persistence.

Rationale:

  • TALOS already uses PostgreSQL. TimescaleDB is a PostgreSQL extension, not a separate database. This means:
    • Same connection string, same SQLAlchemy engine, same Alembic migrations.
    • No additional operational overhead (backup, monitoring, connection pooling).
    • Full SQL compatibility for ad-hoc queries.
  • sqlalchemy-timescaledb (v0.4.1) provides hypertable support in the ORM layer.
  • Continuous aggregates replace manual aggregation queries for dashboard statistics.
  • Compression achieves 90%+ reduction on older data, making multi-month retention feasible.
  • InfluxDB and QuestDB would require a second database in the infrastructure, second backup strategy, and a second query language.

Schema highlights:

# Hypertable with automatic 1-hour partitioning
class TrackingMeasurement(Base):
    __tablename__ = "tracking_measurements"
    timestamp = Column(DateTime(timezone=True), primary_key=True)
    station_id = Column(String(64), primary_key=True)
    # ... measurement fields ...

    __table_args__ = (
        {"timescaledb_hypertable": {"time_column_name": "timestamp"}},
    )

Data volume estimates:

Stations Rows/hour Rows/day Compressed size/day
10 72,000 1.7M ~10 MB
50 360,000 8.6M ~50 MB
100 720,000 17.3M ~100 MB

8. Decision Summary

Technology Decision Version Rationale
dSGP4 v0.5, optional v0.5 10-100x batch speedup; keep Skyfield as default
OR-Tools CP-SAT v0.5, required v0.5 Native interval scheduling; greedy fallback
CesiumJS v0.6, opt-in v0.6 Industry-standard 3D satellite visualization
HTMX v0.6, incremental v0.6 Simplifies CRUD pages without architectural change
CCSDS OMM v0.5 support, v0.6 migration v0.5-v0.6 Catalog number overflow deadline July 2026
CCSDS TDM v0.5, export only v0.5 Standard tracking data interchange
NATS Deferred v0.7+ MQTT sufficient to 500 stations; migration cost high
Rust agent Deferred v0.7+ Python agent sufficient; no constrained hardware target
TimescaleDB v0.5, required v0.5 PostgreSQL extension; no new infrastructure
SoapySDR v0.5, optional v0.5 IQ capture without GNU Radio dependency

9. Dependency Impact

9.1 New Required Dependencies (v0.5)

Package Size Purpose
ortools ~60 MB Scheduling solver
sqlalchemy-timescaledb ~100 KB Hypertable ORM support
httpx already installed CelesTrak client

9.2 New Optional Dependencies (v0.5)

Package Size Purpose Required when
dsgp4 ~2 MB Batch propagation TALOS_BATCH_PROPAGATOR=dsgp4
torch (CPU) ~200 MB dSGP4 runtime dSGP4 enabled
SoapySDR ~5 MB IQ capture Agent with SDR hardware
matplotlib ~30 MB Waterfall generation IQ capture enabled

9.3 New Dependencies (v0.6)

Package Size Purpose
czml3 ~50 KB CZML document generation
cesium (frontend) ~30 MB 3D globe (static asset)
htmx.org (frontend) ~14 KB HTML-over-the-wire

9.4 Docker Image Size Impact

Image Current After v0.5 After v0.6
talos-core ~350 MB ~420 MB (+ortools) ~460 MB (+czml3, htmx assets)
talos-director ~300 MB ~370 MB (+ortools) ~400 MB (+czml3)
talos-agent ~150 MB ~150 MB (no change) ~180 MB (+SoapySDR optional)
talos-agent-dsgp4 N/A ~380 MB (new, with torch CPU) Same

10. Review Schedule

These decisions should be revisited at the following milestones:

Milestone Review
50 active stations Re-evaluate MQTT broker capacity; confirm dSGP4 needed for batch
100 active stations Re-evaluate NATS migration; assess regional sharding need
July 2026 Confirm OMM migration complete; verify 6-digit catalog numbers work
500 active stations Full architecture review; NATS, Rust agent, regional sharding decisions

Summary

The technology decisions for v0.5+ follow a principle of incremental adoption with clear migration paths. dSGP4 and OR-Tools address immediate performance and scheduling needs. CesiumJS and HTMX improve the user experience in v0.6. NATS and Rust are deferred until concrete scaling thresholds are reached. TimescaleDB is the highest-confidence decision: it is a PostgreSQL extension that adds time-series capabilities without new infrastructure. The CCSDS OMM migration is the most time-critical decision due to the July 2026 catalog number overflow deadline.