TALOS v0.5 -- Technology Decisions¶

Date: April 2026 Scope: Technology evaluation and decisions for v0.5 and beyond Author: Engineering (automated review)

1. Overview¶

This document records the technology evaluations and decisions made for the TALOS v0.5+ roadmap. Each section presents the problem, evaluates alternatives, and states the decision with rationale. These decisions are revisable as the system scales and requirements change.

2. Satellite Propagation: dSGP4 vs Skyfield¶

2.1 Problem¶

TALOS needs SGP4 propagation for two workloads:

Real-time tracking -- Propagate current satellite positions at 2 Hz for each active campaign.
Batch prediction -- Compute pass windows, ground tracks, and scheduling data for many satellites over 24+ hour horizons.

Skyfield handles workload (1) adequately but is slow for workload (2) at scale.

2.2 Evaluation¶

Criterion	Skyfield	dSGP4
Single-point speed	~1.5ms per satellite	~2ms per satellite (overhead of PyTorch)
Batch speed (100 sats x 1000 steps)	~12s (sequential)	~0.12s GPU, ~1.2s CPU
Dependencies	NumPy, jplephem (~50 MB)	PyTorch (~200 MB CPU, ~2 GB CUDA)
API maturity	Excellent (stable since 2014)	Good (v1.1.5, stable API)
Earth orientation data	Built-in (IERS)	None (pure SGP4)
Topocentric computation	Built-in	Manual (TEME to topocentric conversion)
GPU support	No	Yes (CUDA, Metal)
conda-forge	Yes	Yes

2.3 Decision¶

Keep both. Use Skyfield as default; dSGP4 as optional batch backend.

Rationale:

Skyfield is superior for real-time single-point propagation (richer API, built-in topocentric/Doppler computation, lighter dependency).
dSGP4 is superior for batch workloads (10-100x faster for campaign planning, pass prediction, ground track generation).
The PropagatorProtocol interface (introduced in v0.4) enables runtime backend selection.
dSGP4's PyTorch dependency is large; making it optional keeps the default Docker image small.

Configuration:

# Default: Skyfield for real-time, Skyfield for batch
TALOS_PROPAGATOR=skyfield

# Optional: Skyfield for real-time, dSGP4 for batch
TALOS_PROPAGATOR=skyfield
TALOS_BATCH_PROPAGATOR=dsgp4

# Full dSGP4 (requires PyTorch)
TALOS_PROPAGATOR=dsgp4

3. Scheduling: OR-Tools CP-SAT¶

3.1 Problem¶

Station-to-campaign assignment is manual. Automated scheduling needs a constraint solver that handles interval scheduling with overlap constraints, priority optimization, and antenna slew time transitions.

3.2 Evaluation¶

Solver	Type	Interval scheduling	Python API	Install size	License
OR-Tools CP-SAT	CP + SAT hybrid	Native (IntervalVar, NoOverlap)	Excellent	~60 MB	Apache 2.0
PuLP + CBC	Linear programming	Manual encoding	Good	~20 MB	MIT / EPL
Gurobi	Mixed integer programming	Via constraints	Good	~500 MB	Commercial
OptaPlanner	Constraint satisfaction	Java-native	JVM only	N/A	Apache 2.0
Custom greedy	Heuristic	Manual implementation	N/A	0	N/A

3.3 Decision¶

OR-Tools CP-SAT for multi-campaign optimization. Greedy fallback for single campaigns.

Rationale:

CP-SAT has native interval scheduling primitives (NewIntervalVar, AddNoOverlap) that directly model the pass scheduling problem.
The Python API is well-documented and widely used in production scheduling systems.
Solves problems with 10,000+ variables in seconds.
Apache 2.0 license is compatible with TALOS (AGPL-3.0).
60 MB install size is acceptable.
The greedy fallback ensures the system works even if OR-Tools is not installed (e.g., constrained edge deployments).

Key API patterns:

from ortools.sat.python import cp_model

model = cp_model.CpModel()

# Interval variables represent pass windows
interval = model.new_optional_fixed_size_interval_var(
    start=start_time,
    size=duration,
    is_present=is_assigned_var,
    name="pass_1",
)

# No-overlap ensures one pass at a time per station
model.add_no_overlap([interval_1, interval_2, interval_3])

# Objective: maximize weighted pass quality
model.maximize(sum(weight[i] * is_assigned[i] for i in range(n)))

solver = cp_model.CpSolver()
solver.parameters.max_time_in_seconds = 10.0
status = solver.solve(model)

4. 3D Visualization: CesiumJS¶

4.1 Problem¶

The Leaflet 2D map cannot convey orbital altitude, coverage geometry, or the spatial relationship between satellites and ground stations. A 3D globe provides more intuitive visualization for operators managing a satellite ground station network.

4.2 Evaluation¶

Library	Rendering	Globe	Satellite support	Size	License
CesiumJS	WebGL	Full 3D globe	CZML format, native orbits	~30 MB	Apache 2.0
Three.js + globe	WebGL	Custom sphere	Manual orbit rendering	~5 MB	MIT
Leaflet (current)	Canvas/SVG	2D projection	Marker + polyline	~200 KB	BSD-2
Mapbox GL	WebGL	2.5D tilt	No native satellite support	~3 MB	BSD-3 (open)
Cesium for Unreal	Unreal Engine	Photorealistic	Full	~1 GB+	Custom

4.3 Decision¶

CesiumJS for v0.6. Keep Leaflet as default 2D view.

Rationale:

CesiumJS is the industry standard for web-based satellite visualization. Organizations like AGI (now Ansys), NASA, and ESA use it.
CZML format provides a clean data contract between the TALOS Director (Python, server-side) and the visualization layer (JavaScript, client-side).
The czml3 Python library generates valid CZML documents without requiring CesiumJS server-side.
Apache 2.0 license is compatible.
30 MB asset size is acceptable for an opt-in feature (loaded only when user selects 3D view).
Leaflet remains the default for users on constrained hardware or slow connections.

CZML generation pipeline:

Director
    |-- Propagate satellite positions (existing)
    |-- Format as CZML (new: czml3 library)
    |
    v
FastAPI endpoint: GET /api/v1/org/{slug}/czml
    |
    v
CesiumJS viewer (client-side, loaded on demand)

5. Message Broker: NATS vs MQTT 5.0¶

5.1 Problem¶

As the station network grows beyond 100 stations, the MQTT broker becomes a potential bottleneck. MQTT 5.0 introduces shared subscriptions (load balancing across subscribers) but the ecosystem tooling is less mature than NATS for high-throughput distributed systems.

5.2 Evaluation¶

Criterion	MQTT 5.0 (Mosquitto/EMQX)	NATS + JetStream
Protocol	MQTT 5.0 (pub/sub, QoS 0/1/2)	NATS (pub/sub, request/reply, streams)
Shared subscriptions	Yes (MQTT 5.0 spec)	Yes (queue groups, native)
Persistence	QoS 1/2, retained messages	JetStream (durable streams, replay)
Clustering	EMQX native; Mosquitto via bridge	Built-in (RAFT consensus)
Edge compatibility	Excellent (lightweight protocol, constrained devices)	Good (nats-server is lightweight, but less IoT tooling)
Python client	paho-mqtt, aiomqtt (mature)	nats-py (mature)
Rust client	rumqttc (excellent)	nats.rs (excellent)
Auth	Username/password, TLS client certs, ACLs	Token, NKey, JWT, accounts
Existing TALOS integration	Full (agent, director, core all use MQTT)	None
Typical throughput	~100K msg/s (EMQX), ~10K msg/s (Mosquitto)	~10M msg/s

5.3 Decision¶

Stay with MQTT 5.0 for now. Evaluate NATS at 500+ stations.

Rationale:

TALOS has significant MQTT investment: topic hierarchy, ACLs, agent protocol, director publishing, WebSocket relay. Migration cost is high.
Mosquitto handles 10K msg/s. At 200 stations publishing at 2 Hz, the message rate is ~400 msg/s -- well within capacity.
MQTT 5.0 shared subscriptions (supported by EMQX) provide horizontal scaling for the Director if needed.
NATS is technically superior for high-throughput distributed systems but the migration cost is not justified until MQTT becomes a demonstrated bottleneck.
The agent protocol abstraction (MQTT topics and schemas in shared/) makes a future migration feasible without rewriting business logic.

Trigger for NATS evaluation:

Metric	Threshold
Active stations	> 500
Broker CPU utilization	> 80% sustained
Message delivery latency p99	> 50ms
Need for request/reply pattern	Yes
Multi-region clustering required	Yes

6. Edge Agent: Rust Agent¶

6.1 Problem¶

The Python agent works well on Raspberry Pi 4 (1+ GB RAM, quad-core ARM). On more constrained hardware (Pi Zero, microcontrollers), Python's memory footprint (~30 MB) and startup time (~2 seconds) are significant.

6.2 Evaluation¶

Criterion	Python (current)	Rust
Memory footprint	~30 MB	~2 MB
Startup time	~2s	~50ms
Binary size	N/A (interpreted)	~5 MB (statically linked)
MQTT client	aiomqtt (mature)	rumqttc (mature, async)
Hamlib bindings	Python ctypes	hamlib-sys (FFI)
Cross-compilation	N/A	cross-rs (ARM, MIPS, RISC-V)
Development speed	Fast	Moderate
Deployment	pip install + venv	Single binary, no runtime

6.3 Decision¶

Defer Rust agent to v0.7+. Document the architecture pattern.

Rationale:

The Python agent is 109 lines and works on all target hardware (Pi 3/4/5).
No immediate requirement for constrained hardware support.
The Rust rewrite effort (~2 weeks) is not justified until there is a concrete deployment target that cannot run Python.
Documenting the architecture pattern now (tokio + rumqttc + hamlib-sys) ensures the design is ready when needed.

Documented architecture:

// Cargo.toml dependencies
[dependencies]
tokio = { version = "1", features = ["full"] }
rumqttc = "0.24"
serde = { version = "1", features = ["derive"] }
serde_json = "1"
tracing = "0.1"
tracing-subscriber = "0.3"

// Main structure
#[tokio::main]
async fn main() {
    // Initialize MQTT client
    let (client, mut eventloop) = rumqttc::AsyncClient::new(options, 100);

    // Subscribe to command topics
    client.subscribe("talos/+/gs/+/command/pointing", QoS::AtLeastOnce).await?;
    client.subscribe("talos/+/gs/+/command/radio", QoS::AtLeastOnce).await?;

    // Event loop
    while let Ok(event) = eventloop.poll().await {
        match event {
            Event::Incoming(Incoming::Publish(msg)) => {
                handle_command(&msg).await;
            }
            _ => {}
        }
    }
}

7. Telemetry Persistence: TimescaleDB¶

7.1 Problem¶

TALOS publishes tracking telemetry over MQTT but does not persist it. Prometheus captures Director metrics but not per-station measurement data. Post-pass analysis, tracking accuracy assessment, and historical trend analysis are not possible.

7.2 Evaluation¶

Database	Time-series optimized	SQL compatible	PostgreSQL extension	Compression	Continuous aggregates
TimescaleDB	Yes	Full SQL	Yes (extension)	90%+	Yes (materialized views)
InfluxDB	Yes	InfluxQL/Flux	No (standalone)	Yes	Continuous queries
QuestDB	Yes	SQL subset	No (standalone)	Yes	No
Plain PostgreSQL	No (manual partitioning)	Full SQL	N/A	TOAST only	No
Prometheus	Yes (metrics only)	PromQL	No	Yes	Recording rules

7.3 Decision¶

TimescaleDB for telemetry persistence.

Rationale:

TALOS already uses PostgreSQL. TimescaleDB is a PostgreSQL extension, not a separate database. This means:
- Same connection string, same SQLAlchemy engine, same Alembic migrations.
- No additional operational overhead (backup, monitoring, connection pooling).
- Full SQL compatibility for ad-hoc queries.
sqlalchemy-timescaledb (v0.4.1) provides hypertable support in the ORM layer.
Continuous aggregates replace manual aggregation queries for dashboard statistics.
Compression achieves 90%+ reduction on older data, making multi-month retention feasible.
InfluxDB and QuestDB would require a second database in the infrastructure, second backup strategy, and a second query language.

Schema highlights:

# Hypertable with automatic 1-hour partitioning
class TrackingMeasurement(Base):
    __tablename__ = "tracking_measurements"
    timestamp = Column(DateTime(timezone=True), primary_key=True)
    station_id = Column(String(64), primary_key=True)
    # ... measurement fields ...

    __table_args__ = (
        {"timescaledb_hypertable": {"time_column_name": "timestamp"}},
    )

Data volume estimates:

Stations	Rows/hour	Rows/day	Compressed size/day
10	72,000	1.7M	~10 MB
50	360,000	8.6M	~50 MB
100	720,000	17.3M	~100 MB

8. Decision Summary¶

Technology	Decision	Version	Rationale
dSGP4	v0.5, optional	v0.5	10-100x batch speedup; keep Skyfield as default
OR-Tools CP-SAT	v0.5, required	v0.5	Native interval scheduling; greedy fallback
CesiumJS	v0.6, opt-in	v0.6	Industry-standard 3D satellite visualization
HTMX	v0.6, incremental	v0.6	Simplifies CRUD pages without architectural change
CCSDS OMM	v0.5 support, v0.6 migration	v0.5-v0.6	Catalog number overflow deadline July 2026
CCSDS TDM	v0.5, export only	v0.5	Standard tracking data interchange
NATS	Deferred	v0.7+	MQTT sufficient to 500 stations; migration cost high
Rust agent	Deferred	v0.7+	Python agent sufficient; no constrained hardware target
TimescaleDB	v0.5, required	v0.5	PostgreSQL extension; no new infrastructure
SoapySDR	v0.5, optional	v0.5	IQ capture without GNU Radio dependency

9. Dependency Impact¶

9.1 New Required Dependencies (v0.5)¶

Package	Size	Purpose
`ortools`	~60 MB	Scheduling solver
`sqlalchemy-timescaledb`	~100 KB	Hypertable ORM support
`httpx`	already installed	CelesTrak client

9.2 New Optional Dependencies (v0.5)¶

Package	Size	Purpose	Required when
`dsgp4`	~2 MB	Batch propagation	`TALOS_BATCH_PROPAGATOR=dsgp4`
`torch` (CPU)	~200 MB	dSGP4 runtime	dSGP4 enabled
`SoapySDR`	~5 MB	IQ capture	Agent with SDR hardware
`matplotlib`	~30 MB	Waterfall generation	IQ capture enabled

9.3 New Dependencies (v0.6)¶

Package	Size	Purpose
`czml3`	~50 KB	CZML document generation
`cesium` (frontend)	~30 MB	3D globe (static asset)
`htmx.org` (frontend)	~14 KB	HTML-over-the-wire

9.4 Docker Image Size Impact¶

Image	Current	After v0.5	After v0.6
`talos-core`	~350 MB	~420 MB (+ortools)	~460 MB (+czml3, htmx assets)
`talos-director`	~300 MB	~370 MB (+ortools)	~400 MB (+czml3)
`talos-agent`	~150 MB	~150 MB (no change)	~180 MB (+SoapySDR optional)
`talos-agent-dsgp4`	N/A	~380 MB (new, with torch CPU)	Same

10. Review Schedule¶

These decisions should be revisited at the following milestones:

Milestone	Review
50 active stations	Re-evaluate MQTT broker capacity; confirm dSGP4 needed for batch
100 active stations	Re-evaluate NATS migration; assess regional sharding need
July 2026	Confirm OMM migration complete; verify 6-digit catalog numbers work
500 active stations	Full architecture review; NATS, Rust agent, regional sharding decisions

Summary¶

The technology decisions for v0.5+ follow a principle of incremental adoption with clear migration paths. dSGP4 and OR-Tools address immediate performance and scheduling needs. CesiumJS and HTMX improve the user experience in v0.6. NATS and Rust are deferred until concrete scaling thresholds are reached. TimescaleDB is the highest-confidence decision: it is a PostgreSQL extension that adds time-series capabilities without new infrastructure. The CCSDS OMM migration is the most time-critical decision due to the July 2026 catalog number overflow deadline.