TALOS v0.5 -- Technology Decisions¶
Date: April 2026 Scope: Technology evaluation and decisions for v0.5 and beyond Author: Engineering (automated review)
1. Overview¶
This document records the technology evaluations and decisions made for the TALOS v0.5+ roadmap. Each section presents the problem, evaluates alternatives, and states the decision with rationale. These decisions are revisable as the system scales and requirements change.
2. Satellite Propagation: dSGP4 vs Skyfield¶
2.1 Problem¶
TALOS needs SGP4 propagation for two workloads:
- Real-time tracking -- Propagate current satellite positions at 2 Hz for each active campaign.
- Batch prediction -- Compute pass windows, ground tracks, and scheduling data for many satellites over 24+ hour horizons.
Skyfield handles workload (1) adequately but is slow for workload (2) at scale.
2.2 Evaluation¶
| Criterion | Skyfield | dSGP4 |
|---|---|---|
| Single-point speed | ~1.5ms per satellite | ~2ms per satellite (overhead of PyTorch) |
| Batch speed (100 sats x 1000 steps) | ~12s (sequential) | ~0.12s GPU, ~1.2s CPU |
| Dependencies | NumPy, jplephem (~50 MB) | PyTorch (~200 MB CPU, ~2 GB CUDA) |
| API maturity | Excellent (stable since 2014) | Good (v1.1.5, stable API) |
| Earth orientation data | Built-in (IERS) | None (pure SGP4) |
| Topocentric computation | Built-in | Manual (TEME to topocentric conversion) |
| GPU support | No | Yes (CUDA, Metal) |
| conda-forge | Yes | Yes |
2.3 Decision¶
Keep both. Use Skyfield as default; dSGP4 as optional batch backend.
Rationale:
- Skyfield is superior for real-time single-point propagation (richer API, built-in topocentric/Doppler computation, lighter dependency).
- dSGP4 is superior for batch workloads (10-100x faster for campaign planning, pass prediction, ground track generation).
- The
PropagatorProtocolinterface (introduced in v0.4) enables runtime backend selection. - dSGP4's PyTorch dependency is large; making it optional keeps the default Docker image small.
Configuration:
# Default: Skyfield for real-time, Skyfield for batch
TALOS_PROPAGATOR=skyfield
# Optional: Skyfield for real-time, dSGP4 for batch
TALOS_PROPAGATOR=skyfield
TALOS_BATCH_PROPAGATOR=dsgp4
# Full dSGP4 (requires PyTorch)
TALOS_PROPAGATOR=dsgp4
3. Scheduling: OR-Tools CP-SAT¶
3.1 Problem¶
Station-to-campaign assignment is manual. Automated scheduling needs a constraint solver that handles interval scheduling with overlap constraints, priority optimization, and antenna slew time transitions.
3.2 Evaluation¶
| Solver | Type | Interval scheduling | Python API | Install size | License |
|---|---|---|---|---|---|
| OR-Tools CP-SAT | CP + SAT hybrid | Native (IntervalVar, NoOverlap) | Excellent | ~60 MB | Apache 2.0 |
| PuLP + CBC | Linear programming | Manual encoding | Good | ~20 MB | MIT / EPL |
| Gurobi | Mixed integer programming | Via constraints | Good | ~500 MB | Commercial |
| OptaPlanner | Constraint satisfaction | Java-native | JVM only | N/A | Apache 2.0 |
| Custom greedy | Heuristic | Manual implementation | N/A | 0 | N/A |
3.3 Decision¶
OR-Tools CP-SAT for multi-campaign optimization. Greedy fallback for single campaigns.
Rationale:
- CP-SAT has native interval scheduling primitives (
NewIntervalVar,AddNoOverlap) that directly model the pass scheduling problem. - The Python API is well-documented and widely used in production scheduling systems.
- Solves problems with 10,000+ variables in seconds.
- Apache 2.0 license is compatible with TALOS (AGPL-3.0).
- 60 MB install size is acceptable.
- The greedy fallback ensures the system works even if OR-Tools is not installed (e.g., constrained edge deployments).
Key API patterns:
from ortools.sat.python import cp_model
model = cp_model.CpModel()
# Interval variables represent pass windows
interval = model.new_optional_fixed_size_interval_var(
start=start_time,
size=duration,
is_present=is_assigned_var,
name="pass_1",
)
# No-overlap ensures one pass at a time per station
model.add_no_overlap([interval_1, interval_2, interval_3])
# Objective: maximize weighted pass quality
model.maximize(sum(weight[i] * is_assigned[i] for i in range(n)))
solver = cp_model.CpSolver()
solver.parameters.max_time_in_seconds = 10.0
status = solver.solve(model)
4. 3D Visualization: CesiumJS¶
4.1 Problem¶
The Leaflet 2D map cannot convey orbital altitude, coverage geometry, or the spatial relationship between satellites and ground stations. A 3D globe provides more intuitive visualization for operators managing a satellite ground station network.
4.2 Evaluation¶
| Library | Rendering | Globe | Satellite support | Size | License |
|---|---|---|---|---|---|
| CesiumJS | WebGL | Full 3D globe | CZML format, native orbits | ~30 MB | Apache 2.0 |
| Three.js + globe | WebGL | Custom sphere | Manual orbit rendering | ~5 MB | MIT |
| Leaflet (current) | Canvas/SVG | 2D projection | Marker + polyline | ~200 KB | BSD-2 |
| Mapbox GL | WebGL | 2.5D tilt | No native satellite support | ~3 MB | BSD-3 (open) |
| Cesium for Unreal | Unreal Engine | Photorealistic | Full | ~1 GB+ | Custom |
4.3 Decision¶
CesiumJS for v0.6. Keep Leaflet as default 2D view.
Rationale:
- CesiumJS is the industry standard for web-based satellite visualization. Organizations like AGI (now Ansys), NASA, and ESA use it.
- CZML format provides a clean data contract between the TALOS Director (Python, server-side) and the visualization layer (JavaScript, client-side).
- The
czml3Python library generates valid CZML documents without requiring CesiumJS server-side. - Apache 2.0 license is compatible.
- 30 MB asset size is acceptable for an opt-in feature (loaded only when user selects 3D view).
- Leaflet remains the default for users on constrained hardware or slow connections.
CZML generation pipeline:
Director
|-- Propagate satellite positions (existing)
|-- Format as CZML (new: czml3 library)
|
v
FastAPI endpoint: GET /api/v1/org/{slug}/czml
|
v
CesiumJS viewer (client-side, loaded on demand)
5. Message Broker: NATS vs MQTT 5.0¶
5.1 Problem¶
As the station network grows beyond 100 stations, the MQTT broker becomes a potential bottleneck. MQTT 5.0 introduces shared subscriptions (load balancing across subscribers) but the ecosystem tooling is less mature than NATS for high-throughput distributed systems.
5.2 Evaluation¶
| Criterion | MQTT 5.0 (Mosquitto/EMQX) | NATS + JetStream |
|---|---|---|
| Protocol | MQTT 5.0 (pub/sub, QoS 0/1/2) | NATS (pub/sub, request/reply, streams) |
| Shared subscriptions | Yes (MQTT 5.0 spec) | Yes (queue groups, native) |
| Persistence | QoS 1/2, retained messages | JetStream (durable streams, replay) |
| Clustering | EMQX native; Mosquitto via bridge | Built-in (RAFT consensus) |
| Edge compatibility | Excellent (lightweight protocol, constrained devices) | Good (nats-server is lightweight, but less IoT tooling) |
| Python client | paho-mqtt, aiomqtt (mature) | nats-py (mature) |
| Rust client | rumqttc (excellent) | nats.rs (excellent) |
| Auth | Username/password, TLS client certs, ACLs | Token, NKey, JWT, accounts |
| Existing TALOS integration | Full (agent, director, core all use MQTT) | None |
| Typical throughput | ~100K msg/s (EMQX), ~10K msg/s (Mosquitto) | ~10M msg/s |
5.3 Decision¶
Stay with MQTT 5.0 for now. Evaluate NATS at 500+ stations.
Rationale:
- TALOS has significant MQTT investment: topic hierarchy, ACLs, agent protocol, director publishing, WebSocket relay. Migration cost is high.
- Mosquitto handles 10K msg/s. At 200 stations publishing at 2 Hz, the message rate is ~400 msg/s -- well within capacity.
- MQTT 5.0 shared subscriptions (supported by EMQX) provide horizontal scaling for the Director if needed.
- NATS is technically superior for high-throughput distributed systems but the migration cost is not justified until MQTT becomes a demonstrated bottleneck.
- The agent protocol abstraction (MQTT topics and schemas in
shared/) makes a future migration feasible without rewriting business logic.
Trigger for NATS evaluation:
| Metric | Threshold |
|---|---|
| Active stations | > 500 |
| Broker CPU utilization | > 80% sustained |
| Message delivery latency p99 | > 50ms |
| Need for request/reply pattern | Yes |
| Multi-region clustering required | Yes |
6. Edge Agent: Rust Agent¶
6.1 Problem¶
The Python agent works well on Raspberry Pi 4 (1+ GB RAM, quad-core ARM). On more constrained hardware (Pi Zero, microcontrollers), Python's memory footprint (~30 MB) and startup time (~2 seconds) are significant.
6.2 Evaluation¶
| Criterion | Python (current) | Rust |
|---|---|---|
| Memory footprint | ~30 MB | ~2 MB |
| Startup time | ~2s | ~50ms |
| Binary size | N/A (interpreted) | ~5 MB (statically linked) |
| MQTT client | aiomqtt (mature) | rumqttc (mature, async) |
| Hamlib bindings | Python ctypes | hamlib-sys (FFI) |
| Cross-compilation | N/A | cross-rs (ARM, MIPS, RISC-V) |
| Development speed | Fast | Moderate |
| Deployment | pip install + venv | Single binary, no runtime |
6.3 Decision¶
Defer Rust agent to v0.7+. Document the architecture pattern.
Rationale:
- The Python agent is 109 lines and works on all target hardware (Pi 3/4/5).
- No immediate requirement for constrained hardware support.
- The Rust rewrite effort (~2 weeks) is not justified until there is a concrete deployment target that cannot run Python.
- Documenting the architecture pattern now (tokio + rumqttc + hamlib-sys) ensures the design is ready when needed.
Documented architecture:
// Cargo.toml dependencies
[dependencies]
tokio = { version = "1", features = ["full"] }
rumqttc = "0.24"
serde = { version = "1", features = ["derive"] }
serde_json = "1"
tracing = "0.1"
tracing-subscriber = "0.3"
// Main structure
#[tokio::main]
async fn main() {
// Initialize MQTT client
let (client, mut eventloop) = rumqttc::AsyncClient::new(options, 100);
// Subscribe to command topics
client.subscribe("talos/+/gs/+/command/pointing", QoS::AtLeastOnce).await?;
client.subscribe("talos/+/gs/+/command/radio", QoS::AtLeastOnce).await?;
// Event loop
while let Ok(event) = eventloop.poll().await {
match event {
Event::Incoming(Incoming::Publish(msg)) => {
handle_command(&msg).await;
}
_ => {}
}
}
}
7. Telemetry Persistence: TimescaleDB¶
7.1 Problem¶
TALOS publishes tracking telemetry over MQTT but does not persist it. Prometheus captures Director metrics but not per-station measurement data. Post-pass analysis, tracking accuracy assessment, and historical trend analysis are not possible.
7.2 Evaluation¶
| Database | Time-series optimized | SQL compatible | PostgreSQL extension | Compression | Continuous aggregates |
|---|---|---|---|---|---|
| TimescaleDB | Yes | Full SQL | Yes (extension) | 90%+ | Yes (materialized views) |
| InfluxDB | Yes | InfluxQL/Flux | No (standalone) | Yes | Continuous queries |
| QuestDB | Yes | SQL subset | No (standalone) | Yes | No |
| Plain PostgreSQL | No (manual partitioning) | Full SQL | N/A | TOAST only | No |
| Prometheus | Yes (metrics only) | PromQL | No | Yes | Recording rules |
7.3 Decision¶
TimescaleDB for telemetry persistence.
Rationale:
- TALOS already uses PostgreSQL. TimescaleDB is a PostgreSQL extension, not a separate database. This means:
- Same connection string, same SQLAlchemy engine, same Alembic migrations.
- No additional operational overhead (backup, monitoring, connection pooling).
- Full SQL compatibility for ad-hoc queries.
sqlalchemy-timescaledb(v0.4.1) provides hypertable support in the ORM layer.- Continuous aggregates replace manual aggregation queries for dashboard statistics.
- Compression achieves 90%+ reduction on older data, making multi-month retention feasible.
- InfluxDB and QuestDB would require a second database in the infrastructure, second backup strategy, and a second query language.
Schema highlights:
# Hypertable with automatic 1-hour partitioning
class TrackingMeasurement(Base):
__tablename__ = "tracking_measurements"
timestamp = Column(DateTime(timezone=True), primary_key=True)
station_id = Column(String(64), primary_key=True)
# ... measurement fields ...
__table_args__ = (
{"timescaledb_hypertable": {"time_column_name": "timestamp"}},
)
Data volume estimates:
| Stations | Rows/hour | Rows/day | Compressed size/day |
|---|---|---|---|
| 10 | 72,000 | 1.7M | ~10 MB |
| 50 | 360,000 | 8.6M | ~50 MB |
| 100 | 720,000 | 17.3M | ~100 MB |
8. Decision Summary¶
| Technology | Decision | Version | Rationale |
|---|---|---|---|
| dSGP4 | v0.5, optional | v0.5 | 10-100x batch speedup; keep Skyfield as default |
| OR-Tools CP-SAT | v0.5, required | v0.5 | Native interval scheduling; greedy fallback |
| CesiumJS | v0.6, opt-in | v0.6 | Industry-standard 3D satellite visualization |
| HTMX | v0.6, incremental | v0.6 | Simplifies CRUD pages without architectural change |
| CCSDS OMM | v0.5 support, v0.6 migration | v0.5-v0.6 | Catalog number overflow deadline July 2026 |
| CCSDS TDM | v0.5, export only | v0.5 | Standard tracking data interchange |
| NATS | Deferred | v0.7+ | MQTT sufficient to 500 stations; migration cost high |
| Rust agent | Deferred | v0.7+ | Python agent sufficient; no constrained hardware target |
| TimescaleDB | v0.5, required | v0.5 | PostgreSQL extension; no new infrastructure |
| SoapySDR | v0.5, optional | v0.5 | IQ capture without GNU Radio dependency |
9. Dependency Impact¶
9.1 New Required Dependencies (v0.5)¶
| Package | Size | Purpose |
|---|---|---|
ortools |
~60 MB | Scheduling solver |
sqlalchemy-timescaledb |
~100 KB | Hypertable ORM support |
httpx |
already installed | CelesTrak client |
9.2 New Optional Dependencies (v0.5)¶
| Package | Size | Purpose | Required when |
|---|---|---|---|
dsgp4 |
~2 MB | Batch propagation | TALOS_BATCH_PROPAGATOR=dsgp4 |
torch (CPU) |
~200 MB | dSGP4 runtime | dSGP4 enabled |
SoapySDR |
~5 MB | IQ capture | Agent with SDR hardware |
matplotlib |
~30 MB | Waterfall generation | IQ capture enabled |
9.3 New Dependencies (v0.6)¶
| Package | Size | Purpose |
|---|---|---|
czml3 |
~50 KB | CZML document generation |
cesium (frontend) |
~30 MB | 3D globe (static asset) |
htmx.org (frontend) |
~14 KB | HTML-over-the-wire |
9.4 Docker Image Size Impact¶
| Image | Current | After v0.5 | After v0.6 |
|---|---|---|---|
talos-core |
~350 MB | ~420 MB (+ortools) | ~460 MB (+czml3, htmx assets) |
talos-director |
~300 MB | ~370 MB (+ortools) | ~400 MB (+czml3) |
talos-agent |
~150 MB | ~150 MB (no change) | ~180 MB (+SoapySDR optional) |
talos-agent-dsgp4 |
N/A | ~380 MB (new, with torch CPU) | Same |
10. Review Schedule¶
These decisions should be revisited at the following milestones:
| Milestone | Review |
|---|---|
| 50 active stations | Re-evaluate MQTT broker capacity; confirm dSGP4 needed for batch |
| 100 active stations | Re-evaluate NATS migration; assess regional sharding need |
| July 2026 | Confirm OMM migration complete; verify 6-digit catalog numbers work |
| 500 active stations | Full architecture review; NATS, Rust agent, regional sharding decisions |
Summary¶
The technology decisions for v0.5+ follow a principle of incremental adoption with clear migration paths. dSGP4 and OR-Tools address immediate performance and scheduling needs. CesiumJS and HTMX improve the user experience in v0.6. NATS and Rust are deferred until concrete scaling thresholds are reached. TimescaleDB is the highest-confidence decision: it is a PostgreSQL extension that adds time-series capabilities without new infrastructure. The CCSDS OMM migration is the most time-critical decision due to the July 2026 catalog number overflow deadline.