Gemini WebSocket Terminal

Nov 2025 · c++ · websocket · lock-free · low-latency · order-book

A command-line WebSocket client that connects to the Gemini exchange, subscribes to one or more instruments on the v2 market data feed, and maintains an in-memory order book per symbol. The producer thread reads frames off the wire and applies updates. The consumer thread (main) renders the books with ncurses and computes basic latency stats. Built as a self-directed learning project covering TLS handshakes, lock-free SPSC queues, and the trade-offs between mutex, atomic load/store, and wait-free coordination on a critical path.

See it running

Gemini WebSocket Terminal demo running in macOS Terminal, showing order book updates streaming in real time

What’s interesting about this

The producer/consumer boundary on the order book is configurable at compile time between three strategies, and the trade-offs map directly to what I deal with day to day in production feed handlers:

//#define ENABLE_MUTEX
//#define ENABLE_ATOMIC_LOADSTORE
#define ENABLE_ATOMIC_WAITFREE

Each one represents a different point on the spectrum between simplicity, correctness guarantees, and producer-side latency. Subscribing to six instruments simultaneously (BTCUSD, ETHUSD, SOLUSD, XRPUSD, DOGEUSD, DOTUSD), the strategies measure roughly:

Wait-free SPSC queue: sub 0.12 ms producer time
Atomic load/store (with full book copy on each update): close to 1 ms
lock_guard mutex: up to 0.4 ms

Numbers are from the in-app stats panel, not a rigorous benchmark, but the ordering and magnitudes match what theory predicts.

Stack

C++17, Boost.Beast (WebSocket and async I/O), OpenSSL 3.6.2 (TLS), nlohmann/json (parsing), ncurses (display). Originally built on macOS Apple Silicon, then ported to Ubuntu 24.04 under WSL.

Design decisions

Three concurrency strategies, selectable at compile time

The natural first instinct is std::mutex. It’s correct, easy to reason about, and the contention window is short. But on a hot path that processes every L2 update, taking and releasing a kernel-mediated lock on every operation is the kind of thing that forecloses any chance of pushing latency lower. So the mutex variant exists mainly as a baseline.

The atomic load/store variant uses std::atomic<std::shared_ptr<OrderBook>> so the producer publishes a fresh book pointer after each update and the consumer loads it without locking. This works conceptually but requires a full copy of the book on every update because the producer can’t mutate in place while the consumer might be reading. That copy cost shows up as 1 ms producer time and makes the variant a poor fit for anything beyond demonstrating that the load/store works.

The wait-free variant uses a single-producer single-consumer ring buffer for order book deltas. The producer enqueues updates and never blocks. The consumer dequeues and applies them to its own local view of the book. Because there’s exactly one writer and one reader, neither side ever waits for the other. The producer time drops to under 0.12 ms because the producer’s only job per update is “append to a ring.” All bookkeeping is paid by the consumer, which is fine because the consumer is bounded by display refresh, not feed rate.

The order book itself

Two balanced maps for bids and asks, plus a hash-map cache from order ID to map iterator for O(1) removals. Vectors would give better cache locality and are on the list of things to try next, but the map version makes price-level removals and ordered traversal for the top-of-book straightforward, which mattered while I was still getting the rest of the system stable.

Cache line size set per architecture

The codebase compiles for both x86_64 and ARM (M-series Macs), and the appropriate CACHE_LINE_SYS_SIZE is selected at compile time. Wrong cache line assumptions lead to false sharing on the SPSC queue’s head and tail indices, which silently destroys throughput. Worth getting right.

Trade-offs and things deliberately scoped out

Single venue. Architecturally the project supports an LRU cache of connection endpoints for multi-venue work, but only the Gemini handshake and parsing are wired up today. Adding another venue would mean a second parser (every exchange has its own L2 update format) and reconciling differences in sequence-number semantics, snapshot delivery, and heartbeat cadence. Out of scope for a learning project, but the structure is there.

Single producer, single consumer. The whole concurrency story rests on this being SPSC. If I wanted to subscribe across multiple connections in parallel, the wait-free strategy stops applying as written and I’d need either MPSC queues per book or per-connection consumer threads.

Display refresh runs on the consumer. ncurses redraw cost is visible on the critical path of the consumer side, but since the consumer is much faster than the producer feed rate, this hasn’t been a real constraint. If it became one, the obvious move is to split rendering into its own thread.

No perf profiling yet. All measurements come from in-app timing around the producer and consumer paths. A real flamegraph from perf would probably surface things the ad-hoc timers miss.

A few things I had to debug along the way

The interesting bugs were the ones with non-obvious causes:

Crossed books. Bids appearing above asks. Cause: I was applying L2 updates without handling deletions. The Gemini feed signals price-level removal with quantity zero (documented under their balance updates section, which took some finding). Once that was wired in, the books stayed clean.

Segfaults during the atomic load/store variant. I wasn’t copying the loaded shared_ptr before dereferencing in the writer, so I was effectively mutating a book another thread might be reading. Fixed once I realised the issue, though it pushed me back to mutexes briefly while I worked out the SPSC queue properly. The deeper issue was that my OrderBook is non-copyable because it holds iterators into its own maps, so a copy constructor that preserved cache integrity was needed. Lesson: when something says “non-trivially copyable,” it’s worth pausing.

Heartbeats counted as book updates. The atomic variant would occasionally segfault because the producer thought every incoming frame contained book data. Once I started filtering heartbeats out at the parsing stage, the variant stabilised. A small thing that wasn’t obvious from the docs at first read.

mvwprintw warnings on Linux. The Linux build flagged format not a string literal and no format arguments on every ncurses call where I’d passed a std::string’s c_str() directly as the format argument. That’s an arbitrary memory access waiting to happen. Switching to mvwprintw(win, y, x, "%s", str.c_str()) everywhere fixed it. Should have been obvious from the start.

What I’d do differently

A spinlock-based variant would round out the concurrency comparison nicely. Vectors instead of maps for the order book is the next clear performance win, particularly with cache locality in mind. Real perf profiling would replace the ad-hoc producer-time stats. And the multi-venue support that the LRU cache hints at is the most interesting extension, because it forces a clean parser abstraction and multi-connection lifecycle management.

Source

The repository lives on a self-hosted Gitea instance and isn’t publicly accessible. A read-only snapshot synced at build time is browsable here:

Browse the source →

Happy to walk through the code in person too, or share specific files on request.