Architecture decisions and optimization techniques from building a CUDA technical analysis library and GPU-first backtest optimization workflows.
Every article includes benchmarks, profiling data, and implementation details from real VectorAlpha projects.
GPU Accelerated Technical Indicators: Current Benchmark Snapshot
How the VectorAlpha technical analysis library uses CUDA to accelerate indicator workloads on modern GPUs.
Walks through tiled ALMA kernels, shared memory layouts and multi series paths, and uses real benchmarks to show how a heavy ALMA batch reaches around 20x the throughput of an AVX512 CPU kernel for that workload.
How the VectorAlpha technical analysis library uses AVX2 and AVX512 kernels to accelerate heavy indicator workloads across more than 300 functions.
Covers kernel selection, windowing patterns, streaming APIs and batch parameter sweeps built on top of a shared SIMD dot product core.
Architecture of a GPU Resident Backtest Optimization App Handling 1B+ Events per Second
Complete architectural breakdown of VectorAlpha's backtest optimization app, designed for 1B+ events per second throughput.
Covers VRAM resident data pipelines, GPU parameter sweeps, and end to end on device computation flow.
Architecture
•Optimization
Coming Q1 2026
Lock Free Data Structures for Real Time Market Data
Design and implementation of wait free ring buffers and lock free order books handling 10M+ updates per second.
Includes latency profiling, memory ordering considerations, and comparison with traditional mutex based approaches.
Low Latency
•C++
Follow Our Technical Work
Watch our repositories for implementation details, benchmarks, and performance analysis updates.