Architecture decisions and optimization techniques from building a CUDA technical analysis
library and GPU-first backtest optimization workflows. Every article includes benchmarks,
profiling data, and implementation details from real VectorAlpha projects.
How the VectorAlpha technical analysis library uses CUDA to accelerate indicator
workloads on modern GPUs. Walks through tiled ALMA kernels, shared memory layouts and
multi series paths, and uses real benchmarks to show how a heavy ALMA batch reaches
around 20x the throughput of an AVX512 CPU kernel for that workload.
How the VectorAlpha technical analysis library uses AVX2 and AVX512 kernels to
accelerate heavy indicator workloads across 340 functions. Covers kernel
selection, windowing patterns, streaming APIs and batch parameter sweeps built on top of
a shared SIMD dot product core.
VRAM Resident CUDA Dispatch for Technical Indicators
How VectorTA's CUDA dispatch layer separates the host compatibility path from the
device-native path, uses validated device views for pointer-in and pointer-out
execution, and makes upload-once, dispatch-many workflows practical. Includes the
3.129 ms ALMA benchmark and the 58,300-backtest Tauri demo result.
How VectorGrid turns exact grid search into the default product path by keeping price
data, indicators, and backtest execution in VRAM. Includes the 58,300 pair ALMA
benchmark on 200,000 bars, the tighter 1 GB VRAM budget run, and the CPU fallback
story.