SIMD Vectorization in a CUDA Technical Analysis Library

Design goals

The TA library implements several hundred indicators in Rust. Each indicator has a scalar path that is straightforward to read and easy to test. SIMD kernels sit beside that scalar code and exist only to shrink constant factors rather than change the algorithm. For a given input and set of parameters, the SIMD path is expected to agree with the scalar path within a very small numeric tolerance.

The SIMD work focuses on three patterns that appear across many indicators: sliding window dot products, streaming updates over live data and batch parameter sweeps. Once these patterns are covered by a solid SIMD core, individual indicators can reuse the same machinery instead of each one growing its own bespoke vectorized path.

Why this library was built

Most open source technical analysis libraries still rely on scalar code and whatever auto vectorization a compiler happens to apply. That is often enough for single indicator calls, but it starts to break down when backtesting needs millions of candles, hundreds of indicators and large parameter sweeps. In practice, indicator calculation becomes the part of the system that sets the upper limit on how fast you can explore strategies.

This library exists to push that limit. By adding explicit AVX2 and AVX512 kernels for the heavy inner loops, it brings deliberate SIMD optimization into a space that has mostly stayed scalar. The goal is not to chase peak theoretical throughput but to make it practical to run more backtests in the same wall clock time and to keep the cost of richer indicator sets under control.

The design also looks forward to where CPUs are heading. Newer Intel designs introduce AVX10, which keeps wide vector instructions available, and recent AMD processors continue to support AVX512 style vectors. Investing in clear SIMD kernels today means that as wider lanes become standard across desktop and server hardware, the same indicator code can take advantage of that extra width without major changes.

Common building blocks

Many indicators reduce to a weighted sum over a window of recent samples. ALMA is one example with a gaussian weight curve, but the same idea appears in moving averages, bands and filters. The inner loop looks like a dot product between a window of data and a vector of weights.

The library keeps this dot product in a small set of functions that act as the foundation for AVX2 and AVX512 code. Scalar helpers such as dot_scalar_unrolled_safe implement a simple four wide unrolled loop and act as a baseline. When SIMD features are available, the code paths branch into dot_avx2 or dot_avx512, which load contiguous blocks of doubles and accumulate them with fused multiply add instructions.

All of this is hidden behind a Kernel enum that describes the available execution modes such as Scalar, Avx2, Avx512 and their batch variants. Callers either pick a kernel explicitly or request Kernel::Auto, which uses feature detection and simple heuristics to choose a reasonable default for the current platform.

Preparing data for SIMD

SIMD kernels work best when data is aligned and windows are contiguous. A preparation step for each indicator handles this setup in one place. Using ALMA as a concrete case, the alma_prepare function first validates the input length, period, sigma and offset and returns clear errors when any parameter is inconsistent with the data.

It then finds the first non NaN sample and determines how many valid points are available from that position onward. With that information it builds the weight vector for the requested parameters, stores the result in an aligned AVec<f64> and rounds the effective period up to a multiple of eight so that both AVX2 and AVX512 kernels can load full vectors with simple pointer arithmetic.

The last decision in this preparation step is the choice of kernel. Callers can specify a concrete value such as Kernel::Scalar or Kernel::Avx512, or they can request Kernel::Auto. The detection path then picks the best available option for the current platform and feature set.

Other indicators reuse the same approach. They generate their own weights or transformation coefficients, store them in aligned buffers and pass both input data and weights into a shared compute function that knows how to dispatch to scalar, AVX2 or AVX512 implementations.

AVX2 kernels

On CPUs that support AVX2 and FMA, the library can process four double precision values in a single instruction. The ALMA implementation shows the general pattern. A top level alma_avx2 function chooses between short and long kernels based on the period length so that small and large windows both stay efficient.

The short kernel keeps control flow simple. For each output index it computes the start of the window, loads blocks of four prices and four weights with _mm256_loadu_pd and combines them with _mm256_fmadd_pd. Any remaining elements that do not fit into a full vector are handled with _mm256_maskload_pd and the same fused multiply add operation. A small horizontal reduction then turns the accumulator vector into a single sum.

The long kernel follows the same logic but unrolls the loop to improve instruction level parallelism. It maintains two independent accumulators, each responsible for half of the chunks. This gives the CPU more work to schedule in parallel and helps hide latency when the period spans many cache lines.

Other indicators with windowed sums can plug into this pattern. As long as they can express their inner loop as a weighted sum over contiguous slices, the AVX2 kernels can reuse the same structure and only the weight generation logic needs to change.

AVX512 kernels

When AVX512 is available, the library can operate on eight double precision values per instruction. The design mirrors the AVX2 path but uses __m512d vectors and mask registers to handle tails.

For shorter periods the alma_avx512_short function walks over the window in eight element blocks, multiplies and accumulates with _mm512_fmadd_pd and handles the remaining elements through _mm512_maskz_loadu_pd. A helper called hsum_pd_zmm wraps _mm512_reduce_add_pd to perform the horizontal reduction in a single intrinsic call.

Longer periods benefit from preload and unrolling. The alma_avx512_long path builds an array of weight vectors in advance, either on the stack or the heap depending on the number of chunks. The inner loop then loads price blocks, multiplies by the preloaded weight registers and accumulates into several partial sums at once. A tree style reduction combines those partial sums at the end of the window.

This structure is general enough for other heavy indicators that operate on wide windows. By centralizing the preload and unrolled reduction logic, new indicators can reuse a tested AVX512 path rather than each one managing its own masked tail handling and weight storage.

Streaming indicators with mirrored ring buffers

Batch processing is only part of the story. Many users want to feed a live price stream into indicators and get one output per tick without recomputing the entire history. The AlmaStream type is a concrete example of how the library handles this while staying SIMD friendly.

The stream keeps two buffers: a canonical ring that stores the last period values and a second buffer that mirrors that ring back to back. Every new value is written to both places. This layout means that the active window always exists as a single contiguous slice in memory and never wraps, which is well suited to SIMD loads.

Once the stream is warm, the update method calls a helper named dot_contiguous. That helper inspects the chosen kernel and either calls a scalar dot product or one of the SIMD variants. From the point of view of the streaming code, the switch is transparent. The same interface works on CPUs with or without AVX512 support.

Other streaming indicators can use the same mirrored ring approach. Each one defines its own weight or transformation buffer and then calls into the shared dot product helpers when enough data has been collected.

Batch parameter sweeps

The library also provides batch APIs for running many parameter combinations over the same input series. For ALMA this shows up as AlmaBatchBuilder and its associated functions, but the underlying pattern is generic: build a grid of parameter sets, precompute the weights for each row and then apply a fast per row kernel that walks the price series once for every parameter combination.

Internally the batch code uses an aligned weight buffer that stores all rows back to back. AVX2 and AVX512 row functions such as alma_row_avx2 and alma_row_avx512 reuse the same sliding window and fused multiply add structure as the single series kernels. The only difference is that the window walks over a shared data slice while the weights come from different offsets in the flat weight buffer.

When the target is not WebAssembly, the batch functions can run in parallel using Rayon. Rows are split across threads so that each core executes the same inner loop for a different parameter set. This makes brute force parameter sweeps practical on a single machine while still relying on the same SIMD building blocks.

Correctness and testing

SIMD code is only useful if it is correct. The library treats the scalar implementation as the reference and builds a test suite that compares SIMD outputs against scalar outputs under many conditions. For ALMA, this includes real market data, synthetic sequences, edge cases such as all NaN input, invalid parameters and reinput scenarios where an indicator is applied twice in a row.

Similar tests exist for the batch and streaming paths. They ensure that warmup periods line up, that NaN prefixes are preserved and that no sentinel poison values escape from helper functions that allocate and initialize buffers. This gives confidence that new indicators added on top of the shared SIMD primitives do not silently change numerical behavior.

What the SIMD work buys

SIMD is the CPU half of the same stack that also powers CUDA execution. On systems without a suitable GPU, it is the main path. On GPU-first systems, it remains the reference and fallback path used for validation and non-CUDA workloads.

For GPU kernel details, see GPU accelerated technical indicators. For product-level context, see Technical Analysis Library and Backtesting Engine.

Microbenchmarks on an AMD Ryzen 9 9950X give a concrete picture of these gains. For the ALMA indicator on a 10 000 element input, the scalar implementation takes roughly 12.4 microseconds, the AVX2 kernel comes in around 8.8 microseconds and the AVX512 kernel around 6.45 microseconds, so AVX2 delivers about 1.4 times the scalar throughput and AVX512 is close to 1.9 times faster. In a larger batch benchmark of roughly 250 million ALMA operations on the same CPU, an AVX512 batch kernel finishes in about 131 milliseconds while the scalar batch path takes around 271 milliseconds, which is a little more than a 2 times speed up for that workload.

These CPU side kernels are also the backbone of a CPU only backtest optimization demo built with Rust and Tauri. On machines without a suitable GPU, that desktop app can drive large parameter sweeps and multi strategy experiments entirely on the host by leaning on the SIMD paths in the TA library and running the search loop in native code. The experience is still interactive because the hot indicator loops move at AVX2 and AVX512 speeds instead of plain scalar rates.

These gains apply to the inner indicator kernels rather than full backtests. Strategy code still spends time in other parts of the system such as signal logic, portfolio bookkeeping and I/O. The SIMD work narrows the gap in the hottest loops so that when indicators dominate the runtime, they do not become the bottleneck.