Achieving 20x Performance with GPU Accelerated Technical Indicators

Why move indicators to the GPU

In most backtesting stacks, indicator calculation ends up as the tight inner loop. The rest of the system spends time orchestrating strategies, managing positions and writing results, but the bulk of the floating point work lives in moving averages, oscillators and other technical indicators. When a strategy uses many indicators and sweeps over a wide range of parameters, that inner loop becomes the part that sets how many backtests you can run in a day.

The SIMD kernels in the TA library push CPU performance a long way, but there is still a gap between an eight lane AVX512 unit and the thousands of FP32 lanes available on a modern GPU. The CUDA path for indicators is designed to close that gap for workloads that are heavy enough to justify moving data to the device, while keeping the scalar and SIMD implementations as references for correctness.

Kernel layout in the ALMA example

The ALMA moving average is a good example of how the CUDA side is structured. At its core, the indicator applies a gaussian weight curve over a moving window and produces one output per time step. On the GPU, this becomes a dot product that can be computed by many threads in parallel across time, parameter combinations and series.

The file alma_kernels.cu contains several kernels that cover the main shapes of ALMA workloads. There is an on the fly batched kernel that computes weights on the device for each parameter combination, a precomputed weight kernel that reuses weights stored in global memory across many launches, and tiled variants that stage both prices and weights in shared memory for higher reuse. There is also a many series kernel that works over time major layouts where each column is a price series and each row is a time step.

All of these kernels target a fast FP32 path, which is a deliberate choice on current gaming and workstation GPUs where FP64 throughput is much lower than FP32. Normalization and reductions rely on warp level primitives, with an optional CUB based block reduction available when it fits the block size. The code is built with a recent CUDA toolchain against an RTX 4090 class device, but the structure stays close to the scalar and SIMD versions so that it remains readable.

On the fly batched kernel

The simplest CUDA entry point for ALMA batches is alma_batch_f32_onthefly. The grid is two dimensional. The x dimension marches over time and the y dimension assigns a block to each parameter combination in the sweep. Inside each block, threads cooperate to compute gaussian weights and their normalization for a single ALMA configuration.

Weight computation happens once per block in shared memory. The helper alma_compute_weights_and_invnorm distributes work across the threads in the block, uses a tree style block reduction to accumulate the norm and writes the inverse norm back into shared memory. A follow up pass scales the weights so that later dot products do not need to multiply by the normalization factor again.

After the weights are ready, each thread walks over the time axis with a simple strided loop. For every time index where the window is fully inside the series it computes a dot product between prices and shared weights using alma_dot, which is a tight loop of fused multiply add operations. Threads write their outputs back to global memory using a fixed stride, which keeps writes coalesced for common grid sizes.

Precomputed batched kernel

The alma_batch_f32 kernel targets workloads where the same parameter combinations are reused many times. Instead of computing weights in every launch, the host precomputes them and stores them in a flat buffer in device memory. Each row of that buffer holds one weight vector padded out to the maximum period in the sweep.

At the start of the kernel, each block loads the relevant row of weights into shared memory. A short branch picks between a vectorized float4 copy and a scalar copy depending on alignment so that common cases use fully coalesced 16 byte loads. Once the weights are in shared memory, the structure of the loop over time is similar to the on the fly version, but with less work in the setup phase.

Tiled kernels for better reuse

The tiled kernels go further by staging both prices and weights in shared memory. The AlmaBatchTiledPrecomputed template computes a tile origin in time and loads a contiguous block of prices that is large enough to cover the tile plus the warm up part of the window. It uses aligned shared memory and float4 paths when possible so that global loads are vectorized and steady.

Within the tile, each thread computes one output by applying alma_dot starting at its assigned offset into the shared buffer. This means that each price sample is read from global memory once per tile and then reused by many threads as they slide the window. Variants such as AlmaBatchTiledPrecomputed2X let each thread produce two consecutive outputs from the same shared tile, which increases arithmetic intensity without adding more global memory traffic.

Many series, one parameter set

Real workflows often involve many series with the same indicator parameters. The kernel alma_multi_series_one_param_f32 and its tiled counterpart address this case. They consume prices laid out in time major order, where all series for a given time step are stored next to each other, and they emit outputs in the same layout.

The tiled implementation uses a two dimensional block where the x dimension steps through time and the y dimension steps across series. A shared tile stores a stripe of prices for several time steps and several series at once. When the series count and alignment conditions are right, threads can load four adjacent series values at a time into shared memory using a float4 path, which keeps memory transactions wide and simple.

Each thread then computes a strided dot product along the time axis using alma_dot_stride. The stride equals the number of series, so the dot product walks through time while staying on the same series. Because weights live in shared memory and the tile covers several time steps at once, the kernel gets many uses out of each global read.

What the GPU gains look like in practice

A batch benchmark for ALMA gives a concrete sense of the benefit of moving this work to the GPU. On the CPU side, measured on an AMD Ryzen 9 9950X, a scalar batch path needs around 271 milliseconds to process roughly 250 million ALMA operations and the AVX512 batch kernel for the same workload cuts that to about 131 milliseconds. On the GPU, a CUDA batch kernel for ALMA on an NVIDIA RTX 4090 completes the same 250 million operations in about 6.04 milliseconds.

That means the CUDA path is a little more than 2 times faster than the AVX512 batch kernel in this test and more than 40 times faster than the scalar batch path. Expressed the other way around, for this ALMA workload the GPU reaches roughly a 20x improvement in throughput compared to an already optimized AVX512 implementation on the same machine. These numbers are for a single indicator microbenchmark and real strategy code will always include other costs, but they show that heavy indicator loops are a good fit for the GPU.

How this feeds back into backtesting

Faster indicators translate directly into faster backtests as long as indicator computation dominates runtime. When a large part of the time budget is spent in ALMA, RSI, MACD or band calculations, offloading those loops to a GPU frees up the CPU to orchestrate parameter grids, manage portfolios and write results to disk. The backtesting engine can then treat the TA library as a high throughput device for signal generation rather than a limiting factor.

A planned Rust Tauri desktop application is the clearest example of how this is intended to work in practice. In that design, price data and indicator state live in VRAM for the duration of a session. The host sends price matrices to the GPU once, launches many backtest and optimization runs entirely on the device, and only reads back compact results such as metrics, equity curves and best parameter sets. On top of that, a GPU only optimization layer can explore large parameter spaces directly on the device, pruning poor candidates before they ever reach the CPU.

The CUDA design mirrors the structure of the scalar and SIMD paths, so it is possible to validate correctness on the CPU and then switch to GPU execution for production runs. As with the SIMD work, the goal is not to claim speedups for every possible strategy but to remove indicator math as the obvious bottleneck when it is the part that dominates runtime.