VRAM Resident CUDA Dispatch for Technical Indicators

Why the CUDA API had to change

A CUDA API built around host memory is enough to prove that an indicator kernel is fast. It is not enough to build a workflow that is genuinely shaped for the GPU. If every call starts from a Rust slice on the host, uploads the full series to the device, launches a kernel and then downloads the result straight back into a Vec<f32>, the GPU is doing useful arithmetic but the surrounding dataflow is still organized around PCIe traffic.

That is why the newer CUDA work in VectorTA is centered on dispatch rather than on individual kernels in isolation. The goal is to upload market data once, keep it resident in VRAM, run indicator sweeps against borrowed device views, and only cross back to the host when there is a real reason to do so. That is the difference between a CUDA feature and a CUDA workflow.

Two entry points, two different jobs

The library now carries two CUDA entry points side by side. compute_cuda(...) is still the compatibility path. It accepts inputs from the host and remains useful when the caller already lives on the CPU side of the application. For the indicators that have been migrated, it can still route into the newer device core, but it begins from host memory and therefore still pays the upload boundary up front.

The more important addition is compute_cuda_device(...). That is the native device path for the supported VRAM resident inventory. It takes validated device views rather than host slices, dispatches directly against already resident data, and only downloads the result if the caller explicitly asks for host output. The old API was not replaced. It was kept as a bridge while a cleaner device contract was added beside it.

The API shape

The execution model at the lower level is based on pointers, which is exactly what a GPU pipeline needs. Input refs backed by device memory such as CudaDeviceSliceF32Ref and the OHLC or OHLCV variants carry a raw device address, a length or shape, and a CUDA device id. That makes them the practical input boundary for work that is already resident on the GPU.

The public Rust surface is still wrapper first rather than raw pointer first. Owned types keep allocations alive, borrowed device refs are validated before dispatch, and explicit raw pointer construction stays behind narrow interop entry points. That split matters. It lets the library behave like a pointer in, pointer out system without asking normal callers to manage unsafe ownership as the default programming model.

One of the cleaner design choices in this API is that upload and download operations are named and visible. The runtime exposes methods for uploading slices, OHLC data, OHLCV layouts and matrices, and it exposes separate download methods for bringing results back to the host. That keeps the residency model legible. A caller can see where bytes cross the bus instead of discovering later that every convenience wrapper hid a full series transfer.

The output side follows the same rule. CudaOutputTarget::HostF32 means the final result should come back as a host vector. CudaOutputTarget::DeviceF32 means the result stays in VRAM as a matrix backed by device memory. The output target is not inferred. It is part of the request, which makes it possible to build predictable GPU pipelines instead of isolated CUDA calls.

Chaining indicators is where the API starts paying for itself

This design becomes more interesting once one indicator feeds another. A native device call can return a DeviceMatrixF32, and a caller can then derive a borrowed device view into one row of that matrix and hand it to the next indicator without staging the whole intermediate result through host memory. The library is not just accelerating isolated kernels. It is exposing enough of the right boundaries to let indicator pipelines stay on the GPU.

That same idea shows up in the moving average selector work as well. Repeated sweeps over a single resident series can reuse shared runtime state, cached period buffers and persistent device allocations instead of reconstructing the whole environment for every call. The API is moving toward upload once, dispatch many, and that is the shape that very high throughput quantitative workloads actually need.

What the benchmarks are really saying

The strongest benchmark in this stack is the one that removes hand waving and just measures a large indicator workload on the hardware that the library targets. On an RTX 4090, the CUDA path can compute roughly 250 million ALMA output points in 3.129 milliseconds. On the same machine, with the CPU side measured on an AMD Ryzen 9 9950X, the corresponding batch timings are about 140.61 milliseconds for AVX512, 188.64 milliseconds for AVX2 and 386.20 milliseconds for scalar execution.

Those numbers are large enough that they stop being a micro optimization story. For that workload, the CUDA path is about 45 times faster than the AVX512 path, about 60 times faster than AVX2 and more than 123 times faster than scalar code. The point is not that the CPU side is weak. The SIMD work is already strong. The point is that once the dataflow is arranged so the GPU is not starved by repeated transfers, there is far more parallel throughput available on the device than even a very fast desktop CPU can offer.

A fast indicator benchmark is useful, but the more interesting result is what happens when those kernels sit inside an optimizer loop. The Rust and Tauri backtest optimization app built on top of this library can run 58,300 backtests for a double ALMA crossover strategy over 200,000 data points in about 85.863 milliseconds on the same RTX 4090 and Ryzen 9 9950X setup. That works out to roughly 679,000 backtests per second for a measured, indicator heavy sweep rather than a toy inner loop benchmark.

Results like that do not come from CUDA in the abstract. They come from an API that lets the optimizer keep price data resident in VRAM, launch repeated indicator work against device inputs, reuse intermediate outputs on the GPU, and pull back only the smaller result sets that the desktop application actually needs to render or rank. The benchmarks are a consequence of the dispatch model.

What VRAM resident means here

In this codebase, VRAM resident does not mean every temporary allocation has been reduced to some perfect global pool, and it does not mean every indicator in the repository already has a native device wrapper. It means something more concrete and more useful: market data can live in buffers owned by the device, supported indicators can consume borrowed device views, outputs can stay in VRAM, and downstream GPU work can continue before any final host copy happens.

It is also worth being precise about the current limits. The native device inventory is broad, but it is not universal, and some wrappers still upload compact metadata from the host even when the main market series remains resident on the GPU. There is also an explicit first_valid contract on the native device side. The compatibility path can infer validity offsets from host data, but borrowed device pointers do not carry enough semantic information for that to happen for free.

That is the infrastructure the CUDA side needed. It turns VectorTA from a library with some fast GPU kernels into a library that can participate in full GPU resident indicator and optimization pipelines. For CPU side context, see SIMD vectorization for technical indicators. For product context, see Technical Analysis Library and Backtesting Engine.