VectorGrid: Exact Search at GPU Speed

A research workstation rather than a benchmark shell

It is easy to look at one fast GPU benchmark and imagine that the product is just a thin wrapper around a kernel. That is not what VectorGrid is. The released desktop application is a full quant research workstation built with Vue 3 and Vite on the front end, Tauri as the desktop bridge, and Rust across optimization, validation, meta selection, reporting, artifact management, and live execution plumbing. The optimization engine is the center of gravity, but it lives inside a real workflow rather than inside an isolated benchmark harness.

In practice that means the backtest optimization application handles the whole path from loading or resampling market data to optimizing a strategy, validating shortlisted candidates, ranking saved runs, browsing artifacts, and exporting reports. It also includes a Stage 5 ML confidence layer, but that layer is used in the one place where it is actually defensible here: after optimization and validation, as a post meta filter and reranking tool rather than as a pretend alpha engine.

Why exact grid search became the product default

The interesting part is that the product did not begin with the assumption that brute force should win. Alternative optimization approaches were tried, including progressive GPU search modes designed to inspect promising regions early and stop before the full space had been evaluated. In some workloads those methods did recover the same best answer, and in a few cases they were faster. That made them worth testing seriously rather than dismissing out of hand.

But once the measurements were taken on real data, the result was less flattering to the fancy algorithms than expected. On larger AAPL workloads, exact grid search was often still faster, and in one of the harder cases the progressive path even missed the exact winner frontier and landed at rank ten. That is why the released product now normalizes the normal optimization workflow to exact Grid. The team did not keep the more complicated option because it sounded more sophisticated. It kept the path that won on speed, determinism, and answer quality.

What the GPU path changes

The reason brute force becomes practical here is that the GPU path is not treated as a one off indicator accelerator. Price data is sent to VRAM up front, moving average generation stays on the device, tiled preparation and crossover evaluation stay there as well, and ranking work is handled before the host sees the final results. The CPU is not micromanaging every intermediate step. It is acting more like an orchestrator around a device resident execution pipeline.

That distinction matters. Instead of copying a large price matrix to the GPU for one calculation and then pulling everything straight back, VectorGrid keeps the calculation path in VRAM and only returns compact outputs such as metrics, shortlisted rows, equity curves, and best parameter sets. To my knowledge, that makes this the first end to end GPU executed backtest optimization software to keep price data, indicators, and backtest execution in VRAM for the full run while still offering a CPU fallback on systems without a suitable CUDA device. CUDA is where the application is most aggressive, but CPU only builds still remain valid and useful.

What the released benchmarks look like

The headline benchmark in the current released stack is a large exact ALMA against ALMA brute force run over 200,000 bars. On an RTX 4090, VectorGrid completes 58,300 valid backtests in about 85.863 milliseconds, which works out to roughly 679,000 exact backtests per second and about 135.8 billion pair bars per second on that workload. The important part is not just that the number is high. It is that the result comes from an exact search over the full parameter space rather than from an approximate shortcut.

The same benchmark family remains strong even when the memory budget gets tighter. Under a 1 GB VRAM budget, the 58,300 pair ALMA workload still lands around 178.48 milliseconds, which is still roughly 326,000 backtests per second. A strategy overlay version of the same benchmark, using the trend_quality profile rather than a plain crossover, comes in around 127.74 milliseconds. There is also a compact SMA against SMA reference run that settles near 86.33 milliseconds, which is useful as a smaller checkpoint, but the larger ALMA benchmark is the one that best captures what makes the app unusual.

Why the rest of the stack matters

Those numbers would be far less interesting if the rest of the application collapsed the moment real workflow complexity appeared. Instead, the desktop command surface is split into focused modules for data, optimization, validation, meta analysis, reports, artifacts, system inspection, and execution state. The front end keeps large result tables virtualized and throttles progress updates so the UI does not turn into the next bottleneck. Serious research tooling needs that kind of discipline because the value is not only in finding one fast answer. It is in being able to rerun, compare, inspect, and export the results without losing track of what happened.

That is also why the product story is stronger than a raw speed claim. The application does exact optimization at GPU speed, but it also carries a real validation layer, holdout aware meta selection, durable artifacts, report generation, and a credible post meta ML layer. For users on machines without a suitable GPU, the same desktop application can fall back to the SIMD accelerated CPU path and still remain interactive. In other words, the release is not just fast because one kernel benchmark looks good. It is fast because the product was simplified around the path that the measurements kept proving right.

For more on the indicator side of this stack, see GPU accelerated technical indicators and VRAM resident CUDA dispatch for technical indicators. For CPU side context, see SIMD vectorization for technical indicators. For product context, see Backtesting Engine.