Skip to main content

Our Technology Stack

How we use GPU and SIMD acceleration to deliver production performance for quantitative workloads.

Parallel Processing at Scale

Our technology stack leans on modern hardware such as GPUs and SIMD instructions to speed up real quantitative workloads.

GPU Acceleration

Custom CUDA kernels process millions of data points in parallel

SIMD Optimization

Vectorized CPU instructions for when GPU isn't available

In VRAM Processing

Keep data on the GPU during the whole calculation to avoid slow CPU↔GPU transfers

Performance Multiplier vs Standard CPU

GPU (CUDA)45x
SIMD (AVX-512)3x
Standard CPU1x

Note: “Up to” speedups, and only for indicators with SIMD kernels;

Up to around 3× SIMD speedup on indicators that have vectorised kernels
45×
Peak GPU acceleration in large indicator batch tests
1 M+
Candles / second processed

Benchmarks are illustrative and hardware dependent. SIMD gains apply only to indicators that implement vector paths, and many indicators driven by recurrence remain close to scalar CPU speed.

GPU Acceleration

Our CUDA approach keeps data on the GPU (in VRAM) for the entire calculation. We minimize round trips to system memory so parallel work on the device isn’t cancelled out by transfer overhead.

  • • In VRAM pipelines (avoid unnecessary CPU↔GPU transfers)
  • • Efficient memory access and batching
  • • Parallel kernels tuned for technical workloads
  • • Overlapped work with multiple streams when helpful

SIMD Optimization

Leveraging AVX2 and AVX-512 instruction sets for vectorized CPU computations, processing multiple data points in a single instruction.

  • • Hand-tuned assembly for critical paths
  • • Runtime CPU feature detection
  • • Aligned memory allocation
  • • Compiler intrinsics for portability

Why keep data on the GPU?

Moving large arrays back and forth between the CPU and GPU uses a relatively slow bus. By doing the full calculation in VRAM and only returning compact results, you get the benefit of parallel hardware without paying repeated transfer costs.

Performance Benchmarks

Simple Moving Average benchmark

Simple Moving Average

1M candles processed

22x

faster than CPU

RSX calculation

RSX Calculation

1M candles processed

3x

faster with SIMD

Bollinger Bands benchmark

Bollinger Bands

1M candles processed

18x

GPU acceleration

Benchmarked on NVIDIA RTX 4090 | AMD 9950X

*Performance varies by workload, data size, and hardware configuration

Open Source Philosophy

All VectorAlpha projects are open source, fostering transparency and community collaboration in quantitative finance.

Why Open Source?

  • Transparency builds trust in financial systems
  • Community contributions improve quality
  • Reproducible research advances the field
  • Accessible tools expand market participation

Apache License 2.0

All VectorAlpha projects are released under the Apache License 2.0, providing maximum flexibility for both commercial and non-commercial use while ensuring contributions remain open.

This permissive license allows you to use, modify, and distribute our software in your own projects without viral licensing concerns.

Built with Modern Technologies

Rust programming language

Rust

Memory safety without garbage collection. Zero-cost abstractions and fearless concurrency.

No Runtime Overhead
Thread Safety
Pattern Matching
Cargo Ecosystem
CUDA GPU computing

CUDA

Direct GPU programming for maximum performance. Custom kernels for financial computations.

Massive Parallelism
Shared Memory
Tensor Cores
Multi-GPU Support

WebAssembly

Near-native performance in browsers. Interactive demos without server infrastructure.

Browser Native
Sandboxed Security
Language Agnostic
Portable Binary

Ready to accelerate your quantitative workflows?

Join developers using VectorAlpha for high performance financial computing