Skip to main content

Performance Tuning

Performance tuning only becomes useful once the workload is stated clearly. Otherwise the process turns into a pile of flags, profiler screenshots, and benchmark fragments that fail to describe the same system. In this stack, the right order is simple: measure the real bottleneck, remove obvious waste, and only then decide whether the answer is a better algorithm, a tighter CPU path, or a GPU path that can stay resident long enough to matter.

That order is less glamorous than hunting for a dramatic speedup first, but it is what keeps the result believable. Loop speed, workflow speed, and strategy quality are separate concerns. The tuning pass should tell you which one you are actually solving.

1. Profile the real workload first

Start with the workload you actually care about, even if that means giving up the cleaner chart. If the question is indicator throughput, profile that. If the question is a full backtest sweep, include the sweep. If the question is a desktop workflow, profile the full run.

cargo install flamegraph

# Profile the real binary and the real path
cargo flamegraph --release --bin my_trading_bot -- backtest --data historical.csv

# Linux alternative
perf record -F 99 -g target/release/my_trading_bot
perf script | inferno-collapse-perf | inferno-flamegraph > flamegraph.svg

Flamegraphs are useful because they tell you where wall-clock time is actually going. Benchmarks are useful because they tell you whether a targeted change moved that cost in a stable way. You usually need both.

2. Fix build and benchmark hygiene before code

A surprising amount of fake optimization disappears once the build is sane. Release builds, stable inputs, and a benchmark harness that states its contract clearly are the floor. If you compare a debug path to a release path, or a warm cache run to a cold workflow, the numbers are already contaminated.

cargo build --release
cargo test --release
RUSTFLAGS="-C target-cpu=native" cargo build --release

# Use a benchmark harness for narrow hot paths
cargo bench

What matters is whether the tuned build still represents the deployment target and whether the benchmark reflects the path you intend to speed up.

3. Remove wasted work before low-level tuning

The biggest wins often come from deleting unnecessary work before accelerating anything. Recomputing the same indicator pipeline repeatedly, copying large buffers between layers, allocating inside hot loops, or rebuilding parameter grids on every run will dominate the result long before AVX or CUDA has a chance to help.

In practice that means looking hard at data layout, reuse, and execution boundaries. Ask whether the series can stay contiguous, whether warmup handling is forcing needless passes, whether the strategy is recalculating indicators it could cache, and whether the host or device boundary is being crossed more often than necessary.

4. Choose the right execution path for the workload

Once the waste is removed, the next decision is which execution path the workload actually deserves. Small or irregular workloads usually belong on the CPU. Large contiguous indicator work often benefits from SIMD. Broad sweeps and device-resident pipelines may justify CUDA. Each path earns its place by matching the workload it was built for.

Ask a narrower question: what is the smallest execution mode that removes the current bottleneck. If SIMD already solves the problem, forcing the GPU into the loop can make the workflow worse.

5. Parallelize after the contract is stable

Parallelism multiplies both speed and ambiguity. If the scalar path is still unclear or the benchmark contract is weak, Rayon, SIMD, and CUDA will only make the debugging harder. Once the reference behavior is stable, parallelism becomes much more valuable because you can measure the gain against something you trust.

The same rule applies across layers. Thread-level parallelism helps when rows or parameter sets are independent. SIMD helps when the inner loop is regular. CUDA helps when the work is wide enough and resident enough to amortize the transfer cost. Those are different tools for different shapes of work.

What a good tuning pass usually changes

  • It narrows the benchmark contract so the result means one specific thing.
  • It removes recomputation, copying, and allocation before adding lower-level tricks.
  • It keeps the scalar reference intact while faster paths are introduced.
  • It picks SIMD or CUDA because the workload proved they matter, not because they sound stronger.
  • It rechecks correctness after every optimization that changes execution mode or layout.

A practical checklist before calling it tuned

  1. Profile the real workload in a release build.
  2. Benchmark the narrow hot path separately from the end-to-end workflow.
  3. Eliminate obvious recomputation, hidden allocations, and needless data movement.
  4. Confirm the optimized path still agrees with the reference behavior.
  5. Only then decide whether the next gain belongs to CPU SIMD, multi-threading, or CUDA.

Next reads

For the broader performance framing, continue with Performance Optimization. If the next question is CPU-side acceleration, read SIMD Optimization Explained and SIMD vectorization for technical indicators . If the workload is large enough that the GPU is the real next step, move to GPU Acceleration Setup and GPU accelerated technical indicators .