@exp Kills SIMD Performance

2026-03-28

Unlocking SIMD in Zig: When @exp Kills Your Vectorisation

Kernel Density Estimation (KDE) is a standard technique for estimating the probability density of a dataset without assuming a particular distribution . The core computation is simple: for each query point x, sum a kernel evaluated at every sample point x_i. It's very typical to use a Gaussian function (normal distribution) as the kernel for its nice mathematical properties:

f(x) = (1 / (n · σ · √(2π))) · Σ exp(-(x - x_i)² / (2σ²))

The inner loop sum loop of this embarrassingly parallel. Each term is independent, which makes it an obvious candidate for SIMD.

Scalar Baseline

The scalar implementation is straightforward:

  pub fn normalKernelDensityEstimation(x_sample: []const f32, standard_dev: f32, x: f32) f32 {
      const n: f32 = @floatFromInt(x_sample.len);
      const scale_factor: f32 = 1.0 / (standard_dev * @sqrt(2.0 * math.pi) * n);
      const inv_2var: f32 = 1.0 / (2.0 * standard_dev * standard_dev);

      var accumulator: f32 = 0.0;
      for (x_sample) |x_i| {
          const diff = x - x_i;
          accumulator += @exp(-(diff * diff) * inv_2var);
      }
      return accumulator * scale_factor;
  }

Naive Vectorised Version

Zig's @Vector(N, T) type maps to SIMD registers. The idea is to process N sample points simultaneously, reducing the number of loop iterations by a factor of N. Using std.simd.suggestVectorLength picks the natural width for the target CPU at compile time:

pub fn normalKernelDensityEstimationVec(x_sample: []const f32, standard_dev: f32, x: f32) f32 {
      const vec_len = std.simd.suggestVectorLength(f32) orelse 8;
      const Vec = @Vector(vec_len, f32);

      const n: f32 = @floatFromInt(x_sample.len);
      const scale_factor: f32 = 1.0 / (standard_dev * @sqrt(2.0 * math.pi) * n);
      const inv_2var: f32 = 1.0 / (2.0 * standard_dev * standard_dev);

      const x_vec: Vec = @splat(x);
      const inv_2var_vec: Vec = @splat(inv_2var);

      var acc_vec: Vec = @splat(0.0);
      const chunks = x_sample.len / vec_len;

      var i: usize = 0;
      while (i < chunks) : (i += 1) {
          const chunk: Vec = x_sample[i * vec_len ..][0..vec_len].*;
          const diff: Vec = x_vec - chunk;
          acc_vec += @exp(-(diff * diff) * inv_2var_vec);
      }

      var acc_tail: f32 = @reduce(.Add, acc_vec);
      for (x_sample[chunks * vec_len ..]) |x_i| {
          const diff = x - x_i;
          acc_tail += @exp(-(diff * diff) * inv_2var);
      }
      return acc_tail * scale_factor;
  }

The structure is correct. The subtraction, multiply, and accumulation all vectorise. But the benchmark results were not what you'd expect:

Scalar n=1K 49.4µs Vec n=1K 64.0µs

The vectorised version is slower than scalar. When you see that disparity, you realise you're about to open a can of worms. The engineer in my could not that slide... This algorithm must be SIMD'able.

The only funky primatives that could be at fault here are @exp and @sqrt. Checking the zig docs we see that both @exp and @sqrt support Vectors, via a dedicated instruction set, when available.

So what gives?

Why @exp kills the gains

Zig's @exp on a @Vector(N, f32) does not lower to a SIMD exp instruction, all the time. From what I can search online, there is no general SIMD exp in the x86 ISA. LLVM's default behaviour is to scalarise: it extracts each lane, calls the scalar expf, and reinserts the result. So @exp on a vector of 8 floats makes 8 serial scalar expf calls, with the overhead of packing and unpacking around them.

The diff and multiply work, was genuinely vectorised. But @exp is the dominant cost per iteration, and scalarising it eliminates every gain made elsewhere. The vector version ends up doing the same exp work as scalar, but with extra overhead on top.

The fix: approximate exp in pure SIMD

Now, I am sure I am missing something, and there might be some magic combination of compiler flags, or even an additional library I can pull in to get a vectorised version if @exp. But I am lazy, and maybe ironically, there is less cognitive overhead for me in just building a @exp approximation that vectorises well than trying to find the magic combination of compiler flags.

And so here it is, a polynomial approximation that never leaves the vector registers. This is after all what SIMD math libraries would be doing to approxiamte this function. It's easy enough to implement it directly in zig.

There are many different polynomial approximations for exp. The easiest to derive naturally is the Taylor expansion of exp.

The algorithm has three steps.

Step 1 — Range reduction. Rewrite exp(x) = 2^(x/ln2). Split the exponent into an integer part n and a fractional remainder r ∈ [-0.5, 0.5]:

n = round(x / ln2) r = x - n·ln2 exp(x) = 2^n · 2^r ≈ 2^n · exp(r)

r is now small enough that a short polynomial accurately approximates exp(r).

Step 2 — Polynomial evaluation. A degree-6 minimax polynomial over [-ln2/2, ln2/2] approximates exp(r) to within ~1.7e-7 — within f32 precision. Evaluated as a Horner scheme it's six fused multiply-adds, all of which vectorise perfectly:

p(r) = p0 + r·(p1 + r·(p2 + r·(p3 + r·(p4 + r·(p5 + r·p6)))))

Step 3 — Reconstruct 2^n via bit manipulation. In IEEE 754, a float's biased exponent lives at bits 23–30. Setting the exponent field to n + 127 and the mantissa to zero gives exactly 2^n, with no division or transcendental call:

const n_i: @Vector(N, i32) = @intFromFloat(n_f); const scale_bits: @Vector(N, u32) = @bitCast((n_i + @as(@Vector(N, i32), @splat(127))) << @as(@Vector(N, u5), @splat(23))); const scale: Vec = @bitCast(scale_bits);

This bitcast is free at runtime — it's a reinterpretation of the register contents.

Putting it together:

inline fn expApprox(comptime T: type, x: T) T {
  const is_vec = @typeInfo(T) == .vector;
  const N = if (is_vec) @typeInfo(T).vector.len else 1;
  const Vec = if (is_vec) T else @Vector(1, f32);

  const v: Vec = if (is_vec) x else @as(Vec, @splat(x));

  const x_clamped = @min(@max(v, @as(Vec, @splat(-87.3))), @as(Vec, @splat(88.7)));

  const n_f: Vec = @floor(x_clamped * @as(Vec, @splat(math.log2e)) + @as(Vec, @splat(0.5)));
  const r: Vec = x_clamped - n_f * @as(Vec, @splat(ln2_hi)) - n_f * @as(Vec, @splat(ln2_lo));

  var p: Vec = @splat(@as(f32, 1.3981999e-3));
  p = p * r + @as(Vec, @splat(@as(f32, 8.3334519e-3)));
  p = p * r + @as(Vec, @splat(@as(f32, 4.1665795e-2)));
  p = p * r + @as(Vec, @splat(@as(f32, 1.6666659e-1)));
  p = p * r + @as(Vec, @splat(@as(f32, 4.9999997e-1)));
  p = p * r + @as(Vec, @splat(@as(f32, 1.0)));
  p = p * r + @as(Vec, @splat(@as(f32, 1.0)));

  const n_i: @Vector(N, i32) = @intFromFloat(n_f);
  const scale_bits: @Vector(N, u32) = @bitCast((n_i + @as(@Vector(N, i32), @splat(127))) << @as(@Vector(N, u5), @splat(23)));
  const scale: Vec = @bitCast(scale_bits);

  const result = p * scale;
  return if (is_vec) result else result[0];
}

The comptime T parameter means the same code serves both the SIMD chunks and the scalar tail - no duplication.

Results

benchmark Scalar Vec (naive) Vec (poly exp)

n=128 6.2µs 7.6µs 440ns n=1K 49.4µs 64.0µs 3.2µs n=8K 395µs 508µs 25.6µs n=64K 3.2ms 4.0ms 204µs

The polynomial approximation is roughly 15–16× faster than scalar across all sizes. The naive vector version was slower than scalar at every size.

Tradeoffs

The approximation has a maximum relative error of ~1.7e-7. For f32 KDE this is acceptable — f32 itself only has ~7 decimal digits of precision, and the density estimate is already approximate by construction. If you need double precision or exact reproducibility with @exp, the polynomial approach needs re-derivation for f64 coefficients, or you accept scalar throughput.

The other thing to be aware of is that the approximation clamps inputs below ~-87.3, below which f32 underflows to zero anyway. For KDE this is a non-issue: any sample point far enough from x contributes essentially nothing to the density.