Inspired by the post by

Elan Ruskin (Valve) on x86 SQRT routines I thought I would visit this for my supercomputing platform of choice, the GPU. These kinds of low level trickery I left behind after finishing with RMM/Pixel-Juice some time around 2000, having decided that 3dNow! reciprocal square root routines were more than good enough..

Anyway, a brief overview of how we can do square roots:

- Calculate it with the FPU, (however that was implemented by the chip manafacturer).

- Calculate it from newton-raphson. This allows you to control the accuracy of the sqrt. (Or typicaly rsqrt) This comes in two flavours:

These approaches typically approximate the inverse square root, so this means we need to:
- Calculate it from the inverse. This comes in two flavours:

- Calculate the reciprical, then invert it (1/rsqrt(x)), this gives you correct results

- Multiply it by the input value (x*rsqrt(x)), this gives you faulty results around 0, but saves you a costly divide.

Note:

1.0f / rsqrtf(0.0f) = 1.0f / infinity = 0.0f

0.0f * rsqrtf(0.0f) = 0.0f * infinity = NaN

Elan's results indicated the x86 SSE units rsqrtss instruction was the fastest (no suprise - it is also a rough estimate), followed by SSE rsqrt with newton-raphson iteration for improvement, then Carmack’s Magic Number rsqrt, and finally the x86 FPU's sqrt. Note that many people don't actually get to the slowest point on Elan's scale, since they don't enable intrinsics when compiling, meaning that the C compiler will use the C library sqrt routine, and not the FPU.

I decided to test three routines for the GPU:

- native sqrt
- native rsqrt
- Carmack's rsqrt

Now benchmarking and timing on the lowest level has always been somewhat of a black art (see

Ryan Geiss's article on timing), but that is even more true on the GPU - you need to worry about block sizes, as well as the type of timer, etc.

I did my best at generating reliable results by testing block sizes from 2..256 and performing 2.5 million sqrt operations. Here are the results from my nVidia 9800GX2:

Method | Total time | Max. ticks per float | Avg. ticks per float | Std. Dev. | Avg. Error |

GPU intrinsic SQRT | 1.285ms | 5.99 | 3.99 | 0.00 | 0.00% |

GPU intrinsic RSQRT * x | 1.281ms | 5.99 | 3.99 | 0.00 | 0.00% |

Carmack RSQRT * x | 2.759ms | 6.28 | 4.26 | 0.01 | 0.09% |

Total time is the total time measured by the CPU that the GPU took to launch the kernel and calculate the results. The clock ticks are meant to be more accurate measurements using the GPU's internal clock, but I find that to be dubious.

The conclusions to take from these results are simple: Carmack's inverse and other trickery isn't going to help, using the GPU RSQRT function as opposed to the inbuilt SQRT function saves you about a clock tick or two. (Probably because nVidias SQRT is implemented as 1/RSQRT, as opposed to X*RSQRT)

I'm happy to say, low level optimization tricks are still safely a thing of the past.

You can get the code for the CUDA benchmark here:

GPU SQRT snippet.