It comes from a practical problem: I need evaluate tension spline very frequently. The most expensive part is to compute: sinh(x-l), sinh(r-l), sinh(r-x).
It is simple to see that I only need evaluate exp(x-l) and exp(r-x), it leads to 40% performance boost. essentially, we replace 3 slow calls by 2 slow calls.
Caveat: mathematically, we are correctly. But numerically, it is less obvious or arguable. To be strict, compiler will not be able to assume (a+b)+c = a+(b+c).
We can defend us (and the approximation below): all calculations in computers are approximate, even hardware from different vendors have slightly different implementation. We should ONLY focus on the study if the approximation is sufficient.
See the links below for general defense line.
http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=126611&highlight_key=y
http://forums.nvidia.com/index.php?showtopic=152829
Now our task is to compute exp(x-l) and exp(r-x); or a duo exp evaluation. Recall that SSE2 has 128bit register to hold 2 doubles and perform parallel calculations on them. It should help me. In fact, my code below achieves 40% speed up, when compiled with VS 2010 Express.
Caveats
1. our calculation is approximate, and I did not study the accuracy yet. But simply compare to even rougher approximation shows: the speedup is real.
2. VS2008 Express can not give me any gain.
3. Intel C++? good idea, but need rewrite the code, as I use Microsoft C++ intrinsics.
Implementation
First, know how to evaluate one exp() in C++. Then parallelize it and write the code for double-double pair in SSE2 intrinsics.
First of all, there is no assembly instruction for exp(). We only have MUL, ADD, SQRT, etc. So we need polynomial approximation. The classic way of doing this is via Taylor expansion. Also, we need pay attention on a few coding tricks to save more time in Taylor expansion evaluation.
I don’t need write my version. Just use Cephes library from Netlib. http://www.netlib.org/cephes/
Then we can see SSE/SSE2 at SF version of Cephes as SseMath.
http://gruntthepeon.free.fr/ssemath/
But I need DF precision! Easy, do some translation. We end up with the following code.
// SSE2
#include <emmintrin.h>
#define SSE2_DOUBLE_CONST(Key, v) \
static const __declspec(align(16)) double _D_##Key[2] = { v, v }
// repeat a const to be in packed form
#define SSE2_INT_CONST(Key, v) \
static const __declspec(align(16)) int _I_##Key[4] = {v, v, v, v}
/* SSE2 defs start */
SSE2_DOUBLE_CONST(1, 1.0);
SSE2_DOUBLE_CONST(0p5, 0.5);
SSE2_DOUBLE_CONST(cephes_LOG2, 1.44269504088896341);
SSE2_DOUBLE_CONST(cephes_C1, 0.693359375);
SSE2_DOUBLE_CONST(cephes_C2, -2.12194440e-4);
SSE2_DOUBLE_CONST(cephes_p0, 1.9875691500E-4);
SSE2_DOUBLE_CONST(cephes_p1, 1.3981999507E-3);
SSE2_DOUBLE_CONST(cephes_p2, 8.3334519073E-3);
SSE2_DOUBLE_CONST(cephes_p3, 4.1665795894E-2);
SSE2_DOUBLE_CONST(cephes_p4, 1.6666665459E-1);
SSE2_DOUBLE_CONST(cephes_p5, 5.0000001201E-1);
SSE2_DOUBLE_CONST(exp_upper, 88.3762626647949);
SSE2_DOUBLE_CONST(exp_lower, -88.3762626647949);
SSE2_INT_CONST(0x7f, 0x7f);
/* SSE2 defs end */
void exp_sse2(double in[2], double out[2])
{
__m128d x = _mm_load_pd(in);
__m128d tmp = _mm_setzero_pd();
__m128d fx;
__m128i emm0;
__m128d oneD = *(__m128d*) _D_1;
x = _mm_min_pd(x, *(__m128d*)_D_exp_upper);
x = _mm_max_pd(x, *(__m128d*)_D_exp_lower);
// exp(x) = exp( z + n log(2) ) = exp(z) * n^2 , where 0<=z < 1
fx = _mm_mul_pd(x, *(__m128d*)_D_cephes_LOG2);
fx = _mm_add_pd(fx, *(__m128d*)_D_0p5);
emm0 = _mm_cvttpd_epi32(fx); //overloaded to handle _m128d
tmp = _mm_cvtepi32_pd(emm0);
__m128d flag = _mm_cmpgt_pd(tmp, fx); // ? 0xffffffffffffffff : 0x0
flag = _mm_and_pd(flag, oneD);
fx = _mm_sub_pd(tmp, flag);
tmp = _mm_mul_pd(fx, *(__m128d*)_D_cephes_C1);
__m128d tmp2 = _mm_mul_pd(fx, *(__m128d*)_D_cephes_C2);
x = _mm_sub_pd(x, tmp);
x = _mm_mul_pd(x, tmp2);
__m128d z = _mm_mul_pd(x, x);
__m128d y = *(__m128d*)_D_cephes_p0;
y = _mm_mul_pd(y, x);
y = _mm_add_pd(y, *(__m128d*)_D_cephes_p1);
y = _mm_mul_pd(y, x);
y = _mm_add_pd(y, *(__m128d*)_D_cephes_p2);
y = _mm_mul_pd(y, x);
y = _mm_add_pd(y, *(__m128d*)_D_cephes_p3);
y = _mm_mul_pd(y, x);
y = _mm_add_pd(y, *(__m128d*)_D_cephes_p4);
y = _mm_mul_pd(y, x);
y = _mm_add_pd(y, *(__m128d*)_D_cephes_p5);
y = _mm_mul_pd(y, z);
y = _mm_add_pd(y, x);
y = _mm_add_pd(y, oneD);
/* build 2^n */
emm0 = _mm_cvttpd_epi32(fx);
emm0 = _mm_add_epi32(emm0, *(__m128i*)_I_0x7f);
emm0 = _mm_slli_epi32(emm0, 23);
__m128d pow2n = _mm_castsi128_pd(emm0);
y = _mm_mul_pd(y, pow2n);
_mm_store_pd(out, y);
}
MISC
http://eigen.tuxfamily.org/index.php?title=FAQ#Vectorization
A linear algebra library, with SSE.
http://simdx86.sourceforge.net/
libSIMDx86 v0.4.0
The optimized SIMD library for x86 processors.
libSVM experiment: SSE extension
see source code for exp() kernel implementation in SSE2.
http://bazaar.launchpad.net/~olivier-grisel/simd4libsvm/libsvm.og.main/view/head:/src/kernel/kernel_sse2.cpp