muda proposal

MUDAMUltiple Data Accelerator language

Project OverviewFeb 24, 2008

Syoyo FUJITA

Nikkei 225 index

2007 Feb/2008

+800 %Mac Pro octa

204 Gflops

Geforce 9800 GX2 rumor

1 TFlops?( 3x of G80)500 GFlops? (+50% of G80)

No update !

?

PS3179.2 Gflops

GPU slumpsCPU soars

Nikkei 225 index

Nikkei 225 index

Future of GPU trend

Subprime shock!Credit boom ends!US economy declines!Green IT!

CPU GPU

GPGPU

Acceleratedcomputing

many-core

CPU GPU

GPGPU


many-core

NO!

GPGPU was dead!!GPU will be dead soon!!

Why GPU -> GPGPU is BAD

• Larger latency : host <-> PCI-ex

• Internal architecture is black box

• Only GPU maker knows it

• Larger cost of branching

• Debugger?

• Program only runs on specific GPU maker’s GPU

• Not portable.

Why CPU -> Accelerated computing is GOOD

• Easy to program

• CPU maker provides good internal spec documentation

• Fast execution of branching

• gdb :-)

• Portable & Versatile

CPU


many-core

MUDA

MUDA’s goal

• Withdraw CPU’s maximum floating point performance for large data

• SIMD

• Cache optimized computation

MUDA example

vec sqrtmu(vec x){ vec y0, y0x, y0xhalf; vec oneish = bit(0x3f800001);

y0 = rsqrt(x); y0x = y0 * x; y0xhalf = 0.5 * y0x; return ((oneish - y0 * y0x) * y0xhalf + y0x); }

MUDA code

__m128 sqrtmu (const __m128 * x) { __m128 y0 ; __m128 y0x ; __m128 y0xhalf ; const __m128 t_vec4 = (__m128)_mm_set1_epi32( 1065353217) ; __m128 oneish = t_vec4 ; const __m128 t_vec6 = (*x) ; const __m128 t_vec5 = _mm_rsqrt_ps( t_vec6) ; y0 = t_vec5 ; const __m128 t_vec8 = y0 ; const __m128 t_vec9 = (*x) ; const __m128 t_vec7 = _mm_mul_ps( t_vec8 , t_vec9 ) ; y0x = t_vec7 ; const float t_float13 = 0.5 ; const float t_float12 = t_float13 ; const __m128 t_vec10 = _mm_set_ps1( t_float12 ) ; const __m128 t_vec14 = y0x ; const __m128 t_vec11 = _mm_mul_ps( t_vec10 , t_vec14 ) ; y0xhalf = t_vec11 ; const __m128 t_vec19 = oneish ; const __m128 t_vec20 = y0 ; const __m128 t_vec21 = y0x ; const __m128 t_vec15 = _mm_mul_ps( t_vec20 , t_vec21 ) ; const __m128 t_vec16 = _mm_sub_ps( t_vec19 , t_vec15 ) ; const __m128 t_vec22 = y0xhalf ; const __m128 t_vec17 = _mm_mul_ps( t_vec16 , t_vec22 ) ; const __m128 t_vec23 = y0x ; const __m128 t_vec18 = _mm_add_ps( t_vec17 , t_vec23 ) ; return t_vec18 ;}

x86/SSE output

Why MUDA?

No unified way to describe SIMD op

• SSE: _mm_add_ps()

• AltiVec: vec_add

• SPE: spu_add

CPU ISA changes frequently

• SSE2(2000), SSE3(2004), SSE4(2006)

• SSE5 and Coming New CPU design(?)

• 8-element SIMD?, no SIMD in the future CPU?

• Keeping up with them is hard and not productive. Waste of your time.

MUDA MUDAcompiler

SSE2 C code

SSE4 C code

VMX C code

LLVM IR

Portable, CPU independent

description

CPU or Arch dependentcode

Status

• SSE2 backend : 75 %

• SSE4 backend : 0 %

• VMX backend : 20 %

• LLVM IR backend : 30 %

• SIMD math function for MUDA : 5 %

• Automatic optimizer : TODO

= I’m currently working on

Future direction

• Cache miss analysis and memory access optimization

• Valgrind, Cache Miss Equation(CME)

• Automatic optimization

• Such like FFTW, ATLAS and Spiral are doing

• Automatic error measurement for floating point computation

• Interval Arithmetic, Affine Arithmetic, Gappa

Performance gap

0

25

50

75

100

SIMD Memory

Scalar:SIMD =

1:4

cache miss:cache hit =

1:100

Better

Performance gap

0

25

50

75

100

SIMD Memory

Scalar:SIMD =

1:4

cache miss:cache hit =

1:100

Better

Optimizing memory access is much more important than SIMDization

muda proposal

Technology