[osxdev]metal

Metal

girl+sk8er osxdev.org

http://osxdev.org

RecapWWDC 2014

•Swift •Yosemite •Metal

I’m so happy that I was too lazy to learn Objective-C

Maybe or notGame Industry Trend

C++ OOP

Design Pattern TDD

-

C FP / PP / DOP

- Fast Iteration Immutability

Maybe or notGame Industry Trend

C++ / Objective-C OOP

Design Pattern TDD

C / Swift FP / PP / DOP

!Fast Iteration Immutability

seen season one before?Explaining Metal

Season one BAAM!Explaining Metal

SubTitleBoss의 한마디

http://www.bloter.net/archives/195819

This talk

•No API in detail •No code(my own) •No demo

CPU vs GPU

Control

Cache

ALU ALU

ALU ALU

DRAM DRAM

March/April 2008 23 more queue: www.acmqueue.com

the duration of processing for a single frame, different stages will dominate overall execution, often resulting in bandwidth- and compute-intensive phases of execu-tion. Maintaining an efficient mapping of the graphics pipeline to a GPU’s resources in the face of this variability is a significant challenge, as it requires processing and on-chip storage resources to be dynamically reallocated to pipeline stages, depending on current load.

Mixture of predictable and unpredictable data access. The graphics pipeline rigidly defines inter-stage data flows using streams of entities. This predictability presents opportunities for aggregate prefetching of stream data records and highly specialized hardware management on-chip storage resources. In contrast, buffer and texture accesses performed by shaders are fine-grained memory operations on dynamically computed addresses, making prefetch difficult. As both forms of data access are critical to maintaining high throughput, shader programming models explicitly differentiate stream from buffer/texture memory accesses, permitting specialized hardware solu-tions for both types of accesses.

Opportunities for instruction stream sharing. While the shader programming model permits each shader invocation to follow a unique stream of control, in practice, shader execution on nearby stream elements often results in the same dynamic control-flow decisions. As a result, multiple shader invocations can likely share an instruction stream. Although GPUs must accom-modate situations where this is not the case, instruction stream sharing across multiple shader invocations is a key optimization in the design of GPU processing cores and is accounted for in algorithms for pipeline scheduling.

A large fraction of a GPU’s resources exist within programmable processing cores responsible for exe-cuting shader functions. While substantial imple-mentation differences exist across vendors and product lines, all modern GPUs maintain high efficiency through the use of multi-core designs that employ both hardware multi-threading and SIMD (single instruction, multiple data)

processing. As shown in table 1, these throughput-com-puting techniques are not unique to GPUs (top two rows). In comparison with CPUs, however, GPU designs push these ideas to extreme scales.

Multicore + SIMD Processing = Lots of ALUs. A thread of control is realized by a stream of processor instructions that execute within a processor-managed environment, called an execution (or thread) context. This context con-sists of states such as a program counter, a stack pointer, general-purpose registers, and virtual memory mappings. A multicore processor replicates processing resources (both ALUs and execution contexts) and organizes them into independent cores. When an application features multiple threads of control, multicore architectures pro-vide increased throughput by executing these instruction streams on each core in parallel. For example, an Intel Core 2 Quad contains four cores and can execute four instruction streams simultaneously. As significant paral-lelism exists across shader invocations, GPU designs easily push core counts higher. High-end models contain up to 16 cores per chip.

Even higher performance is possible by populating each core with multiple floating-point ALUs. This is done efficiently with SIMD processing, which uses each ALU to perform the same operation on a different piece of data. The most common implementation of SIMD processing is via explicit short-vector instructions, similar to those provided by the x86 SSE or PowerPC Altivec ISA exten-sions. These extensions provide a SIMD width of four, with instructions that control the operation of four ALUs. Alternative implementations, such as NVIDIA’s 8-series architecture, perform SIMD execution by implicitly shar-

Type Processor Cores/Chip ALUs/Core3 SIMD width MaxT4

GPUs AMD Radeon HD 2900 4 80 64 48

NVIDIA GeForce 8800 16 8 32 96

CPUs Intel Core 2 Quad1 4 8 4 1

STI Cell BE2 8 4 4 1

Sun UltraSPARC T2 8 1 1 4

TABLE 1

1SSE processing only, does not account for x86 FPU.2Stream processing (SPE) cores only, does not account for PPU cores.332-bit, floating point (all ALUs are multiply-add except the Intel Core 2 Quad)4 The ratio of core thread contexts to simultaneously executable threads. We use the ratio T (rather than the total number of per-core thread contexts) to describe the extent to which processor cores automatically hide thread stalls via hardware multithreading.

Apple A7

http://www.anandtech.com/show/8116/some-thoughts-on-apples-metal-api

Why we should use driver?

Why we should use driver?

•GPU runs asynchronously •Different address space •Different ISA •Display is updated by frame

그림 그리기

•도화지를 편다 •(그릴 그림을 생각한다) •붓과 물감을 고른다 •붓으로 그림을 그린다. •(구겨 버리거나 걸어둔다) •새 도화지를 편다

그림 그리기 / Graphics App.

•도화지를 편다 / Framebuffer setup •(그릴 그림을 생각한다) / Data setup •붓과 물감을 고른다 / State setup •붓으로 그림을 그린다. / Draw call •(구겨 버리거나 걸어둔다) / Update a frame •새 도화지를 편다 / Framebuffer clear

Graphics Driver는 이 모든 과정의 API를 제공한다

Graphics Driver의 계층 구조

API Interface

State Management

Command Queue Management

I/O Controller

Shader Compiler

Why is it expensive?Graphics Driver가 하는 일

•State validation ■ Confirming API usage is valid ■ Encoding API state to hardware state

•Shader compilation ■ Run-time generation of shader machine code ■ Interactions between state and shaders

•Sending work to GPU ■ Managing resource residency ■ Batching commands

OpenGLState validation

void glTexImage2D( GLenum target, GLint level,

GLint internalFormat, GLsizei width, GLsizei height, GLint border,

GLenum format, GLenum type,

const GLvoid * data);

Are you kidding?Shader Compilation

•No standard for pre-built shader •No standard for shader binary format

int Init(ESContext *esContext) {

UserData *userData = esContext->userData; GLbyte vShaderStr[] = "attribute vec4 vPosition; \n" "void main() \n" "{ \n" " gl_Position = vPosition; \n" "} \n";

GLbyte fShaderStr[] = "precision mediump float; \n" "void main() \n" "{ \n" " gl_FragColor = vec4(1.0, 0.0, 0.0, 1.0); \n" "} \n"; GLuint vertexShader; GLuint fragmentShader; GLuint programObject; GLint linked;

음영(陰影)Shader

•Shader는 오브젝트를 어둡게 칠한다

courtesy of 西川善司

복붙Sending work to GPU

•Batching commands and committing •Transferring data and texture

Design targetMetal

•Low CPU overhead •More predictable performance •Better programmability

Key ideasMetal

•Create and validate state up-front •Shader can be compiled offline •Enable versatile multi-threading •Shared memory for CPU & GPU •Handle synchronisation explicitly •Tile-based deferred rendering •C++11 based language •No legacy baggage •Compute shader

But, A7 only - What the x

Multi-threading

Metal vs OpenGL ESCode comparison

Low CPU overhead enableSo what

•more draw calls •more objects •better physics •better AI •more complex logic •low battery usage

Use engine or forgetHow do I start?

•Unity 5(next year) - free/4,500$ •Unreal 4(may be this year) - 19$/month •Cocos2D - free •Xcode template

Proprietary API

•Apple is a promoter of Khronos Group •OpenCL story •판이 꺼졌으니 사다리 걷어차기?

■ 하지만 구글은 바보가 아니다(Expansion Pack)

몰라도 그만Conclusion

•Low CPU overhead •Can do something more •A7 only(할 수 없거나 귀찮거나) •Game-changer? maybe or not

[osxdev]metal

Technology

shader shader courtesy

shader w

n glbyte fshaderstr

valid encoding api state

sse processing

metal girl

stream processing spe

proprietary api apple