[osxdev]metal

29
Metal girl+sk8er osxdev.org

Upload: naver-d2

Post on 15-Jan-2015

394 views

Category:

Technology


0 download

DESCRIPTION

OSXDEV 오픈세미나 - WWDC 따라잡기

TRANSCRIPT

Page 1: [Osxdev]metal

Metal

girl+sk8er osxdev.org

Page 2: [Osxdev]metal

RecapWWDC 2014

•Swift •Yosemite •Metal

Page 3: [Osxdev]metal

I’m so happy that I was too lazy to learn Objective-C

Page 4: [Osxdev]metal

Maybe or notGame Industry Trend

C++ OOP

Design Pattern TDD

-

C FP / PP / DOP

- Fast Iteration Immutability

Page 5: [Osxdev]metal

Maybe or notGame Industry Trend

C++ / Objective-C OOP

Design Pattern TDD

C / Swift FP / PP / DOP

!Fast Iteration Immutability

Page 6: [Osxdev]metal

seen season one before?Explaining Metal

Page 7: [Osxdev]metal

Season one BAAM!Explaining Metal

Page 8: [Osxdev]metal

SubTitleBoss의 한마디

http://www.bloter.net/archives/195819

Page 9: [Osxdev]metal

This talk

•No API in detail •No code(my own) •No demo

Page 10: [Osxdev]metal

CPU vs GPU

Control

Cache

ALU ALU

ALU ALU

DRAM DRAM

March/April 2008 23 more queue: www.acmqueue.com

the duration of processing for a single frame, different stages will dominate overall execution, often resulting in bandwidth- and compute-intensive phases of execu-tion. Maintaining an efficient mapping of the graphics pipeline to a GPU’s resources in the face of this variability is a significant challenge, as it requires processing and on-chip storage resources to be dynamically reallocated to pipeline stages, depending on current load.

Mixture of predictable and unpredictable data access. The graphics pipeline rigidly defines inter-stage data flows using streams of entities. This predictability presents opportunities for aggregate prefetching of stream data records and highly specialized hardware management on-chip storage resources. In contrast, buffer and texture accesses performed by shaders are fine-grained memory operations on dynamically computed addresses, making prefetch difficult. As both forms of data access are critical to maintaining high throughput, shader programming models explicitly differentiate stream from buffer/texture memory accesses, permitting specialized hardware solu-tions for both types of accesses.

Opportunities for instruction stream sharing. While the shader programming model permits each shader invocation to follow a unique stream of control, in practice, shader execution on nearby stream elements often results in the same dynamic control-flow decisions. As a result, multiple shader invocations can likely share an instruction stream. Although GPUs must accom-modate situations where this is not the case, instruction stream sharing across multiple shader invocations is a key optimization in the design of GPU processing cores and is accounted for in algorithms for pipeline scheduling.

A large fraction of a GPU’s resources exist within programmable processing cores responsible for exe-cuting shader functions. While substantial imple-mentation differences exist across vendors and product lines, all modern GPUs maintain high efficiency through the use of multi-core designs that employ both hardware multi-threading and SIMD (single instruction, multiple data)

processing. As shown in table 1, these throughput-com-puting techniques are not unique to GPUs (top two rows). In comparison with CPUs, however, GPU designs push these ideas to extreme scales.

Multicore + SIMD Processing = Lots of ALUs. A thread of control is realized by a stream of processor instructions that execute within a processor-managed environment, called an execution (or thread) context. This context con-sists of states such as a program counter, a stack pointer, general-purpose registers, and virtual memory mappings. A multicore processor replicates processing resources (both ALUs and execution contexts) and organizes them into independent cores. When an application features multiple threads of control, multicore architectures pro-vide increased throughput by executing these instruction streams on each core in parallel. For example, an Intel Core 2 Quad contains four cores and can execute four instruction streams simultaneously. As significant paral-lelism exists across shader invocations, GPU designs easily push core counts higher. High-end models contain up to 16 cores per chip.

Even higher performance is possible by populating each core with multiple floating-point ALUs. This is done efficiently with SIMD processing, which uses each ALU to perform the same operation on a different piece of data. The most common implementation of SIMD processing is via explicit short-vector instructions, similar to those provided by the x86 SSE or PowerPC Altivec ISA exten-sions. These extensions provide a SIMD width of four, with instructions that control the operation of four ALUs. Alternative implementations, such as NVIDIA’s 8-series architecture, perform SIMD execution by implicitly shar-

Type Processor Cores/Chip ALUs/Core3 SIMD width MaxT4

GPUs AMD Radeon HD 2900 4 80 64 48

NVIDIA GeForce 8800 16 8 32 96

CPUs Intel Core 2 Quad1 4 8 4 1

STI Cell BE2 8 4 4 1

Sun UltraSPARC T2 8 1 1 4

TABLE 1

1SSE processing only, does not account for x86 FPU.2Stream processing (SPE) cores only, does not account for PPU cores.332-bit, floating point (all ALUs are multiply-add except the Intel Core 2 Quad)4 The ratio of core thread contexts to simultaneously executable threads. We use the ratio T (rather than the total number of per-core thread contexts) to describe the extent to which processor cores automatically hide thread stalls via hardware multithreading.

Page 11: [Osxdev]metal

Apple A7

http://www.anandtech.com/show/8116/some-thoughts-on-apples-metal-api

Page 12: [Osxdev]metal

Why we should use driver?

Page 13: [Osxdev]metal

Why we should use driver?

•GPU runs asynchronously •Different address space •Different ISA •Display is updated by frame

Page 14: [Osxdev]metal

그림 그리기

•도화지를 편다 •(그릴 그림을 생각한다) •붓과 물감을 고른다 •붓으로 그림을 그린다. •(구겨 버리거나 걸어둔다) •새 도화지를 편다

Page 15: [Osxdev]metal

그림 그리기 / Graphics App.

•도화지를 편다 / Framebuffer setup •(그릴 그림을 생각한다) / Data setup •붓과 물감을 고른다 / State setup •붓으로 그림을 그린다. / Draw call •(구겨 버리거나 걸어둔다) / Update a frame •새 도화지를 편다 / Framebuffer clear

Graphics Driver는 이 모든 과정의 API를 제공한다

Page 16: [Osxdev]metal

Graphics Driver의 계층 구조

API Interface

State Management

Command Queue Management

I/O Controller

Shader Compiler

Page 17: [Osxdev]metal

Why is it expensive?Graphics Driver가 하는 일

•State validation ■ Confirming API usage is valid ■ Encoding API state to hardware state

•Shader compilation ■ Run-time generation of shader machine code ■ Interactions between state and shaders

•Sending work to GPU ■ Managing resource residency ■ Batching commands

Page 18: [Osxdev]metal

OpenGLState validation

void glTexImage2D( GLenum target, GLint level,

GLint internalFormat, GLsizei width, GLsizei height, GLint border,

GLenum format, GLenum type,

const GLvoid * data);

Page 19: [Osxdev]metal

Are you kidding?Shader Compilation

•No standard for pre-built shader •No standard for shader binary format

int Init(ESContext *esContext) {

UserData *userData = esContext->userData; GLbyte vShaderStr[] = "attribute vec4 vPosition; \n" "void main() \n" "{ \n" " gl_Position = vPosition; \n" "} \n";

GLbyte fShaderStr[] = "precision mediump float; \n" "void main() \n" "{ \n" " gl_FragColor = vec4(1.0, 0.0, 0.0, 1.0); \n" "} \n"; GLuint vertexShader; GLuint fragmentShader; GLuint programObject; GLint linked;

Page 20: [Osxdev]metal

음영(陰影)Shader

•Shader는 오브젝트를 어둡게 칠한다

courtesy of 西川善司

Page 21: [Osxdev]metal

복붙Sending work to GPU

•Batching commands and committing •Transferring data and texture

Page 22: [Osxdev]metal

Design targetMetal

•Low CPU overhead •More predictable performance •Better programmability

Page 23: [Osxdev]metal

Key ideasMetal

•Create and validate state up-front •Shader can be compiled offline •Enable versatile multi-threading •Shared memory for CPU & GPU •Handle synchronisation explicitly •Tile-based deferred rendering •C++11 based language •No legacy baggage •Compute shader

But, A7 only - What the x

Page 24: [Osxdev]metal

Multi-threading

Page 25: [Osxdev]metal

Metal vs OpenGL ESCode comparison

Page 26: [Osxdev]metal

Low CPU overhead enableSo what

•more draw calls •more objects •better physics •better AI •more complex logic •low battery usage

Page 27: [Osxdev]metal

Use engine or forgetHow do I start?

•Unity 5(next year) - free/4,500$ •Unreal 4(may be this year) - 19$/month •Cocos2D - free •Xcode template

Page 28: [Osxdev]metal

Proprietary API

•Apple is a promoter of Khronos Group •OpenCL story •판이 꺼졌으니 사다리 걷어차기?

■ 하지만 구글은 바보가 아니다(Expansion Pack)

Page 29: [Osxdev]metal

몰라도 그만Conclusion

•Low CPU overhead •Can do something more •A7 only(할 수 없거나 귀찮거나) •Game-changer? maybe or not