(videogame) Rendering 102

Post on 07-Apr-2018

226 views 0 download

Transcript of (videogame) Rendering 102

  • 8/3/2019 (videogame) Rendering 102

    1/32

  • 8/3/2019 (videogame) Rendering 102

    2/32

  • 8/3/2019 (videogame) Rendering 102

    3/32

  • 8/3/2019 (videogame) Rendering 102

    4/32

    This is were we left from http://c0de517e.blogspot.com/2011/09/rendering-101.html

    http://c0de517e.blogspot.com/2011/09/rendering-101.htmlhttp://c0de517e.blogspot.com/2011/09/rendering-101.htmlhttp://c0de517e.blogspot.com/2011/09/rendering-101.htmlhttp://c0de517e.blogspot.com/2011/09/rendering-101.html
  • 8/3/2019 (videogame) Rendering 102

    5/32

    In the middle, Nvidia Fermi GPU Die

  • 8/3/2019 (videogame) Rendering 102

    6/32

  • 8/3/2019 (videogame) Rendering 102

    7/32

    In the background image and in the scheme, the Pentium 4 CPU. Compare its

    complicated design with many different logic blocks with the Fermi GPU in Slide 5

    See i.e. http://www.azillionmonkeys.com/qed/cpujihad.shtml

    http://www.azillionmonkeys.com/qed/cpujihad.shtmlhttp://www.azillionmonkeys.com/qed/cpujihad.shtml
  • 8/3/2019 (videogame) Rendering 102

    8/32

    The code is written in a dumb way here, we could have written out[i] =

    (in*i+105)in*i+, but we wanted to write it in a way thats closer to how things are

    translated in assembly and executed

  • 8/3/2019 (videogame) Rendering 102

    9/32

  • 8/3/2019 (videogame) Rendering 102

    10/32

    Its interesting to notice the similarities between manual unrolling and prefetching used

    in traditional CPU optimization, fibers, continuations and async I/O used in modern

    servers, and GPGPU architecture.

    Note about the fixed vectors: SIMD instructions (i.e. Math on float4s) does nottranslate into SIMD execution in HW (nor is needed for HW SIMD) HW SIMD width

    may be very different from language SIMD width!

  • 8/3/2019 (videogame) Rendering 102

    11/32

    Siggraph 2008 - From Shader Code to a Teraflop: How Shader Cores Work Kayvon

    Fatahalian, Stanford University

    http://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdf

    http://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdfhttp://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdfhttp://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdfhttp://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdfhttp://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdfhttp://s08.idav.ucdavis.edu/fatahalian-gpu-architecture.pdf
  • 8/3/2019 (videogame) Rendering 102

    12/32

  • 8/3/2019 (videogame) Rendering 102

    13/32

  • 8/3/2019 (videogame) Rendering 102

    14/32

    http://en.wikipedia.org/wiki/Parallel_computing

    http://en.wikipedia.org/wiki/Parallel_computinghttp://en.wikipedia.org/wiki/Parallel_computing
  • 8/3/2019 (videogame) Rendering 102

    15/32

    Moores law: it was originally about transistor count, and processors roughly managed to respect it. But CPUs are also resp ectingit in performance, that is odd, as the performance should increase due to the transistor count AND CPU frequency (faster!). GPUsare following Moores in transistor count but beating it (as it should be expected) when it comes to performance, but only on heavily data-parallel tasks where all the code runs in parallel (Amdahls law is the limiting factor there)

    Whats on the die (PC processors...)

    8086...386 ---Mostly processing power: Logic units.

    486...Pentium2 ---Processing power and caches: A bit of cache, FPUs. Multiple pipelines.

    Pentium3...Pentium4 ---Caches and scheduling logic: Heavy instruction decode/reorder units, branch predition, cache prediction. Longer pipelines.

    Core2...i7 ---Multicore + Big caches

    Future ---Back to pure processing power, ALUs on most of the die (and cache) Manycore, small decode stages (in-order, shared between units) and caches (shared between units), wide hardware and logical

    SIMD, lower power/flops ratio (GPUs, Cell...)Manycore (GPU) integrated with multicore (CPU), sharing a cache level or direct bus interconnection (single die or fast paths

    between units: Xenon/Xenos, Ps3 PPU/SPU...)

    Past:http://www.tayloredge.com/museum/processor/processorhistory.htmlhttp://www.cpu-world.com/CPUs/index.htmlhttp://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.htmlhttp://www.thg.ru/cpu/19990809/onepage.htmlhttp://www.cs.clemson.edu/~mark/330/p6.html

    Future:www.gpgpu.org/static/s2007/slides/02-gpu-architecture-overview-s07.pdfs09.idav.ucdavis.edu/talks/02_kayvonf_gpuArchTalk09.pdfbps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdfhttp://www.research.ibm.com/cell/http://en.wikipedia.org/wiki/Cell_(microprocessor)http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://sites.amd.com/us/fusion/apu/Pages/fusion.aspx

    http://www.tayloredge.com/museum/processor/processorhistory.htmlhttp://www.cpu-world.com/CPUs/index.htmlhttp://www.tayloredge.com/museum/processor/processorhistory.htmlhttp://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.htmlhttp://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.htmlhttp://www.thg.ru/cpu/19990809/onepage.htmlhttp://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.htmlhttp://www.cs.clemson.edu/~mark/330/p6.htmlhttp://www.cs.clemson.edu/~mark/330/p6.htmlhttp://www.research.ibm.com/cell/http://en.wikipedia.org/wiki/Cell_(microprocessor)http://en.wikipedia.org/wiki/Cell_(microprocessor)http://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://sites.amd.com/us/fusion/apu/Pages/fusion.aspxhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://sites.amd.com/us/fusion/apu/Pages/fusion.aspxhttp://sites.amd.com/us/fusion/apu/Pages/fusion.aspxhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://www.anandtech.com/show/4083/the-sandy-bridge-review-intel-core-i5-2600k-i5-2500k-and-core-i3-2100-testedhttp://en.wikipedia.org/wiki/Cell_(microprocessor)http://www.research.ibm.com/cell/http://www.cs.clemson.edu/~mark/330/p6.htmlhttp://www.thg.ru/cpu/19990809/onepage.htmlhttp://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.htmlhttp://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.htmlhttp://www.chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.htmlhttp://www.cpu-world.com/CPUs/index.htmlhttp://www.cpu-world.com/CPUs/index.htmlhttp://www.cpu-world.com/CPUs/index.htmlhttp://www.tayloredge.com/museum/processor/processorhistory.html
  • 8/3/2019 (videogame) Rendering 102

    16/32

  • 8/3/2019 (videogame) Rendering 102

    17/32

    How much latency? On the 360 GPU, from the start of a shader (task) to the end (write

    into the framebuffer) there are roughly 1000 gpu cycles of latency

  • 8/3/2019 (videogame) Rendering 102

    18/32

    Just a few examples! There are many fast sequential sorts (i.e. Radix and the other

    distribution sorts), many are even faster if the sequence to sort has certain properties

    (i.e. Uniform: Flash, Almost sorted: Smooth) or if we some given behaviour are

    desiderable (i.e. Cache efficient: Funnel, Few writes: Cycle, Extract LIS: Patience, Online:

    Library) and most of them can be parallelized (not only the MergeSort). Also hybrids areoften useful (i.e. Radix sort and parallel merge).

    www.cse.ohio-state.edu/~kerwin/MPIGpu.pdf

    theory.csail.mit.edu/classes/6.895/fall03/projects/final/youn.ppt

    http://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/algoen.htm

    http://elliottback.com/wp/sorting-in-linear-time/

    http://en.wikipedia.org/wiki/Sorting_algorithm

    http://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.c

    om/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-

    related&resnum=1&ved=0CCQQzwIwAA

    http://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/algoen.htmhttp://elliottback.com/wp/sorting-in-linear-time/http://en.wikipedia.org/wiki/Sorting_algorithmhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://scholar.google.ca/scholar?hl=en&lr=&q=related:W94guGpq0ZIJ:scholar.google.com/&um=1&ie=UTF-8&ei=hiWvTeuCMamw0QGStaCiCw&sa=X&oi=science_links&ct=sl-related&resnum=1&ved=0CCQQzwIwAAhttp://en.wikipedia.org/wiki/Sorting_algorithmhttp://elliottback.com/wp/sorting-in-linear-time/http://elliottback.com/wp/sorting-in-linear-time/http://elliottback.com/wp/sorting-in-linear-time/http://elliottback.com/wp/sorting-in-linear-time/http://elliottback.com/wp/sorting-in-linear-time/http://elliottback.com/wp/sorting-in-linear-time/http://elliottback.com/wp/sorting-in-linear-time/http://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/algoen.htmhttp://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/algoen.htmhttp://www.inf.fh-flensburg.de/lang/algorithmen/sortieren/algoen.htm
  • 8/3/2019 (videogame) Rendering 102

    19/32

    http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-

    /dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2

    http://www.amazon.com/GPU-Computing-Gems-Emerald-

    Applications/dp/0123849888/ref=pd_bxgy_b_img_c

    http://developer.nvidia.com/category/zone/cuda-zone

    http://gpgpu.org/

    http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://developer.nvidia.com/category/zone/cuda-zonehttp://gpgpu.org/http://gpgpu.org/http://developer.nvidia.com/category/zone/cuda-zonehttp://developer.nvidia.com/category/zone/cuda-zonehttp://developer.nvidia.com/category/zone/cuda-zonehttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/GPU-Computing-Gems-Emerald-Applications/dp/0123849888/ref=pd_bxgy_b_img_chttp://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2http://www.amazon.com/Programming-Massively-Parallel-Processors-Hands-/dp/0123814723/ref=sr_1_2?ie=UTF8&qid=1319062601&sr=8-2
  • 8/3/2019 (videogame) Rendering 102

    20/32

    Also you might want to start experimenting with some code. SlimDX is a good DirectX9/10/11

    C# (.net) wrapper for DirectX and comes with some simple examples: http://slimdx.org/. Note

    that the various APIs have different amounts of legacy, its probably not a bad idea to start with

    DX11, its more complicated than for example old immediate mode OpenGL, but its closer to

    reality.Immediate mode refers to a way of drawing primitives that places the vertex data directly in

    the command buffer (instead of being a resource that is created upfront and then bound). Its

    slow and mostly deprecated.

    States (switches for the fixed function parts of the pipelines), resources and shader constants

    are very different concepts in DX9 where you can set/get a single render state or a single

    shader constant. This is expensive and yields to lots of commands being sent and requires

    careful practices to avoid generating too many redundant state sets. DX10-11 manage

    everything as resources, buffers that the CPU can modify, transfer to the GPU and then set, and

    that contain data, textures, groups of states and shader constants.

    Some GPU implementations are very different from what I will sketch from here, for example

    the PowerVR GPU, which is used in many mobile platforms, use tile based deferred rendering

    which is pretty different from what I will discuss in terms of pipelines and rasterization and its

    pretty distant from the logical stages that the DirectX API exposes (even if you might find a

    DirectX or equivalently OpenGL implementation for such platforms)

    http://en.wikipedia.org/wiki/PowerVR

    Remember that these pipeline stages are only a logical view of the API with some logical view

    of a typical implementation, not a low-level physical view.

    Also, we wont delve deep, for further reference see:

    http://c0de517e.blogspot.com/2008/04/gpu-part-1.html

    htt : c0de517e.blo s ot.com 2008 04 how- u-works- art-2.html

    http://slimdx.org/http://en.wikipedia.org/wiki/PowerVRhttp://c0de517e.blogspot.com/2008/04/gpu-part-1.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/how-gpu-works-part-2.htmlhttp://c0de517e.blogspot.com/2008/04/gpu-part-1.htmlhttp://c0de517e.blogspot.com/2008/04/gpu-part-1.htmlhttp://c0de517e.blogspot.com/2008/04/gpu-part-1.htmlhttp://c0de517e.blogspot.com/2008/04/gpu-part-1.htmlhttp://c0de517e.blogspot.com/2008/04/gpu-part-1.htmlhttp://en.wikipedia.org/wiki/PowerVRhttp://slimdx.org/
  • 8/3/2019 (videogame) Rendering 102

    21/32

    Dump of GPU commands, captured with Windows DirectX PIX (comes with the DirectX

    SDK)

    There are a few other API capture applications, the best ones for DirectX 9 are PIX and

    Intel GPA (http://software.intel.com/en-us/articles/intel-gpa/)For DirectX 11 you can use PIX but also Nvidias Parallel Nsight:

    http://developer.nvidia.com/nvidia-parallel-nsight and AMD GPU PerfStudio

    http://developer.amd.com/tools/PerfStudio/Pages/default.aspx

    To peek into games the old DXExplorer http://www.sandboxsoftware.net/dxexplorer/

    and 3D Ripper are fun too http://www.deep-shadows.com/hax/3DRipperDX.htm

    Similar tools exist for OpenGL, for example gDEBugger http://www.gremedy.com/

    http://software.intel.com/en-us/articles/intel-gpa/http://developer.nvidia.com/nvidia-parallel-nsighthttp://developer.amd.com/tools/PerfStudio/Pages/default.aspxhttp://www.sandboxsoftware.net/dxexplorer/http://www.deep-shadows.com/hax/3DRipperDX.htmhttp://www.gremedy.com/http://www.gremedy.com/http://www.deep-shadows.com/hax/3DRipperDX.htmhttp://www.deep-shadows.com/hax/3DRipperDX.htmhttp://www.deep-shadows.com/hax/3DRipperDX.htmhttp://www.sandboxsoftware.net/dxexplorer/http://developer.amd.com/tools/PerfStudio/Pages/default.aspxhttp://developer.nvidia.com/nvidia-parallel-nsighthttp://developer.nvidia.com/nvidia-parallel-nsighthttp://developer.nvidia.com/nvidia-parallel-nsighthttp://developer.nvidia.com/nvidia-parallel-nsighthttp://developer.nvidia.com/nvidia-parallel-nsighthttp://software.intel.com/en-us/articles/intel-gpa/http://software.intel.com/en-us/articles/intel-gpa/http://software.intel.com/en-us/articles/intel-gpa/http://software.intel.com/en-us/articles/intel-gpa/http://software.intel.com/en-us/articles/intel-gpa/
  • 8/3/2019 (videogame) Rendering 102

    22/32

    Data fetching usually has a simple linear cache associated with it (pre-transform cache).

    Indexed reading is useful as triangles often share the same vertices, so we dont wantto replicate the same data over and over in source buffers, by using indexing we can

    just replicate indices. Moreover, GPUs store a few vertices out of the vertex shader, andif you ask to shade the same vertex again it can reuse the output without re-executingthe shader (post-transform cache).Usually indexed triangle lists are the best ways to render objects on modern GPUs andthere are ways to optimize the order of triangles in an object in order to maximize post-transform cache use.

    Usually using inteleaved inputs (a single buffer, an array of structures) is the bestchoice, but we might want to split the data in different buffers for example if we needto use the same data in multiple draws but each draw needs only some attributes, or ifpart of the data needs to be modified by the cpu etc...

    As for all the units, dont confuse the logical representation with the physical one. Forexample, on a Xbox 360 Xenos GPU the vertex data fetching is done by the vertexshader, vertex fetching instructions are added at the beginning of a vertex shader andthe fetching unit is only a kind of memory interface, like the texture fetching unit, andits available both to vertex and pixel shaders (as the shader engine is unified on thatGPU). The only unit that generates vertices is the unit that handles indexing and post-transform caching, and vertices are only indices that are an input to the vertex shader.

  • 8/3/2019 (videogame) Rendering 102

    23/32

  • 8/3/2019 (videogame) Rendering 102

    24/32

    Nvidia FX Composer is a good tool to experiment with shaders without having to care

    about how to draw primitives and bind resources http://developer.nvidia.com/fx-

    composer

    The older 1.8 version of FX Composer is actually a better starting point that the newer.net based one.

    DirectX shader language is called HLSL, its very close to Nvidias CG which can be

    used both from OpenGL and DirectX (and works on non-nvidia cards too). OpenGL uses

    GLSL which is a bit more confusing and in general messier. HLSL and CG support an

    effect framework (FX) that enables the specification of not only shader code but also

    control over some GPU states, thus enabling to data-drive most of what is needed to

    render a given object in an easier way.

    http://developer.nvidia.com/fx-composerhttp://developer.nvidia.com/fx-composerhttp://developer.nvidia.com/fx-composerhttp://developer.nvidia.com/fx-composerhttp://developer.nvidia.com/fx-composer
  • 8/3/2019 (videogame) Rendering 102

    25/32

    Note: we are really collapsing in this slide multiple related fixed units, clipping/culling,

    primitive setup/assembly, interpolation and so on are different stages, each with their

    own limits (throughput) and buffers.

    There is a lot of documentation on how to write rasterizers and rasterizationalgorithms. Ill link here two software based approach, a simple and mostly illustrative

    one: http://www.devmaster.net/codespotlight/show.php?id=17 and a more advanced

    one that was devised for the (now defunct) Larrabee GPU, which had a programmable

    rasterizer http://software.intel.com/en-us/articles/rasterization-on-larrabee/

    Often rasterization happens in more than a single step, there can for example be a

    coarse-rasterizer that generates bigger tiles (i.e. 8x8 pixels) and then does some early-

    rejection (see next slides) to then pass the results to a fine-rasterizer that generates 2x2

    quads.

    http://www.devmaster.net/codespotlight/show.php?id=17http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://software.intel.com/en-us/articles/rasterization-on-larrabee/http://www.devmaster.net/codespotlight/show.php?id=17
  • 8/3/2019 (videogame) Rendering 102

    26/32

  • 8/3/2019 (videogame) Rendering 102

    27/32

    Quad waste is a big problem for dense meshes, tassellation and displacement. Mesh

    level of detail is often more useful to avoid generating too much waste in the shaded

    quads than to reduce the number of vertices and their associated load in the vertex

    shading stages of the pipeline. Its a problem that needs to be solved for the future of

    GPU graphics. http://graphics.stanford.edu/papers/fragmerging/shade_sig10.pdf

    Its a long discussion, and in practice very few shaders use differentials to integrate

    discontinuities (we should though, or try to avoid discontinuities and high-frequencies

    in shaders), but the differentials are by default used when fetching textures to pre-filter

    them (mipmaps, anisotropic filtering), and this alone makes a HUGE difference.

    Some pointers for further reads:

    http://en.wikipedia.org/wiki/Spatial_anti-aliasing

    http://en.wikipedia.org/wiki/Multisample_anti-aliasing

    http://www.amazon.com/Texturing-Modeling-Second-Procedural-

    Approach/dp/0122287304

    http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-

    Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1

    http://graphics.stanford.edu/papers/fragmerging/shade_sig10.pdfhttp://en.wikipedia.org/wiki/Spatial_anti-aliasinghttp://en.wikipedia.org/wiki/Multisample_anti-aliasinghttp://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Real-Time-Rendering-Third-Tomas-Akenine-Moller/dp/1568814240/ref=sr_1_1?s=books&ie=UTF8&qid=1318980645&sr=1-1http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://www.amazon.com/Texturing-Modeling-Second-Procedural-Approach/dp/0122287304http://en.wikipedia.org/wiki/Multisample_anti-aliasinghttp://en.wikipedia.org/wiki/Multisample_anti-aliasinghttp://en.wikipedia.org/wiki/Multisample_anti-aliasinghttp://en.wikipedia.org/wiki/Spatial_anti-aliasinghttp://en.wikipedia.org/wiki/Spatial_anti-aliasinghttp://en.wikipedia.org/wiki/Spatial_anti-aliasinghttp://graphics.stanford.edu/papers/fragmerging/shade_sig10.pdf
  • 8/3/2019 (videogame) Rendering 102

    28/32

  • 8/3/2019 (videogame) Rendering 102

    29/32

    I wont go into any details of what a stencil buffer is and how early-stencil can be used,

    or the many ways a depth buffer can be used and configured... But early-rejections are

    fundamental for GPU performance and they can be used for many different

    optimizations, its fundamental to understand how they work on a specific GPU

    architecture as often they will work only in some given state configurations and areeasy to be invalidated. Some data here:

    http://www.gpgpu.org/w/index.php/Code_Examples#Early-z

    Also each GPU vendor calls this optimizations in different ways: Hierarchial Z (HiZ),

    HyperZ, Early-Z, Zcull and so on... The actual representation that permits this early test

    varies with the hardware, and there can even be multiple levels of rejection. Most of

    them are inspired by Hierarchial Occlusion Maps:

    http://www.cs.unc.edu/~zhangh/hom.html but the literature is rich i.e.

    http://citeseerx.ist.psu.edu/showciting;jsessionid=921596071E227DDD76FAFE435BCBFC89?cid=78472

    http://www.gpgpu.org/w/index.php/Code_Exampleshttp://www.cs.unc.edu/~zhangh/hom.htmlhttp://citeseerx.ist.psu.edu/showciting;jsessionid=921596071E227DDD76FAFE435BCBFC89?cid=78472http://citeseerx.ist.psu.edu/showciting;jsessionid=921596071E227DDD76FAFE435BCBFC89?cid=78472http://citeseerx.ist.psu.edu/showciting;jsessionid=921596071E227DDD76FAFE435BCBFC89?cid=78472http://citeseerx.ist.psu.edu/showciting;jsessionid=921596071E227DDD76FAFE435BCBFC89?cid=78472http://www.cs.unc.edu/~zhangh/hom.htmlhttp://www.gpgpu.org/w/index.php/Code_Exampleshttp://www.gpgpu.org/w/index.php/Code_Exampleshttp://www.gpgpu.org/w/index.php/Code_Examples
  • 8/3/2019 (videogame) Rendering 102

    30/32

  • 8/3/2019 (videogame) Rendering 102

    31/32

  • 8/3/2019 (videogame) Rendering 102

    32/32

    Example image and zbuffer from a pix capture of a retail version of Battlefield Bad

    Company 2