Java gpu computing
-
Author
arjan-lamers -
Category
Software
-
view
136 -
download
0
Embed Size (px)
Transcript of Java gpu computing

Java GPU Computing
Maarten Steur & Arjan Lamers

● Overzicht OpenCL● Simpel voorbeeld ● Casus● Tips & tricks● Vragen

Waarom GPU Computing

Afkortingen
● CPU, GPU, APU● Khronos: OpenCL, OpenGL● Nvidia: CUDA● JogAmp JOCL, JavaCL, JOCL

GPU vergeleken met CPU● Veel simpele cores● Veel high bandwidth geheugen
●Intel core i7 GeForce GT 650M
8 cores 384 cores
180 Gflops 650 Gflops

Programmeer model
● Definieer stream (flow)
● Run in parallel

Gebruik
● Algorithme:– Hoge Concurrency
– Partitioneerbaar
● Maar:– Extra latency door on- en offloaden op
de GPU
– Extra complexiteit

Componenten

Componenten

Voorbeeld (MacBook Pro)Platform name: ApplePlatform profile: FULL_PROFILEPlatform spec version: OpenCL 1.2Platform vendor: Apple
Device 16925696 HD Graphics 4000Driver:1.2(Aug 17 2014 20:29:07)Max work group size:512Global mem size: 1073741824Local mem size: 65536Max clock freq: 1200Max compute units: 16
Device 16918272 GeForce GT 650MDriver:8.26.28 310.40.55b01Max work group size:1024Global mem size: 1073741824Local mem size: 49152Max clock freq: 900Max compute units: 2
Device 4294967295 Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHzDriver:1.1Max work group size:1024Global mem size: 17179869184Local mem size: 32768Max clock freq: 2600Max compute units: 8

Work & Memory

Application / Kernel
● Schrijf .cl files in C variant● Kernels zijn de 'publieke' functies
● Java Bytecode – Aparapi (OpenCL)
– RootBeer (CUDA)

Disclaimer

Parallel sort
kernel void sort(global const float* in, global float* out, int size) { int i = get_global_id(0); // current thread float id = in[i]; int pos = 0; for (int j=0;j<size;j++) { float jd = in[j];
// in[j] < in[i] ? bool smaller = (jx < ix) || (jx == ix && j < i);
pos += (smaller)?1:0; } out[pos] = id;}

Java GPU Computing
CLContext globalContext = CLContext.create();
CLDevice device = globalContext.getMaxFlopsDevice(Type.GPU);
CLContext context = CLContext.create(device);
CLCommandQueue queue = device.createCommandQueue();
CLProgram program = context.createProgram(
First8GpuComputing.class.getResourceAsStream("MyTask.cl")).build();
Je kunt ook builden voor specifieke devices: build(device)

Java GPU ComputingCLBuffer<FloatBuffer> inBuffer = context.createFloatBuffer(
input.length , READ_ONLY);
CLBuffer<FloatBuffer> outBuffer = context.createFloatBuffer(
input.length, WRITE_ONLY);
mapToBuffer(inBuffer.getBuffer(), workLoad);

Java GPU ComputingCLBuffer<FloatBuffer> inBuffer = context.createFloatBuffer(
input.length , READ_ONLY);
CLBuffer<FloatBuffer> outBuffer = context.createFloatBuffer(
input.length, WRITE_ONLY);
mapToBuffer(inBuffer.getBuffer(), workLoad);
CLKernel kernel = program.createCLKernel("MyTask");
kernel.putArgs(inBuffer, outBuffer).putArg(workLoad.length);

Java GPU ComputingCLBuffer<FloatBuffer> inBuffer = context.createFloatBuffer(
input.length , READ_ONLY);
CLBuffer<FloatBuffer> outBuffer = context.createFloatBuffer(
input.length, WRITE_ONLY);
mapToBuffer(inBuffer.getBuffer(), workLoad);
CLKernel kernel = program.createCLKernel("MyTask");
kernel.putArgs(inBuffer, outBuffer).putArg(workLoad.length);
queue.putWriteBuffer(inBuffer, false)
.put1DRangeKernel(kernel, 0, globalWorkSize, localWorkSize)
.putReadBuffer(outBuffer, true);
FloatBuffer output = outBuffer.getBuffer();

Praktijkcasus

Praktijk casus
● Rekeninstrument ter ondersteuning van de Programmatische Aanpak Stikstof.
● http://www.aerius.nl

Praktijk casus

Praktijk casus

Tips & tricks
● CL beheer– getResourceAsStream()?
– Java constanten → #define
– Locale? Oops!

Tips & tricks
● Unit testen– Aparte test kernels
– Test cases in batches
kernel void testDifficultCalculation(const int testCount, global const double* distance, global double* results) {
const int testId = get_global_id(0); if (testId < testCount) { results[testId] = difficultCalculation(distance[testId]); }}

Direct memory management
● -XX:MaxDirectMemorySize=??M● ByteBuffer.allocateDirect(int capacity)
– Max 2GB per buffer
● Garbage collection te laat– Getriggered door heap collection
– Handmatig vrijgeven
– ((sun.nio.ch.DirectBuffer) myBuffer).cleaner().clean();
● VisualVM plugin voor direct buffers

GPU vs CPU
● GPU's checken minder dan CPU's– Div by zero
– Out of bounds checks
– Test eerst op CPU

Portabiliteit
● OpenCL is portable, de performance niet
– Memory sizes verschillen
– Memory latencies verschillen
– Work group sizes verschillen
– Compute devices verschillen
– OpenCL implementatie verschillen
● Develop dus voor de productie hardware

Ten slotte
● Float vs Double– Dubbele precisie
– Halve performance
– Double support optioneel

Conclusie

Conclusie
● Wanneer te gebruiken?– Als performance echt nodig is
– Als probleem hoge concurrency heeft
– Als probleem partitioneerbaar is

Vragen?Setting up OpenCL test on Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHzWarming up OpenCL test[thread 32003 also had an error][thread 33027 also had an error]
## A fatal error has been detected by the Java Runtime Environment:## SIGSEGV[thread 32515 also had an error] (0xb)[thread 32771 also had an error][thread 32259 also had an error] at pc=0x00000001250ded70, pid=99851, tid=29475## JRE version: Java(TM) SE Runtime Environment (8.0_20-b26) (build 1.8.0_20-b26)# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.20-b23 mixed mode bsd-amd64 compressed oops)# Problematic frame:# [thread 17415 also had an error]C [cl_kernels+0x1d70] sort_wrapper+0x1b0## Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again## An error report file with more information is saved as:# /Users/arjanl/Documents/opencl/workspace/opencl-test/jogamp/hs_err_pid99851.log[thread 31763 also had an error]## If you would like to submit a bug report, please visit:# http://bugreport.sun.com/bugreport/crash.jsp#