GPGPU Entry: Getting started Date: Fri May 17 16:07:49 EDT 2013 Basic ideas: this is part of graphics rendering pipeline. The computation is a side effect of consuming and generating a texture. There are two pipelines to use: - vertex shader - fragment shader / pixel shader We use the pixel shader, and align the texture pixels with the output array. [1] http://www.mathematik.uni-dortmund.de/~goeddeke/gpgpu/tutorial.html Entry: Android Renderscript Date: Fri May 17 17:10:47 EDT 2013 http://developer.android.com/guide/topics/renderscript/compute.html Entry: revisit Date: Sun Dec 11 16:18:29 EST 2016 on the Radeon HD 7800 Pitcairn [ 86.336] (--) RADEON(0): Chipset: "PITCAIRN" (ChipID = 0x6811) This would be OpenCL. https://anteru.net/blog/2012/11/03/2009/ http://stackoverflow.com/questions/21522554/how-to-setup-opencl-on-amd-videocard-with-opensource-driver glxinfo-> OpenGL renderer string: Gallium 0.4 on AMD PITCAIRN (DRM 2.43.0 / 4.6.0-1-amd64, LLVM 3.8.1) https://packages.debian.org/sid/mesa-opencl-icd # apt-get install mesa-opencl-icd https://laanwj.github.io/2016/05/06/opencl-ubuntu1604.html see also git/opencl archive tom@zoe:~/opencl$ ./devices.elf 1. Platform Profile: FULL_PROFILE Version: OpenCL 1.1 Mesa 13.0.2 Name: Clover Vendor: Mesa Extensions: cl_khr_icd 1. Device: AMD PITCAIRN (DRM 2.43.0 / 4.6.0-1-amd64, LLVM 3.9.0) 1.1 Hardware version: OpenCL 1.1 Mesa 13.0.2 1.2 Software version: 13.0.2 1.3 OpenCL C version: OpenCL C 1.1 1.4 Parallel compute units: 20 What's the difference between compute units (20) and stream cores (see wikipedia: 1024-1280) https://community.amd.com/thread/166930 https://community.amd.com/community/devgurus CU is roughly equivalent to an independent CPU. Each CU is subdivided into stream cores, programmed using SIMT. GCN = graphics core next https://en.wikipedia.org/wiki/Graphics_Core_Next The Graphics Core Next (officially called "Southern Islands") microarchitecture combines 64 shader processors with 4 TMUs and 1 ROP to a compute unit (CU). Each Compute Unit consists of: - a CU Scheduler - a Branch & Message Unit - 4 SIMD Vector Units (each 16-lane wide) - 4 64KiB VGPR files - 1 scalar unit - a 4 KiB GPR file - a local data share of 64 KiB - 4 Texture Filter Units - 16 Texture Fetch Load/Store Units - a 16 KiB L1 Cache. Four Compute units are wired to share an Instruction Cache 16 KiB in size and a scalar data cache 32KiB in size. These are backed by the L2 cache. A SIMD-VU operates on 16 elements at a time (per cycle), while a SU can operate on one a time (one/cycle). In addition the SU handles some other operations like branching. This seems interesting: http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf Entry: Recap Date: Tue Dec 20 20:44:48 EST 2016 - OpenCL requires explicit management of compute units - Compute units are SIMD machines Entry: GPU architecture Date: Wed Dec 13 01:20:58 EST 2017 ( this is a winter thing apparently :) 3 ways to increase throughput per area measure: - simpler cores (no caches or complex execution logic) - SIMD ALU (reduce instruction decoding cost) - many register sets (for thread context switching, to hide mem latency) https://www.cs.cmu.edu/afs/cs/academic/class/15462-f11/www/lec_slides/lec19.pdf Entry: OpenCL Date: Wed Dec 13 01:44:56 EST 2017 So practically, how to determine the split of work based on the number of compute units? Again: OpenCL Compute Units = 20. 1280 "Stream Cores" 64 Stream Cores per compute unit. Organized as 4 Vector Units with 16 SIMD lanes each https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf In GCN, each SIMD unit is assigned its own 40-bit program counter and instruction buffer for 10 wavefronts. The whole CU can thus have 40 wavefronts in flight, each potentially from a different work-group or kernel, which is substantially more flexible than previous designs A wavefront is a thread containing a SIMD instruction stream for 64 work items. This is executed in 4 cycles using the 16-lane execution unit. Instructions are dispatched to the 4 SIMD units in a round-robin fashion. The dispatcher picks one of the 10 associated wavefronts. The 20CU card I have thus has: - 800 64-way SIMD threads - 51200 scalar threads SIMD has dedicated 512-entry scalar register file, shared across wavefronts executing on a single SIMD unit. It seems to be similar for the vector units. So definitely the compiler needs to be aware about how wavefronts interact! Bundling work into 64-way SIMD streams is done by the compiler. Scheduling wavefronts at the SIMD level is done by the CU. Wavefronts are associated to CUs programmatically? So AMD's move from VLIW to GCN is about replacing parallellism inside a wavefront with thread switching different wavefronts. Entry: Programming Date: Wed Dec 13 03:17:43 EST 2017 https://developer.amd.com/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/ https://gpuopen.com/amdgcn-assembly/ https://github.com/Dakkers/OpenCL-examples/blob/master/example00/main.cpp Entry: Haskell Date: Wed Dec 13 18:13:02 EST 2017 https://hackage.haskell.org/package/OpenCL Entry: opencl processing elements Date: Sun Dec 17 23:31:36 EST 2017 can't query number of PEs https://stackoverflow.com/questions/21170154/opencl-query-number-of-processing-elements PE = virtual scalar processor http://downloads.ti.com/mctools/esd/docs/opencl/execution/kernels-workgroups-workitems.html kernel = expression for one work item global size = number of work items (kernel duplication over data) local size = number of work items in a work-group guess: work-group determines SIMD a work group executes on a single compute unit? so basically: - local size large enough to fill the simd / simt - enough groups to cover all compute units it's possible to pass NULL as local size https://stackoverflow.com/questions/18105300/opencl-optimal-group-size PREFERRED_WORK_GROUP_SIZE_MULTIPLE groups are also important because they can share __local memory. Entry: RAI? Date: Mon Dec 18 00:32:46 EST 2017 Since it's pretty much just C, this should work. Now, what to do with it? Entry: new GPUs? Date: Mon Dec 18 00:44:35 EST 2017 HD7870 is 5 years old. What's on the market today? Radeon RX470 $350 2048 stream procs Entry: haskell opencl Date: Sat Dec 30 01:09:36 EST 2017 https://lancelet.github.io/posts/2017-12-26-opencl-helloworld.html https://hackage.haskell.org/package/language-c-quote