GPGPU

Entry: Getting started
Date: Fri May 17 16:07:49 EDT 2013

Basic ideas: this is part of graphics rendering pipeline.  The
computation is a side effect of consuming and generating a texture.

There are two pipelines to use:
- vertex shader
- fragment shader / pixel shader

We use the pixel shader, and align the texture pixels with the output array.

[1] http://www.mathematik.uni-dortmund.de/~goeddeke/gpgpu/tutorial.html


Entry: Android Renderscript
Date: Fri May 17 17:10:47 EDT 2013

http://developer.android.com/guide/topics/renderscript/compute.html


Entry: revisit
Date: Sun Dec 11 16:18:29 EST 2016

on the Radeon HD 7800 Pitcairn
[    86.336] (--) RADEON(0): Chipset: "PITCAIRN" (ChipID = 0x6811)

This would be OpenCL.

https://anteru.net/blog/2012/11/03/2009/

http://stackoverflow.com/questions/21522554/how-to-setup-opencl-on-amd-videocard-with-opensource-driver

glxinfo->
OpenGL renderer string: Gallium 0.4 on AMD PITCAIRN (DRM 2.43.0 / 4.6.0-1-amd64, LLVM 3.8.1)

https://packages.debian.org/sid/mesa-opencl-icd
# apt-get install mesa-opencl-icd

https://laanwj.github.io/2016/05/06/opencl-ubuntu1604.html

see also git/opencl archive

tom@zoe:~/opencl$ ./devices.elf
1. Platform
  Profile: FULL_PROFILE
  Version: OpenCL 1.1 Mesa 13.0.2
  Name: Clover
  Vendor: Mesa
  Extensions: cl_khr_icd
1. Device: AMD PITCAIRN (DRM 2.43.0 / 4.6.0-1-amd64, LLVM 3.9.0)
 1.1 Hardware version: OpenCL 1.1 Mesa 13.0.2
 1.2 Software version: 13.0.2
 1.3 OpenCL C version: OpenCL C 1.1 
 1.4 Parallel compute units: 20


What's the difference between compute units (20) and stream cores (see wikipedia: 1024-1280)
https://community.amd.com/thread/166930
https://community.amd.com/community/devgurus


CU is roughly equivalent to an independent CPU.
Each CU is subdivided into stream cores, programmed using SIMT.

GCN = graphics core next
https://en.wikipedia.org/wiki/Graphics_Core_Next

The Graphics Core Next (officially called "Southern Islands")
microarchitecture combines 64 shader processors with 4 TMUs and 1 ROP
to a compute unit (CU).

Each Compute Unit consists of:
- a CU Scheduler
- a Branch & Message Unit
- 4 SIMD Vector Units (each 16-lane wide)
- 4 64KiB VGPR files
- 1 scalar unit
- a 4 KiB GPR file
- a local data share of 64 KiB
- 4 Texture Filter Units
- 16 Texture Fetch Load/Store Units
- a 16 KiB L1 Cache.

Four Compute units are wired to share an Instruction Cache 16 KiB in
size and a scalar data cache 32KiB in size. These are backed by the L2
cache.

A SIMD-VU operates on 16 elements at a time (per cycle), while a SU
can operate on one a time (one/cycle). In addition the SU handles some
other operations like branching.

This seems interesting:
http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf


Entry: Recap
Date: Tue Dec 20 20:44:48 EST 2016

- OpenCL requires explicit management of compute units
- Compute units are SIMD machines


Entry: GPU architecture
Date: Wed Dec 13 01:20:58 EST 2017

( this is a winter thing apparently :)

3 ways to increase throughput per area measure:

- simpler cores (no caches or complex execution logic)
- SIMD ALU (reduce instruction decoding cost)
- many register sets (for thread context switching, to hide mem latency)

https://www.cs.cmu.edu/afs/cs/academic/class/15462-f11/www/lec_slides/lec19.pdf


Entry: OpenCL
Date: Wed Dec 13 01:44:56 EST 2017

So practically, how to determine the split of work based on the number
of compute units?

Again:
OpenCL Compute Units = 20.
1280 "Stream Cores"
64 Stream Cores per compute unit.
Organized as 4 Vector Units with 16 SIMD lanes each


https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

    In GCN, each SIMD unit is assigned its own 40-bit program counter
    and instruction buffer for 10 wavefronts. The whole CU can thus
    have 40 wavefronts in flight, each potentially from a different
    work-group or kernel, which is substantially more flexible than
    previous designs


A wavefront is a thread containing a SIMD instruction stream for 64
work items.  This is executed in 4 cycles using the 16-lane execution
unit.

Instructions are dispatched to the 4 SIMD units in a round-robin
fashion.  The dispatcher picks one of the 10 associated wavefronts.

The 20CU card I have thus has:
- 800 64-way SIMD threads
- 51200 scalar threads

SIMD has dedicated 512-entry scalar register file, shared across
wavefronts executing on a single SIMD unit.  It seems to be similar
for the vector units.  So definitely the compiler needs to be aware
about how wavefronts interact!

Bundling work into 64-way SIMD streams is done by the compiler.
Scheduling wavefronts at the SIMD level is done by the CU.
Wavefronts are associated to CUs programmatically?

So AMD's move from VLIW to GCN is about replacing parallellism inside
a wavefront with thread switching different wavefronts.


Entry: Programming
Date: Wed Dec 13 03:17:43 EST 2017

https://developer.amd.com/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/
https://gpuopen.com/amdgcn-assembly/
https://github.com/Dakkers/OpenCL-examples/blob/master/example00/main.cpp


Entry: Haskell
Date: Wed Dec 13 18:13:02 EST 2017

https://hackage.haskell.org/package/OpenCL


Entry: opencl processing elements
Date: Sun Dec 17 23:31:36 EST 2017

can't query number of PEs
https://stackoverflow.com/questions/21170154/opencl-query-number-of-processing-elements

PE = virtual scalar processor

http://downloads.ti.com/mctools/esd/docs/opencl/execution/kernels-workgroups-workitems.html

kernel = expression for one work item
global size = number of work items (kernel duplication over data)
local size = number of work items in a work-group


guess: work-group determines SIMD

a work group executes on a single compute unit?


so basically:
- local size large enough to fill the simd / simt
- enough groups to cover all compute units


it's possible to pass NULL as local size

https://stackoverflow.com/questions/18105300/opencl-optimal-group-size

PREFERRED_WORK_GROUP_SIZE_MULTIPLE

groups are also important because they can share __local memory.


Entry: RAI?
Date: Mon Dec 18 00:32:46 EST 2017

Since it's pretty much just C, this should work.  Now, what to do with it?


Entry: new GPUs?
Date: Mon Dec 18 00:44:35 EST 2017

HD7870 is 5 years old.
What's on the market today?

Radeon RX470 $350
2048 stream procs


Entry: haskell opencl
Date: Sat Dec 30 01:09:36 EST 2017

https://lancelet.github.io/posts/2017-12-26-opencl-helloworld.html
https://hackage.haskell.org/package/language-c-quote