Wed Dec 13 01:44:56 EST 2017


So practically, how to determine the split of work based on the number
of compute units?

OpenCL Compute Units = 20.
1280 "Stream Cores"
64 Stream Cores per compute unit.
Organized as 4 Vector Units with 16 SIMD lanes each


    In GCN, each SIMD unit is assigned its own 40-bit program counter
    and instruction buffer for 10 wavefronts. The whole CU can thus
    have 40 wavefronts in flight, each potentially from a different
    work-group or kernel, which is substantially more flexible than
    previous designs

A wavefront is a thread containing a SIMD instruction stream for 64
work items.  This is executed in 4 cycles using the 16-lane execution

Instructions are dispatched to the 4 SIMD units in a round-robin
fashion.  The dispatcher picks one of the 10 associated wavefronts.

The 20CU card I have thus has:
- 800 64-way SIMD threads
- 51200 scalar threads

SIMD has dedicated 512-entry scalar register file, shared across
wavefronts executing on a single SIMD unit.  It seems to be similar
for the vector units.  So definitely the compiler needs to be aware
about how wavefronts interact!

Bundling work into 64-way SIMD streams is done by the compiler.
Scheduling wavefronts at the SIMD level is done by the CU.
Wavefronts are associated to CUs programmatically?

So AMD's move from VLIW to GCN is about replacing parallellism inside
a wavefront with thread switching different wavefronts.