Hisztogram közös memóriával

Atomikusok gyorsítása

A korábbi példában megnéztük hogyan lehet az atomikus műveletekkel hisztogramot számítani. A korábbi programon viszont a közös memória haszálatával még lehet fejleszteni. Ezt nézzük meg a mostani kódban.

A program teljes kódja letölthető innen, vagy megtalálható a lap alján.

Új kernel

Az új hatékonyabb kódunkhoz gyakorlatilag csak a kernelt kell módosítanunk.

Vegyük észre ugyanis, hogy ha a globális memóriában végzünk atomikus műveleteket az nem túl hatékony.

Egyrészt, ha feltesszük, hogy a számítógépben van 32 multiprocesszor, amiken amik egyszerre 32-32 aktív szálat kezelnek, akkor 32*32=1024 szál próbál növelni 256 hisztogrmam számlálót. Ez legjobb esetben is számlálónként egy négyszeres konfliktus, aminek feloldása 4*1200 = 4800 órajelt vesz igénybe.

Van viszont a multiprocesszorokban egy közös memória is, amiben tudunk tömböket tárolni, és blokkon belül közösen kezelni. Így tehát, ha a blokkok szálai a közös memóriában előállítanának egy részeredményt, akkor egyszerre csak 32 szál között lehetne konfliktus, amik 256 értéket kezelnek. Itt mégy figyelembe kell venni, hogy a közös memória 32 bankba van szervezve tehát 32 szál növel 32 bankot, de az arány akkor is 32-32 lesz.

Ehhez jön, hogy a közös memória elérése sokkal gyorsabb, mint a globális memóriáé, ugyanis a közös memória bankjai 4 órajel alatt elérhetőek (szemben a globális memória 600 órajeles késleltetésével). A számítás tehát kihozza, hogy statisztikailag egy-egy növeléshes 8 órajel lég lehet a korábbi 4800 helyett.

Az új kernel tehát először is közös memóriában tárol egy rész-hisztogramot.

    __shared__  unsigned int temp[256];
    temp[threadIdx.x] = 0;
    __syncthreads();

A közös hisztogram foglalása és nullázása után kiszámítjuk a részeredményt.

int i = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;
    while (i < size) {
        atomicAdd( &temp[buffer[i]], 1 );
        i += stride;
    }

Végül pedig csak annyi feladatunk marad, hogy az számokat a közös memóriából kiírjuk a globális memóriába.

    __syncthreads();
    atomicAdd( &(histo[threadIdx.x]), temp[threadIdx.x] );

Ez már nem annyira időigényes feladat, mivel hisztogram számláló sokkal kevesebb van, mint pixel.

A számításban most kihasználtuk, hogy pont annyi szál fut, ahány hisztogram elem van, így egy-egy érték kiírása szálanként pont megfelelő.

Teljes kód egyben

/* This software contains source code provided by NVIDIA Corporation.
 * The program is based on the example code hist_gpu_shmem_atomics.cu
 * of Chapter 9 of the "CUDA by Example" book.
 * 
 * A program az NVIDIA Corporation által készített programkódot tartalmaz.
 * A program a "Cuda by Example" tankönyv 9. fejezetének
 * hist_gpu_shmem_atomics.cu  példaprogramjára épül.
 */

#include 
#include 

#include 
#include 
#include 

using namespace std;

#define SIZE    (100*1024*1024)


void* big_random_block( int size ) {
    unsigned char *data = (unsigned char*)malloc( size );
    for (int i=0; i<size; i++)
        data[i] = rand();

    return data;
}

__global__ void histo_kernel( unsigned char *buffer,
                              long size,
                              unsigned int *histo ) {

    // clear out the accumulation buffer called temp
    // since we are launched with 256 threads, it is easy
    // to clear that memory with one write per thread
    __shared__  unsigned int temp[256];
    temp[threadIdx.x] = 0;
    __syncthreads();

    // calculate the starting index and the offset to the next
    // block that each thread will be processing
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;
    while (i < size) {
        atomicAdd( &temp[buffer[i]], 1 );
        i += stride;
    }
    // sync the data from the above writes to shared memory
    // then add the shared memory values to the values from
    // the other thread blocks using global memory
    // atomic adds
    // same as before, since we have 256 threads, updating the
    // global histogram is just one write per thread!
    __syncthreads();
    atomicAdd( &(histo[threadIdx.x]), temp[threadIdx.x] );
}

int main( void ) {
    unsigned char *buffer =
                     (unsigned char*)big_random_block( SIZE );

    // capture the start time
    // starting the timer here so that we include the cost of
    // all of the operations on the GPU.  if the data were
    // already on the GPU and we just timed the kernel
    // the timing would drop from 74 ms to 15 ms.  Very fast.
    cudaEvent_t     start, stop;
    cudaEventCreate( &start );
    cudaEventCreate( &stop );
    cudaEventRecord( start, 0 );

    // allocate memory on the GPU for the file's data
    unsigned char *dev_buffer;
    unsigned int *dev_histo;
    cudaMalloc( (void**)&dev_buffer, SIZE );
    cudaMemcpy( dev_buffer, buffer, SIZE, cudaMemcpyHostToDevice );

    cudaMalloc( (void**)&dev_histo, 256 * sizeof( int ) );
    cudaMemset( dev_histo, 0, 256 * sizeof( int ) );

    // kernel launch - 2x the number of mps gave best timing
    cudaDeviceProp  prop;
    cudaGetDeviceProperties( &prop, 0 );
    int blocks = prop.multiProcessorCount;
    histo_kernel<<<blocks*2,256>>>( dev_buffer, SIZE, dev_histo );
    
    unsigned int    histo[256];
    cudaMemcpy( histo, dev_histo, 256 * sizeof( int ), cudaMemcpyDeviceToHost );

    // get stop time, and display the timing results
    cudaEventRecord( stop, 0 );
    cudaEventSynchronize( stop );
    float   elapsedTime;
    cudaEventElapsedTime( &elapsedTime, start, stop );

    long histoCount = 0;
    for (int i=0; i<256; i++) {
        histoCount += histo[i];
        cout << i << ": " << histo[i] << endl;
    }

    cout << endl;
    cout << "Time to generate:  " << elapsedTime << " msec." << endl;
    cout << "Histogram Sum: " << histoCount << endl;

    // verify that we have the same counts via CPU
    for (int i=0; i<SIZE; i++)
        histo[buffer[i]]--;
    for (int i=0; i<256; i++) {
        if (histo[i] != 0)
            cout << "Failure at " << i << "!  Off by " << histo[i] << endl;
    }

    cudaEventDestroy( start );
    cudaEventDestroy( stop );
    cudaFree( dev_histo );
    cudaFree( dev_buffer );
    free( buffer );
    return 0;
}

Feladat

Írjuk át a belső szorzat példaprogramot, hogy a grafikus memóriában egyetlen float értékbe számításon.
- Nézzük meg,hogy ez megváltoztatta-e a futási időt!

Licensed under the Creative Commons Attribution Share Alike License 4.0