Press J to jump to the feed. Press question mark to learn the rest of the keyboard shortcuts
Log In
Found the internet!

GPGPU: General Purpose computing on Graphics Processing Units

r/gpgpu

2
Posted by16 days ago
2
4 comments
7
Posted by2 months ago

I'm currently looking for a cross platform GPU computing framework, and I'm currently not sure on which one to use.

Right now, it seems like OpenCL, the framework for cross vendor computing, doesn't have much of a future, leaving no unified cross platform system to compete against CUDA.

I've currently found a couple of option, and I've roughly ranked them from supporting the most amount of platforms to least.

  1. Vulkan

    1. Pure Vulkan with Shaders

      1. This seems like a great option right now, because anything that will run Vulkan will run Vulkan Compute Shaders, and many platforms run Vulkan. However, my big question is how to learn how to write compute shaders. Most of the time, a high level language is compiled down to the SPIR-V bytecode format that Vulkan supports. One popular and mature language is GLSL, used in OpenGL, which has a decent amount of resources to learn. However, I've heard that their are other languages that can be used to write high-level compute shaders. Are those languages mature enough to learn? And regardless, for each language, could someone recommend good resources to learn how to write shaders in each language?

    2. Kompute

      1. Same as vulkan but reduces amount of boiler point code that is needed.

  2. SYCL

    1. hipSYCL 

    2. This seems like another good option, but ultimately doesn't support as many platforms, "only" CPUs, Nvidia, AMD, and Intel GPUs. It uses existing toolchains behind on interface. Ultimately, it's only only one of many SYCL ecosystem, which is really nice. Besides not supporting mobile and all GPUs(for example, I don't think Apple silicon would work, or the currently in progress Asahi Linux graphic drivers), I think having to learn only one language would be great, without having to weed through learning compute shaders. Any thoughts?

  3. Kokkos

    1. I don't know much about Kokkos, so I can't comment anything here. Would appreciate anyone's experience too.

  4. Raja

    1. Don't know anything here either

  5. AMD HIP

    1. It's basically AMDs way of easily porting CUDA to run on AMD GPUs or CPUs. It only support two platforms, but I suppose the advantage is that I can learn basically CUDA, which has the most amount of resources for any GPGPU platform.

  6. ArrayFire

    1. It's higher level than something like CUDA, and supports CPU, CUDA and OpenCL as the backends. It seems accelerate only tensor operations too, per the ArrayFire webpage.

All in all, any thoughts how the best approach for learning GPGPU programming, while also being cross platform? I'm leaning towards hipSYCL or Vulkan Kompute right now, but SYCL is still pretty new, with Kompute requiring learning some compute shader language, so I'm weary to jump into one without being more sure on which one to devote my time into learning.

7
16 comments
5
Crossposted by3 months ago
Posted by3 months ago

This is a technical write up of the challenges and obstacles I faced to make compute kernels run on Nvidia video cards.

OpenGL compute

With OpenGL 4.3 came the inclusion of compute kernels, which is supposed to be a vendor independent way of running code on arbitrary data residing in GPU memory. The specification was released back in 2012, so I thought that every card will support this 10 year old technology. I wanted to implement my code on the oldest spec possible to give everyone a chache to play my game, not just the owners of the newest cards.

The three big video chip vendors are AMD, Intel and Nvidia. Sadly Nvidia already had CUDA, their vendor dependent way of running compute on the gpu so they implemented the OpenGL support, lets just say, sub-optimally.

How it is supposed to work

With OpenGL you ship the source code written in GL shading language (based on C) to the machine of the user in text form, and use the video card driver of the user to compile the source into a program, executable on the video card. Data structures in GPU memory are defined in SSBO buffers. While programming the GPU you want to use "structs of arrays" instead of "arrays of structs" to get coalesced memory access.

So for example if you want to define lines and circles in shader code you can do it like this:

// structs for holding the data
// we doing compute (TM) here so we need a lot of it
struct circle_s {
    float center_x [1024];
    float center_y [1024];
    float radius   [1024];
};
struct line_s {
    float start_x [1024];
    float start_y [1024];
    float end_x   [1024];
    float end_y   [1024];
};

// the named SSBO data buffer
// instantiate struct members
layout (...) buffer gpu_data_b {
    circle_s circle;
    line_s   line;
} data;

// you can use data members in code like this
void main(){
    // set the variables of the 1st circle
    data.circle.center_x [0] = 10.0;
    data.circle.center_y [0] = 11.0;
    data.circle.radius   [0] =  5.0;
}

This is still not a lot of data, only 28 kB. It has the benefit of defining the structs before instantiating it in GPU memory, so the definition can be reused in C/C++ code to simplify data movement between CPU and GPU! Great! This works on Intel and AMD, compiles just fine. But it does not compile on Nvidia. The shader compiler just crashes.

Nvidia quirk 1 : loop unrolls

The first thing I came across googling my problem is how agressively Nvidia is trying to unroll loops. Okay, so it is a known problem. I can work around it. The code looked like this before:

void main(){
    for (int i = 0; i < 8; i++){
        for (int j = 0; j < 8; j++){
            // lot of computation
            // lot of code
            // nested for loops needed for thread safe memory access reasons
            // if you unroll it fully, code size becomes 64 times bigger
        }
    }
}

There are mentions of nvidia specific pragmas to disable loop unrolling, but these did not work for me. So I forced the compiler to do not unroll:

layout (...) buffer gpu_no_unroll_b {
    int zero;
} no_unroll;

// on NVidia video cards
#define ZERO no_unroll.zero

// on AMD and Intel
#define ZERO 0

void main(){
    for (int i = 0; i < (8 + ZERO); i++){
        for (int j = 0; j < (8 + ZERO); j++){
            // ...
        }
    }
}

I fill the no_unroll.zero GPU memory with 0 at runtime from the CPU side so the Nvidia compiler has no other choice but to fetch the memory location at runtime, forcing the loop to stay in place. On AMD and Intel I set the define to constant 0, so there is no performance impact on these platforms.

Nvidia quirk 2 : no structs

After a lot of googling I stumbled upon this stackoverflow post. It talks about how it takes a long time to run the program, but mine would not even compile without this change. Okay, so no structs. The code looks like this now:

// the named SSBO data buffer
// instantiate "struct" members
layout (...) buffer gpu_data_b {

    float circle_center_x [1024];
    float circle_center_y [1024];
    float circle_radius   [1024];

    float line_start_x    [1024];
    float line_start_y    [1024];
    float line_end_x      [1024];
    float line_end_y      [1024];

} data;

// you can use data in code like this
void main(){
    // set the variables of the 1st circle
    data.circle_center_x [0] = 10.0;
    data.circle_center_y [0] = 11.0;
    data.circle_radius   [0] =  5.0;
}

It still only works on AMD or Intel. But the direction is right, I can "trick" the Nvidia compiler into compiling my code base. The problem is that the Nvidia compiler eats so much RAM that it gets killed by the operating system after a while. I tried to unload all the possible compile kernel sources as soon as possible, even tried to unload the compiler between compilations. This helped a little bit but did not solve the problem.

Disk cache

On all OpenGL vendors there is disk caching involved. This means that the driver caches the compiled compute kernel executable to disk, saves it as a file. If it needs to recompile the code (for example you exited the game and started it again) it does not recompile, it just loads the saved executable from disk.

I have multiple kernels, so starting my game several times on a machine with Nvidia video card gave me this result:

  • 1st run

    • 1st compute kernel is compiled by the driver

    • 2nd compute kernel is compiled by the driver

    • trying to compile the 3rd kernel, driver eats all the memory, gets killed, game crashes

  • 2nd run

    • 1st compute kernel is cached, loaded from disk

    • 2nd compute kernel is cached, loaded from disk

    • 3rd compute kernel is compiled by the driver

    • 4th compute kernel is compiled by the driver

    • trying to compile the 5th kernel, driver eats all the memory, gets killed, game crashes

  • 3rd run

    • 1st compute kernel is cached, loaded from disk

    • 2nd compute kernel is cached, loaded from disk

    • 3rd compute kernel is cached, loaded from disk

    • 4th compute kernel is cached, loaded from disk

    • 5th compute kernel is compiled by the driver

    • 6th compute kernel is compiled by the driver

    • This was the last compute kernel, game launches just fine

While this "game launch" was not optimal at least I had something finally running on Nvidia. I thought I could launch the game in the background with a startup script, have it crash a few times, then finally launch it in the foreground when all compute kernels are cached, but I ran into the next problem.

Nvidia quirk 3 : no big arrays

In my shader code all arrays have a compile time settable size:

#define circle_size (1024)
#define line_size   (1024)

layout (...) buffer gpu_data_b {

    float circle_center_x [circle_size];
    float circle_center_y [circle_size];
    float circle_radius   [circle_size];

    float line_start_x    [line_size];
    float line_start_y    [line_size];
    float line_end_x      [line_size];
    float line_end_y      [line_size];

} data;

When I set those defined sizes up too high, the Nvidia compiler crashes yet again, without caching a single compute shader. Others are encountered this problem too. "There is a minor GLSL compiler bug whereby the compiler crashes with super-large fixed-size SSBO array definitions." Minor problem from them, a major problem for me, as it turns out "super large" is only around 4096 in my case. After some googling it turned out that variable sized SSBO arrays do not crash the Nvidia compiler. So I've written a python script that translates a fixed size SSBO definition into a variable sized SSBO definition with a lot of defines added for member accesses.

#define circle_size (1024*1024)
#define line_size   (1024*1024)

layout (...) buffer gpu_data_b {
    float array[];
} data;

#define data_circle_center_x (index) data.array[(index)]
#define data_circle_center_y (index) data.array[circle_size+(index)]
#define data_circle_radius   (index) data.array[2*circle_size+(index)]

#define data_line_start_x    (index) data.array[3*circle_size+(index)]
#define data_line_start_y    (index) data.array[3*circle_size+line_size+(index)]
#define data_line_end_x      (index) data.array[3*circle_size+2*line_size+(index)]
#define data_line_end_y      (index) data.array[3*circle_size+3*line_size+(index)]

// you can use data in code like this
void main(){
    // set the variables of the 1st circle
    data_circle_center_x (0) = 10.0;
    data_circle_center_y (0) = 11.0;
    data_circle_radius   (0) =  5.0;
}

Of course, a real world example would use ints and uints too, not just floats. As there can be only one variable sized array per SSBO, I created 3 SSBOs, one for each data type. Luckily I avoided using the vector types available in GLSL, because I sometimes compiled the GLSL code as C code to have access to better debug support. With this modification the Nvidia compiler was finally defeated, it accepted my code and compiled all my compute kernels without crashing! And it only took one month of googling! Hooray!

Nvidia quirk 4 : no multiply wrap

From OpenGL 4.2 to 4.3 there was a change in specification on how integer multiplication should behave. In 4.2 overflows were required to wrap around. In 4.3 this became undefined behavior. On the hardware I tested AMD and Intel still wraps around but Nvidia saturates. I relied on this behavior using a linear congruential pseudorandom number generator in my shader code. This is clearly out of spec, so I needed to change it. I found xorshift RNGs to be just as fast while staying within the OpenGL 4.3 specifications.

Early Access now on Steam!

Check out my game EEvol on Steam if you want to see what I used this technology for! Here is a quick guide how to make it work. It is still a work in progress, but I can't stop, won't stop until I finish my dream of a big digital aquarium with millions and millions of cells, thousands of multicellular organisms coexisting with the simplest unicellular life forms peacefully living day by day displayed as the main decorative element of my living room.

34 points
5
0 comments
1
Posted by4 months ago
1
9 comments

About Community

Welcome to gpgpu
Created Jan 26, 2010

3.2k

Members

1

Online