Live Chat Software by Kayako
How to optimize Raspberry Pi code using its GPU
Posted by Thang Le Toan on 30 April 2018 03:55 AM
When I was at Apple, I spent five years trying to get source-code access to the Nvidia and ATI graphics drivers. My job was to accelerate image-processing operations using GPUs to do the heavy lifting, and a lot of my time went into debugging crashes or strange performance issues. I could have been a lot more effective if I’d had better insights into the underlying hardware, and been able to step through and instrument the code that controlled the graphics cards. Previously I’d written custom graphics drivers for game consoles, so I knew how useful having that level of control could be.
I never got the access I’d wanted, and it left me with an unscratched itch. I love CUDA/OpenCL and high-level shader interfaces, but the underlying hardware of graphics cards is so specialized, diverse, and quirky that you can’t treat them like black boxes and expect to get the best performance. Even with CUDA, you end up having to understand the characteristics of what’s under the hood if you want to really speed things up. I understand why most GPU manufacturers hate the idea, even just the developer support you’d need to offer for a bare-metal interface would take a lot of resources, but it still felt like a big missed opportunity to write more efficient software.
That all meant I was very excited when Broadcom released detailed documentation of the GPU used on the Raspberry Pi a few months ago. The Pi’s a great device to demonstrate the power of deep learning computer vision, and I’d ported my open-source library to run on it, but the CPU was woefully slow on the heavy math that neural networks require, taking almost twenty seconds even with optimized assembler, so I had a real problem I thought GPU acceleration might be able to help with.
Broadcom’s manual is a good description of the hardware interface to their GPU, but you’ll need more than that if you’re going to write code to run on it. In the end I was able to speed up object recognition from twenty seconds on the CPU to just three on the GPU, but it took a lot of head-scratching and help from others in the community to get there. In the spirit of leaving a trail of breadcrumbs through the forest, I’m going to run through some of what I learned along the way.
Broadcom’s Videocore Reference Guide will be your bible and companion, I’m constantly referring to it to understand everything from assembly instructions to interface addresses.
The very first program you should try running is the hello_fft sample included in the latest Raspbian. If you can get this running, then at least you’re set up correctly to run GPU programs.
There’s no debugger for the GPU, at all. You can’t even log messages. In the past I’ve had to debug shaders by writing colors to the screen, but in this case there isn’t even a visible output surface to use. I’ve never regretted investing time up-front into writing debug tools, so I created a convention where a register was reserved for debug output, it would be written out to main memory at the end of the program, could be immediately invoked with a LOG_AND_EXIT() macro, and the contents would be printed out to the console after the code was done. It’s still painful, but this mechanism at least let me get glimpses of what was going on internally.
I also highly recommend using a regular laptop to ssh into your Pi, alongside something like sshfs so you can edit source files easily in your normal editor. You’ll be crashing the device a lot during development, so having a separate development machine makes life a lot easier.
Vertex Program Memory
One of the eternal problems of GPU optimization is getting data back and forth between the main processor and the graphics chip. GPUs are blazingly fast when they’re working with data in their local memory, but coordinating the transfers so they don’t stall either processor is a very hard problem. My biggest optimization wins on the Playstation 2 came from fiddling with the DMA controller to feed the GPU more effectively, and on modern desktop GPUs grouping data into larger batches to upload is one of the most effective ways to speed things up.
The Broadcom GPU doesn’t have very much dedicated memory at all. In fact, the only RAM that’s directly accessible is 4,096 bytes in an area known as Vertex Program Memory. This is designed to be used as a staging area for polygon coordinates so they can be transformed geometrically. My initial assumption was that this would have the fastest path into and out of the GPU, so I built my first implementation to rely on it for data transfer. Unfortunately, it has a few key flaws.
There are actually 12 cores inside the GPU, each one known as a QPU for Quad Processing Unit. The VPM memory is shared between them, so there wasn’t much available for each. I ended up using only 8 cores, and allocating 512 bytes of storage to each, which meant doing a lot of small and therefore inefficient transfers from main memory. The real killer was that a mutex lock was required before kicking off a transfer, so all of the other cores ground to a halt while one was handling an upload, which killed parallelism and overall performance.
Texture Memory Unit
After I released the initial VPM-based version of the matrix-to-matrix multiply GEMM function that’s the most time-consuming part of the object recognition process, several people mentioned that the Texture Memory Unit or TMU was a lot more efficient. The documentation only briefly mentions that you can use the TMU for general memory access, and there wasn’t any detail on how to do it, so I ended up looking at the disassembly of the hello_fft sample to see how it was done. I also received some help over email from Eben Upton himself, which was a lovely surprise! Here’s a summary of what I learned:
– There are two TMUs available to each core. You can manually choose how to use each if you have an algorithmic way to send the same work to both, by turning off ‘TMU swap’, or if you leave it enabled half the cores will be transparently rewired to use alternating TMUs for 0 and 1.
– You write a vector of 16 addresses to registers ra56 and ra60 for TMU0 and 1 respectively, and that will start a fetch of the values held in those addresses.
– Setting a ldtmu0/1 code in an instruction causes the next read in the pipeline to block until the memory values are returned, and then you can read from r4 to access those values in further instructions.
– There’s a potentially long latency before those values are ready. To mitigate that, you can kick off up to four reads on each TMU before calling a ldtmu0/1. This means that memory reads can be pipelined while computation is happening on the GPU, helping performance a lot thanks to all the overlapping pipelining.
– To reduce extra logic-checking instructions, I don’t try to prevent overshooting on speculative reads, which means there may be accesses beyond the end of arrays (though the values aren’t used). In practice this hasn’t caused problems.
– I didn’t dive into this yet, but there’s a 4K direct-mapped L1 cache with 64-byte lines for the TMU. Avoiding aliasing on this will be crucial for maintaining speed, and in my case I bet it depends heavily on the matrix size and allocation of work to different QPUs. There are performance counters available to monitor cache hits and misses, and on past experience dividing up the data carefully so everything stays in-cache could be a big optimization.
– A lot of my data is stored as 8 or 16-bit fixed point, and the VPM had a lot more support for converting them into float vectors than the TMU does. I discovered some funky problems, like the TMU ignoring the lower two bits of addresses and only loading from 32-bit aligned words, which was tricky when I was dealing with odd matrix widths and lower precision. There isn’t much support for ‘swizzling’ between components in the 16-float vectors that are held in each register either, beyond rotating, so I ended up doing lots of masking tricks.
– Reading from nonsensical addresses can crash the system. During development I’d sometimes end up with wildly incorrect values for my read addresses, and that would cause a hang so severe I’d have to reboot.
– This isn’t TMU specific, but I’ve noticed that having a display attached to your Pi taxes the GPU, and can result in slower performance by around 25%.
In the end I was able to perform object recognition in just three seconds with the optimized TMU code, rather than six using the VPM, which opens up a lot more potential applications!
Developing GPU code on the Raspberry Pi has come a long way in just the last few months, but it’s still in its early stages. I’m hitting mysterious system hangs when I try to run my deep learning TMU example with any kind of overclocking for example, and there’s no obvious way to debug those kind of problems, especially if they’re hard to reproduce in a simple example.
The community, including folks like eman, Eben, Andrew Holme, and Herman Hermitage, are constantly improving and extending the documentation, examples, and tools, so developing should continue to get easier. I recommend keeping an eye on the Raspberry Pi forums to see the latest news!
Running the example
If you want to try out the deep learning object recognition code I developed yourself, you can follow these steps:
Install the latest firmware by running `sudo rpi-update`.
From `raspi-config`, choose 256MB for GPU memory.
Clone qpu-asm from Github.
Run `make` inside the qpu-asm folder.
Create a symbolic link to the qpu-asm program, for example by running `sudo ln -s /home/pi/projects/qpu-asm/qpu-asm /usr/bin/`.
Clone DeepBeliefSDK from Github.
From the DeepBeliefSDK/source folder, run `make TARGET=pi GEMM=piqpu`.
Once it’s successfully completed the build, make sure the resulting library is in your path, for example by running `sudo ln -s /home/pi/projects/DeepBeliefSDK/source/libjpcnn.so /usr/lib/`.
Run `sudo ./jpcnn -i data/dog.jpg -n ../networks/jetpac.ntwk -t -m s`