Tuesday, March 18, 2014

Parallel Concerns

Computers continue to increase in power, pretty much at the same rate as before. They double their performance every 18 to 24 months. This trend is known as Moore's law.  The leaders of the microprocessor industry swear they see no end to this law in the 21st century, that everything is fine. Some others say this will come to an end around 2020, but with some luck we should be able to find another exponential to ride.

Regardless of who you believe, you should be wary of the fine-print. Moore's law is about transistor count, it says nothing about how fast they operate. Since the sixties programmers became used to see algorithms improve their performance with every new hardware generation. You could count on Moore's law to make your software run twice as fast every 18 months. No changes were needed, not even to recompile your code. The same program would just run faster.

This way of thinking came to an end around 2005. Clock frequencies hit a physical limit. Since then you cannot compute any faster, you can only compute more. To achieve faster results the only option is to compute in parallel. 

Chip-makers try not to make a big deal out of this, but there is a world of difference to programmers. If they were car-makers, it is like their cars had reached a physical limit of 60 miles per hour. When asked how could you do 120 miles per hour they would suggest you take two cars.


Many programmers today ignore this monumental shift. It is often because they produce code for platforms that already deal with parallelization, like database engines or web servers. Or they work creating User Interfaces, where a single thread of code is already fast enough to outpace humans.

Procedural generation requires all the processing power you can get. Any algorithms you design will have to run in this new age of computing. You have no choice but to make them run in parallel.

Parallel computing is an old subject. For many years now programmers have been trying to find a silver bullet approach that will take a simple serial algorithm and re-shuffle it so it can run in parallel. None has succeeded. You simply cannot ignore parallelization and rely on wishful thinking. Your algorithms have to be designed with it in mind. What is worse, your algorithm design will depend on the hardware you choose because parallel means something different depending where you go.

This post is about my experience with different parallel platforms.

In my view, today there are three main different hardware platforms worth considering. The first one is traditional CPUs, like the x86 series by Intel. Multicore CPUs are now common, and the number of cores may still grow exponentially. In ten years from now we could have hundreds of them. If you manage to break your problem into many different chunks, you can feed each chunk to an individual core and have them run in parallel. As the number of cores grows, your program will run faster.

Let's say you want to generate procedural terrain. You could break the terrain into regular chunks by using an Octree, then process many Octree cells in parallel by feeding them to the available cores.

The x86 platform has the nicest, most mature development tools. Also since the number of concurrent threads is not very high, the parallelization effort is around breaking the work into large chunks and sewing the results back together. Most of the actual computation remains serial. This is a bonus: anyone can write stable serial code, but you will pay dearly for good parallel programmers. The more old-fashion serial code you can write within your fully parallel solution the better.

Being so oriented towards serial, generic code is also the weakness of traditional CPUs. They devote a big deal of transistors and energy dealing with the unpredictability of generic algorithms. If your logic is very simple and linear all this silicone goes to waste.

The second hardware platform are the GPUs. These are the video cards powering games in PCs and consoles. Graphics rendering is highly parallel. The image you see on screen is a matrix of pixels, where each pixel's color can be computed mostly in isolation from the rest. Video cards have evolved around this principle. They allow hundreds of concurrent threads to run in parallel, each one is devoted to producing one fragment of the final image. Compared to today's average CPUs which allow only eight simultaneous threads, hundreds of threads may seem a bonanza.

The catch is all these GPU threads are required to run the same instruction at the same time. Imagine you have 100 dancers on stage. Each dancer must execute exactly the same steps. If just one dancer deviates from the routine, the other dancers stop and wait.


Perfect synchronization is desirable in ballet, not so much in general purpose computing. A single "if" statement in the code could be enough to create divergence. What often happens when an "if" is encountered, is that both branches are executed by all threads, then the results of the branch that was not supposed to run are discarded. It is like some sort of weird quantum reality where all alternate scenarios do happen, but only the outcome of one is picked afterwards. The cat is really both dead and alive in your GPU.

Loops having a variable number of iterations are a problem too. If 99 of the dancers spin twice and the one remaining dancer -for some inexplicable reason- decides to spin forty times, the 99 dancers won't be able to do anything else until the rogue dancer is done. The execution time is the same as if all the dancers had done forty loops. 

So programming mainstream GPUs is great as long as you can avoid loops and conditional statements. This sounds absurd, but with the right mindset it is very much possible. The speedup compared to a multithreaded CPU implementation may be significant.

There are some frameworks that allow general purpose programs to run on GPUs. CUDA is probably the best known. It is deceptively simple. You write a single function in a language almost identical to C. Each one of the many threads in the GPU will run this function at the same time, but each one will input and output data from a different location. Let's say you have two large arrays of the same size, A and B. You want to compute array C as the sum of these two arrays. To solve this using CUDA you would write a function that looks like this:

void Add(in float A[], in float B[], out float C[]) 
{     
  int i = getThreadIndex();     
  C[i] = A[i] + B[i]; 


This is pseudocode, the actual CUDA code would be different, but the idea is the same. This function processes only one element in the array. To have the entire array processed you need to ask CUDA to spawn a thread for each element. 

One big drawback of CUDA is that it is a proprietary framework. It is owned by Nvidia and so far it is limited to their hardware. This means you cannot run CUDA programs on AMD GPUs.

An alternative to CUDA is OpenCL. OpenCL was proposed by Apple, but it is an open standard like OpenGL. It is almost identical to CUDA, maybe a tad more verbose, but for a good reason: not only it runs on both Nvidia and AMD GPUs, it also runs on CPUs. This is great news for developers. You can write code that will use all computing resources available.

Even with these frameworks to aid you, GPU programming requires a whole different way of thinking. You need to address your problem in a way that can be digested by this swarm of rather stupid worker threads. You will need to come up with the right choreography for them, otherwise they will wander aimlessly scratching their heads. And there is one big skeleton in the closet. It is easy to write programs that run on the GPU, but it is hard to make full use of it. Often the bottleneck is between the GPU and the main memory. It takes time to feed data and read results. Adding two arrays in the naive form, like it was shown in the example before, would spend 99% of the time moving data and 1% doing the actual operation. As the arrays grow in size, the GPU performs poorly compared to a CPU implementation.

So which platform should you target, CPUs or GPUs?

Soon it many not be a clear cut anymore. CPUs and GPUs are starting to converge. CPUs may soon include a higher number of less intelligent cores. They will not be very good at running unpredictable generic code, but will do great with linear tasks. On the other side GPUs will get better at handling conditionals and loops. And new architectures are already improving the bandwidth issues. Still it will take a few years for all this to happen so this hardware becomes mainstream.

If you were building a procedural supercomputer today, it would make sense to use a lot of GPUs. You would get more done for a smaller electricity bill. You could stick to Nvidia hardware and hire a few brainiacs to develop your algorithms in CUDA.

If you want to crowd-source your computing, have your logic run by your users, GPUs also make a lot of sense. But then using a general purpose computing framework like CUDA or OpenCL may not be a good idea. In my experience you cannot trust everyone to have the right stack of drivers you will require. In absence of a general computing GPU framework you would need to use graphic rendering APIs like DirectX and OpenGL to perform general purpose computing. There is a lot of literature and research on GPGPU (General Computing on GPUs) but this path is not for the faint of hart. Things will get messy.

On the other hand, CPUs are very cheap and easy to program. It is probably better to get a lot of them running not-so-brilliant code, which after all is easier to produce and maintain. As often happens, hardware excellence does not prevail. You may achieve a lot more just because how developer friendly the platform is and how easy it will be to port your code.

This brings us to the third parallel platform, which is a network of computers (aka the cloud). Imagine you get hundreds of PCs talking to each other over a Local Area Network. Individually they could be rather mediocre systems. They will be highly unreliable, hard drives could stop working anytime, power supply units getting busted without a warning. Still as a collective they could be cheaper, faster and simpler to program than anything else.

23 comments:

  1. It really is a very exciting time to be in right now and the compute world is exploding at a pace I don't think anyone can keep up with at the moment. For example the soon to be released CUDA 6.0 comes with built in abstraction and management layers for what you have correctly identified as the biggest bottleneck in GPU compute at the moment - GPU<->CPU memory communication. The problem still exists of course, but its finally managed by something often smarter than us. :)

    It would be nice for it to start to normalize soon though as I think a lot of applications (3D engines) could be drastically improved if they were made with a modern parallel architecture.

    ReplyDelete

  2. Good post ;)

    Don't forgot AVX. Not long ago all we had was 128 bit SSE.

    Now with AVX we are at 256. In a year or two intel will release AVX 512. You have to explicitly use these instructions, but it does give a nice speedup if you can make it fit your algorithm.

    I do wish GPU compute had full support for double with loss of performance-



    ReplyDelete
  3. So... 10 or more years from now, where do you think the focus will be? With smarter drivers/architecture, and breakthroughs like resistive ram, graphene etc.

    ReplyDelete
  4. No idea. I guess it really depends on whether Moore's law continues. If it does not, my bet is on the cloud. Computing will get bulky and power hungry, we will have no choice to hide these monsters away in data centers and keep thinner clients for interface.

    If Moore's law was to continue for a long time, I think there may be a shift towards power at your fingertips. Processing will move as close to the client as possible. It is more convenient, secure and private.

    In an ideal world networks are important because of the exchange of information, but we want processing power to be as close to the client as possible. It is a more resilient architecture.

    For instance, I would like our future robot nurses to be really powerful computers walking around. They should be as self-reliant as possible. On the other hand, if they are rather stupid drones controlled by a central intelligence in the cloud, what happens when the network goes down?

    ReplyDelete
    Replies
    1. How far do you think cloud computing can go with voxels given our current internet speeds? Assuming the bottle neck is purely in the amount of data transferred and not the time required to generate the world. Compared to your current local worlds, how much smaller of a voxel size could you go?

      Delete
    2. Not a big deal according to our stats. Smaller voxels also imply more processing if you were to do generate all in the client.

      Also note you need to send voxels only for the area the viewer can interact with. This is just the first LOD. Subsequent LODs can be just polygons which we can compress very well. For the current voxel sizes in our demos, using the cloud produces the fastest experience we have seen. It is amazingly fast.

      The drawback is you need to pay for the processor resources used by the cloud. I think a hybrid system is the best, so you can engage all the cores you can in the client and compensate for what you lack from the cloud.

      Delete
  5. I think parallel architecture would be a good way to take full advantage of the coming quantum computers, but especially dealing with big data sets like lighting and voxels.

    ReplyDelete
    Replies
    1. I have no idea how this would work. Can you explain more?

      Delete
    2. I just meant that qbits are supposed to be able to enable algorithms on very big data sets because a single qbit can hold multiple values simultaneously & test on them at the same time, and somehow that's supposed to help it scope out very efficient routes through the data or something (computational Darwinism?). It's a kind of parallelism at the bit level. But an article I read on it (sorry I can't find it now) mentioned some of the same issues you brought up, like even processing on multiple values simultaneously, depending on the algorithm they might have to wait for other values to come down a pipeline to make progress, etc; so it's not a silver bullet by itself. That and your post made me think that some of the same lessons with parallel architecture would probably need to apply for it to live up to its potential. I think I worded it poorly in that comment.

      Delete
  6. Any thoughts on the DX12 press release? Are you excited by the low level optimizations? Does this open any new doors?

    ReplyDelete
    Replies
    1. Have not check it out yet. Still DX is propietary, same as CUDA or Mantle... it is a bit hard for me to get exited about what happens inside a walled garden. But progress is always good. Will check it out as soon as I return from GDC.

      Delete
  7. Still the only thing programmers could do to run their code faster is to optimize it. Which brings me to a question: https://www.youtube.com/watch?v=f_ns9R9CC1U
    What tips could you give me to optimize my own procedural voxel terrain?

    ReplyDelete
    Replies
    1. Not sure if you already do it, but it is essential you start with a good profiler. I like Intel's VTune but there are others too. I am often wrong when guessing where the bottlenecks are, but the profiler will tell you for sure. In my experience it is often about how your code access memory, whether you are getting cache misses or stepping over cache lines. Being able to understand the assembly code the compiler is churning out helps a lot too.

      Your terrain looks cool already, if you have specific questions on how to optimize it or anything else I will do my best to help.

      Delete
    2. Thanks for your reply. I will check out VTune. I didn't know that something like thats exists. Although to read assembly code isn't quite enjoyable but I think there is no way around.

      Anyway the main concern i have with the terrain is that it is still to smooth. Also there are no overhangs. To get these kind of features I have to process more frequencies of perlin noise. And there is (I think) the main bottleneck: the procedural generation itself as the surface extraction and seam creation is already quite fast.

      In earlier posts you said that you precook the geomety on a server farm. Is it still the case or is the mesh generation now client-side?

      Either way i have to optimize the voxel creation somehow. And then try to get the terrain a more natural look. I don't know how far I will get with perlin and worley noise as you also wrote about the problems of these generation techniques in one of your older posts. Maybe something like that http://dmytry.com/mojoworld/erosion_fractal/home.html could help.

      Delete
    3. In one of his older posts, he had said he came up with a faster noise function rather than using Perlin Noise. I'm not sure what he meant by that, since he said his method uses several layers of white noise and linear interpolation, which sounds pretty much like Perlin Noise. Hopefully he'll shed some light on this, since I've been trying to find a faster method than Perlin Noise.

      Delete
    4. Wow good memory. That was back in Dec 2010. I did not come up with a faster noise than Perlin. I said I had ditched Perlin noise and was using texture lookups. For a GPU implementation it was faster. Here is a link to that post and comment about how the noise was computed: http://procworld.blogspot.ca/2010/12/volcanic-world.html

      Delete
    5. This comment has been removed by the author.

      Delete
    6. Didn't know you could read and write to textures in OpenCL. I'll check that out. Anyway i think it could be faster with noise and linear interpolation as this method would use less calculation.

      Also what are your thoughts on erosion fractals for procedural terrain?

      Delete
  8. Replies
    1. I have not tried using FPGA for procedural generation. It is an interesting idea, especially the option to use analog processing to mimic natural phenomena. I am not sure how accessible it would be as a platform. One of the goals of this project is that pretty much anyone could come up with new ways of generating content and feeding custom data to achieve that.

      Delete
    2. I understand that the consumer side will not buy something like that, but on the server side, for specific applications, FPGA will absolutely deliver way more parallelism and flexibility. By the way, I've only used system verilog language with FPGAs, but you can program FPGAs with opencl (probably you already know that).

      When voxelfarm allow to "anyone come up with new ways of generating content", I'll be the one feeding procedural generated content with FPGAs.

      Delete
  9. Very nice post :)
    Is there any literature on parallel programming concepts that you would recommend?

    ReplyDelete
  10. New press release today, it looks like nvidia's most pressing concerns are with what you have detailed above. They are looking at improving memory transfer rates as well as opening up cpu/gpu memory for more flexible performance.

    http://blogs.nvidia.com/blog/2014/03/25/gpu-roadmap-pascal/

    ReplyDelete