Wootracer threading

I may have been a bit quiet recently, but it’s not because I’m not working on stuff. There’s a few issues with the way that the Wootracer works which are causing me significant annoyance at the moment. And they’re becoming bigger issues the more complex the scenes are getting. The main bugbears at the moment:
– Once I kick off the render I have to sit and wait until the render is finished with no visual feedback on progress.
– There’s no way to add additional samples to reduce noise after the initial render is complete.

The upshot of this is that I’ve had to get into the guts of the threading mechanism and rewrite the whole way that render management is done by the Wootracer.

Lets start with the nitty gritty of how the system currently works. There are two types of workpackage within the raytracer. These are pixels and rows. Within a pixel there may be many samples, within a row there will be a number of pixels equal to the width of the image. Each of these requires a different type of treatment for an optimal raytracer.

First off lets talk about the multisampling within a single pixel. Typically this might require 500+ samples for some pixels to reduce the level of noise they’re showing. For each pixel that’s multisampled I use a class called “PatchRenderer” to jitter the samples over a grid that covers each pixel.

These calculations tend to be closely correlated because they’re hitting a similar area of the scene, but there are edges of shadow, bright areas of indirect illumination, etc. where the contrast within a patch can be very high. Ideally these areas would be oversampled, and that’s currently handled using a minSamples, maxSamples approach. Where variance is very low, minSamples are sent through the pixel. Where variance is still high, samples are increased up to maxSamples.

Each of these patches are calculated on a single processor, but to make the most out of modern day CPUs you need to split the work up so that each CPU can calculate a segment of the work. For my current system I’m running on 2 physical cores, each with hyperthreading for a total of 4 logical, so I need to divide the work up into four different sections.

A naive way would give a quarter of the output image to each CPU, but scene complexity can vary dramatically across an image, so one thread may finish well ahead of the others. The most even distribution of samples is to give the 1st, 5th, 9th pixels to CPU1, 2nd, 6th, 10th pixels to CPU2, etc.. but this has a problem in that the buffer we’re writing the pixel colours to is a single block of memory. If each CPU is hitting the same small bit of the buffer, then performance can be crippled due to the interaction of the L2 and L3 caches on the CPU.

So neither of those approaches are particularly good. Ideally we’d split the screen into squares of pixels, perhaps 16×16, and then render each of these on a different CPU. The problem is that I didn’t decide to do this initially. Instead I split the screen into 5 row slices and gave each of those to a CPU. When one 5-row slice is completely rendered, I end that thread and kick off a new one to render the next slice. This kept my threading code super easy, but has lead to a lot of the problems I have at the moment.

for (i=0; i<4; i++)
{
  CreateThread(NULL, 0, RenderRows, args[i], 0, &threadId[i]);
  row += 5;
}

while (row<height)
{
  WaitForMultipleObjects(4, threadhandles, true, INFINITE)
  {
    CreateThread(RenderRows, row);
    row +=5;
  }
}

The problem with this naive solution is that pixels can vary dramatically in colour across a single row, i.e. the middle of the scene bears no correlation to the edges, so it's tough to rank rows based on complexity if you only want to resample busy areas of the image.

You're also creating height/5 number of threads, which on a modern OS isn't the end of the world, but it's not super good practice either.

The modern way to do this stuff is to use a worker thread pool to process a continuing list of workitems unless an image is fully rendered. To do this I've introduced a new WorkManager class which keeps track of "RenderFragments" which need raytracing. Each RenderFragment is a 16x16 block on the screen, and they accumulate additional samples when they get worked on again.

This time there is no while loop and the threads just run until there's no work left to do. To run the raytracer forever you simply put the renderfragments back on the queue after each one is rendered so that the worker threads cycle round and round the buffer busily resampling the whole screen. By managing where the fragments are inserted back into the queue it's easy to add additional samples to fragments with high variability on the samples. It's also easy to pause the rendering, and then resume again later.

This sounds super complicated but the basic worker queue came out fairly simply.

DWorkQueue::DWorkQueue() : mStop(false)
{
	InitializeCriticalSection(&mWQCS);
}

void DWorkQueue::Add(DRenderFragment* fragment)
{
	// lock shouldn't be necessary here, but WTF
	EnterCriticalSection(&mWQCS);
	mFragments.push_back(fragment);
	LeaveCriticalSection(&mWQCS);
}

void DWorkQueue::Update(DRenderFragment* fragment)
{
	EnterCriticalSection(&mWQCS);
	mFragments.push_back(fragment);
	LeaveCriticalSection(&mWQCS);
}

DRenderFragment* DWorkQueue::GetHead()
{
	if (mStop)
		return 0;

	EnterCriticalSection(&mWQCS);
	DRenderFragment* ret = mFragments.front();
	mFragments.remove(ret);
	LeaveCriticalSection(&mWQCS);
	return ret;
}

void DWorkQueue::SetStop()
{
	mStop = true;
}

Now I need to rebuild the C# app which is calling the WooTracer so I can continue work on WooScript. In the meantime I've been doing a little bit of profiling. Here's a grab of a simple scene being rendered with some typical statistics alongside.

Pathtracing

New threading system in place

The statistics are worth a mention. This scene has been rendering for 90 seconds on my laptop. In that time the system rendered 20 million samples, using 63 million ray intersection tests and 2.7 billion KD tree intersection tests (it wrapped). This worked out at 700 thousand ray intersections per second and 150 samples per pixel.

For someone that used to write graphics demos on a 386 these numbers are pretty damn mental. Lovely. I'll leave it here for now, but shortly I'll be building the new WooScript interface with this system in place. Expect slow updates but something pretty cool in about a months time!

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

Spam Protection *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>