Engine tales

Sunday, April 25, 2010

Occlusion culling, part 1.

I want to share some results from my recent experiments with GPU-based occlusion culling. I've decided to not to use occlusion visibility queues as they have lag and scalability problems. Basically you should be able to use them in production with some clever optimizations and tricks, but I wanted to make fast implementation.

So I've choose rasterization and hierarchical Z. It's easy to implement if you already have Z-fill path, easy to test and easy to debug because of very simple concept. And I was a bit inspired by this presentation from Ubisoft Montreal studio. I've seen more detailed internal presentation from them and it was pretty cool. I'll give here some high level ideas for you to not to read whole thing.

Basically they are rendering good occluders geometry to small viewport and build Z-buffer from this. Z-buffer then converted into hierarchical representation (higher levels containing only largest Z depth). All tested objects then converted into point sprites vertex stream containing post-projected bounding boxes information. Pixel shader fetches hierarchical Z and performs simple Z tests, outputting visibility queries result to it's render target. Render target then is read by CPU for visibility results to be used in scene processing. Guys are trying to keep pixel shader performance under control by testing whole bounding box with only one Z sample by selecting best fit layer.

While on X360 this scheme can perform enormous amount of tests in very small time frame, it has one major drawback. You need to sync with GPU to read back results. And GPU will be idle while you are building jobs lists for it. This is not a big deal for X360, though.

I've prototyped my implementation on PC, so read back quickly became a problem for me. In test scene with pretty heavy shaders and a lot of simple objects (~20K) CPU<->GPU synchronization spent like 2.5 - 5 ms in GetRenderTargetData() call - time needed for GPU to complete it's task. I've expected lockable render targets rotation will solve the issue but for me AMD driver still sync with GPU on LockRect() even for surfaces rendered few frames behind. So bottleneck seems unavoidable and will degrade performance for simple scenes.

For this implementation I've used GPU only for Z-buffer construction and parallelized CPU loops (TBB) for Z-hierarchy and post-projected bounding boxes tests.

You can see the final result here.

In next implementation I will add software rasterizer to compare. Stay tuned.

Thursday, April 8, 2010

Instant pipeline.

CryTek showed their Sandbox for CryEngine 3 and you can see it here:
http://nvidia.fullviewmedia.com/gdc2010/14-sean-tracy.html

Iterations are really fast and this is really cool. More iterations for the same budget! =)

I'm not really happy with their method of creating sandbox functionality - it's programmers driven and workflow is not good because of this in some places. Still, tool as a whole is great!

I'll share some interesting thoughts on interface and workflow design for such tools later, implemented in my pet engine.

Thursday, March 25, 2010

Collision detection trick.

Imagine you need to collide large groups of soldiers that moves all the time. Use of spatial structures in this case became a pain because of update times. To overcome this we've developed simple solution (I believe it used widely in modern physics engines for broad phase).

Consider picture below (we are looking on 2D version, but algorithm will work in 3D as well). We want to collide two circles:

Simple radius check is good enough in this case. But if you want to collide 1K circles - it will just become impractical. So, what about this colored lines on picture?

Red lines represent left bounds of each object on each axis and green lines represent right bounds. To detect possible collision between all objects on grid, we can perform the following.

Push all bounds to two arrays, one for each axis. Sort arrays by bound position. Iterate through arrays with logic:

For each left bound consider object as open.
For each right bound consider object as closed.
For any bound met ->

For each opened object ->

Put potential collision bit into bit-table.

Repeat for next axis.

If both tables are now show that there is potential collision between two objects (both bits are set) - perform radius check and report collision, if any.

To extent this method, presort bounds of groups of objects (for example units of soldiers). To perform collision between any groups you can do cheap mix of presorted arrays to avoid sorting of very large arrays.

Draw a forest.

While working on our first title in MindLink Studio ("Telladar Chronicles: Decline"), we've set one very ambitious goal for that times: be able to show like 20K soldiers clashing on battlefields. With full 3D representation of everything. Back in 2002 we didn't seen anything like this yet.

I had an ~~obvious~~ idea of using sprites for lowest LODs to be able to draw such a large world. After few years of development and experimentation I've ended up with following solution.

When you need something to be drawn at distance, you ask sprites rendering subsystem to provide you with sprite representation of this. If something already in cache - you're done.

In opposite case subsystem considers size of the thing and selects place for it in sprites cache. This textures are divided on rows of predefined height and it selects one with best smallest row height. It injects sprite into the row, modifying row linked list. Rows contents is managed by subsystem like in any common memory manager. To wipe the row if it become too fragmented just copy it's content to another row, updating the sprites vertexes.

After getting sprite representation, LOD system injects four vertexes into dynamic storage. Great thing about this approach - you can fire-and-forget about your sprites. As long your sprite is alive - it will be rendered in batch with others. And you can just ignore updates for distant objects for any amount of frames (nobody ever will notice this on great distances).

To erase the sprite - ignore it when constructing your next ping-pong index buffer. To reuse the space in vertex buffer - make sure enough frames passed from the moment it was used last time.

In this way we were able to draw forests with hundreds of thousand of trees at real time, still providing nice 3D trees at close distance.

Soldiers was a bit tricky, as you need to animate them to look believable. After months of visual tests I've came to conclusion that you need to provide only 2-3 frames of walking animations to "shimmer" on distance.

Forest in action:

Lighting is a bit outdated. Still there is something like 30K trees in viewport with 30 FPS on SM1 hardware.

Friday, March 19, 2010

Toolset.

Fast thought after visiting one of the numerous Ubi development studios around the globe:
The only way to create exceptional game is to have exceptional tools.

[We thank you, Cap!]
[Applauds]

Seriously. No good tools? No good AAA game for you, sorry.

And the reason behind is pretty obvious: less time per iteration == more iterations for the same budget. More iterations means greater quality. Greater quality means increased value to your customer and increased sales. Profit.

But there is one trick involved. Interfaces and processes for this tools should be built by UI usability experts, not programmers or artists or anybody else. And the good news is - you can find plenty on IT market. I'm wondering why most of the studios ignore this fact?

Saturday, January 16, 2010

Shaders development.

Shaders are really important part of any graphics engine and usually a lot of work is invested into them. But shaders creation pipelines proposed by NVidia and ATI lacks one core ability - they have nothing in common with real engine shaders. You can prototype a shader to some extent using FXComposer or Render Monkey. But when it comes to engine integration - you still modify them by hands and blindly run, hoping for the best.

If your engine has run time editor with "What You See Is What You Play" ability - you already has a good platform for shaders development. And the simplest form of assisting is viewing of render targets contents (which can be achieved by using NVPerfHUD for example). But to make it more interesting than just overlay rendering I've tried the following:

So basically I've attached texture exploration interface (you already need it for texture compression) to renderer outputs. In drop-down list you can select all render targets that renderer internally exposes through it's interface. This interface might look like:

virtual uint Debug_GetNumRenderOutputs() const;
virtual const wchar_t* Debug_GetRenderOutputName(uint index) const;
virtual const ITexture* Debug_RegisterForRenderOutput(const wchar_t* renderOutput);
virtual void Debug_UnregisterForRenderOutput(const wchar_t* renderOutput);

When someones registers for render output, renderer begins to capture render targets contents. Internally it looks like:

void RegisterRenderOutput(const wchar_t* name, ITexture* pTexture);
void RegisterRenderOutput(const wchar_t* name, IDirect3DSurface9* pSurface);

This methods are called in some places through renderer, which checks if render output is registered for capturing. After that it copies RT contents to already created texture, returned in method Debug_RegisterForRenderOutput(). This texture is then displayed by texture exploration UI.

In this way developer can access any render target exposed by renderer, which gets updated in real time. Combined with ability to reload shaders this will create simple and powerful shaders development pipeline.

Thursday, January 14, 2010

Color correction.

Easiest way to control the mood of the picture is color correction. No serious team will go out without this feature inside their beloved engine. And there is a lot of cases you can't live without it. Every time your lead artist wants to change something globally - color correction is the first thing he will consider.

From technical point of view, color remapping is really easy to implement. But the most obvious solution, use of volume texture to simply remap colors, will suffer from banding issues until resolution of this volume will be close to 8 bit precision (like 256x256x256). This is prohibitively costly in terms of memory and performance of texture cache as we need to touch every screen pixel.

Fortunately, this can be easily solved by using signed addition instead of remap. So to transform Red (1,0,0) to Green (0,1,0) we need to use value of (-1,1,0). In this way we can use volume texture with size like 16x16x16 without any banding artifacts. Let's call this color correction matrix.

But technical solution is not enough. We need artist-friendly way of controlling this. After some consideration and communicating with artists I've choose the following solution. Artist creates some arbitrary number of screenshots of the level or area he wants to use color correction on. Then he uses any color correction tool of his choice (we've used Photoshop for prototyping) to produce derived color corrected versions of above screenshots. In this way we are not restricting him anyhow.

After creating two sets of screenshots artist feeds them into the engine tool that builds color correction matrix from them. Processor just incrementally saves difference of colors to some table and compress it into correction matrix after all images being processed. Afterward we can add some effects to this matrix.

Implementation note: remember that you need to correct image before HDR!

Shader code for applying correction is trivial:

float4 source = tex2Dlod(samplerSource, uv);
float4 corr = tex3Dlod(samplerColorCorrection, float4(source.xyz, 0));
return source + (corr * 2.0f - 1.0f); // if you use unsigned format

This might look like this in reality:

There is some obvious difference between corrected image and generated by run time. Major source of this is 16x16x16 correction matrix. HDR contributes by adjusting luminosity as well. But it looks not that bad in real situations.

You might notice unknown parameter "Color shift". It used for shifting dark colors below threshold to some arbitrary color. In this way you can make shadows blueish for example. I've seen this in CryTek sometime ago and I don't know who should be credited for this idea. Perhaps CryTek guys. ;)

It's pretty much everything. The only valuable addition to this method might be engine ability to place zones (volumes) with arbitrary color correction on level for special effects and fake global illumination. Use your imagination.