I want to share some results from my recent experiments with GPU-based occlusion culling. I've decided to not to use occlusion visibility queues as they have lag and scalability problems. Basically you should be able to use them in production with some clever optimizations and tricks, but I wanted to make fast implementation.
So I've choose rasterization and hierarchical Z. It's easy to implement if you already have Z-fill path, easy to test and easy to debug because of very simple concept. And I was a bit inspired by this presentation from Ubisoft Montreal studio. I've seen more detailed internal presentation from them and it was pretty cool. I'll give here some high level ideas for you to not to read whole thing.
Basically they are rendering good occluders geometry to small viewport and build Z-buffer from this. Z-buffer then converted into hierarchical representation (higher levels containing only largest Z depth). All tested objects then converted into point sprites vertex stream containing post-projected bounding boxes information. Pixel shader fetches hierarchical Z and performs simple Z tests, outputting visibility queries result to it's render target. Render target then is read by CPU for visibility results to be used in scene processing. Guys are trying to keep pixel shader performance under control by testing whole bounding box with only one Z sample by selecting best fit layer.
While on X360 this scheme can perform enormous amount of tests in very small time frame, it has one major drawback. You need to sync with GPU to read back results. And GPU will be idle while you are building jobs lists for it. This is not a big deal for X360, though.
I've prototyped my implementation on PC, so read back quickly became a problem for me. In test scene with pretty heavy shaders and a lot of simple objects (~20K) CPU<->GPU synchronization spent like 2.5 - 5 ms in GetRenderTargetData() call - time needed for GPU to complete it's task. I've expected lockable render targets rotation will solve the issue but for me AMD driver still sync with GPU on LockRect() even for surfaces rendered few frames behind. So bottleneck seems unavoidable and will degrade performance for simple scenes.
For this implementation I've used GPU only for Z-buffer construction and parallelized CPU loops (TBB) for Z-hierarchy and post-projected bounding boxes tests.
You can see the final result here.
In next implementation I will add software rasterizer to compare. Stay tuned.
Sunday, April 25, 2010
Subscribe to:
Post Comments (Atom)
Thanks for this post - it motivated me to try it. I knew this technique before (from this presentation) but rejected it at first. After reading your post, however, I'll give it a try.
ReplyDelete