Infinite Volumes, LargeVolume, and Threading

JonThomas · **Joined:** Fri Apr 11, 2014 9:05 pm **Posts:** 5

I'm currently writing a voxel-based game with infinite, chunked terrain a-la Minecraft. Terrain volume management was implemented using a LargeVolume for storage/access. During development though I ran in to some pain points where I feel that the LargeVolume doesn't really align with the needs of infinite terrain storage/paging for Minecraft-type games. To solve these, I made some drastic modifications to PolyVox over the course of a day or two and things are working pretty well. Basically what I did was make my own volume. (I apologize in advance for desecrating PolyVox's elegant template metacode with me-specific manual specialization.

)

Since there appears to already be discussion about threading LargeVolume, I thought I'd compile a list of the pain points I encountered with LargeVolume, as well as share some of my research with threading PolyVox volumes.

The Issues
Lack of Manual Chunk Paging Control
Minecraft-alikes with chunked terrain generally have these requirements for volume paging:

(Near) Infinite terrain
Sparse chunk storage. (Two players are a thousand blocks away on a server, but they each only load the 441 chunks around them, and not all between)
Uncompressed, direct access to blocks for all loaded chunks.
The game engine needs to control which chunks it considers to be loaded

An infinite volume can be achieved via LargeVolume, but only indirectly by setting the volume's valid region to [INT_MIN, INT_MAX]. This causes issues of its own related to what LargeVolume does when a VolumeSampler asks for voxels from a block that the game engine doesn't considered loaded; but I've dedicated a whole section to that issue in particular.

The map that LargeVolume uses to associate regions with their datablocks does a good job at sparse storage. No complaints there.

I've not found the need to keep compressed chunks around in memory. When a chunk moves out of the load radius of a player entity I just save the chunk to disk and unload it all from memory. During the game loop, all loaded game-chunks will need to consult the terrain data every frame for collision checking with entities, randomly "ticking" a block (e.g. for sapling growth, fire spread, etc) and that sort of thing. When I used LargeVolume I just tended to set the max uncompressed block count high enough that it would never actually compress a block, and manually unloaded PolyVox regions when the game considered the areas to be unloaded.

Performance and Threading
The lack of thread safety was really starting to hurt performance. With LargeVolume, my engine had for threads that needed to touch the volume: a thread for mesh extraction, a thread for Resampling/loading volume data from disk, a thread for resampling volume data to write to disk, and the game loop thread, which accessed the volume for collision detection among other things.

What would happen is that when a player entity would cross a chunk boundary, 10 chunks would be queued for unloading, 10 new chunks would be queued for loading, and Meshes would be queued for extraction as the chunks were being loaded. All threads use a single producer/single consumer model, and use condition variables sleep until their input queues have jobs to process. Using a single mutex to protect all method calls to LargeVolume or VolumeResampler runs on LargeVolume, any given thread could be waiting for the lock for upwards of 120-200ms. CubicSurfaceExtracting the 16x128x16 chunks took ~40ms on average, and the rest of the wait comes from the other threads. If say the chunk-unload thread releases the mutex and the chunk-load thread gets it next and hands it off to the mesh extractor thread once it's done with the mutex, that's a guaranteed 100ms wait before the game loop thread gets to use the volume for collision calculations, at which point we've dropped 4 or 5 frames already creating a bad stutter. Even if you used, say, 16x16x16 chunks and that made surface extraction 8x shorter, you'd still potentially be able to miss a frame or two to mutex contention in the game loop if the game loop thread manages to get the mutex last. Clearly, some form of concurrent reading needs to be possible for applications where loading/unloading chunks during gameplay is acceptably performant.

The Memory "Leak"
This was really the clincher which convinced me to write my own Volume. After I got my infinite-terrain-paging code implemented I ran some "profiling" (a.k.a. Looking at ksysguard) to get a quick idea about memory performance and make sure I wasn't leaking memory everywhere. What I saw was that memory usage was steadily increasing even as I was unloading chunks. A quick run through valgrind --leak-check=yes showed no obvious leaks that could account for the memory increase, so I ran my program through massif to get an idea about what was allocating memory. This analysis showed LargeVolume holding on to its memory even as regions were being flushed.

To be fair, LargeVolume never was actually "leaking" memory in the sense that it was losing track of pointers it had allocated. What was happening was that the mesh extraction process was causing LargeVolume to create new UncompressedBlocks for chunks that my game did not consider to be "loaded". This happened because for meshes to line up properly you generally need to grow the region you sample by one block in each direction. Since you obtain an infinite terrain effect with LargeVolume by setting its valid region to [INT_MIN, INT_MAX], LargeVolume happily allocated a new UncompressedBlock for regions my game didn't consider to be loaded yet. As a consequence, my game engine was unaware that the Volume had allocated memory for these regions and never told LargeVolume to flush them, so the UncompressedBlocks for those regions stuck around in memory forever since I controlled paging manually. This in effect left a ring of phantom UncompressedBlocks that would never be deleted around every chunk that the game engine told LargeVolume to render.

To alleviate this I tried implementing a clever way of only growing the extract-region in directions where my game thought there were loaded chunks, but that turned out to be rather difficult. In between this issue and my adventures trying to thread LargeVolume, it looked like it would be a lot easier to make my own volume that behaved in the ways that I wanted it to.

Threading LargeVolume
Before writing my own volume, I tried several methods for making LargeVolume thread safe, apart from guarding every call to a LargeVolume method. First off, here is an analysis of the Ways that LargeVolume Is Not Threadsafe™:

LargeVolume uses std::map's to associate a block position with an (Un)CompressedBlock. Accessing a voxel requires iteration over this map, and erasure from maps can invalidate an iterator if it happens to be pointing to the erased pair.
Obviously, there's no protection from two threads accessing the same UncompressedBlock during get/setVoxel.
LargeVolume has a cache to store the UncompressedBlock pointer most recently accessed from the std::map to avoid repeated lookups. The cache state is mutated on reads on LargeVolume and has the potential to do some funky shit if a thread yields in the middle of mutating the cache state. E.g. the MeshExtractor sometimes looks at the wrong UncompressedBlock during MeshExtraction and you get a few voxels where they shouldn't be.

I briefly considered using a QReadWriteLock to protect the entire LargeVolume before I realized that LargeVolume makes concurrent reads almost certainly unsafe. I tried it anyways and the extra work that a read-write lock needs to do didn't make it too much faster, and it was still thread-unsafe to boot.

My first real attempt at speeding things up involved using a mutex to protect LargeVolume's internal most-recently-used-pointer cache so that at least as long as I didn't read and write to the same block at once I'd be ok. This really didn't speed things up much at all because the threads were all accessing different blocks but sharing the same most-recently-used-block pointer, forcing a map lookup nearly every call.

With that in mind I made a small struct for the cache data, and made an std::map<threadid, CacheData> map to give each thread its own cache within LargeVolume. Unfortunately since that cache is queried millions of times during a mesh extraction or a resampling, the map lookup for that ate all the gains from the per-thread caches.

Enter SparseVolume and ChunkData
Using the knowledge I gained from trying to thread LargeVolume I designed a volume with threading in mind. I ended up calling it SparseVolume. Here's a rundown of the design characteristics of SparseVolume:

Loaded chunks are stored in an associative container to achieve sparseness, like LargeVolume
This map is protected by a mutex
All loaded chunks are uncompressed
The UncompressedBlock analogue for SparseVolume is called ChunkData. Its API is made public so ChunkData instances can be made outside SparseVolume
SparseVolume does not keep a most-recently-accessed ChunkData-pointer cache. This is moved to the VolumeSampler on a per-sampler instance.
SparseVolume offers thread-safe arbitrary access to voxel data from ChunkData objects, though if you need more than a few voxels then it makes sense to use the VolumeSampler.
If a VolumeSampler asks for a voxel that has not been loaded yet, return an empty Voxel

Actually, ChunkData plays a huge role in how it all works too, so here's some stuff about that too:

ChunkData objects are passed around as shared_ptrs
Each ChunkData contains a unique_ptr to a Voxel array, and has a mutex to guard access to that array.
Objects reading/writing to a particular ChunkData are responsible for locking its mutex
Relatedly, none of ChunkData's member functions ever lock the mutex

Here's the code. The Chunk class is really just a Vector2DInt for chunk coords with some helper methods. All in all I don't know how useful this code would be to PolyVox, aside from general design ideas. I got rid of a lot of the template stuff compared to the analagous PolyVox classes, and the volume is always one chunk tall, etc. Made it a lot easier to prototype. The only other changes I made were to make copies of CubicSurfaceExtractor, Raycaster, and Picking that used my non-templated VolumeSampler.

Anyway, with that code in mind, here's what SparseVolume can do to make things run faster. To illustrate, I'll walk you through a ChunkData's lifetime.

The game wishes to load a chunk, and puts in a request with the chunk loader thread. The loader thread schleps the data off the disk into a QByteArray. Once the data is in memory, it creates a ChunkData instance on the heap and wraps it in a shared_ptr like so:

Code:

    std::shared_ptr<ChunkData> data = std::make_shared<ChunkData>(16, 128);

    int i = 0;
    auto arr = volumeArray.data();
    for (int x = 0; x < 16; ++x) {
        for (int z = 0; z < 16; ++z) {
            for (int y = 0; y < 128; ++y) {
                Voxel voxel(arr[i++]);
                if (!voxel.getMaterial())
                    continue; // Don't bother setting empty voxels
                data->setVoxelAt(x, y, z, voxel);
            }
        }
    }

Once this is done, the loader thread merely passes it along to the SparseVolume, which adds the ChunkData to its map. SparseVolume's mutex only needed to be locked while the shared_ptr was being added to the map. Contention for the mutex while loading data into the volume is now essentially zero, since all the bit copying happened in another thread and adding the ChunkData to the SparseVolume only involved adding a pointer to a map.

Now that the ChunkData has been added to the SparseVolume, the main thread puts in a request to the mesh extractor thread to make a mesh out of it. The CubicSurfaceExtractor uses a VolumeSampler to get voxel data from the volume. The VolumeSampler keeps a local cache of ChunkData pointers it has touched, and locks the mutex of the particular ChunkData it's sampling until it's sampling from a different ChunkData. This leaves other VolumeSamplers able to access every ChunkData object other than the one that our VolumeSampler is currently touching without mutex contention.

The player has now moved and the chunk we loaded/rendered is now out of the load radius of the player. To remove the data, we immediately remove the ChunkData shared_ptr from SparseVolume's map and pass the shared_ptr to our save-to-disk thread, which happily writes the data to disk. Contention for the mutex while saving volume data to disk is now essentially zero, since the critical section consisted solely of removing the ChunkData shared_ptr from SparseVolume's map. Even if a mesh extractor thread was using the ChunkData to create a mesh when we removed it from the volume, the shared ownership semantics mean that the mesh extractor thread can at least finish its extraction safely, even if the result is then promptly discarded by the renderer on account of the chunk not being loaded anymore.

With SparseVolume I get sub-1ms mutex contention for collision detection using volume data in the game thread, which is almost infinitely better than it had been before. Rendering, loading and saving all got a speed boost from that too, which was nice.

Anyway, that's that. PolyVox is really cool, thanks for it!

David Williams · **Joined:** Sun May 04, 2008 6:35 pm **Posts:** 1827

Thanks for the detailed analysis Jon!

There are indeed a number of issues with the LargeVolume, particularly with regard to achieving the kind of infinite terrain you are after. The core issue is that we haven't actually worked with such infinite terrain ourselves, and so although we try to keep this scenario in mind it is largely untested.

Our work on Cubiquity has been with relatively limited volume sizes but we are keen to increase it further in the future. I've given a lot of thought to this in parallel with Cubiquity and have some to many of the same conclusions as you. My ideas are all untested though so it's great to get real feedback:

JonThomas wrote:

The game engine needs to control which chunks it considers to be loaded

Maybe, actually one of the aims was to avoid the application having to worry about this, so that it could just treat is as a large 3D array without having to deal with the implementation details. But I can see that in some circumstances more control is useful, if only to override the default choices and force blocks to be loaded/unloaded, or pinned in memory at certain times.

JonThomas wrote:

I've not found the need to keep compressed chunks around in memory. When a chunk moves out of the load radius of a player entity I just save the chunk to disk and unload it all from memory.

Yep, I agree with this. The use of a hierarchy (uncompressed -> compressed in memory -> on disk) seemed reasonable at the time, but actually it now seems unnecessary in Cubiquity as well. We are storing the voxel data as compressed blocks in a SQLite database, and SQLite provides it's own caching mechanism. This means there is little point in us caching compressed blocks in memory as well. Further more, the multilevel hierarchy makes the code more complex than it could otherwise be.

JonThomas wrote:

This happened because for meshes to line up properly you generally need to grow the region you sample by one block in each direction.

PolyVox does deliberately separate the size of blocks in the volume from the size of the meshes which are being extracted. That is, you might have 32x32x32 blocks on disk but extract meshes of size 16x16x16. Actually our prototype LOD changes the mesh size depending on which LOD level it is working with. So I think this is a useful concept but can introduce the kind of 'misalignment' issues you are talking about.

JonThomas wrote:

it looked like it would be a lot easier to make my own volume that behaved in the ways that I wanted it to.

That's fine of course, PolyVox is templatized on VolumeType to support users doing exactly this. But I'm a bit surprised you had to make non-template versions of CubicSurfaceExtractor and friends, I would have expected the templatized versions could have been made to work. I don't actually have a solution for you though as sometimes templates can get rather complex!

JonThomas wrote:

Various threading ideas and discussion...

It's really good to hear what's working for you here. Some of the ideas are similar to what I have in mind, for example the use of std::shared_ptr to provide reference counting of blocks so that they can be shared by multiple volume samplers and freed automatically. In this context do you allow multiple concurrent reads of the data? Reading should be much more common than writing, and I'm expecting that multiple surface extractors could be reading from the same block without problems, but this is completely untested (and I'm a little green when it comes to threading).

There's a bunch of other stuff I'd like to change, for example exposing timestamps and/or hashes of the data so that client code can check whether a given region has changed. Currently client code has to keep track of that separately I'd like to look at storing the voxels in Morton order as well as preliminary experiments have shown it can double the compression rate.

I'm all-in on Cubiquity for the new few weeks, but once it is on the asset store I really want to come back and overhaul the C++ side of things. The first task is to eliminate the need for a separate PolyVox branch for Cubiquity, and after that think I'll take a crack at the new volume class.

Thanks for your insight it really is useful :-)

JonThomas · **Joined:** Fri Apr 11, 2014 9:05 pm **Posts:** 5

David Williams wrote:

JonThomas wrote:

it looked like it would be a lot easier to make my own volume that behaved in the ways that I wanted it to.

That's fine of course, PolyVox is templatized on VolumeType to support users doing exactly this. But I'm a bit surprised you had to make non-template versions of CubicSurfaceExtractor and friends, I would have expected the templatized versions could have been made to work. I don't actually have a solution for you though as sometimes templates can get rather complex!

Heh, yeah. I didn't have to make non-template versions, I just chose too since there was some stuff like variable wrap modes, etc that I didn't particularly care about. The modifications didn't take too long at any rate. This is mostly just laziness on my part.

David Williams wrote:

JonThomas wrote:

Various threading ideas and discussion...

It's really good to hear what's working for you here. Some of the ideas are similar to what I have in mind, for example the use of std::shared_ptr to provide reference counting of blocks so that they can be shared by multiple volume samplers and freed automatically. In this context do you allow multiple concurrent reads of the data? Reading should be much more common than writing, and I'm expecting that multiple surface extractors could be reading from the same block without problems, but this is completely untested (and I'm a little green when it comes to threading).

I don't currently provide concurrent read access. I kinda mentally discarded the idea of a read-write lock for block access while investigating threading for LargeVolume since the way LargeVolume currently works you'd have to lock it every voxel access, which would have high overhead even with a normal mutex. With region-level mutexes, however, it would make sense to provide a read-write lock since the overhead is negligible over the span of "long" sampling operations. Right now my engine testbed only has one entity on the field (the controlled player) but I can see that without concurrent reads I'd still probably run in to mutex contention with a higher amount of entities. (E.g. re-rendering a chunk that has an entity in it that needs to check collision)

The only thing is that it we won't get an std::shared_mutex until c++14. I could use QReadWriteLock for my personal project, but for PolyVox you may have to look at rolling your own.

David Williams wrote:

There's a bunch of other stuff I'd like to change, for example exposing timestamps and/or hashes of the data so that client code can check whether a given region has changed. Currently client code has to keep track of that separately I'd like to look at storing the voxels in Morton order as well as preliminary experiments have shown it can double the compression rate.

Yeah, referring back to my implementation it currently has a "dirty" flag and I'll probably add timestamping when I need it. As an aside, I'm really loving the ownership semantics that the c++11 smart pointers offer. they make memory management so much easier. 8-)

David Williams wrote:

I'm all-in on Cubiquity for the new few weeks, but once it is on the asset store I really want to come back and overhaul the C++ side of things. The first task is to eliminate the need for a separate PolyVox branch for Cubiquity, and after that think I'll take a crack at the new volume class.

Neat! I'll be keeping an eye out for that. In general it would be nice to not maintain essentially a personal fork of PolyVox on my part.

David Williams wrote:

Thanks for your insight it really is useful :-)

No problem! Glad to help.

JonThomas · **Joined:** Fri Apr 11, 2014 9:05 pm **Posts:** 5

Oh, I had also been benchmarking in debug mode. My mesh extractions take from 6-30 ms in release mode. :oops:

Dem asserts, haha.

David Williams · **Joined:** Sun May 04, 2008 6:35 pm **Posts:** 1827

JonThomas wrote:

The only thing is that it we won't get an std::shared_mutex until c++14.

It look like there is one here (maybe from Boost?): http://stackoverflow.com/a/14307116

It appears to be header only so we could probably drop that in if need be. But I'll have a think about it when we get to that point.

JonThomas wrote:

Oh, I had also been benchmarking in debug mode. My mesh extractions take from 6-30 ms in release mode. :oops:

Dem asserts, haha.

You probably noticed that you can toggle PolyVox asserts interdependently of debug vs. release mode. They can indeed be pretty slow as they validate every voxel access and I think they break inlining.

Anyway, I hope we can get to some of this in the near future!

petersvp · **Joined:** Tue Apr 08, 2014 5:10 pm **Posts:** 124

By the way I am writing almost the same thing - with smooth terrains. I used multiple RawVolumes - one for each chunk. Then I just overlap the borders data - for chunk size 16x16x32, the actual chunk volume is 18x18x34, and datawise, last and first columns share same values. Then I run the surface extractors from other threads but on the 16x16x32 chunk, e.g. not the whole region.

For syncing I use boost::atomics in the main game loop and mutexes only in the worker threads.

And It's working - just made my Chunk Manager's setVoxelAt resolve borderline cases and update all volumes that "share" single voxel.

Just works. Each* thread on its own Chunk, with its own Volume... and Surface Extractor.
* Each = Actually NumCpu-1 worker threads that request/consume their work.

David Williams · **Joined:** Sun May 04, 2008 6:35 pm **Posts:** 1827

As mentioned in my other post you should avoid using multiple overlapping volumes in this way, but apart from that you guys are both making multi-threaded volumes and surface extraction look easy! Really hope to add an implementation of this to PolyVox soon.

petersvp · **Joined:** Tue Apr 08, 2014 5:10 pm **Posts:** 124

David, to scare you even more, not just my chunks are implemented that way. ALL In-game models are meshes that come from voxel volumes and can be edited on crafting stations

I like the volumes and when I can over-use them, I will

I feed the meshes to the physics engine and often unload the actual voxel data if players are far away

Infinite Volumes, LargeVolume, and Threading

Who is online