I'm currently writing a voxel-based game with infinite, chunked terrain a-la Minecraft. Terrain volume management was implemented using a LargeVolume for storage/access. During development though I ran in to some pain points where I feel that the LargeVolume doesn't really align with the needs of infinite terrain storage/paging for Minecraft-type games. To solve these, I made some drastic modifications to PolyVox over the course of a day or two and things are working pretty well. Basically what I did was make my own volume. (I apologize in advance for desecrating PolyVox's elegant template metacode with me-specific manual specialization.
)
Since there appears to already be discussion about threading LargeVolume, I thought I'd compile a list of the pain points I encountered with LargeVolume, as well as share some of my research with threading PolyVox volumes.
The IssuesLack of Manual Chunk Paging ControlMinecraft-alikes with chunked terrain generally have these requirements for volume paging:
- (Near) Infinite terrain
- Sparse chunk storage. (Two players are a thousand blocks away on a server, but they each only load the 441 chunks around them, and not all between)
- Uncompressed, direct access to blocks for all loaded chunks.
- The game engine needs to control which chunks it considers to be loaded
An infinite volume can be achieved via LargeVolume, but only indirectly by setting the volume's valid region to [INT_MIN, INT_MAX]. This causes issues of its own related to what LargeVolume does when a VolumeSampler asks for voxels from a block that the game engine doesn't considered loaded; but I've dedicated a whole section to that issue in particular.
The map that LargeVolume uses to associate regions with their datablocks does a good job at sparse storage. No complaints there.
I've not found the need to keep compressed chunks around in memory. When a chunk moves out of the load radius of a player entity I just save the chunk to disk and unload it all from memory. During the game loop, all loaded game-chunks will need to consult the terrain data every frame for collision checking with entities, randomly "ticking" a block (e.g. for sapling growth, fire spread, etc) and that sort of thing. When I used LargeVolume I just tended to set the max uncompressed block count high enough that it would never actually compress a block, and manually unloaded PolyVox regions when the game considered the areas to be unloaded.
Performance and ThreadingThe lack of thread safety was really starting to hurt performance. With LargeVolume, my engine had for threads that needed to touch the volume: a thread for mesh extraction, a thread for Resampling/loading volume data from disk, a thread for resampling volume data to write to disk, and the game loop thread, which accessed the volume for collision detection among other things.
What would happen is that when a player entity would cross a chunk boundary, 10 chunks would be queued for unloading, 10 new chunks would be queued for loading, and Meshes would be queued for extraction as the chunks were being loaded. All threads use a single producer/single consumer model, and use condition variables sleep until their input queues have jobs to process. Using a single mutex to protect all method calls to LargeVolume or VolumeResampler runs on LargeVolume, any given thread could be waiting for the lock for upwards of 120-200ms. CubicSurfaceExtracting the 16x128x16 chunks took ~40ms on average, and the rest of the wait comes from the other threads. If say the chunk-unload thread releases the mutex and the chunk-load thread gets it next and hands it off to the mesh extractor thread once it's done with the mutex, that's a guaranteed 100ms wait before the game loop thread gets to use the volume for collision calculations, at which point we've dropped 4 or 5 frames already creating a bad stutter. Even if you used, say, 16x16x16 chunks and that made surface extraction 8x shorter, you'd still potentially be able to miss a frame or two to mutex contention in the game loop if the game loop thread manages to get the mutex last. Clearly, some form of concurrent reading needs to be possible for applications where loading/unloading chunks during gameplay is acceptably performant.
The Memory "Leak"This was really the clincher which convinced me to write my own Volume. After I got my infinite-terrain-paging code implemented I ran some "profiling" (a.k.a. Looking at ksysguard) to get a quick idea about memory performance and make sure I wasn't leaking memory everywhere. What I saw was that memory usage was steadily increasing even as I was unloading chunks. A quick run through valgrind --leak-check=yes showed no obvious leaks that could account for the memory increase, so I ran my program through massif to get an idea about what was allocating memory. This analysis showed LargeVolume holding on to its memory even as regions were being flushed.
To be fair, LargeVolume never was actually "leaking" memory in the sense that it was losing track of pointers it had allocated. What was happening was that the mesh extraction process was causing LargeVolume to create new UncompressedBlocks for chunks that my game did not consider to be "loaded". This happened because for meshes to line up properly you generally need to grow the region you sample by one block in each direction. Since you obtain an infinite terrain effect with LargeVolume by setting its valid region to [INT_MIN, INT_MAX], LargeVolume happily allocated a new UncompressedBlock for regions my game didn't consider to be loaded yet. As a consequence, my game engine was unaware that the Volume had allocated memory for these regions and never told LargeVolume to flush them, so the UncompressedBlocks for those regions stuck around in memory forever since I controlled paging manually. This in effect left a ring of phantom UncompressedBlocks that would never be deleted around every chunk that the game engine told LargeVolume to render.
To alleviate this I tried implementing a clever way of only growing the extract-region in directions where my game thought there were loaded chunks, but that turned out to be rather difficult. In between this issue and my adventures trying to thread LargeVolume, it looked like it would be a lot easier to make my own volume that behaved in the ways that I wanted it to.
Threading LargeVolumeBefore writing my own volume, I tried several methods for making LargeVolume thread safe, apart from guarding every call to a LargeVolume method. First off, here is an analysis of the Ways that LargeVolume Is Not Threadsafe™:
- LargeVolume uses std::map's to associate a block position with an (Un)CompressedBlock. Accessing a voxel requires iteration over this map, and erasure from maps can invalidate an iterator if it happens to be pointing to the erased pair.
- Obviously, there's no protection from two threads accessing the same UncompressedBlock during get/setVoxel.
- LargeVolume has a cache to store the UncompressedBlock pointer most recently accessed from the std::map to avoid repeated lookups. The cache state is mutated on reads on LargeVolume and has the potential to do some funky shit if a thread yields in the middle of mutating the cache state. E.g. the MeshExtractor sometimes looks at the wrong UncompressedBlock during MeshExtraction and you get a few voxels where they shouldn't be.
I briefly considered using a QReadWriteLock to protect the entire LargeVolume before I realized that LargeVolume makes concurrent reads almost certainly unsafe. I tried it anyways and the extra work that a read-write lock needs to do didn't make it too much faster, and it was still thread-unsafe to boot.
My first real attempt at speeding things up involved using a mutex to protect LargeVolume's internal most-recently-used-pointer cache so that at least as long as I didn't read and write to the same block at once I'd be ok. This really didn't speed things up much at all because the threads were all accessing different blocks but sharing the same most-recently-used-block pointer, forcing a map lookup nearly every call.
With that in mind I made a small struct for the cache data, and made an std::map<threadid, CacheData> map to give each thread its own cache within LargeVolume. Unfortunately since that cache is queried millions of times during a mesh extraction or a resampling, the map lookup for that ate all the gains from the per-thread caches.
Enter SparseVolume and ChunkDataUsing the knowledge I gained from trying to thread LargeVolume I designed a volume with threading in mind. I ended up calling it SparseVolume. Here's a rundown of the design characteristics of SparseVolume:
- Loaded chunks are stored in an associative container to achieve sparseness, like LargeVolume
- This map is protected by a mutex
- All loaded chunks are uncompressed
- The UncompressedBlock analogue for SparseVolume is called ChunkData. Its API is made public so ChunkData instances can be made outside SparseVolume
- SparseVolume does not keep a most-recently-accessed ChunkData-pointer cache. This is moved to the VolumeSampler on a per-sampler instance.
- SparseVolume offers thread-safe arbitrary access to voxel data from ChunkData objects, though if you need more than a few voxels then it makes sense to use the VolumeSampler.
- If a VolumeSampler asks for a voxel that has not been loaded yet, return an empty Voxel
Actually, ChunkData plays a huge role in how it all works too, so here's some stuff about that too:
- ChunkData objects are passed around as shared_ptrs
- Each ChunkData contains a unique_ptr to a Voxel array, and has a mutex to guard access to that array.
- Objects reading/writing to a particular ChunkData are responsible for locking its mutex
- Relatedly, none of ChunkData's member functions ever lock the mutex
Here's the
code. The Chunk class is really just a Vector2DInt for chunk coords with some helper methods. All in all I don't know how useful this code would be to PolyVox, aside from general design ideas. I got rid of a lot of the template stuff compared to the analagous PolyVox classes, and the volume is always one chunk tall, etc. Made it a lot easier to prototype. The only other changes I made were to make copies of CubicSurfaceExtractor, Raycaster, and Picking that used my non-templated VolumeSampler.
Anyway, with that code in mind, here's what SparseVolume can do to make things run faster. To illustrate, I'll walk you through a ChunkData's lifetime.
The game wishes to load a chunk, and puts in a request with the chunk loader thread. The loader thread schleps the data off the disk into a QByteArray. Once the data is in memory, it creates a ChunkData instance on the heap and wraps it in a shared_ptr like so:
Code:
std::shared_ptr<ChunkData> data = std::make_shared<ChunkData>(16, 128);
int i = 0;
auto arr = volumeArray.data();
for (int x = 0; x < 16; ++x) {
for (int z = 0; z < 16; ++z) {
for (int y = 0; y < 128; ++y) {
Voxel voxel(arr[i++]);
if (!voxel.getMaterial())
continue; // Don't bother setting empty voxels
data->setVoxelAt(x, y, z, voxel);
}
}
}
Once this is done, the loader thread merely passes it along to the SparseVolume, which adds the ChunkData to its map. SparseVolume's mutex only needed to be locked while the shared_ptr was being added to the map. Contention for the mutex while loading data into the volume is now essentially zero, since all the bit copying happened in another thread and adding the ChunkData to the SparseVolume only involved adding a pointer to a map.
Now that the ChunkData has been added to the SparseVolume, the main thread puts in a request to the mesh extractor thread to make a mesh out of it. The CubicSurfaceExtractor uses a VolumeSampler to get voxel data from the volume. The VolumeSampler keeps a local cache of ChunkData pointers it has touched, and locks the mutex of the particular ChunkData it's sampling until it's sampling from a different ChunkData. This leaves other VolumeSamplers able to access every ChunkData object other than the one that our VolumeSampler is currently touching without mutex contention.
The player has now moved and the chunk we loaded/rendered is now out of the load radius of the player. To remove the data, we immediately remove the ChunkData shared_ptr from SparseVolume's map and pass the shared_ptr to our save-to-disk thread, which happily writes the data to disk. Contention for the mutex while saving volume data to disk is now essentially zero, since the critical section consisted solely of removing the ChunkData shared_ptr from SparseVolume's map. Even if a mesh extractor thread was using the ChunkData to create a mesh when we removed it from the volume, the shared ownership semantics mean that the mesh extractor thread can at least finish its extraction safely, even if the result is then promptly discarded by the renderer on account of the chunk not being loaded anymore.
With SparseVolume I get sub-1ms mutex contention for collision detection using volume data in the game thread, which is almost infinitely better than it had been before. Rendering, loading and saving all got a speed boost from that too, which was nice.
Anyway, that's that. PolyVox is really cool, thanks for it!