Taos Engine ▦ Taos: Building a Modern WebGPU Game Engine

Geo Voxel World — Streamed Terrain, Voxelized on the GPU

Run the sample · geo_voxel_world.ts

The real Grand Canyon, rebuilt out of Minecraft cubes you can walk through — and every cube is coloured by the aerial photo of the ground beneath it. Nothing here is authored: elevation streams in as raster-DEM terrain tiles (AWS Terrarium + Esri imagery), and as each tile arrives it is voxelized entirely on the GPU. The smooth terrain mesh is never drawn; the cubes are.

This sample is really about one idea pushed end to end: take streaming geospatial geometry and turn it into a GPU-driven voxel world — compute-shader voxelization, compute-shader meshing, indirect draws — with the CPU doing almost nothing per frame. The walking, the colour, the sliding view window and the memory ceiling all fall out of that.

RasterDemTileset (geo)  ──►  tile GPU mesh + draped imagery
                                  │
        ┌─────────────────────────┴── per frame, on the GPU ──┐
        │  clear grid → raster tiles → mesh chunks → indirect  │
        └──────────────────────────────┬───────────────────────┘
                                        ▼
                          deferred GBuffer (one draw / chunk)

The world streams, but it isn't drawn#

The scene is anchored at Mather Point (lon −112.1071°, lat 36.0608°, rim ≈ 2100 m) through a GeoFrame, and terrain streams through a GeoScene + RasterDemTileset exactly like Geo Fox in the Rain. The twist: the terrain tileset is set invisible (geo.setVisible(awsTerrain, false)). It keeps streaming and keeps feeding heightAt, but emits no draws. We only want each tile's geometry and imagery as raw material for voxelization.

Voxelizing a tile from its own GPU mesh#

The voxelizer needs three things per tile: the world-space surface, the texture coordinates into the draped imagery, and the imagery texture itself. All three are already on the GPU, and all three are reachable without any engine change:

  • The terrain tile mesh is created with storageBuffers: true (terrain.ts), so its vertex and index buffers can be bound as read-only storage in a compute shader. The layout is fixed (stride 12 floats; position at offset 0, the Web-Mercator imagery UV at offset 6 — see terrain_bake.ts).
  • The draped imagery is the tile material's albedo map, exposed publicly via (content.material as PbrMaterial).albedoMap.view.

So the sample hands VoxelTerrainPass a tiny descriptor per visible tile — { vbuf, ibuf, indexCount, imagery } — and the GPU does the rest. The terrain mesh's positions are already baked into the floating-origin world frame (and the origin is never re-anchored), so they are world coordinates: no transform needed.

Stage 1 — triangles → a packed height + colour grid#

imagery_raster.wgsl runs one invocation per triangle. Each thread walks the triangle's XZ bounding box of voxel columns and, for every column whose centre the triangle covers, samples the aerial imagery at the interpolated UV (textureSampleLevel, which is legal in a compute shader) and writes the result into a shared grid — one u32 per world column.

The neat trick is how height and colour are stored together. There's no 64-bit atomic in WebGPU, so we pack both into one 32-bit cell and use atomicMax:

cell = (height16 << 16) | rgb565

The height occupies the high 16 bits, so atomicMax resolves to the highest surface at that column, and the colour (sRGB, RGB565) rides along in the low 16 bits as a free passenger. The tallest writer wins, and it brings its photo colour with it — no second pass, no lock, no read-modify-write race.

Stage 2 — grid → cube geometry + indirect draw args#

voxel_mesher.wgsl turns the grid into cubes, one thread per column. A heightfield only needs the visible faces:

  • one top quad at the column's surface height, and
  • a wall quad on each of the four sides whose neighbour column is lower.

A wall is a single quad spanning the whole height drop, however tall — the side texture tiles vertically by world-Y, so a 40-block cliff face is still one quad, not 40. That keeps the per-column geometry bounded (a top + up to four walls) and is what makes the whole thing cheap.

Each column appends its vertices into its chunk's slot via an atomic counter, and a finalize pass writes a per-chunk drawIndirect argument buffer [vertexCount, 1, firstVertex, 0]. The vertex carries the surface colour packed as a u32 (and, for the tinted style, a block-type id in the high bits).

Stage 3 — one indirect draw per chunk#

voxel_color.wgsl fills the deferred GBuffer. The pass issues one drawIndirect per frustum-culled chunk, so the GPU decides how many vertices each chunk draws — the CPU never reads the counts back. The whole sequence (clear → raster → reset → mesh → finalize) chains on a single compute encoder, relying on WebGPU's automatic intra-pass hazard tracking between dispatches, exactly like ParticlePass.

A window that follows you#

The voxel grid is a fixed-size square. The first version pinned it to the world origin — so you could only roam ±half a window from spawn. To walk through the world, the window has to move.

The solution is to treat the grid as scratch and rebuild it from scratch every frame: each frame the window's origin snaps to the camera (setRegionOrigin), the grid is cleared, all currently-streamed tiles re-voxelize into it, and every chunk re-meshes. Because cubes are defined by world coordinates (not grid slots), the sliding window doesn't make them jitter — a cube at world X = 10 is always drawn at 10, regardless of where the window edge currently sits. The geo tileset already streams fresh terrain around the moving camera, so new ground voxelizes ahead of you as you go. The per-tile raster bind groups are cached by mesh-buffer identity, so re-voxelizing every frame doesn't churn allocations.

The trade is honest: it re-rasters and re-meshes the whole window each frame rather than accumulating. For the modest tile counts here that's fine, and it sidesteps every staleness bug a moving accumulator would invite.

Walking on a heightfield, cheaply#

Physics reuses crafty's PlayerController unmodified. That controller talks to the world through exactly one method — getBlockType(x, y, z) — so the sample feeds it a four-line BlockWorld adapter whose column is solid stone below the surface, air above:

getBlockType(bx, by, bz) {
  const H = colTopH(bx, bz);          // floor(geo.heightAt) + 1
  return (H !== null && by < H) ? BlockType.STONE : BlockType.NONE;
}

Crucially the surface height comes from geo.heightAt on the CPU — no GPU readback — so collision stays cheap regardless of how much is voxelized for the eye. The GPU and CPU derive height from the same streamed terrain, so they agree.

The fall-through bug#

The player kept falling through the ground on steep cliffs. The cause was streaming, not collision: when a tile refines LOD, geo.heightAt returns null for a column for a frame or two while the new tile loads. A naive per-frame cache turned that brief null into air, and the player dropped. Two layers fix it:

  • a persistent last-known-height cache (terrain height is static, so a column once loaded stays loaded), and
  • neighbour bridging for a column never yet seen — on a null, take the highest known height of its eight neighbours (a heightfield has no real gaps, so a neighbour is a safe stand-in until the tile arrives).

A per-frame anti-sink clamp catches the remaining case where a later, higher tile streams in under a standing player.

Two looks, one geometry#

Press T to switch render styles:

  • Photo — albedo is the aerial imagery, straight from the packed colour.
  • Tinted — the crafty block atlas (grass / rock / sand grain) multiplied by the imagery colour, so the cubes keep a Minecraft surface texture but wear the satellite's palette.

The tinted path is why the mesher also packs a band block-type (by elevation + slope) into the vertex's high 16 bits; the photo path ignores it. A style uniform picks the branch in the fragment shader, so both looks share one pipeline and one vertex stream.

The memory ceiling (and a lesson about device limits)#

The honest constraint of this design is the vertex pool. Geometry is written into fixed per-chunk slots sized for the worst case, so memory scales with the window's area. At the true worst case (a top + four full walls = 30 vertices per column) a large window needs gigabytes.

Two moves keep it allocatable:

  1. Budget 18 vertices/column, not 30. A uniform slope has exactly two downhill walls (top + 2 walls = 18). Columns needing three or four (pits, spikes) clip gracefully via the mesher's overflow guard — rare, and a dropped wall is far less jarring than a failed allocation.
  2. Size the window to the device, not to a constant. _computeMaxRegion picks the largest window whose pool fits the GPU's buffer limits.

That second point hides a sharp lesson. The obvious code sizes to device.limits.maxStorageBufferBindingSize — but those reported limits are the spec ceiling, not what actually allocates. A GPU that advertises a 2 GB binding will happily fail a 1.5 GB allocation. Trusting the number gives you a slider whose top half crashes. So the pool is also clamped by a hard VERT_POOL_CAP_BYTES tuned to what real GPUs allocate, which lands the maximum view range at a verified-safe value rather than a theoretical one.

The principled fix — and the natural next step — is GPU vertex compaction: count vertices per chunk, prefix-sum to per-chunk offsets, then write tightly packed. The pool would then size to actual geometry instead of worst-case slots, and the window could grow much further within the same memory.

Controls#

Desktop: click to lock the mouse, WASD to walk, Space to jump, Shift sneak, Ctrl sprint. F toggles a free-fly camera (crafty's CameraController); T toggles the render style; the slider sets view range.

Both controllers expose the same analog input surface (inputForward / inputStrafe / applyLookDelta), so touch and gamepad wire in with one set of callbacks that route to whichever mode is active, via the engine's setupTouchControlsLazy / setupGamepadControlsLazy helpers. The action buttons map contextually — jump becomes ascend, sneak becomes descend, sprint becomes boost — so the same overlay drives walking and flying.

Status#

tsc clean, full test suite green, vite build clean. The WGSL only compiles at runtime, so the GPU pipeline itself is best verified in a browser; the fall-through, imagery-colour, streaming-window and buffer-cap behaviours were all shaped against real runs.