Chapter 29: The Road Ahead
Taos is a living project. This chapter outlines the planned features and directions for future development.
29.1 Ray Tracing with WebGPU#
WebGPU has no ray tracing API today. The native wgpu Rust crate carries an experimental ray tracing backend behind a Cargo feature flag, but it is not exposed to Firefox, and Chrome has no public implementation. The topic has surfaced in the WebGPU working group several times — what an acceleration-structure builder should look like, how to expose ray queries from compute, whether ray pipelines or inline ray queries should land first — but nothing has been standardized.
If a ray tracing extension does eventually land, it will most likely be a thin abstraction over the underlying platform APIs (DXR on D3D12, VK_KHR_ray_tracing on Vulkan, and Metal's ray tracing intersector). At that point Taos could implement:
- Hardware-accelerated ray traced shadows — replacing cascade shadow maps with single-pass shadows that stay sharp at any distance, with no peter-panning or filtering artifacts.
- Ray traced ambient occlusion — physically grounded AO without the screen-space gaps and self-occlusion artifacts of SSAO.
- Ray traced reflections — true off-screen reflections in water and shiny blocks, without the missing-geometry problems of SSR.
- Path-traced preview — a pure path tracing mode for offline-quality screenshots, useful as a ground-truth reference for the real-time renderer.
Until then, ray tracing in WebGPU has to be done entirely in compute, with the engine building and traversing its own acceleration structure. The viable options are:
- BVH over triangle meshes. Build a bounding volume hierarchy on the CPU (or in compute) and walk it from a ray-cast compute shader. Works for arbitrary geometry but the build is expensive and rebuilds per-frame for dynamic content are slow without specialized algorithms.
- Signed distance fields. Skip primitive intersection entirely — sphere-trace through a 3D SDF texture. Great for organic shapes and soft shadows, less great for sharp polygonal detail.
- Voxel DDA. For Taos specifically this is the natural fit. The world is a sparse voxel grid, and 3D-DDA traversal through chunk storage gives exact per-block ray hits in a single compute pass — no acceleration structure to maintain because the chunk layout already is one. This is how Teardown and several voxel engines implement their lighting today, and Taos's
ChunkBufferis already shaped to support it.
A pragmatic plan: prototype voxel-DDA shadows and reflections in compute now (they fit Taos's data model better than triangle BVH ray tracing anyway), and keep the hardware ray tracing path as a swap-in for when the API exists.
29.2 Bindless Resources#
WebGPU's bind-group model requires every texture, buffer, and sampler a shader touches to be enumerated in a BindGroupLayout at pipeline creation time, and bound through a small fixed number of BindGroup slots at draw time. This is safe and portable, but it forces the renderer into patterns that are starting to feel dated:
- The block atlas is one giant texture array because we can't index into a heap of independent textures from the shader.
- Material variation is encoded in uniform buffers and texture array layers, not by indexing into a global resource table.
- Every draw call that uses a different texture set needs a different bind group, even if the shader and pipeline are identical.
- GPU-driven rendering — where the GPU itself emits draws via
multiDrawIndirect— is hamstrung because each indirect draw can't pick its own textures.
A "bindless" extension (under discussion in the WebGPU working group, modeled on Vulkan's VK_EXT_descriptor_indexing and D3D12's resource heaps) would let a shader index a large, sparse array of textures or buffers using a runtime integer, with the binding resolved per-invocation rather than per-draw.
// ── future: bindless texture access ──
@group(0) @binding(0) var textures: binding_array<texture_2d<f32>>;
@group(0) @binding(1) var samp: sampler;
struct Material {
albedoIndex: u32,
normalIndex: u32,
ormIndex: u32,
_pad: u32,
};
@group(0) @binding(2) var<storage, read> materials: array<Material>;
@fragment
fn fs_main(in: VertexOutput) -> @location(0) vec4f {
let mat = materials[in.materialId];
let albedo = textureSample(textures[mat.albedoIndex], samp, in.uv);
let normal = textureSample(textures[mat.normalIndex], samp, in.uv);
// ...
}
For an engine like Taos this would unlock several things:
- A single mega-draw for the world. All visible chunks could be drawn with one
multiDrawIndirectcall. The GPU's culling compute pass would write the visible chunk list, and each chunk's vertex data would carry amaterialIdthat selects the right block textures inline. - First-class material variety. Blocks, particles, GLTF meshes, and decals could all live in one global texture pool with per-instance material indices, instead of being siloed into per-system atlases and array slices.
- Cheaper transparent and decal passes. Many small per-object draws with unique textures (UI sprites, item drops, tile entities) collapse into instanced draws.
- Streaming friendliness. Newly loaded chunk textures or imported assets can be slotted into free indices in the texture array without rebuilding pipelines or bind group layouts.
- GPU-driven scene submission. Combined with
multiDrawIndirectand compute culling, the CPU's per-frame job shrinks to "upload camera, dispatch culling, submit one draw" — closer to a modern AAA renderer's frame structure.
The trade-offs match the desktop story: bindless makes validation looser, encourages a more data-oriented scene representation, and pushes more correctness responsibility onto the engine. Taos's current RenderItem/Pass design is already a reasonable starting point for it, since draw submission is already centralized in the render graph.
29.3 Virtualized Geometry#
Virtualized geometry treats triangles the way virtual memory treats bytes — stream them in on demand at the resolution the screen actually needs, and let the GPU itself decide what to draw each frame. The result is meshes with millions of triangles that cost about the same as a few hundred thousand, with no manual LOD authoring. WebGPU has most of the raw ingredients for a stripped-down implementation, but several missing features keep it from being practical today.
The pipeline a WebGPU virtualized-geometry renderer would need:
- Meshlet preprocessing (offline). Source meshes are split into clusters of ~64–128 triangles, simplified into a DAG of LOD levels, and packed alongside per-cluster bounding spheres and screen-space error metrics. Taos's chunk mesher already produces something cluster-shaped — a real implementation would generalize this to GLTF assets.
- GPU cluster selection. A compute pass walks the cluster DAG, picks the coarsest level whose projected error is below one pixel, and culls against the frustum and a Hi-Z depth pyramid built from last frame.
- GPU draw submission. Surviving clusters are appended to an indirect draw buffer. A single
multiDrawIndirectordrawIndexedIndirectper material bucket renders the whole scene. - Software rasterization for small triangles. Triangles covering only a few pixels are rasterized in a compute shader using 64-bit atomics into a visibility buffer — this is dramatically faster than the hardware rasterizer at sub-pixel sizes, where quad overdraw dominates.
- Visibility buffer + deferred material pass. Instead of writing a G-buffer per triangle, the rasterizer writes
(instanceId, triangleId). A full-screen pass then reads the visibility buffer, reconstructs the triangle, fetches its material, and shades. This collapses overdraw costs and is the only practical way to handle millions of micro-triangles.
// ── future: compute software rasterizer with 64-bit atomic visibility buffer ──
@group(0) @binding(0) var<storage, read> clusters: array<Cluster>;
@group(0) @binding(1) var<storage, read_write> visBuffer: array<atomic<u64>>;
@compute @workgroup_size(64)
fn cs_raster(@builtin(global_invocation_id) gid: vec3u) {
let tri = loadTriangle(clusters, gid.x);
let bbox = screenBounds(tri);
for (var y = bbox.min.y; y < bbox.max.y; y++) {
for (var x = bbox.min.x; x < bbox.max.x; x++) {
if (insideTriangle(tri, vec2f(f32(x), f32(y)))) {
let depth = interpolateDepth(tri, vec2f(f32(x), f32(y)));
let packed = (u64(bitcast<u32>(depth)) << 32u) | u64(tri.id);
atomicMin(&visBuffer[y * width + x], packed);
}
}
}
}
WebGPU is missing several pieces that would make this dramatically simpler or faster:
- Mesh shaders. Letting a compute-shader-like stage emit primitives directly into the rasterizer would skip the indirect-buffer round trip and make cluster culling a single GPU stage. WebGPU has discussed mesh shaders but no specification exists yet.
- 64-bit atomics. The visibility buffer trick — packing depth and triangle ID into one 64-bit value and using
atomicMinfor depth testing — is the heart of the software rasterizer. WebGPU currently only specifies 32-bit atomics. Without 64-bit atomics, you need either two separate atomic operations (racy) or a much more elaborate per-tile binning scheme. - Subgroup operations. Cluster culling, prefix sums for compaction, and software raster all benefit from intra-warp shuffles and ballots. Subgroups are a proposed WebGPU feature but not yet shipped uniformly.
multi_draw_indirectwith a GPU-written count. WebGPU hasdrawIndirect, but a virtualized-geometry renderer issues thousands of draws with the count itself decided by the GPU culling pass. Withoutmulti_draw_indirect_count, the CPU has to read back a count or upper-bound the draw count and waste slots.- Bindless resources. Covered in §27.2. The material resolve pass needs to index into a global texture table using the triangle's instance ID — without bindless, every material becomes a separate pipeline or a separate pass.
- Persistent threads / work graphs. D3D12's work graphs let the GPU enqueue more work onto itself, which is a natural fit for cluster-DAG traversal. WebGPU has no equivalent.
- Sampler feedback / texture residency. Virtualized shadow maps and virtual texturing depend on knowing which texture tiles were actually sampled. WebGPU has no feedback mechanism.
- Pipeline statistics in compute. Profiling a compute rasterizer is painful without primitive counters or warp-occupancy queries — both standard on desktop APIs.
For Taos, the immediate payoff would be sceneries that the current chunk renderer can't handle: high-poly imported assets (statues, vehicles, buildings) with no LOD authoring, dense vegetation with per-leaf geometry, terrain features with overhangs and caves that don't fit the voxel grid. The chunk system itself wouldn't be replaced — voxels are a different geometric representation — but everything placed inside the world could share a single virtualized geometry pipeline.
A pragmatic path: start with cluster culling + multiDrawIndirect for GLTF meshes (achievable today on top of the existing render graph), defer the software rasterizer until 64-bit atomics ship, and treat full virtualized-geometry parity as a multi-year goal that follows the WebGPU spec rather than racing ahead of it.
29.4 Compute Shader Post-Processing#
Several post-processing effects could benefit from compute shader implementations:
- Compute bloom — faster separable blur via shared memory in workgroups.
- Compute DOF — tile-based depth of field with variable-radius gather.
- Compute TAA — neighborhood sampling with shared memory.
// ── future: compute-based post-processing ──
const computePipeline = device.createComputePipeline({
compute: { module: bloomComputeShader, entryPoint: 'cs_bloom' },
});
29.5 Procedural Generation at Scale#
The terrain generation system will be extended with:
- Infinite terrain — seamless generation in all directions using a hash-based coordinate system.
- Cave systems — 3D cellular automata and Perlin worm caves.
- Biome diversity — more biomes (jungle, swamp, tundra, mesa) with unique vegetation and block types.
- Structure generation — trees, villages, dungeons placed by rule-based and template-based generation.
- LOD system for generation — distant chunks use lower-octave noise for faster generation.
29.6 Entity-Component-System Architecture#
Taos today uses a classic object-oriented scene model: a GameObject holds a list of Component instances, each component is a class with methods and per-object state, and the scene walks the tree every frame to find work. This is a familiar, approachable pattern — easy to teach, easy to inspect in a debugger, and a natural fit for the way humans think about a world made of things. It also has well-understood scaling limits.
The bottleneck shows up long before the GPU does. A MeshRenderer is a heap-allocated object, its mesh and material references point off into other allocations, and the per-object Transform lives somewhere else again. Iterating ten thousand renderers means ten thousand pointer chases through scattered cache lines, with the CPU stalled most of the time waiting on memory. Virtual method dispatch, polymorphic component shapes, and the unpredictable allocation order of a long-lived scene all conspire against the prefetcher. By the time a chunked voxel world has a few thousand entities — particles, projectiles, dropped items, mobs, networked players — the per-frame Scene.update walk starts to dominate the CPU budget even though each individual update does almost nothing.
An Entity-Component-System turns the data layout inside out. An entity is just an integer ID. A component is a plain struct of data, stored contiguously in a typed array. A system is a function that iterates one or more component arrays in parallel and writes the result back. There are no objects, no inheritance, no per-entity virtual calls.
// ── future: ECS-style component storage ──
class TransformStore {
positionX: Float32Array;
positionY: Float32Array;
positionZ: Float32Array;
// ... rotation, scale
}
class VelocityStore {
vx: Float32Array;
vy: Float32Array;
vz: Float32Array;
}
// A "system" is a loop, not a method on an object.
function integrateMotion(transforms: TransformStore, velocities: VelocityStore, dt: number, count: number) {
for (let i = 0; i < count; i++) {
transforms.positionX[i] += velocities.vx[i] * dt;
transforms.positionY[i] += velocities.vy[i] * dt;
transforms.positionZ[i] += velocities.vz[i] * dt;
}
}
The payoff is mechanical sympathy. Every byte the loop reads is one the CPU was already fetching for the next iteration. Branches are predictable. The arrays are SIMD-friendly — the same loop vectorizes trivially, and on the GPU the same shape ports straight to a compute shader. Systems with no shared writes can run on separate workers without locking. A mature ECS routinely simulates hundreds of thousands of entities at frame rate on hardware that would choke on a tenth as many GameObjects. For Taos specifically, this is what would unlock dense particle storms, swarms of physics-driven projectiles, large mob populations, and high-entity-count multiplayer sessions without the CPU becoming the bottleneck.
The cost is a fundamentally different way of writing game code, and it is genuinely harder to think in. Object-oriented code lets you say "the zombie chases the player" — you find the zombie, call zombie.chase(player), and the method has access to everything the zombie is. ECS forces you to reframe that as "for every entity with an AIBrain and a Transform and a Hostile tag, query the nearest entity with PlayerTag, then update its Velocity." Behavior is no longer attached to objects; it lives in systems that operate on whatever happens to match a query. A handful of consequences fall out of this:
- Cross-cutting logic gets harder. A spell that freezes a target, plays a sound, spawns particles, and applies a damage-over-time is one method call in OOP. In ECS it is several systems writing to several component stores, often coordinated through event queues or deferred command buffers because you can't safely mutate the world mid-iteration.
- Queries replace references. Holding a direct reference to "the player" becomes an anti-pattern, because the entity ID may be reused and components may move between archetypes as tags are added or removed. You filter the world each frame instead, which is fast but feels wasteful to programmers raised on pointers.
- Debugging is less visual. A
GameObjectin a debugger shows you the whole object at once. An ECS entity is an integer scattered across a dozen disjoint component arrays — useful inspector tooling (entity browsers, archetype viewers) is something the engine has to provide, not something the language gives for free. - Refactoring boundaries shift. Adding a new field to a component is cheap; adding a new component type and rewiring the systems that consume it is the expensive operation. This inverts the OO instinct to extend classes freely and rewire interfaces sparingly.
Hybrid approaches exist: an ECS world can sit alongside the classic component system rather than replacing it, reserving the data-oriented layout for the entities that need it while keeping objects elsewhere. A realistic Taos migration would probably keep the high-level scene tree and components as the authoring surface — that is where the API ergonomics matter most — and move the hot, numerous, homogeneous entities (particles, projectiles, blocks-as-entities, NPC swarms) onto a parallel ECS world that the systems iterate flatly. The renderer would consume both: GameObject-derived RenderItems from the scene tree, and bulk-instanced draws emitted by ECS systems, merged into the same render graph passes.
This is a deep architectural change rather than an incremental refactor, and the right time to take it on is when the current model becomes a measurable bottleneck — not before. The bet is that the kind of world Taos wants to grow into (multiplayer, dense simulation, AI-driven NPCs at scale) will eventually demand it.
29.7 WebXR#
- WebXR integration — immersive VR mode using
XRSessionwith WebGPU as the rendering backend. The renderer would output to the WebXR framebuffer with the correct projection and view matrices for each eye.
29.8 Closing Thoughts#
Taos began as an experiment — could we build a complete, modern renderer from scratch on the web platform? The answer is yes, and the code is here for you to read, modify, and learn from.
The rendering techniques covered in this book — deferred shading, PBR, CSM, TAA, bloom, SSAO, and the rest — are not specific to WebGPU or to Taos. They are the foundation of real-time graphics in 2026, and understanding them deeply will serve you regardless of which API or engine you use next.
The source code at https://github.com/brendan-duncan/TaosEngine will continue to evolve. The book will be updated as new features are added. Contributions, issues, and forks are welcome.
29.9 Summary#
Taos's future directions and closing reflections:
- Ray tracing: Hardware-accelerated shadows, AO, reflections, and path-traced preview
- Bindless resources: Sparse texture/buffer heaps indexed at runtime, enabling GPU-driven
multiDrawIndirectrendering of the whole world - Virtualized geometry: cluster culling, software rasterization, and visibility buffer shading — gated on 64-bit atomics, mesh shaders, and subgroup ops
- Compute post-processing: Bloom, DOF, and TAA implemented as compute shaders
- Procedural generation: Infinite terrain, cave systems, biome diversity, LOD generation
- ECS architecture: Data-oriented entity storage for cache-coherent iteration over tens of thousands of entities — at the cost of having to think in systems and queries rather than methods on objects
- WebXR: Immersive VR mode via
XRSessionwith WebGPU rendering backend - Closing: The rendering techniques covered are the foundation of real-time graphics, applicable beyond WebGPU and Taos
Further reading:
- WebGPU specification: https://www.w3.org/TR/webgpu/
- WGSL specification: https://www.w3.org/TR/WGSL/
- Physically Based Rendering (Pharr, Jakob, Humphreys): https://pbr-book.org/