Chapter 27: Performance

This chapter covers the profiling, optimization, and culling techniques used to keep Taos running at 60+ FPS on mid-range hardware.

27.1 Frame Timing and Profiling#

Command encoder timeline: writeTimestamp brackets each render pass (shadow, opaque, transparent, post) and writes nanosecond counters into a query set; results are resolved into a buffer, mapped, and converted into per-pass milliseconds

RenderContext.update() advances deltaTime, elapsedTime, frameCount, and a smoothed fps once per frame — the in-game HUD reads these directly to display performance state:

// ── from src/renderer/render_context.ts ──
update(): boolean {
  // Update frame timing
  const now = performance.now();
  this.deltaTime = Math.max(0, (now - this.#lastTime) / 1000);
  this.#lastTime = now;
  this.elapsedTime = (now - this.#startTime) / 1000;

  // Frame counter & FPS (exponential moving average)
  this.frameCount++;
  const instFps = this.deltaTime > 0 ? 1 / this.deltaTime : 0;
  this.framesPerSecond += (instFps - this.framesPerSecond) * 0.1;
  this.fps = Math.round(this.framesPerSecond);
  // ...
}

For per-pass GPU timing, WebGPU exposes a 'timestamp-query' device feature: a GPUQuerySet of type 'timestamp' plus encoder.writeTimestamp(querySet, i) brackets around each pass, resolved into a buffer and mapped back on the CPU. Taos does not currently consume these — overall frame time has been sufficient signal — but the render graph's pass-ordered execution would make it straightforward to wire in.

27.2 Lazy Pipeline Compilation#

Pipeline compilation is expensive. Each pass that needs material-specific pipelines (e.g. the geometry pass) keeps a per-instance pipeline cache and compiles lazily on first use, keyed by shader + variant mask:

// ── from src/renderer/render_graph/passes/geometry_pass.ts ──
private _getPipeline(material: Material, variantMask: number): GPURenderPipeline {
  const key = `${material.shaderId}:${variantMask}`;
  let pipeline = this._pipelineCache.get(key);
  if (pipeline) {
    return pipeline;
  }

  const defines: Record<string, string> = {};
  if (variantMask & 1) {
    defines['HAS_ALBEDO_MAP'] = '1';
  }

  const shaderModule = this._ctx.createShaderModule(
    material.getShaderCode(MaterialPassType.Geometry, variantMask),
    `GeometryShader[${key}]`,
    defines,
  );
  pipeline = this._device.createRenderPipeline({ /* layout, vertex, fragment, depth, primitive */ });
  this._pipelineCache.set(key, pipeline);
  return pipeline;
}

For materials that are always visible, pipelines can be compiled eagerly during the loading screen by walking the material list and triggering the cache miss once.

27.3 Frustum Culling#

Top-down view of the camera frustum sweeping across a grid of chunks; chunks fully outside any of the six view planes are culled, chunks intersecting are kept, with the AABB positive-vertex test highlighted

Every chunk and mesh is tested against the camera frustum before rendering. The test uses the six planes of the view-projection frustum:

// ── from src/renderer/render_graph/passes/block_geometry_pass.ts ──
private _extractFrustumPlanes(m: Float32Array | number[]): void {
  const p = this._frustumPlanes;
  p[ 0]=m[3]+m[0]; p[ 1]=m[7]+m[4]; p[ 2]=m[11]+m[ 8]; p[ 3]=m[15]+m[12];
  p[ 4]=m[3]-m[0]; p[ 5]=m[7]-m[4]; p[ 6]=m[11]-m[ 8]; p[ 7]=m[15]-m[12];
  p[ 8]=m[3]+m[1]; p[ 9]=m[7]+m[5]; p[10]=m[11]+m[ 9]; p[11]=m[15]+m[13];
  p[12]=m[3]-m[1]; p[13]=m[7]-m[5]; p[14]=m[11]-m[ 9]; p[15]=m[15]-m[13];
  p[16]=m[2];      p[17]=m[6];      p[18]=m[10];        p[19]=m[14];
  p[20]=m[3]-m[2]; p[21]=m[7]-m[6]; p[22]=m[11]-m[10]; p[23]=m[15]-m[14];
}

private _isVisible(ox: number, oy: number, oz: number): boolean {
  const p = this._frustumPlanes;
  const mx = ox + CHUNK_SIZE, my = oy + CHUNK_SIZE, mz = oz + CHUNK_SIZE;
  for (let i = 0; i < 6; i++) {
    const a = p[i*4], b = p[i*4+1], c = p[i*4+2], d = p[i*4+3];
    // AABB's positive-vertex test against this plane
    if (a*(a>=0?mx:ox) + b*(b>=0?my:oy) + c*(c>=0?mz:oz) + d < 0) {
      return false;
    }
  }
  return true;
}

The six frustum planes are extracted directly from the un-jittered view-projection matrix (Gribb-Hartmann), and the AABB test picks each box corner adaptively based on the plane normal's sign so we never iterate all eight vertices.

Because this runs on the CPU for every mesh — and, for streamed geo tiles, alongside a horizon-occlusion test against the curve of the globe — the frustum question is already answered cheaply before the draw list is even built. What the CPU can't answer cheaply is whether an in-frustum object is hidden behind something else. That's the job of the next section.

27.4 Occlusion Culling#

Frustum culling removes what's outside the view; occlusion culling removes what's inside it but hidden behind nearer geometry — a ridge, a wall, the near limb of the globe. The CPU can't answer "is this behind something?" cheaply, so Taos answers it on the GPU as an opt-in pass built around a hierarchical depth buffer (Hi-Z). Three pieces wire together through the render graph:

HiZPass (src/renderer/render_graph/passes/hiz_pass.ts) builds an r32float depth pyramid from the scene depth. Each coarser mip reduces a 2×2 block to the farthest occluder — max for standard depth, min for reversed-Z — so a coarse texel can never claim an occluder nearer than reality. That one-sided rounding is what makes the whole scheme conservative: it may fail to cull something that is in fact hidden, but it never culls something that is actually visible.

GpuCullPass (src/renderer/render_graph/passes/gpu_cull_pass.ts, src/shaders/gpu_cull.wgsl) runs one compute thread per instance. It tests the instance's world bounding sphere against the frustum, then projects the sphere's screen-space AABB, picks the mip whose texels span that footprint, samples the pyramid, and compares the box's nearest depth against the farthest occluder there:

// ── from src/shaders/gpu_cull.wgsl ── (standard depth; reversed-Z flips both)
// Occluded when the box's nearest point is *behind* the farthest occluder
// across its whole screen footprint — then nothing the box could draw is visible.
if (reversed) {
  return nearDepth < occ;
}
return nearDepth > occ;

If the instance is occluded, the thread writes instanceCount = 0 into that instance's drawIndexedIndirect args; otherwise it leaves the 1 the CPU wrote. The geometry pass then issues one indirect draw per instance and the GPU skips the zeroed ones with no vertex work (§27.5).

GeometryPass.setGpuOcclusionCull(...) is the opt-in switch. Given a config it packs per-instance bounds and args, adds the cull compute pass, and swaps its draw loop from drawIndexed to drawIndexedIndirect. With culling off, the pass is byte-for-byte the old path — nothing in the buffers or the draw loop changes.

Previous-frame Hi-Z. There is a chicken-and-egg problem: the occluders are the geometry we are about to draw, so the pyramid for this frame doesn't exist yet. Taos resolves it by culling against last frame's pyramid, then rebuilding the pyramid from this frame's depth once the geometry is drawn. The cull → geometry → rebuild Hi-Z ordering keeps the read ahead of the overwrite within the frame, and the pyramid is a persistent texture that survives into the next frame. The cost is one frame of latency: a hard camera cut can leave something un-culled for a single frame, but because the test is conservative it never wrongly hides anything. The first frame and any window resize skip the cull (no valid prior pyramid) and fall back to plain drawIndexed.

This lives in the shared deferred GeometryFeature, so deferredPreset({ occlusionCull: true }) turns it on for any deferred scene; planet_explorer and crafty expose it as a runtime toggle, and the gpu_cull_test sample is a focused demo — a wall hiding hundreds of high-poly spheres — with a live "drawn / total" readout fed by reading the post-cull args buffer back.

Why occlusion and not a GPU frustum cull? As §27.3 noted, the CPU already frustum-culls (and horizon-culls geo tiles) before the draw list exists, so a GPU frustum pass would mostly repeat work already done. Occlusion is the part the CPU can't afford — so that is where the GPU pass earns its keep, and it is most worthwhile when many in-frustum objects sit behind a near occluder (looking across a canyon rather than down at open ground).

27.5 GPU-Driven Indirect Draw#

Terrain LOD uses indirect draw so the CPU never sees the visible patch count. A compute pass walks a CDLOD quadtree per max-depth cell, atomically appending the patches it decides to render into a patch buffer and incrementing instance_count on a shared drawIndexedIndirect args struct:

// ── from src/shaders/terrain/terrain_lod.wgsl ──
struct IndirectArgs {
  index_count: u32,
  instance_count: atomic<u32>,
  first_index: u32,
  base_vertex: u32,
  first_instance: u32,
}

The subsequent render pass calls drawIndexedIndirect directly against that buffer — no readback, no CPU iteration, no per-patch draw call assembly.

The occlusion cull of §27.4 rides the same drawIndexedIndirect mechanism, but fills the args buffer the other way around. Terrain appends: a compute pass grows the list and atomically bumps a single instance_count, so the buffer ends up holding exactly the survivors. The mesh GeometryPass instead masks: the CPU writes one arg slot per draw with instanceCount = 1, and the cull compute zeroes the slots whose instances turn out to be hidden.

// ── from src/renderer/render_graph/cull_util.ts ──
// One drawIndexedIndirect arg set: [indexCount, instanceCount, firstIndex, baseVertex, firstInstance]
export function packIndirectArgs(out: Uint32Array, off: number, indexCount: number, instanceCount = 1): void {
  out[off]     = indexCount;
  out[off + 1] = instanceCount; // the cull compute later overwrites this with 0 when occluded
  out[off + 2] = 0;
  out[off + 3] = 0;
  out[off + 4] = 0;
}

Either way the render pass calls drawIndexedIndirect straight against the buffer, and a slot whose instanceCount is 0 costs no vertex work. The trade-off between the two is about what stays stable: appending shrinks the draw list to just the survivors (ideal when the CPU has nothing per-item to do, as with terrain patches), while masking keeps a fixed slot per draw — so the CPU-side draw loop, material bind groups, and per-draw uniforms don't have to be reshuffled — and simply switches entries off.

27.6 Memory Management#

Side-by-side comparison: bad pattern allocates and frees per frame (sawtooth GC pressure); good pattern keeps a pre-allocated scratch array plus a grow-only buffer pool that is reused frame-to-frame and never freed during gameplay

Pre-Allocated Staging Arrays#

Each pass owns the Float32Array / Uint32Array scratch buffers it needs for uniform uploads, allocated once at construction so the per-frame upload path never hits the GC:

// ── from src/renderer/render_graph/passes/geometry_pass.ts ──
private readonly _modelData = new Float32Array(32);
private readonly _cameraScratch = new Float32Array(CAMERA_UNIFORM_SIZE / 4);

Per-Draw Buffer Pooling#

Per-draw model uniform buffers are grown on demand and never shrunk during gameplay, so a frame that drops back to fewer draws keeps the buffers around for the next spike:

// ── from src/renderer/render_graph/passes/geometry_pass.ts ──
private _ensurePerDrawBuffers(count: number): void {
  while (this._modelBuffers.length < count) {
    const mb = this._device.createBuffer({
      size: MODEL_UNIFORM_SIZE,
      usage: GPUBufferUsage.UNIFORM | GPUBufferUsage.COPY_DST,
    });
    this._modelBuffers.push(mb);
    this._modelBindGroups.push(this._device.createBindGroup({
      layout: this._modelBgl,
      entries: [{ binding: 0, resource: { buffer: mb } }],
    }));
  }
}

This avoids allocation and deallocation churn when the draw count varies between frames; the matching bind groups are pooled alongside the buffers, so binding doesn't allocate either.

Transient Resource Pooling#

Inside the render graph, transient textures and buffers are pulled from a descriptor-keyed free-list (PhysicalResourceCache) instead of being created and destroyed each frame. Persistent resources — shadow maps, TAA history, IBL data — are registered by stable string key and live across graph rebuilds:

// ── from src/renderer/render_graph/physical_resource_cache.ts ──
acquireTexture(desc: TextureDesc, usage: GPUTextureUsageFlags): GPUTexture {
  const key = textureKey(desc, usage);
  const pool = this._texturePool.get(key);
  let tex = pool?.pop();
  if (!tex) {
    tex = this._device.createTexture({ /* size, format, mips, usage */ });
  }
  this._liveTransients.push({ kind: 'tex', key, resource: tex });
  return tex;
}

The cache also memoizes bind groups and texture views, keyed by the underlying GPU object ids — so the same (layout, resource-set) doesn't allocate a fresh GPUBindGroup every frame.

27.7 Summary#

Performance optimization techniques used throughout Taos:

Profiling: GPU timestamp queries for per-pass timing
Shader compilation: Lazy pipeline creation with caching to avoid stalls
Culling: CPU frustum culling (6-plane AABB), plus opt-in GPU Hi-Z occlusion culling (previous-frame depth pyramid → per-instance indirect draw)
Batching: GPU indirect draw with compute-shader culling — append (terrain) or mask (meshes) — for reduced CPU overhead
Memory management: Pre-allocated staging arrays, buffer pooling, texture cache with reference counting

Further reading:

src/renderer/render_graph/passes/ — Per-pass buffer pre-allocation patterns
src/block/chunk.ts — Chunk culling
crafty/main.ts — Frame loop and performance tracking