WebGPU Renderer

Oct 2025

WebGPUWGSLTypeScript

Overview

Rendering hundreds of dynamic lights in real time is expensive. The naive approach checks every light for every fragment on every draw call, and frame time blows up linearly with light count. I wanted to solve that problem entirely in the browser, with no native GPU plugins, using the new WebGPU API. The result is two complete rendering pipelines written in WebGPU and WGSL: Forward Plus and Clustered Deferred.

Both pipelines share a compute-shader clustering stage that partitions the view frustum (the pyramid-shaped region the camera can see) into a 3D grid and assigns point lights to each cell. Fragment shaders then query only the lights for their cell rather than iterating the full scene light list, cutting per-fragment work from O(lights) to O(lights per cluster).

At 1,000 lights, the clustered pipelines cut frame time by ~90% compared to the naive forward baseline. Clustered Deferred pulls further ahead under heavy overdraw since lighting only touches the final visible surface stored in the G-buffer (textures holding surface properties like position and normals).

Technical Details

Light Clustering (Compute)

The view frustum is partitioned into a 16x10x32 grid of clusters along screen X, screen Y, and depth Z. Each cluster is a frustum-shaped volume aligned with the camera, not an axis-aligned box. Screen XY extents of each tile come from pixel coordinates unprojected through the inverse projection matrix into view space. Z boundaries follow an exponential distribution so near slices are thin (high precision) and far slices are thick, matching how perspective compression spreads geometry across depth.

Exponential (log) depth slicing: thin near, thick far

Uniform depth slicing: equal-width slabs

A compute shader tests each light against every cluster. Each light's view-space sphere is tested against the tile's bounding volume; lights that intersect are appended to a per-cluster index list in a flat storage buffer. This buffer is written before any rasterization begins each frame.

Frustum-aligned cluster volumes in Sponza

16x10x32 cluster grid colored by light count

16x10x32: each tile colored by assigned light count

Forward Plus Pipeline

After the clustering pass, the scene draws in a single forward pass. Each fragment computes its cluster index from its screen-space pixel coordinate and hardware depth value: XY indices come from dividing pixel position by tile size, and the Z index is derived from the same log-depth formula used during clustering. The fragment shader then reads only the lights in that cluster's list and accumulates Lambertian contributions and distance attenuation for each one. Lights outside the cluster are never touched.

I built Forward Plus first because it maps closely to standard forward shading, which made it a good baseline for understanding where time was actually spent. It also has a real advantage: materials can have arbitrary per-fragment complexity since the full surface context is available at shade time, not reconstructed from a G-buffer. Comparing it against Clustered Deferred later showed me exactly where that flexibility costs frame time under heavy overdraw.

ComplexityO(P × L_cluster), where L_cluster « L — cost scales with lights per cluster, not total scene lights

Clustered Deferred Pipeline

The deferred variant splits geometry and lighting across two passes. The geometry pass renders the entire scene and writes three render targets into the G-buffer (textures storing surface properties like position and normals): albedo (RGBA8), view-space normals encoded to [0,1] range (RGBA8), and world-space position (RGBA32F). No lighting is computed here. After the clustering compute pass runs, a fullscreen triangle pass reads those G-buffer textures by pixel coordinate, recovers albedo, normal, and world position, resolves the cluster index with the same log-depth formula, and accumulates lighting from the cluster's light list.

Normals (RGBA8)

Albedo (RGBA8)

Depth (RGBA32F)

The payoff over Forward Plus is eliminating overdraw, where fragments are drawn and then overwritten before reaching the framebuffer. In a complex scene, the same screen pixel may be rasterized several times as geometry overlaps in depth. Standard forward shading runs the full lighting loop for every one of those discarded fragments. Deferring all lighting to a single fullscreen pass over the G-buffer means each pixel is lit exactly once regardless of scene depth complexity. Because lighting cost scales with screen pixels rather than draw calls, adding more geometry or increasing light counts hurts Clustered Deferred far less than it hurts Forward Plus.

ComplexityO(P × L_cluster) — same asymptotic cost as Forward Plus, but each pixel is shaded exactly once regardless of overdraw, improving GPU cache coherence under heavy geometry

Performance

Clustered vs. Naive Forward

I measured frame time with Chrome's built-in WebGPU timestamp queries, averaged over several seconds at each light count on the Sponza scene. The naive baseline checks every light for every fragment on every draw call, so frame time grows as O(P × L), linearly with both pixel count and total light count.

At 1,000 point lights the clustered pipelines reduce frame time by ~90% compared to the naive forward baseline. Clustered Deferred continues to outperform Forward Plus at high light counts because its lighting pass touches each screen pixel exactly once in a single fullscreen draw. Forward Plus still shades each fragment on every draw call, so overdraw multiplies the lighting cost. The G-buffer approach breaks that coupling entirely.