CUDA Boids

Sept 2025

CUDAOpenGL

500,000 boids simulated in real time on GPU

Overview

A GPU-accelerated boid flocking simulation built in CUDA and rendered with OpenGL, completed for CIS 5650 (GPU Programming and Architecture) at Penn. The simulation implements Craig Reynolds' three flocking rules: cohesion (move toward the center of mass of neighbors), alignment (match neighbor heading), and separation (avoid crowding). Three distinct approaches are implemented and benchmarked, ranging from a brute-force O(N²) pass to a memory-coherent uniform grid that sustains real-time frame rates at 500,000 boids.

Technical Details

Naive Approach (O(N²))

The baseline kernel (kernUpdateVelocityBruteForce) assigns one thread per boid. Each thread iterates over every other boid, accumulates the three rule contributions, and writes the updated velocity. A second kernel then advances positions. This is straightforward to implement but the per-frame cost grows quadratically: at 500,000 boids the frame time reaches ~1,469 ms, making interactive rates impossible at large scales.

Uniform Scattered Grid

Because boids only care about neighbors within a fixed search radius, the simulation space is partitioned into a uniform grid. Each cell is sized to 2× the rule radius, which guarantees that only the 8 immediately surrounding cells need to be searched per boid.

Each frame a kernel maps every boid to its grid cell index. thrust::sort_by_key sorts boids by cell index, and a second kernel records the start and end indices for each occupied cell. During the velocity update, each thread identifies its boid's cell, looks up the 8 neighbors, and checks only the boids stored in those cells. The search space drops from N to a small constant per boid, cutting frame time from 1,469 ms to ~28 ms at 500,000 boids.

Coherent Grid (Memory Layout Optimization)

The scattered grid still requires indirect memory accesses: each thread fetches a neighbor boid index from the sorted index array, then jumps to that boid's position and velocity in unsorted buffers. Those random reads thrash the cache.

The coherent variant adds a preprocessing step each frame that physically reorders the position and velocity buffers to match the sorted grid order. After reordering, neighboring boids in grid space are also adjacent in memory, so the velocity update kernel reads positions and velocities in a linear, cache-friendly pattern. At 500,000 boids this brings frame time from ~28 ms down to ~4.9 ms, a 5.8× additional speedup over the scattered grid with no algorithmic change.

Performance

Benchmarked on an RTX 4070 Laptop (8 GB) with a fixed block size of 128 threads. Frame times in milliseconds (lower is better). Bold entries mark the fastest approach at each count.

Boid Count	Naive (ms)	Scattered (ms)	Coherent (ms)
1,000	0.105	0.141	0.241
5,000	0.550	0.233	0.279
50,000	14.604	0.963	0.941
100,000	58.832	2.260	1.529
250,000	366.550	7.963	2.431
500,000	1,469.482	28.458	4.900

At very small counts (1,000 boids) the naive method wins because the cost of computing grid indices and running thrust::sort_by_key exceeds the savings from reduced neighbor search. That crossover disappears by 5,000 boids, where the grid pays for itself. Block sizes below 32 show measurable degradation because sub-warp blocks leave GPU lanes idle; sizes from 32 upward perform equivalently.