Mods are asleep

My last day at ControlZee was September 30th, and my first day at CCP is tomorrow morning. It’ll be a welcome change to actually work on-site and be around people again for a change

During my two weeks between jobs, I dusted off my old geometry project and fixed a bunch of things that had been left to rot and were still relying on some deprecated LLVM features. Those turned out to be fairly easy to hammer back into place.

Back in 2019 I was experimenting with various caching mechanisms for the integer-based orient logic, specifically, caching the sub-determinants of the 4×4 determinant so the orient call could be reduced to four multiplications and three additions. Even though those few operations involved very large integers, they were still much cheaper than computing the full determinant (28 multiplications, 17 additions ). The main issue, however, was the sheer size of each cached structure: the memory footprint could easily cancel out the arithmetic savings if memory management wasn’t handled gracefully.

Since then, two papers have explored similar ideas, Cherchi et al. (2020) and Nehring-Wirxel et al., “Fast Exact Booleans” (2021). It’s not surprising: anyone who’s ever implemented a 4×4 determinant quickly realizes that its intermediate values are ideal candidates for caching. And yes, these approaches can dramatically reduce computation, but I feel some of those papers gloss over the less flattering aspects. Particularly the memory overhead and data-management complexity that come with storing large integer intermediates.

My solution takes a different route. Instead of caching the full 4D determinant values solely, I use fixed-integer arithmetic to compute the exact integer cell that bounds the three integer planes. This produces a unit-cube AABB (axis-aligned bounding box) around the point where the three planes intersect. From that, I can construct tightly fitting and computationally cheap AABB trees for all plane-based polygons in memory. These cell-based AABBs are then used directly in the tree-building process, where each new plane cuts through existing AABB trees, all of which already carry accurate integer-bounded polygon parts.

Because my input grid is limited to 30-bit integer coordinates, supporting the full range of the plane integer width, the derived plane coefficients (a, b, c, and d) fit comfortably within 64- and 128-bit integers. When these integer planes intersect, even though the true intersection often lies at a fractional location. The corresponding 4×4 determinant still yields intermediate values wide enough to preserve about 53 bits of meaningful precision. In practice, that means the absolute positional error is less than one unit in integer space, giving a tight and reliable window in which the true intersection must lie. This makes it straightforward to construct a small axis-aligned bounding box around the intersection point with exact integer bounds, keeping everything precise and efficient without resorting to massive determinant caches.

Using these unit-width cell bounds, we can short-circuit most of the orient calls by simply testing each plane against its corresponding AABB. In my tests, this fast path handles over 90% of all orientation queries. The remaining ~10% are the tricky cases, where the cell straddles the plane and a full determinant evaluation is unavoidable. For those, I cache the sub-determinants so they can be re-used for much cheaper evaluation, while keeping the amount of cached data; and thus the memory footprint is well under control. The cache has a fixed memory cap, which keeps things lightweight compared to the “full-banana” approach of storing every intermediate in those enormous numeric containers.

It’s worth noting that the 2021 paper also made an effort to manage memory usage by generating these cached 4D points only in batches. Their system would create temporary points on demand, for example, when clipping an entire mesh or processing a large batch of intersections, and then recycle those buffers once the operation completed. This avoided the need to store every point persistently in a spatial structure, and it kept memory growth under control. Still, it added a fair amount of boilerplate to coordinate the batching, whereas my approach sidesteps that entirely by only caching the few points that actually fall into those “both” cases.

So, with the cell-fitting approach, the orient overhead dropped to roughly one-tenth of what it used to be. Before adding the cell-fitting and sub-determinant caching, the orient operations alone still consumed around 30% of the runtime. The rest was spent managing and updating the spatial trees, building them, traversing them, and applying every shortcut I could find to reduce unnecessary classification work.

Of course, that also means that even if I were to cut the computational overhead down to less than 1% of what it once was, I’d only be shaving off about 29% of the total runtime. In other words, the math was never the whole problem. The real cost now lies in the bookkeeping: managing spatial trees, merging structures, and all the orchestration around those supposedly “cheap” orientation tests.

That’s why I’m planning to extend this work to full mesh arrangements, using the same integer kernel together with the hybrid sub-determinant caching and cell-bound approach, to see how much more performance I can squeeze out of that remaining 70% of the time slice. Sounds like a fair challenge, right? With of-course some SDF generation magic for creating dynamic input meshes from simple expressions.

I still have my rather crude 10,366 dodecahedron clipping the Stanford Bunny test case, which I plan to revisit so it more closely mirrors the setups in the published papers. Even so, that entire operation completes in under a second on my four-year-old 32-core AMD workstation. The other benchmark scenes, the old Gilbert Bernstein examples like the heatsink and random boxes: run even faster, typically in the 15–35 ms range, including tree construction and the subsequent merging phase.

There are plenty of other optimizations I’ve implemented such as fully lock-free threading, dynamic kernel scaling, but those will have to wait for another day. It’s getting late, and a new job awaits me in the morning.

Wish me luck,

tata!