Summer/Fall 2021

For the times I have been able to work, I have managed to make a few good things. Such as improving the performance of incremental sculpting items from a few hundred operations a second, to around 10000 per second on a modern and beefy machine. (An operation being, carving one shape from another).

Memory pools had the most impact on those numbers, and the better alignment and smaller memory allocations in general. Allocating most things on the stack and completely removing all recursive declarations and implementations. Even tree destructions utilise Morris traversal to clean up the tree data on exit. Typeless memory pools were also built so that you allocate from a pool by sizeof(T) instead of a particular type, so all types of shared size can share the pool. And each thread in the thread-pool has dedicated memory pools to remove any synchro contention from allocations, since it is quite ok to move data between pools that are all destroyed in unison.

Bounding Volume Hierarchies to assist with large tree construction, so that when partitioning a tree, you can very quickly pass a plane through a tree, and the BVH broad phase split will return you two new trees to use for the next iteration. A lot faster to work with data in such a way, and a lot easier to create custom splitting heuristic to find a more performant setup for the data. A derivation of Naylors Good-Tree heuristic is employed in a particular fashion using faces of the hypercube as projection vectors to isolate and converge on convex groups.

Internal faces of the trees are merged and simplified during build and merge, which reduces the cost of iterative operations by a large amount.

Trees are incrementally reduced, so that empty areas in the tree are clipped away and redundant nodes are removed. And basically any operation that can reduce the size of the tree in an inexpensive way is the most effective way to reduce operational cost.

The polygons in the trees are kept non-overlapping so that to not just keep the internal polys to able to identify dead branches in the tree, but actually keep them as the final export data and to reuse that distributed workload when you need to rebuild the trees to get rid of all the garbage collected in it.

Threading manages to distribute a large array of operations across different threads so that they are split geometrically and only focus on subspace of the problem, then just re-attached at the end of the process, where each thread piece ends up as a node in a parent tree. How to evenly distribute those actions, is to do a very minor work-distribution analysis and try to find a sum of work of each thread that is as close to the median as possible. So, to try and keep all the threads as busy for the whole run.

The threading setup I have for a large array of operations, works just as well for few operations on large models. Since the trees at the start are build with awareness of the the accumulated contributions of polygons/shapes of all the input models. Having the same base partitioning layout so sub-trees can easily be assigned to threads spatially. This works out quite well, if you are careful to ensure any shared partitioner that separates that geometry for each thread is not coincident with any input polygon. So, there exist no ambiguities between threads. And it simplifies any threading heuristics and moves those pathways to higher ground so any lower level operations have no special branches anymore.

All operations are contained and generic, so what can be for each thread, can be for each machine, since there are no hard dependencies between threads. In fact, there are no locks except for the ones used internally in the threading pool itself. I have workers, services and a server. Using that good old NNG(nanomsg) library to communicate between threads, processes and machines. And every component in the communication setup has proc utilities to spawn and test each part separately and run any part as headless item. The server can compute the spatial devision to assign to each thread, process or machine. Then each thread, process or machine, will return the sub-part of the problems solved, and the server attaches the parts into a final result.

The data returned is all in flatbuffers, so a result from the building process hands you a buffer and a meta-data buffer, which is a flexbuffer of json like description of the data schema and buffer layout, so you can very easily move the buffer into render buffers without much or any processing at all. Using the new Unity MeshAPI, I was able to reduce mesh creation for models with 300k polys to a few ms, completely bypassing the flatbuffers C# API and just using the unsafe pointer utilities and native arrays to convert a native buffer to a mesh. And also, now the point data for the mesh items are the same as expected for import by the building system, the system can easily re-ingest its creations for more extreme testing practices.

I altered the way I construct the planes so that it tries to match the input accuracy as closely as possible. The average distance error from input cartesian points to linearly solved with planar equations are within e-9. This can be configured to exchange accuracy for performance. I also created a fixed point utility to compute a 128bit reciprocal value for the inverse determinant, so that I can be sure to extract the cartesian coordinates from the planar equations while computing the determinant values in extended value forms with the clang ExtInt numerics. Like with the old ways of doing division, you create a very large scaled reciprocal value, which you then just multiply with instead of dividing, then scale back down. I encode the large exponent data into a double precision format by truncating it down to the most significant 54 bits. Using proper rounding rules and all to make sure that numbers have the least amount of error possible. And, at least with the extended large integers, I ensure that all signs are correctly represented, and any error introduced does not affect that. But these numbers are only computed at the very end, when producing the final mesh. I also enhanced my plane encoding that is used to encode planes created by individual threads or to encode simple planes that lie on one of the hypercube normals. Minor metadata and a truncated double precision format for the distance value, all into a 64bit index value.

The most expensive part of the operation now, is the reconstruction bit, but that is being solved as I am typing. Instead of using the old and heavy voting based, linear solving mechanics created by Murali and Funkhauser, I am going to use the Fast Winding Numbers tricks for assigning solid values per solid cell, according to each cells mass centre. So, re-using the BVH structure used for building the trees, with dipole information for the leaf triangles in internal trees to approximate the winding numbers, so that I can more cheaply assign solid classifications to the tree after it has been built, or transformed. I believe the performance of that should allow for very fast corrections for the tree structures.

With the whole pipeline becoming blazingly fast, I am moving back into editor land where I can create huge instance based worlds with my nested tiles, add robust geometric routines and large manifold geometric pieces till my stomach starts to turn. In addition I am tempted to use the new OurMachinery engine for the front end, so I can have full source access to the renderer.

Ta ta

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s