Ouroboros virtualized queues for dynamic memory management on GPUs. Martin Winter, Daniel Mlakar, Mathias Parger, and Markus Steinberger. 2020.
Dynamic memory allocation on a single instruction, multiple threads architecture, like the Graphics Processing Unit (GPU), is challenging and implementation guidelines caution against it. Data structures must rise to the challenge of thousands of concurrently active threads trying to allocate memory. Efficient queueing structures have been used in the past to allow for simple allocation and reuse of memory directly on the GPU but do not scale well to different allocation sizes, as each requires its own queue.
In this work, we propose Ouroboros, a virtualized queueing structure, managing dynamically allocatable data chunks, whilst being built on top of these same chunks. Data chunks are interpreted on-the-fly either as building blocks for the virtualized queues or as paged user data. Re-usable user memory is managed in one of two ways, either as individual pages or as chunks containing pages. The queueing structures grow and shrink dynamically, only currently needed queue chunks are held in memory and freed up queue chunks can be reused within the system. Thus, we retain the performance benefits of an efficient, static queue design while keeping the memory requirements low. Performance evaluation on an NVIDIA TITAN V with the native device memory allocator in CUDA 10.1 shows speed-ups between 11X and 412X, with an average of 118X. For real-world testing, we integrate our allocator into faimGraph, a dynamic graph framework with proprietary memory management. Throughout all memory-intensive operations, such as graph initialization and edge updates, our allocator shows similar to improved performance. Additionally, we show improved algorithmic performance on PageRank and Static Triangle Counting.
Overall, our memory allocator can be efficiently initialized, allows for high-throughput allocation and offers, with its per-thread allocation model, a drop-in replacement for comparable dynamic memory allocators.

Interactive Modeling of Cellular Structures on Surfaces with Application to Additive Manufacturing. Pascal Stadlbauer, Daniel Mlakar, H.-P. Seidel, Markus Steinberger, and Rhaleb Zayer. 2020.
The rich and evocative patterns of natural tessellations endow them with an unmistakable artistic appeal and structural properties which are echoed across design, production, and manufacturing. Unfortunately, interactive control of such patterns‐as modeled by Voronoi diagrams, is limited to the simple two dimensional case and does not extend well tofreeform surfaces. We present an approach for direct modeling and editing of such cellular structures on surface meshes. The overall modeling experience is driven by a set of editing primitives which are efficiently implemented on graphics hardware. We feature a novel application for 3D printing on modern support‐free additive manufacturing platforms. Our method decomposes the input surface into a cellular skeletal structure which hosts a set of overlay shells. In this way, material saving can be channeled to the shells while structural stability is channeled to the skeleton. To accommodate the available printer build volume, the cellular structure can be further split into moderately sized parts. Together with shells, they can be conveniently packed to save on production time. The assembly of the printed parts is streamlined by a part numbering scheme which respects the geometric layout of the input model.

Subdivision‐Specialized Linear Algebra Kernels for Static and Dynamic Mesh Connectivity on the GPU. Daniel Mlakar, Martin Winter, Pascal Stadlbauer, H.-P. Seidel, Markus Steinberger, and Rhaleb Zayer. 2020.
Awarded Eurographics Best Paper Award.
Subdivision surfaces have become an invaluable asset in production environments. While progress over the last years has allowed the use of graphics hardware to meet performance demands during animation and rendering, high‐performance is limited to immutable mesh connectivity scenarios. Motivated by recent progress in mesh data structures, we show how the complete Catmull‐Clark subdivision scheme can be abstracted in the language of linear algebra. While this high‐level formulation allows for a fully parallel implementation with significant performance gains, the underlying algebraic operations require further specialization for modern parallel hardware. Integrating domain knowledge about the mesh matrix data structure, we replace costly general linear algebra operations like matrix‐matrix multiplication by specialized kernels. By further considering innate properties of Catmull‐Clark subdivision, like the quad‐only structure after refinement, we achieve an additional order of magnitude in performance and significantly reduce memory footprints. Our approach can be adapted seamlessly for different use cases, such as regular subdivision of dynamic meshes, fast evaluation for immutable topology and feature‐adaptive subdivision for efficient rendering of animated models. In this way, patchwork solutions are avoided in favor of a streamlined solution with consistent performance gains throughout the production pipeline. The versatility of the sparse matrix linear algebra abstraction underlying our work is further demonstrated by extension to other schemes such as sqrt 3 and Loop subdivision.

spECK: accelerating GPU sparse matrix-matrix multiplication through lightweight analysis. Mathias Parger, Martin Winter, Daniel Mlakar, and Markus Steinberger. 2019.
Sparse general matrix-matrix multiplication on GPUs is challenging due to the varying sparsity patterns of sparse matrices. Existing solutions achieve good performance for certain types of matrices, but fail to accelerate all kinds of matrices in the same manner. Our approach combines multiple strategies with dynamic parameter selection to dynamically choose and tune the best fitting algorithm for each row of the matrix. This choice is supported by a lightweight, multi-level matrix analysis, which carefully balances analysis cost and expected performance gains. Our evaluation on thousands of matrices with various characteristics shows that we outperform all currently available solutions in 79% over all matrices with >15k products and that we achieve the second best performance in 15%. For these matrices, our solution is on average 83% faster than the second best approach and up to 25X faster than other state-of-the-art GPU implementations. Using our approach, applications can expect great performance independent of the matrices they work on.

Adaptive sparse matrix-matrix multiplication on the GPU. Martin Winter, Daniel Mlakar, Rhaleb Zayer, Hans-Peter Seidel, and Markus Steinberger. 2019.
In the ongoing efforts targeting the vectorization of linear algebra primitives, sparse matrix-matrix multiplication (SpGEMM) has received considerably less attention than sparse Matrix-Vector multiplication (SpMV). While both are equally important, this disparity can be attributed mainly to the additional formidable challenges raised by SpGEMM. In this paper, we present a dynamic approach for addressing SpGEMM on the GPU. Our approach works directly on the standard compressed sparse rows (CSR) data format. In comparison to previous SpGEMM implementations, our approach guarantees a homogeneous, load-balanced access pattern to the first input matrix and improves memory access to the second input matrix. It adaptively re-purposes GPU threads during execution and maximizes the time efficient on-chip scratchpad memory can be used. Adhering to a completely deterministic scheduling pattern guarantees bit-stable results during repetitive execution, a property missing from other approaches. Evaluation on an extensive sparse matrix benchmark suggests our approach being the fastest SpGEMM implementation for highly sparse matrices (80% of the set). When bit-stable results are sought, our approach is the fastest across the entire test set.

Layered fields for natural tessellations on surfaces. Rhaleb Zayer, Daniel Mlakar, Markus Steinberger, and Hans-Peter Seidel. 2018.
Mimicking natural tessellation patterns is a fascinating multi-disciplinary problem. Geometric methods aiming at reproducing such partitions on surface meshes are commonly based on the Voronoi model and its variants, and are often faced with challenging issues such as metric estimation, geometric, topological complications, and most critically, parallelization. In this paper, we introduce an alternate model which may be of value for resolving these issues. We drop the assumption that regions need to be separated by lines. Instead, we regard region boundaries as narrow bands and we model the partition as a set of smooth functions layered over the surface. Given an initial set of seeds or regions, the partition emerges as the solution of a time dependent set of partial differential equations describing concurrently evolving fronts on the surface. Our solution does not require geodesic estimation, elaborate numerical solvers, or complicated bookkeeping data structures. The cost per time-iteration is dominated by the multiplication and addition of two sparse matrices. Extension of our approach in a Lloyd’s algorithm fashion can be easily achieved and the extraction of the dual mesh can be conveniently preformed in parallel through matrix algebra. As our approach relies mainly on basic linear algebra kernels, it lends itself to efficient implementation on modern graphics hardware.

faimGraph: high performance management of fully-dynamic graphs under tight memory constraints on the GPU. Martin Winter, Daniel Mlakar, Rhaleb Zayer, Hans-Peter Seidel, and Markus Steinberger.
In this paper, we present a fully-dynamic graph data structure for the Graphics Processing Unit (GPU). It delivers high update rates while keeping a low memory footprint using autonomous memory management directly on the GPU. The data structure is fully-dynamic, allowing not only for edge but also vertex updates. Performing the memory management on the GPU allows for fast initialization times and efficient update procedures without additional intervention or reallocation procedures from the host. Our optimized approach performs initialization completely in parallel; up to 300x faster compared to previous work. It achieves up to 200 million edge updates per second for sorted and unsorted update batches; up to 30x faster than previous work. Furthermore, it can perform more than 300 million adjacency queries and millions of vertex updates per second. On account of efficient memory management techniques like a queuing approach, currently unused memory is reused later on by the framework, permitting the storage of tens of millions of vertices and hundreds of millions of edges in GPU memory. We evaluate algorithmic performance using a PageRank and a Static Triangle Counting (STC) implementation, demonstrating the suitability of the framework even for memory access intensive algorithms.