Efficient Autoregressive Shape Generation via Octree-Based Adaptive Tokenization

1Roblox    2Carnegie Mellon University    3Stanford University

We propose an Octree-based Adaptive shape Tokenization (OAT) that dynamically allocates tokens based on shape complexity. Our approach achieves better reconstruction quality with fewer tokens on average (439 compared to 512 on the full test set) by intelligently distributing more tokens to complex shapes while saving on simpler ones.

Abstract

Many 3D generative models rely on variational autoencoders (VAEs) to learn compact shape representations. However, existing methods encode all shapes into a fixed-size token, disregarding the inherent variations in scale and complexity across 3D data. This leads to inefficient latent representations that can compromise downstream generation. We address this challenge by introducing Octree-based Adaptive Tokenization, a novel framework that adjusts the dimension of latent representations according to shape complexity. Our approach constructs an adaptive octree structure guided by a quadric-error-based subdivision criterion and allocates a shape latent vector to each octree cell using a query-based transformer. Building upon this tokenization, we develop an octree-based autoregressive generative model that effectively leverages these variable-sized representations in shape generation. Extensive experiments demonstrate that our approach reduces token counts by 50% compared to fixed-size methods while maintaining comparable visual quality. When using a similar token length, our method produces significantly higher-quality shapes. When incorporated with our downstream generative model, our method creates more detailed and diverse 3D content than existing approaches.

Generation Results

Input Text Prompt

Generated Shape

Input Text Prompt

Generated Shape

"A pirate ship with cannons"

"A pair of noise-cancelling headphones"

"A 3D model of a wizard hat"

"A fluffy cute dog"

"A dog warrior"

"A 3D model of a pigeon"

"A wooden ship with mast"

"A 3D model of a space rocket"

"A 3D model of a locomotive"

"A 3D model of a car"

"The imperial state crown of England"

"An intricate ring with a gem"

Adaptive Octree

Traditional octree construction subdivides each octant based on whether the octant contains any mesh element. This construction always subdivides to the maximum depth (set to 6 in this example), leading to a similar amount of nodes for simple (top) and complex (middle) shapes. In contrast, our approach terminates subdivision when the local geometry is simple (e.g., a plane), leading to an adaptive octree that better reflects the shape complexity. We show that our adaptive octree construction can reduce the number of nodes by 50% compared to the traditional octree construction.

Method Pipeline

(a) Adaptive Shape Tokenization. Given an input mesh with surface point samples, we partition 3D space into a sparse octree that adapts to the local geometric complexity of the surface. We then use a Perceiver-based transformer to encode the shape into a tree of latent codes, where a child node need encode only the (quantized) residual latent relative to its parent. Latents can then be decoded into an occupancy field from which a mesh can be extracted. (b) Autoregressive Shape Generation. We define an autoregressive model for generating a tree of quantized shape tokens given a textual prompt, following a coarse-to-fine breadth-first search traversal. Similar to variable-length generation of text via end-of-sentence tokens, we make use of structural tokens to generate variable-size tree structures.

Reconstruction Results

We plot reconstruction quality (IoU) against latent size in both discrete (left) and continuous (right) scenarios. We use KiloBytes (KB) for continuous latent representations for a fair comparison. Our method consistently outperforms baseline approaches at equivalent latent sizes and achieves comparable reconstruction quality with much smaller latent representations.

BibTeX

@article{deng2024oat,
    title={Efficient Autoregressive Shape Generation via Octree-Based Adaptive Tokenization},
    author={Deng, Kangle and Liu, Hsueh-Ti Derek and Zhu, Yiheng and Sun, Xiaoxia and Shang, Chong and Bhat, S. Kiran, and Ramanan, Deva and Zhu, Jun-Yan  and Agrawala, Maneesh and Zhou, Tinghui},
    journal={arXiv preprint arXiv:2504.02817},
    year={2025},
}

Related Work

Acknowledgements

We thank Akash Garg, Daiqing Li, Alexander Weiss, Alejandro Peláez, Sheng-Yu Wang, Gaurav Parmer, Ruihan Gao, Nupur Kumari, and Maxwell Jones for their discussion and help. This work was done when Kangle was an intern at Roblox. The project is partly supported by Roblox. Jun-Yan Zhu is partly supported by the Packard Fellowship. Kangle Deng is supported by the Microsoft Research PhD Fellowship.