• banner

OpenAI Point E: Create a 3D point cloud from complex waveforms in minutes on a single GPU

In a new article Point-E: A system for generating 3D point clouds from complex signals, the OpenAI research team introduces Point E, a 3D point cloud text conditional synthesis system that uses diffusion models to create varied and complex 3D shapes driven by complex text cues. in minutes on a single GPU.
The amazing performance of today’s state-of-the-art image generation models has stimulated research in the generation of 3D text objects. However, unlike 2D models, which can generate output in minutes or even seconds, object generative models typically require several hours of GPU work to generate a single sample.
In a new article Point-E: A system for generating 3D point clouds from complex signals, the OpenAI research team presents Point·E, a textual conditional synthesis system for 3D point clouds. This new approach uses a propagation model to create varied and complex 3D shapes from complex text signals in just a minute or two on a single GPU.
The team is focused on the challenge of converting text to 3D, which is critical to democratizing 3D content creation for real world applications ranging from virtual reality and gaming to industrial design. Existing methods for converting text to 3D fall into two categories, each of which has its drawbacks: 1) generative models can be used to generate samples efficiently, but cannot scale efficiently for diverse and complex text signals; 2) a pre-trained text-image model to handle complex and varied text cues, but this approach is computationally intensive and the model can easily get stuck in local minima that do not correspond to meaningful or coherent 3D objects.
Therefore, the team explored an alternative approach that aims to combine the strengths of the above two approaches, using a text-to-image diffusion model trained on a large set of text-image pairs (allowing it to handle diverse and complex signals) and a 3D image diffusion model trained on a smaller set of text-image pairs. image-3D pair dataset. The text-to-image model first samples the input image to create a single synthetic representation, and the image-to-3D model creates a 3D point cloud based on the selected image.
The command’s generative stack is based on recently proposed generative frameworks for conditionally generating images from text (Sohl-Dickstein et al., 2015; Song & Ermon, 2020b; Ho et al., 2020). They use a GLIDE model with 3 billion GLIDE parameters (Nichol et al., 2021), fine-tuned on rendered 3D models, as their text-to-image transformation model, and a set of diffusion models that generate RGB point clouds as their transformation model. images to image. 3D models.
While previous work used 3D architectures to process point clouds, the researchers used a simple transducer-based model (Vaswani et al., 2017) to improve efficiency. In their diffusion model architecture, point cloud images are first fed into a pre-trained ViT-L/14 CLIP model and then the output meshes are fed into the converter as markers.
In their empirical study, the team compared the proposed Point·E method with other generative 3D models on scoring signals from COCO object detection, segmentation, and signature datasets. The results confirm that Point·E is able to generate diverse and complex 3D shapes from complex text signals and speed up inference time by one to two orders of magnitude. The team hope their work will inspire further research into 3D text synthesis.
A pretrained point cloud propagation model and evaluation code are available on the project’s GitHub. Document Point-E: A system for creating 3D point clouds from complex clues is on arXiv.
We know that you don’t want to miss any news or scientific discovery. Subscribe to our popular Synced Global AI Weekly newsletter to receive weekly AI updates.


Post time: Dec-28-2022