Sketch2Scene

Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes
from User's Casual Sketches

^*Indicates Equal Contribution

¹Tencent XR Vision Labs ²Australian National University

Abstract

3D Content-Generation is at the heart of many computer graphics applications, including video gaming, filmmaking, virtual reality, and augmented reality. This paper proposes a novel approach for automatically generating interactive (i.e., playable) 3D game scenes from users' casual prompts, including hand-drawn sketches and text descriptions. Sketch-based input offers a natural, convenient, and precise way to capture the user's design intention. To circumvent the prohibitive challenge of the lack of large-scale training data for 3D scenes, our method leverages a pre-trained 2D diffusion model to generate images as conceptual guidance. We advocate for the use of isometric projection images to factor out unknown camera poses while simultaneously obtaining the scene layout. Given a generated isometric image, we utilize pre-trained image understanding models to identify scene components and extract the scene layout. Finally, we leverage a procedural generation engine to render the obtained 3D scenes, resulting in a playable 3D game scene that can be seamlessly integrated into a game environment such as Unity or Unreal Engine. Extensive experimental results demonstrate that our method can efficiently generate high-quality 3D game scenes with layouts that closely follow users' intentions.

Overview Pipeline of Sketch2Scene

Our Sketch2Scene consists of three processing modules: (1) Isometric 2D Image Generation, (2) Visual Scene Understanding, and (3) Procedural 3D Scene Generation. First, the Isometric 2D Image Generation module uses a pre-trained ControlNet to create a 2D isometric reference image from the user's sketch and text prompt. Next, the Visual Scene Understanding module extracts foreground object masks, computes the heightmap and texture splatmap, and determines object instance poses from the isometric image. Finally, the Procedural 3D Scene Generation module uses the heightmap, splatmap, and object poses to generate and render the final 3D game scene.

Sketch Conditioned Isometric Image Generation

For the sketch-conditional isometric image generation, we introduce an adapted ControlNet for generating isometric images from text prompts and sketches. To tackle the challenges of training for such controllability with limited data, we propose a novel approach called Sketch-Aware Loss (SAL). SAL facilitates ControlNet’s training with a single ground truth image associated with diverse sketches generated through random category filtering, thereby enhancing its performance on flexible sketches.

Isometric Basemap Inpainting

For generating a 2D empty basemap of the terrain, we train a LoRA on SDXL-Inpaint to learn the distribution of the basemap and foreground masks. A customized training and inference mechanism is introduced, providing well-controlled and accurate generation of the terrain, thereby enhancing the overall quality of the 3D game scene. To overcome the absence of isometric basemap ground truth, we compile an inpainting dataset from three sources: 5,000 full isometric images, 4,000 manually filtered perspective images of empty terrains inpainted using Lama, and 1,000 pure texture images.

Compositional Visual Scene Understanding

Given an isometric 2D image, we apply a SOTA semantic segmentation algorithm, such as Grounded Segment Anything, to separate the per-category foreground objects. After basemap inpainting, we reconstruct a coarse but watertight 3D terrain mesh using Depth-Anything and Poisson reconstruction. We can then easily rotate the view to obtain a bird's-eye view of the terrain and retrieve the heightmap and corresponding texture splatmap using a combination of SAM and Osprey. For objects like buildings and other landmarks, instance segmentation is applied to obtain the 2D reference image for each object and estimate their pose within the 3D scene, which will be used for object retrieval or reconstruction at a later stage.

Procedural 3D Scene Synthesize

In this work, we use the Unity game engine to build our 3D interactive environment, as Unity offers valuable optimization features for terrain, vegetation, and animation, ensuring optimized runtime performance. For larger objects, we use the segmented instance of the foreground object to perform object retrieval or 3D object generation.

Results

We demonstrate that our Sketch2Scene system can function as a game scene generator for video games, allowing players to freely explore a high-quality world with layouts that closely follow users' intentions.

Citation

@article{xu2024sketch2scene,
 title={Sketch2Scene: Automatic Generation of Interactive 3D Game Scenes from User's Casual Sketches}, 
 author={Xu, Yongzhi and Ng, Yonhon and Wang, Yifu and Sa, Inkyu and Duan, Yunfei and Li, Yang and Ji, Pan and Li, Hongdong},
 journal={arXiv preprint arXiv:2408.04567},
 year={2024},
}

This website is based on mip-Nerf.