Figure 1: Illustration of our main idea. Taking a point cloud as the input, we first encode the geometry information for each point. Then we sample a projection view and rearrange the point-wise features into an image-style layout to obtain the pixel-wise features with Geometry-preserved Projection. The colorless projection will be enriched to produce a colorful image I with the color information via a learnable Coloring Module. Our P2P framework can be easily transferred to several downstream tasks with a task-specific head with the help of the transferable visual knowledgae from the pre-trained image model. We take the classical Vision Transformer as our pre-trained image model for illustration in this pipeline.
Nowadays, pre-training big models on large-scale datasets has become a crucial topic in deep learning. The pre-trained models with high representation ability and transferability are tuned on downstream tasks and achieve a great success in natural language processing and computer vision. However, it is non-trivial to promote such a pretraining-tuning paradigm to the 3D vision, given the limited training data that are relatively inconvenient to collect. In this paper, we propose a new perspective of leveraging pre-trained 2D knowledge in 3D domain to tackle this problem, tuning pre-trained image models with the novel Point-to-Pixel prompting for point cloud analysis at a minor parameter cost. Following the principle of prompting engineering, we transform point clouds into colorful images with geometry-preserved projection and geometry-aware coloring to adapt to pre-trained image models, whose weights are kept frozen during the end-to-end optimization of point cloud analysis tasks. We conduct extensive experiments to demonstrate that cooperating with our proposed Point-to-Pixel Prompting, better pre-trained image model will lead to consistently better performance in 3D vision. Therefore, by leveraging prosperous development from image pre-training field, our proposed framework achieves competitive results with the state-of-the-art methods on point cloud classification and part segmentation.
We show that our proposed P2P prompting method can benefit from the 2D pre-trained models and obtain better 3D classification performance with larger 2D pre-trained models from one family.
We show that our proposed P2P can outperform other carefully designed 3D models with much fewer trainable parameters.
The visualization of images produced by our Point-to-Pixel Prompting.
Table 1: Classification results on ModelNet40 and ScanObjectNN. For different image models, we report the image classification performance (IN Acc.) on ImageNet dataset. After migrating them to point cloud analysis with Point-to-Pixel Prompting, we report the number of trainable parameters (Tr. Param.), performance on ModelNet40 dataset (MN Acc.) and performance on ScanObjectNN dataset (SN Acc.).
Figure 2: Illustration of our ablation design. We conduct ablation studies on (a) replacing P2P prompting with vanilla fine-tuning or visual prompt tuning. (b) Point-to-Pixel Prompting designs. (c) different tuning strategies on the pre-trained image model.
Table 4: Ablation studies on ModelNet40 classification. We select ViT-B that is supervisely pre-trained on ImageNet-1k as our image model and show the result under different ablation study setting.
Figure 3: Images produced by our Point-to-Pixel Prompting. We show the original point clouds (top line) and the projected colorful images produced by our P2P of synthetic objects from ModelNet40 (left five columns) and the objects in real-world scenarios from ScanObjectNN (right three columns) from two different projection views. Best viewed in colors.