ORL: Unsupervised Object-Level Representation Learning from Scene Images

NeurIPS 2021

Paper

Abstract

Contrastive self-supervised learning has largely narrowed the gap to supervised pre-training on ImageNet. However, its success highly relies on the object-centric priors of ImageNet, i.e., different augmented views of the same image correspond to the same object. Such a heavily curated constraint becomes immediately infeasible when pre-trained on more complex scene images with many objects. To overcome this limitation, we introduce Object-level Representation Learning (ORL), a new self-supervised learning framework towards scene images. Our key insight is to leverage image-level self-supervised pre-training as the prior to discover object-level semantic correspondence, thus realizing object-level representation learning from scene images. Extensive experiments on COCO show that ORL significantly improves the performance of self-supervised learning on scene images, even surpassing supervised ImageNet pre-training on several downstream tasks. Furthermore, ORL improves the downstream performance when more unlabeled scene images are available, demonstrating its great potential of harnessing unlabeled data in the wild. We hope our approach can motivate future research on more general-purpose unsupervised representation learning from scene data.

(a) Current image-level contrastive learning methods heavily rely on the object-centric bias of ImageNet, i.e., different crops correspond to the same object. Prior works use either the different views of the same image (i.e., intra-image) or similar images (i.e., inter-image) to form positive pairs. (b) Directly adopting image-level contrastive learning methods on scene images can cause inconsistent learning signals since different crops may correspond to different objects. (c) Object-level contrastive learning can overcome the limitation in (b) by enforcing object-level consistency. (d) We find that image-level contrastive learning encodes priors for region correspondence discovery across images, and high-response regions are usually objects or object parts, which is useful for object-level representation learning.

Paper

Code

Multi-Stage

ORL Framework

ORL extends the existing image-level self-supervised learning framework to object level. We leverage image-level contrastive pre-training (e.g., BYOL) as the prior for object-level semantic correspondence discovery, and use the obtaind correspondence to construct positive object-instance pairs for object-level representation learning.

Representation

Learning

Table 1: Image classification with linear models

Method	Pre-train data	VOC07	ImageNet	Places205	iNat.
Method	Pre-train data	mAP	Top-1	Top-1	Top-1
Random	-	9.6	13.7	16.6	4.8
Supervised	ImageNet	87.5	75.9	51.5	45.4
SimCLR	COCO	78.1	50.9	48.0	22.7
MoCo v2	COCO	82.2	55.1	48.8	27.8
BYOL	COCO	84.5	57.8	50.5	29.5
ORL (ours)	COCO	86.7	59.0	52.7	31.8
BYOL	COCO+	87.0	59.6	52.7	30.9
ORL (ours)	COCO+	88.6	60.7	54.1	32.0

Table 2: Low-shot image classification on VOC07

Method	Pre-train data	VOC07 low-shot (mAP)
Method	Pre-train data	1	2	4	8	16	32	64	96
Random	-	9.2	9.4	11.1	12.3	14.3	17.4	21.3	23.8
Supervised	ImageNet	53.0	63.6	73.7	78.8	81.8	83.8	85.2	86.0
SimCLR	COCO	33.3	43.5	52.5	61.1	66.7	70.5	73.7	75.0
MoCo v2	COCO	39.5	49.3	60.4	69.3	74.1	76.8	79.1	80.1
BYOL	COCO	39.4	50.9	62.2	71.7	76.6	79.2	81.3	82.2
ORL (ours)	COCO	39.6	51.2	63.4	72.6	78.2	81.3	83.6	84.7
BYOL	COCO+	41.1	54.3	66.6	75.2	80.1	82.6	84.6	85.4
ORL (ours)	COCO+	42.1	54.9	67.4	75.7	81.3	83.7	85.8	86.7

Table 3: Semi-supervised learning on ImageNet

Method	Pre-train data	1% labels		10% labels
Method	Pre-train data	Top-1	Top-5	Top-1	Top-5
Random	-	1.6	5.0	21.8	44.2
Supervised	ImageNet	25.4	48.4	56.4	80.4
SimCLR	COCO	23.4	46.4	52.2	77.4
MoCo v2	COCO	28.2	54.7	57.1	81.7
BYOL	COCO	28.4	55.9	58.4	82.7
ORL (ours)	COCO	31.0	58.9	60.5	84.2
BYOL	COCO+	28.3	56.0	59.4	83.6
ORL (ours)	COCO+	31.8	60.1	60.9	84.4

Table 4: COCO object detection and instance segmentation

Method	Pre-train data	COCO detection			COCO instance seg.
Method	Pre-train data	AP^bb	AP^bb₅₀	AP^bb₇₅	AP^mk	AP^mk₅₀	AP^mk₇₅
Random	-	32.8	50.9	35.3	29.9	47.9	32.0
Supervised	ImageNet	39.7	59.5	43.3	35.9	56.6	38.6
SimCLR	COCO	37.0	56.8	40.3	33.7	53.8	36.1
MoCo v2	COCO	38.5	58.1	42.1	34.8	55.3	37.3
Self-EMD	COCO	39.3	60.1	42.8	-	-	-
DenseCL	COCO	39.6	59.3	43.3	35.7	56.5	38.4
BYOL	COCO	39.5	59.3	43.2	35.6	56.5	38.2
ORL (ours)	COCO	40.3	60.2	44.4	36.3	57.3	38.9
BYOL	COCO+	40.0	60.1	44.0	36.2	57.1	39.0
ORL (ours)	COCO+	40.6	60.8	44.5	36.7	57.9	39.3

Note: All unsupervised methods are based on 800-epoch pre-training on COCO(+) with ResNet-50.

Correspondence

Pairs

In contrast to typical contrastive learning methods that perform aggressive intra-image augmentations to simulate intra-class variances, our discovered inter-RoI pairs can substantially provide more natural and diverse intra-class variances at the object-instance level.

Top-ranked region correspondence discovered in Stage 2 of ORL on COCO. The high-response region pairs are usually objects or object parts.

Attention

Maps

Both BYOL and ORL can produce relatively high-quality attention maps focusing on the foreground objects in scene images. Nevertheless, ORL can 1) activate more object regions, and 2) produce more accurate object boundary in the heatmap than BYOL due to introducing object-level similarity learning.

Attention maps generated by BYOL and ORL on COCO. ORL can activate more object regions and produce more accurate object boundary than BYOL.

Paper

Citation

@inproceedings{xie2021unsupervised,
title = {Unsupervised Object-Level Representation Learning from Scene Images},
author = {Xie, Jiahao and Zhan, Xiaohang and Liu, Ziwei and Ong, Yew Soon and Loy, Chen Change},
booktitle = {NeurIPS},
year = {2021}
}

Projects

Delving into Inter-Image Invariance for Unsupervised Visual Representations
J. Xie, X. Zhan, Z. Liu, Y. S. Ong, C. C. Loy
Technical report, arXiv:2008.11702, 2020
[arXiv] [Project Page]
Online Deep Clustering for Unsupervised Representation Learning
X. Zhan*, J. Xie*, Z. Liu, Y. S. Ong, C. C. Loy
in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2020 (CVPR)
[PDF] [arXiv] [Project Page]

Contact

Jiahao Xie
Email: jiahao003 at e.ntu.edu.sg

ORL: Unsupervised Object-Level Representation Learning from Scene Images

Paper

Abstract

Paper

Code

Multi-Stage

ORL Framework

Representation

Learning

Correspondence

Pairs

Attention

Maps

Paper

Citation

Related

Projects