-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
robintrmbtt
committed
Dec 20, 2023
1 parent
7e38bfb
commit aa741a1
Showing
9 changed files
with
121 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
--- | ||
layout: review | ||
title: "Image as Set of Points" | ||
tags: visual-representation clustering deep-learning | ||
author: "Robin Trombetta" | ||
cite: | ||
authors: "Xu Ma, Yuqian Zhou, Huan Wang, Can Qin, Bin Sun, Chang Liu, Yun Fu" | ||
title: "Image as Set of Points" | ||
venue: "International Conference on Learning Representations (ICLR) 2023 Oral" | ||
pdf: "https://arxiv.org/pdf/2303.01494.pdf" | ||
--- | ||
|
||
# Highlights | ||
|
||
* The authors propose Context Clusters (CoCs) which is a new paradigm for visual representations where images are seen as a set of unorganized points. | ||
* Even though they don't achieve state-of-the-art performances, they are still on par with recent convolutional and transformer-based visual backbone. | ||
* CoCs have some good properties such as generalization to different data domains (point clouds, RGBD images, etc.) and good interpretability. | ||
* Code is available on this [GitHub repo](https://github.com/ma-xu/Context-Cluster). | ||
|
||
| ||
|
||
# Context Clusters | ||
|
||
To translate an image to a set of points, an input image $$ \textbf{I} \in \mathbb{R}^{3 \times w \times h}$$ is changed to an augmented collection of points (pixels) $$ \textbf{P} \in \mathbb{R}^{5 \times n}$$ with $$n = w \times h$$ and where each pixel $$\textbf{I}_{i,j}$$ has its RGB channel values enhanced with its 2D coordinates $$[\frac{i}{w} - 0.5 ; \frac{j}{h} - 0.5]$$ to give a vector of 5 numbers. | ||
|
||
Then, the features are extracted from this set of points with a succession of Points Reducers and Context Clusters blocks (Figure 1). As one of the goals of the model is to be plugged as a backbone feature extractor into more complex architecture, it copies the global architecture and intermediate/final output sizes of standard CNN and transformer-based backbones such as ResNet and ViTs. In particular, at each stage, the number of points is set so that it can be turned back into a square 2D image. | ||
|
||
<div style="text-align:center"> | ||
<img src="/collections/images/CoCs/cocs_architecture.jpg" width=800></div> | ||
<p style="text-align: center;font-style:italic">Figure 1. High-level overview of Context Cluster architecture with four stages.</p> | ||
|
||
**Point reduction.**   The point reduction operation is done by placing evenly spaced anchors in the image. The feature of the resulting point is obtained after a linear projection of the concatenated features of the *k* nearest points (Figure 2). While *k* can be any number, if it is properly set (for instance 4, 9 or 16), this operation can be done with a standard convolution, which is what is done by the authors. This mechanisms is the same as the initial patch embedding in ViTs. | ||
|
||
<div style="text-align:center"> | ||
<img src="/collections/images/CoCs/anchors_point_reduction.jpg" width=300></div> | ||
<p style="text-align: center;font-style:italic">Figure 2. Illustration of anchors for point reduction. The blue points are the anchors and the red square shows the points that are aggregated for the top-left anchor.</p> | ||
|
||
## Context Cluster Operation | ||
|
||
The main proposal of the authors is the context cluster block, which will be described in details in this section. In each block, three main operations are done : *context clustering*, *feature aggregation* and *feature dispatching* | ||
|
||
**Context clustering.**   Given a set of $$n$$ feature points of size $$d$$, $$ \textbf{P} \in \mathbb{R}^{n \times d}$$, they are linearly projected into a smaller space $$\textbf{P}_S$$ for computational purposes. Then, similarly to point reduction, $$c$$ centers are evenly distributed in space and the center feature is computed as the average of the $$k$$ nearest points of each center. | ||
|
||
Then, the pair-wise cosine similarity between the center and the all the features points $$\textbf{P}_S$$ is computed and give a matrix $$\textbf{S} \in \mathbb{R}^{c \times n}$$. Note that it is implicitely both a similarity between points' dictances and features. Finally, each point is assigned to its most similar center (unique association), resulting in $$c$$ clusters (Figure 3), that have a different number of points (and may also be empty). | ||
|
||
|
||
<div style="text-align:center"> | ||
<img src="/collections/images/CoCs/centers_features.jpg" width=200></div> | ||
<p style="text-align: center;font-style:italic">Figure 3. Illustration of the computation of the center features. The red squares are the centers, the blue circle show the feature that are averaged for the top center, and the color palette shows the clusters of points associated with each centers.</p> | ||
|
||
| ||
|
||
**Feature aggregating.**   Given a cluster $$c$$ with $$m$$ points, the points are mapped to a high-dimension *value space* $$\textbf{P}_v \in \mathbb{R}^{m \times d'}$$. A center $$v_c$$ is also computed in the value space like it is done at the clustering step. | ||
|
||
> In practice, the dimension of the value space is set to be equal to d' so this step can be skipped. | ||
The aggregated feature $$g \in \mathbb{R}^{d'}$$ of the cluster $$c$$ is computed as : | ||
|
||
$$ g = \frac{1}{C} \left( v_c + \sum_{i=1}^{m} \text{sig}(\alpha s_i + \beta) \ast v_i \right)$$ | ||
|
||
with $$C = 1 + \sum_{i=1}^m \text{sig} (\alpha s_i \ast \beta)$$. | ||
|
||
$$v_i$$ is th $$i$$-th point in $$\textbf{P}_v$$, $$\alpha$$ and $$\beta$$ are learnt scalars and $$\text{sig}(\cdot)$$ is the sigmoïd function. $$v_c$$ and $$C$$ are added for numerical stability. | ||
|
||
**Feature dispatching.**   Once the aggregated feature $$g$$ has been computed inside a cluster, it is adaptively dispatched to each point in a cluster based on the similarity. This allows pixels inside a cluster to interact. Each point $$p_i$$ is updated as follows : | ||
|
||
$$ p_i ' = p_i + \text{FC}(\text{sig} (\alpha s_i + \beta) \ast g )$$ | ||
|
||
where $$\text{FC}$$ is a fully-connected layer to get back to the original dimension $$d$$ from the value-space dimension $$d'$$ | ||
|
||
A schematic view of a context cluster block is given is Figure 4. | ||
|
||
<div style="text-align:center"> | ||
<img src="/collections/images/CoCs/cocs_block.jpg" width=700></div> | ||
<p style="text-align: center;font-style:italic">Figure 4. Overview of a Context Cluster block.</p> | ||
|
||
| ||
|
||
## Choices for computation time improving | ||
|
||
**Multi-head design.**   Like in ViTs, the authos use a multi-head design in the context cluster. The output of all the heads are concatenated and fused by a fully-connected layer. | ||
|
||
**Region splitting.**   If no restriction is set when associating the points with the centers, the time complexity to calculate the feature similarity is $$O(ncd)$$ (with $$n = h \times w$$), which is quadratic relatively to the number of points in the image. As it is too computationally demanding, the same technique as in SwinTransformer is done, that is to say **splitting the points into several local regions.**. If the number of regions is set to $$r$$, the time complexity is lowered to $$O(r\frac{n}{r}\frac{c}{r}d)$$. This obviously results in a trade-off between time complexity gain and limitation of the receptive field. In practice, the parameter $$r$$ is set so that the clusters are computed in a $$7 \times 7$$ window. | ||
|
||
| ||
|
||
# Results | ||
|
||
The architecture is evaluated on several datasets for image classification, point cloud classification, object detection, instance segmentation and semantic segmentation tasks. Even though the performances do not reach that of the best state-of-the-art models such as ConvNeXt or DaViT, they are in par with other recent CNN and transformer-based backbones. | ||
|
||
|
||
<div style="text-align:center"> | ||
<img src="/collections/images/CoCs/results_classification.jpg" width=800></div> | ||
<p style="text-align: center;font-style:italic">Table 1. Results on several backbones on ImageNet-1k classification benchmark.</p> | ||
|
||
An important result shown in Table 1 is the two first lines of the cluster-based methods. $$ \text{Context-Cluster-Ti } \ddagger $$ has the same architecture as the other models execpt that the image is splitted into much fewer local regions for clustering operations. The low difference of performance between this version of the model and $$\text{Context-Cluster-Ti }$$ is an argument in favor of the region-splitting strategy. | ||
|
||
| ||
|
||
Context clusters can be adapted and plugged into point cloud classification networks such as PointMLP or used with Mask-RCNN for object detection and instance segmentation. | ||
|
||
<div style="text-align:center"> | ||
<img src="/collections/images/CoCs/results_point_cloud_classification.jpg" width=500></div> | ||
<p style="text-align: center;font-style:italic">Table 2. Classification results on ScanObjectNN.</p> | ||
|
||
<div style="text-align:center"> | ||
<img src="/collections/images/CoCs/results_object_detection.jpg" width=700></div> | ||
<p style="text-align: center;font-style:italic">Table 3. COCO object detection and instance segmentation results using Mask-RCNN .</p> | ||
|
||
| ||
|
||
A nice property of treating images as sets of points and clustering them is that it is much easier to see which pixels interact compared to CNN or ViTs for which complex activation maps has to be computed to get analogous visualizations. | ||
|
||
<div style="text-align:center"> | ||
<img src="/collections/images/CoCs/activation_maps.jpg" width=800></div> | ||
<p style="text-align: center;font-style:italic">Figure 5. Visualization of activation map, class activation map and clustering map for various models and at different stages.</p> | ||
|
||
# Conclusion | ||
|
||
See highlights, you know how it works. | ||
|
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.