Skip to content

Latest commit

 

History

History
1518 lines (929 loc) · 43.8 KB

CVPR2023-Papers-with-Code.md

File metadata and controls

1518 lines (929 loc) · 43.8 KB

CVPR 2023 论文和开源项目合集(Papers with Code)

CVPR 2023 论文和开源项目合集(papers with code)!

25.78% = 2360 / 9155

CVPR 2023 decisions are now available on OpenReview! This year, wereceived a record number of 9155 submissions (a 12% increase over CVPR 2022), and accepted 2360 papers, for a 25.78% acceptance rate.

注1:欢迎各位大佬提交issue,分享CVPR 2023论文和开源项目!

注2:关于往年CV顶会论文以及其他优质CV论文和大盘点,详见: https://github.com/amusi/daily-paper-computer-vision

如果你想了解最新最优质的的CV论文、开源项目和学习资料,欢迎扫码加入【CVer学术交流群】!互相学习,一起进步~

【CVPR 2023 论文开源目录】

Backbone

Integrally Pre-Trained Transformer Pyramid Networks

Stitchable Neural Networks

Run, Don't Walk: Chasing Higher FLOPS for Faster Neural Networks

BiFormer: Vision Transformer with Bi-Level Routing Attention

DeepMAD: Mathematical Architecture Design for Deep Convolutional Neural Network

Vision Transformer with Super Token Sampling

Hard Patches Mining for Masked Image Modeling

  • Paper: None
  • Code: None

SMPConv: Self-moving Point Representations for Continuous Convolution

Making Vision Transformers Efficient from A Token Sparsification View

CLIP

GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis

DeltaEdit: Exploring Text-free Training for Text-driven Image Manipulation

MAE

Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders

Generic-to-Specific Distillation of Masked Autoencoders

GAN

DeltaEdit: Exploring Text-free Training for Text-driven Image Manipulation

NeRF

NoPe-NeRF: Optimising Neural Radiance Field with No Pose Prior

Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures

NeRF in the Palm of Your Hand: Corrective Augmentation for Robotics via Novel-View Synthesis

Panoptic Lifting for 3D Scene Understanding with Neural Fields

NeRFLiX: High-Quality Neural View Synthesis by Learning a Degradation-Driven Inter-viewpoint MiXer

HNeRV: A Hybrid Neural Representation for Videos

DETR

DETRs with Hybrid Matching

Prompt

Diversity-Aware Meta Visual Prompting

NAS

PA&DA: Jointly Sampling PAth and DAta for Consistent NAS

Avatars

Structured 3D Features for Reconstructing Relightable and Animatable Avatars

Learning Personalized High Quality Volumetric Head Avatars from Monocular RGB Videos

ReID(重识别)

Clothing-Change Feature Augmentation for Person Re-Identification

  • Paper: None
  • Code: None

MSINet: Twins Contrastive Search of Multi-Scale Interaction for Object ReID

Shape-Erased Feature Learning for Visible-Infrared Person Re-Identification

Large-scale Training Data Search for Object Re-identification

Diffusion Models(扩散模型)

Video Probabilistic Diffusion Models in Projected Latent Space

Solving 3D Inverse Problems using Pre-trained 2D Diffusion Models

Imagic: Text-Based Real Image Editing with Diffusion Models

Parallel Diffusion Models of Operator and Image for Blind Inverse Problems

DiffRF: Rendering-guided 3D Radiance Field Diffusion

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

HouseDiffusion: Vector Floorplan Generation via a Diffusion Model with Discrete and Continuous Denoising

TrojDiff: Trojan Attacks on Diffusion Models with Diverse Targets

Back to the Source: Diffusion-Driven Adaptation to Test-Time Corruption

DR2: Diffusion-based Robust Degradation Remover for Blind Face Restoration

Trace and Pace: Controllable Pedestrian Animation via Guided Trajectory Diffusion

Generative Diffusion Prior for Unified Image Restoration and Enhancement

Conditional Image-to-Video Generation with Latent Flow Diffusion Models

长尾分布(Long-Tail)

Long-Tailed Visual Recognition via Self-Heterogeneous Integration with Knowledge Excavation

Vision Transformer

Integrally Pre-Trained Transformer Pyramid Networks

Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors

Learning Trajectory-Aware Transformer for Video Super-Resolution

Vision Transformers are Parameter-Efficient Audio-Visual Learners

Where We Are and What We're Looking At: Query Based Worldwide Image Geo-localization Using Hierarchies and Scenes

DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets

DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting

BiFormer: Vision Transformer with Bi-Level Routing Attention

Vision Transformer with Super Token Sampling

BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

BAEFormer: Bi-directional and Early Interaction Transformers for Bird’s Eye View Semantic Segmentation

  • Paper: None
  • Code: None

Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention

Making Vision Transformers Efficient from A Token Sparsification View

视觉和语言(Vision-Language)

GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods

Teaching Structured Vision&Language Concepts to Vision&Language Models

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training

CapDet: Unifying Dense Captioning and Open-World Detection Pretraining

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation Using Scene Object Spectrum Grounding

All in One: Exploring Unified Video-Language Pre-training

Position-guided Text Prompt for Vision Language Pre-training

EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding

CapDet: Unifying Dense Captioning and Open-World Detection Pretraining

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

Align and Attend: Multimodal Summarization with Dual Contrastive Losses

Multi-Modal Representation Learning with Text-Driven Soft Masks

Learning to Name Classes for Vision and Language Models

目标检测(Object Detection)

YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

DETRs with Hybrid Matching

Enhanced Training of Query-Based Object Detection via Selective Query Recollection

Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection

目标跟踪(Object Tracking)

Simple Cues Lead to a Strong Multi-Object Tracker

Joint Visual Grounding and Tracking with Natural Language Specification

语义分割(Semantic Segmentation)

Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos

FREDOM: Fairness Domain Adaptation Approach to Semantic Scene Understanding

医学图像分割(Medical Image Segmentation)

Label-Free Liver Tumor Segmentation

Directional Connectivity-based Segmentation of Medical Images

Bidirectional Copy-Paste for Semi-Supervised Medical Image Segmentation

Devil is in the Queries: Advancing Mask Transformers for Real-world Medical Image Segmentation and Out-of-Distribution Localization

Fair Federated Medical Image Segmentation via Client Contribution Estimation

Ambiguous Medical Image Segmentation using Diffusion Models

Orthogonal Annotation Benefits Barely-supervised Medical Image Segmentation

MagicNet: Semi-Supervised Multi-Organ Segmentation via Magic-Cube Partition and Recovery

MCF: Mutual Correction Framework for Semi-Supervised Medical Image Segmentation

Rethinking Few-Shot Medical Segmentation: A Vector Quantization View

Pseudo-label Guided Contrastive Learning for Semi-supervised Medical Image Segmentation

SDC-UDA: Volumetric Unsupervised Domain Adaptation Framework for Slice-Direction Continuous Cross-Modality Medical Image Segmentation

DoNet: Deep De-overlapping Network for Cytology Instance Segmentation

视频目标分割(Video Object Segmentation)

Two-shot Video Object Segmentation

Under Video Object Segmentation Section

视频实例分割(Video Instance Segmentation)

Mask-Free Video Instance Segmentation

参考图像分割(Referring Image Segmentation )

PolyFormer: Referring Image Segmentation as Sequential Polygon Generation

3D点云(3D-Point-Cloud)

Physical-World Optical Adversarial Attacks on 3D Face Recognition

IterativePFN: True Iterative Point Cloud Filtering

Attention-based Point Cloud Edge Sampling

3D目标检测(3D Object Detection)

DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets

FrustumFormer: Adaptive Instance-aware Resampling for Multi-view 3D Detection

3D Video Object Detection with Learnable Object-Centric Global Optimization

  • Paper: None
  • Code: None

Hierarchical Supervision and Shuffle Data Augmentation for 3D Semi-Supervised Object Detection

3D语义分割(3D Semantic Segmentation)

Less is More: Reducing Task and Model Complexity for 3D Point Cloud Semantic Segmentation

3D语义场景补全(3D Semantic Scene Completion)

3D配准(3D Registration)

Robust Outlier Rejection for 3D Registration with Variational Bayes

3D人体姿态估计(3D Human Pose Estimation)

3D人体Mesh估计(3D Human Mesh Estimation)

3D Human Mesh Estimation from Virtual Markers

Low-level Vision

Causal-IR: Learning Distortion Invariant Representation for Image Restoration from A Causality Perspective

Burstormer: Burst Image Restoration and Enhancement Transformer

超分辨率(Video Super-Resolution)

Super-Resolution Neural Operator

视频超分辨率

Learning Trajectory-Aware Transformer for Video Super-Resolution

Denoising

去噪(Denoising)

图像去噪(Image Denoising)

Masked Image Training for Generalizable Deep Image Denoising

图像生成(Image Generation)

GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis

MAGE: MAsked Generative Encoder to Unify Representation Learning and Image Synthesis

Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation

Few-shot Semantic Image Synthesis with Class Affinity Transfer

TopNet: Transformer-based Object Placement Network for Image Compositing

视频生成(Video Generation)

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

Conditional Image-to-Video Generation with Latent Flow Diffusion Models

视频理解(Video Understanding)

Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

Frame Flexible Network

Masked Motion Encoding for Self-Supervised Video Representation Learning

MARLIN: Masked Autoencoder for facial video Representation LearnING

行为检测(Action Detection)

TriDet: Temporal Action Detection with Relative Boundary Modeling

文本检测(Text Detection)

DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting

知识蒸馏(Knowledge Distillation)

Learning to Retain while Acquiring: Combating Distribution-Shift in Adversarial Data-Free Knowledge Distillation

Generic-to-Specific Distillation of Masked Autoencoders

模型剪枝(Model Pruning)

DepGraph: Towards Any Structural Pruning

图像压缩(Image Compression)

Context-Based Trit-Plane Coding for Progressive Image Compression

异常检测(Anomaly Detection)

Deep Feature In-painting for Unsupervised Anomaly Detection in X-ray Images

三维重建(3D Reconstruction)

OReX: Object Reconstruction from Planar Cross-sections Using Neural Fields

SparsePose: Sparse-View Camera Pose Regression and Refinement

NeuDA: Neural Deformable Anchor for High-Fidelity Implicit Surface Reconstruction

Vid2Avatar: 3D Avatar Reconstruction from Videos in the Wild via Self-supervised Scene Decomposition

To fit or not to fit: Model-based Face Reconstruction and Occlusion Segmentation from Weak Supervision

Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction

3D Cinemagraphy from a Single Image

Revisiting Rotation Averaging: Uncertainties and Robust Losses

FFHQ-UV: Normalized Facial UV-Texture Dataset for 3D Face Reconstruction

A Hierarchical Representation Network for Accurate and Detailed Face Reconstruction from In-The-Wild Images

深度估计(Depth Estimation)

Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation

轨迹预测(Trajectory Prediction)

IPCC-TP: Utilizing Incremental Pearson Correlation Coefficient for Joint Multi-Agent Trajectory Prediction

EqMotion: Equivariant Multi-agent Motion Prediction with Invariant Interaction Reasoning

车道线检测(Lane Detection)

Anchor3DLane: Learning to Regress 3D Anchors for Monocular 3D Lane Detection

BEV-LaneDet: An Efficient 3D Lane Detection Based on Virtual Camera via Key-Points

图像描述(Image Captioning)

ConZIC: Controllable Zero-shot Image Captioning by Sampling-Based Polishing

Cross-Domain Image Captioning with Discriminative Finetuning

Model-Agnostic Gender Debiased Image Captioning

视觉问答(Visual Question Answering)

MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering

手语识别(Sign Language Recognition)

Continuous Sign Language Recognition with Correlation Network

Paper: https://arxiv.org/abs/2303.03202

Code: https://github.com/hulianyuyy/CorrNet

视频预测(Video Prediction)

MOSO: Decomposing MOtion, Scene and Object for Video Prediction

新视点合成(Novel View Synthesis)

3D Video Loops from Asynchronous Input

Zero-Shot Learning(零样本学习)

Bi-directional Distribution Alignment for Transductive Zero-Shot Learning

Semantic Prompt for Few-Shot Learning

  • Paper: None
  • Code: None

立体匹配(Stereo Matching)

Iterative Geometry Encoding Volume for Stereo Matching

Learning the Distribution of Errors in Stereo Matching for Joint Disparity and Uncertainty Estimation

特征匹配(Feature Matching)

Adaptive Spot-Guided Transformer for Consistent Local Feature Matching

场景图生成(Scene Graph Generation)

Prototype-based Embedding Network for Scene Graph Generation

隐式神经表示(Implicit Neural Representations)

Polynomial Implicit Neural Representations For Large Diverse Datasets

图像质量评价(Image Quality Assessment)

Re-IQA: Unsupervised Learning for Image Quality Assessment in the Wild

数据集(Datasets)

Human-Art: A Versatile Human-Centric Dataset Bridging Natural and Artificial Scenes

Align and Attend: Multimodal Summarization with Dual Contrastive Losses

GeoNet: Benchmarking Unsupervised Adaptation across Geographies

CelebV-Text: A Large-Scale Facial Text-Video Dataset

其他(Others)

Interactive Segmentation as Gaussian Process Classification

Backdoor Attacks Against Deep Image Compression via Adaptive Frequency Trigger

SplineCam: Exact Visualization and Characterization of Deep Network Geometry and Decision Boundaries

SCOTCH and SODA: A Transformer Video Shadow Detection Framework

DeepMapping2: Self-Supervised Large-Scale LiDAR Map Optimization

RelightableHands: Efficient Neural Relighting of Articulated Hand Models

Token Turing Machines

Single Image Backdoor Inversion via Robust Smoothed Classifiers

To fit or not to fit: Model-based Face Reconstruction and Occlusion Segmentation from Weak Supervision

HOOD: Hierarchical Graphs for Generalized Modelling of Clothing Dynamics

A Whac-A-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others

RelightableHands: Efficient Neural Relighting of Articulated Hand Models

Neuro-Modulated Hebbian Learning for Fully Test-Time Adaptation

Demystifying Causal Features on Adversarial Examples and Causal Inoculation for Robust Network by Adversarial Instrumental Variable Regression

UniDexGrasp: Universal Robotic Dexterous Grasping via Learning Diverse Proposal Generation and Goal-Conditioned Policy

Disentangling Orthogonal Planes for Indoor Panoramic Room Layout Estimation with Cross-Scale Distortion Awareness

Learning Neural Parametric Head Models

A Meta-Learning Approach to Predicting Performance and Data Requirements

MACARONS: Mapping And Coverage Anticipation with RGB Online Self-Supervision

Masked Images Are Counterfactual Samples for Robust Fine-tuning

HairStep: Transfer Synthetic to Real Using Strand and Depth Maps for Single-View 3D Hair Modeling

Decompose, Adjust, Compose: Effective Normalization by Playing with Frequency for Domain Generalization

Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves Generalization

Unlearnable Clusters: Towards Label-agnostic Unlearnable Examples

Where We Are and What We're Looking At: Query Based Worldwide Image Geo-localization Using Hierarchies and Scenes

UniHCP: A Unified Model for Human-Centric Perceptions

CUDA: Convolution-based Unlearnable Datasets

Masked Images Are Counterfactual Samples for Robust Fine-tuning

AdaptiveMix: Robust Feature Representation via Shrinking Feature Space

Physical-World Optical Adversarial Attacks on 3D Face Recognition

DPE: Disentanglement of Pose and Expression for General Video Portrait Editing

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

Intrinsic Physical Concepts Discovery with Object-Centric Predictive Models

  • Paper: None
  • Code: None

Sharpness-Aware Gradient Matching for Domain Generalization

Mind the Label-shift for Augmentation-based Graph Out-of-distribution Generalization

  • Paper: None
  • Code: None

Blind Video Deflickering by Neural Filtering with a Flawed Atlas

RiDDLE: Reversible and Diversified De-identification with Latent Encryptor

PoseExaminer: Automated Testing of Out-of-Distribution Robustness in Human Pose and Shape Estimation

Upcycling Models under Domain and Category Shift

Modality-Agnostic Debiasing for Single Domain Generalization

Progressive Open Space Expansion for Open-Set Model Attribution

Dynamic Neural Network for Multi-Task Learning Searching across Diverse Network Topologies

GFPose: Learning 3D Human Pose Prior with Gradient Fields

PRISE: Demystifying Deep Lucas-Kanade with Strongly Star-Convex Constraints for Multimodel Image Alignment

Sketch2Saliency: Learning to Detect Salient Objects from Human Drawings

Boundary Unlearning

ImageNet-E: Benchmarking Neural Network Robustness via Attribute Editing

Zero-shot Model Diagnosis

GeoNet: Benchmarking Unsupervised Adaptation across Geographies

Quantum Multi-Model Fitting

DivClust: Controlling Diversity in Deep Clustering

Neural Volumetric Memory for Visual Locomotion Control

MonoHuman: Animatable Human Neural Field from Monocular Video

Trace and Pace: Controllable Pedestrian Animation via Guided Trajectory Diffusion

Bridging the Gap between Model Explanations in Partially Annotated Multi-label Classification

HyperCUT: Video Sequence from a Single Blurry Image using Unsupervised Ordering

On the Stability-Plasticity Dilemma of Class-Incremental Learning

Defending Against Patch-based Backdoor Attacks on Self-Supervised Learning

VNE: An Effective Method for Improving Deep Representation by Manipulating Eigenvalue Distribution

Detecting and Grounding Multi-Modal Media Manipulation

Meta-causal Learning for Single Domain Generalization

Disentangling Writer and Character Styles for Handwriting Generation

DexArt: Benchmarking Generalizable Dexterous Manipulation with Articulated Objects

Hidden Gems: 4D Radar Scene Flow Learning Using Cross-Modal Supervision

Marching-Primitives: Shape Abstraction from Signed Distance Function

Towards Trustable Skin Cancer Diagnosis via Rewriting Model's Decision