GitHub - ikaijua/Awesome-AIResearch: Collection of AI-related research. Welcome to submit issues and pull requests /收藏AI相关的研究，欢迎提交issues 或者pull requests

This repo collects AI-related research.

Spatial Intelligence

Name	Description	Links	Publish Time
Behavior Vision Suite	BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation	Project website	2024

AI Agent

Name	Description	Links	Publish Time
TEN-Agent	TEN Agent is a realtime conversational AI agent powered by TEN. It seamlessly integrates the OpenAI Realtime API, RTC capabilities, and advanced features like weather updates, web search, computer vision, and Retrieval-Augmented Generation (RAG).	TEN-Agent	2024

Distributed Training Framework

Name	Description	Links	Publish Time
DeepSpeed	DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.	Github	-
Megatron-LM	Ongoing research training transformer models at scale.	Github	-

Robot

Name	Description	Links	Publish Time
huggingface/lerobot	State-of-the-art Machine Learning for Real-World Robotics in Pytorch	Github	2024
TidyBot	A household cleanup robot done by StanfordAILab.	GitHub	2023
Eureka	Human-Level Reward Design via Coding Large Language Models, such as GPT-4, to perform in-context evolutionary optimization over reward code. Harnessing them to learn complex low-level manipulation tasks, such as dexterous pen spinning	Github	2023
NOIR	Neural Signal Operated Intelligent Robots for Everyday Activities. Stanford University	Project website	2023
robotics-survey/Awesome-Robotics-Foundation-Models	This repository is largely based on the following paper: Foundation Models in Robotics: Applications, Challenges, and the Future By Stanford University, Princeton University, UT Austin, NVIDIA, Scaled Foundations, Google DeepMind, TU Berlin, Shanghai Jiao Tong University	Github	2023
JeffreyYH/robotics-fm-survey	Survey Paper of foundation models for robotics. paper: oward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis By CMU, Bosch Center for AI, SAIR Lab, Georgia Tech, FAIR at Meta, UC San Diego, Google DeepMind	Github	2023

Multi-modal LLM

Name	Description	Links	Publish Time
mPLUG-DocOwl	Modularized Multimodal Large Language Model for Document Understanding. By Alibaba Group	Github	2024

Vision-Language (VL) Model

Name	Description	Links	Publish Time
DeepSeek-VL	An open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. DeepSeek-VL possesses general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios.	Github	2024
An Introduction to Vision-Language Modeling	An Introduction to Vision-Language Modeling. By Meta.	URL	2024
Insight-V	Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models. By 1S-Lab, NTU.	Github	2024

NeRF

Name	Description	Links	Publish Time
NeRF	Code release for NeRF (Neural Radiance Fields). Paper: https://arxiv.org/abs/2003.08934	Github	2020

3D Gaussian Splatting

Name	Description	Links	Publish Time
gaussian-splatting	Original reference implementation of "3D Gaussian Splatting for Real-Time Radiance Field Rendering".	Github	2023

Brain Computer Interface

Name	Description	Links	Publish Time
TBC-TJU/MetaBCI	China’s first open-source platform for non-invasive brain computer interface. The project of MetaBCI is led by Prof. Minpeng Xu from Tianjin University, China.	Github	2022

LLM Datasets

Name	Description	Links	Publish Time
Awesome-LLMs-Datasets	Summarize existing representative LLMs text datasets.	Github	2024

LMMs Benchmark

Name	Description	Links	Publish Time
mathvista	A benchmark designed to combine challenges from diverse mathematical and visual tasks. By UCLA and Microsoft Research	Project website	2023
hallucination-leaderboard	Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents.	Github	2023
GAIA	A benchmark for General AI Assistants. By Meta-FAIR, Meta-GenAI, HuggingFace and AutoGPT	Project website	2023
microsoft/promptbench	A Unified Library for Evaluating and Understanding Large Language Models.	Github	2023

Summarization

Name	Description	Links	Publish Time
Summarization is (Almost) Dead	Our findings indicate a clear preference among human evaluators for LLM-generated summaries over human-written summaries and summaries generated by fine-tuned models.	https://arxiv.org/pdf/2309.09558.pdf	2023

TTS

Name	Description	Links	Publish Time
F5-TTS	By A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching. By Shanghai Jiao Tong University.	Github	2024
fishaudio/fish-speech	Brand new TTS solution. Demo: https://fish.audio/	Github	2024
VoiceCraft	VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference.	Github	2024
Mega-TTS 2	Input text and reference audio, clone the timbre of the reference audio to generate speech corresponding to the text. By Zhejiang University and ByteDance. Paper：https://arxiv.org/abs/2307.07218	URL	2024
NaturalSpeech 3	Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models. By Microsoft Research Asia paper: https://arxiv.org/abs/2403.03100	URL	2024
BASE TTS	BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. By amazon. paper:https://arxiv.org/abs/2402.08093	URL	2024
metavoice-src	Foundational model for human-like, expressive TTS. Zero-shot cloning for American & British voices, with 30s reference audio.	Github	2024
Bark	Multilingual Demo: https://huggingface.co/spaces/suno/bark Paper: https://arxiv.org/abs/2209.03143	Github	2023
XTTS	Multilingual Demo: https://huggingface.co/spaces/coqui/xtts	Github	2021
OpenVoice	ZH + EN Demo: https://huggingface.co/spaces/myshell-ai/OpenVoice Paper: https://arxiv.org/abs/2312.01479	Github	2023
TorToiSe TTS	English Demo: https://huggingface.co/spaces/Manmay/tortoise-tts Paper:https://arxiv.org/abs/2305.07243	Github	2022
GPT-SoVITS	Multilingual	Github
EmotiVoice	ZH + EN	Github	2023
MeloTTS	high-quality multi-lingual text-to-speech library by MyShell.ai. Support English, Spanish, French, Chinese, Japanese and Korean.	Github	2024
Tacotron 2	English Paper: https://arxiv.org/abs/1712.05884	Unofficial Repo:Github	GDrive
Silero	EM + DE + ES + EA	Github
StyleTTS 2	English Demo: https://huggingface.co/spaces/styletts2/styletts2 Paper:https://arxiv.org/abs/2306.07691	Github	2023
Amphion	Demo: https://huggingface.co/amphion Paper: https://arxiv.org/abs/2312.09911	Github	2023
VALL-E	Paper: https://arxiv.org/abs/2301.02111	Unofficial Repo:Github	2023
Piper	Multilingual	Github
WhisperSpeech	English, Polish Demo	Github	2023
HierSpeech++	KR + EN Demo:https://huggingface.co/spaces/LeeSangHoon/HierSpeech_TTS Paper:https://arxiv.org/abs/2311.12454	Github	2023
Glow-TTS	English Demo:https://jaywalnut310.github.io/glow-tts-demo/index.html Paper:https://arxiv.org/abs/2005.11129	Github	2020
xVASynth	Multilingual Demo:https://store.steampowered.com/app/1765720/xVASynth/ Paper:https://arxiv.org/abs/2009.14153	Github	2023
IMS-Toucan	Multilingual, Demo: https://huggingface.co/spaces/Flux9665/IMS-Toucan Paper: https://arxiv.org/abs/2206.12229	Github	2023
Matcha-TTS	English Demo:https://huggingface.co/spaces/shivammehta25/Matcha-TTS Paper:https://arxiv.org/abs/2309.03199	Repo	2023
RAD-TTS	English Paper:https://openreview.net/pdf?id=0NQwnnwAORi	Github	2022
MahaTTS	English + Indic Demo: Colab	Github	2023
Neural-HMM TTS	English Demo:https://shivammehta25.github.io/Neural-HMM/ Paper:https://arxiv.org/abs/2108.13320	Repo	2021
pflowTTS	English Paper:https://openreview.net/pdf?id=zNA7u7wtIN	Unofficial Repo	2023
Pheme	English Demo:https://huggingface.co/spaces/PolyAI/pheme Paper:https://arxiv.org/abs/2401.02839	Github	2024
TTTS	ZH Demo:https://colab.research.google.com/github/adelacvg/ttts/blob/master/demo.ipynb	Github
VITS/ MMS-TTS	English Demo:https://huggingface.co/spaces/kakao-enterprise/vits Paper:https://arxiv.org/abs/2106.06103	Github	2021
OverFlow TTS	English Demo:https://shivammehta25.github.io/OverFlow/ Paper: https://arxiv.org/abs/2211.06892	Github	2022

Image Generage

Name	Description	Links	Publish Time
AnyText	Multilingual Visual Text Generation And Editing. By Alibaba Group	Github	2023
InstantID	InstantID is a new state-of-the-art tuning-free method to achieve ID-Preserving generation with only single image, supporting various downstream tasks.	Github	2023
apple/ml-mgie	Guiding Instruction-based Image Editing via Multimodal Large Language Models. By Apple.	Github	2024
lllyasviel/IC-Light	IC-Light is a project to manipulate the illumination of images. Demo：https://huggingface.co/spaces/lllyasviel/IC-Light	Github	2024
Tencent/HunyuanDiT	A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding	Github	2024

Sign Language

Name	Description	Links	Publish Time
SignLLM/Prompt2Sign	Prompt2Sign is first comprehensive multilingual sign language dataset, which uses tools to automate the acquisition and processing of sign language videos on the web, is an evolving data set that is efficient, lightweight, reducing the previous shortcomings.	Github	2024

Video Generate

Name	Description	Links	Publish Time
Lightricks/LTX-Video	LTX-Video is the first DiT-based video generation model that can generate high-quality videos in real-time.	Github
	2024
AILab-CVC/VideoGen-Eval	By Tencent. The Dawn of Video Generation: Preliminary Explorations with SORA-like Models	Github	2024
THUDM/CogVideo	CogVideoX is an open-source version of the video generation model	Github	2024
MusePose	MusePose is a diffusion-based and pose-guided virtual human video generation framework.By Tencent.	Github	2024
ProPainter	Improving Propagation and Transformer for Video Inpainting. S-Lab, Nanyang Technological University	Github	2023
Emu Edit/Emu video	Emu Edit is an AI generated image model that supports modifying local content of images through text; Emu Video is an AI generated video model that also supports text modification of local content in videos.	Project website	2023
PixelDance	A novel approach based on diffusion models that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation. By ByteDance Research	Project website	2023
MagicDance	Realistic Human Dance Video Generation with Motions & Facial Expressions Transfer. By University of Southern California	Github	2023
TencentARC/ MotionCtrl	A Unified and Flexible Motion Controller for Video Generation	Github	2023
DreaMoving	A Human Video Generation Framework based on Diffusion Models. By Alibaba Group	Github	2023
magicvideov2	Multi-Stage High-Aesthetic Video Generation by ByteDance	URL	2024
Boximator	Generating Rich and Controllable Motions for Video Synthesis. By ByteDance	URL	2024
fudan-generative-vision/champ	Controllable and Consistent Human Image Animation with 3D Parametric Guidance	Github	2024
TaoHuUMD/SurMo	Surface-based 4D Motion Modeling for Dynamic Human	Github	2024
ToonCrafter	A research paper for generative cartoon interpolation	Github	2024

Talking Face Synthesis

Name	Description	Links	Publish Time
PersonaTalk	PersonaTalk creates lip-sync visual dubbing while preserving indivisuals' talking style and facial details. Paper: https://arxiv.org/pdf/2409.05379	Porject website	2024
Loopy	Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency. By Bytedance and Zhejiang University	Project website	2024
V-Express	V-Express aims to generate a talking head video under the control of a reference image, an audio, and a sequence of V-Kps images. By Tencent.	Github	2024
InstructAvatar	InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation. By Peking University	Project website	2024
X-LANCE/AniTalker	Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding	Github	2024
VASA-1	Lifelike Audio-Driven Talking Faces Generated in Real Time. By Microsoft. paper:https://arxiv.org/abs/2404.10667	Project Website	2024
GeneFace	Generalized and High-Fidelity 3D Talking Face Synthesis. Zhejiang University, ByteDance	Github	2023
GAIA	Zero-shot talking avatar generation aims at synthesizing natural talking videos from speech and a single portrait image. GAIA (Generative AI for Avatar), which eliminates the domain priors in talking avatar generation. By Microsoft	Project Website	2023

Video Comprehension

Name	Description	Links	Publish Time
VLMEvalKit	Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks.	Github	2024
NVlabs/VILA	a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)	Github	2024
PKU-YuanGroup/Video-LLaVA	Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	Github	2023
evolvinglmms-lab/longva	Long Context Transfer from Language to Vision. 介绍文章：机器之心：7B最强长视频模型！ LongVA视频理解超千帧，霸榜多个榜单	Github	2024
Vision-CAIR/MiniGPT4-video	Goldfish model for long video understanding and MiniGPT4-video for short video understanding. Goldfish_website	Github	2024

Data Cleaning

Name	Description	Links	Publish Time
cleanlab	The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.	Github

3D Generate

Name	Description	Links	Publish Time
DMV3D	Denoising Multi-View Diffusion using 3D Large Reconstruction Model. A single-stage approach for high-quality text-to-3D generation and single-image reconstruction in 30s. By Adobe, Stanford, etc	Project website	2023
Make-A-Character	High Quality Text-to-3D Character Generation within Minutes. By Alibaba	Github	2023

Object Dectection

Name	Description	Links	Publish Time
facebookresearch/segment-anything-2	Demo: https://sam2.metademolab.com/demo, blog: https://ai.meta.com/blog/segment-anything-2-video/	Github	2024
open-mmlab/mmdetection	MMDetection is an open source object detection toolbox based on PyTorch.	Github
AILab-CVC/YOLO-World	Real-Time Open-Vocabulary Object Detection. By Tencent.	Github	2024
LiheYoung/Depth-Anything	Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. Foundation Model for Monocular Depth Estimation. By 1The University of Hong Kong · 2TikTok · 3Zhejiang Lab · 4Zhejiang University	Github	2024
t-rex	Towards Generic Object Detection via Text-Visual Prompt Synergy.	Github	2024

Image/Video Enhancements

Name	Description	Links	Publish Time
CodeFormer	Towards Robust Blind Face Restoration with Codebook Lookup Transformer (NeurIPS 2022) . By S-Lab, Nanyang Technological University	Github	2023

Super-Resolution

Name	Description	Links	Publish Time
Upscale-A-Video	Upscale-A-Video is a diffusion-based model that upscales videos by taking the low-resolution video and text prompts as inputs. S-Lab, Nanyang Technological University	Github	2023
ComfyUI-SUPIR	SUPIR upscaling wrapper for ComfyUI	Github	2024
APISR	APISR: Anime Production Inspired Real-World Anime Super-Resolution (CVPR 2024). APISR aims at restoring and enhancing low-quality low-resolution anime images and video sources with various degradations from real-world scenarios.	Github	2024
EvTexture	Event-driven Texture Enhancement for Video Super-Resolution. By University of Science and Technology of China	Github	2024
jnjaby/KEEP	Kalman-Inspired Feature Propagation for Video Face Super-Resolution. By S-Lab, Nanyang Technological University.ECCV 2024	Github

Virtual Try-On

Name	Description	Links	Publish Time
OutfitAnyone	Outfit Anyone: Ultra-high quality virtual try-on for Any Clothing and Any Person. Institute for Intelligent Computing, Alibaba Group	Github	2023
OOTDiffusion	Official implementation of OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on	Github Demo:https://ootd.ibot.cn/	2024
ViViD	ViViD: Video Virtual Try-on using Diffusion Models. By Alibaba	Github	2024

AI Muisc Generation

Name	Description	Links	Publish Time
StemGen	StemGen: A music generation model that listens, ByteDance Inc	Project Website	2023

RAG(Retrieval-Augmented Generation)

Name	Description	Links	Publish Time
microsoft/graphrag	A modular graph-based Retrieval-Augmented Generation (RAG) system	Github	2024
Retrieval-Augmented Generation for Large Language Models: A Survey	Shanghai Research Institute for Intelligent Autonomous Systems	URL	2023

OCR

Name	Description	Links	Publish Time
surya	Surya is a multilingual document OCR toolkit. It can do: Accurate line-level text detection	Github	2024
Nutlope/llama-ocr	Document to Markdown OCR library with Llama 3.2 vision	Github	2024

Visual Speech Processing

Name	Description	Links	Publish Time
sally-sh/vsp-llm	Visual Speech Processing incorporated with LLMs paper：https://arxiv.org/abs/2402.15151v1	Github	2024

3D Human Pose Estimation

Name	Description	Links	Publish Time
NationalGAILab/HoT	Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation	Github	2024

Computer Vision

Name	Description	Links	Publish Time
GeneOH-Diffusion	Towards Generalizable Hand-Object Interaction Denoising via Denoising Diffusion	Github	2024
Efficient-Large-Model/VILA	VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)	Github	2024
roboflow/supervision	We write your reusable computer vision tools.	Github	2023

Star History

如果您喜欢这个项目，可以赞赏一下支持我们，谢谢您的支持！

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spatial Intelligence

AI Agent

Distributed Training Framework

Robot

Multi-modal LLM

Vision-Language (VL) Model

NeRF

3D Gaussian Splatting

Brain Computer Interface

LLM Datasets

LMMs Benchmark

Summarization

TTS

Image Generage

Sign Language

Video Generate

Talking Face Synthesis

Video Comprehension

Data Cleaning

3D Generate

Object Dectection

Image/Video Enhancements

Super-Resolution

Virtual Try-On

AI Muisc Generation

RAG(Retrieval-Augmented Generation)

OCR

Visual Speech Processing

3D Human Pose Estimation

Computer Vision

Star History

About

Releases

Packages

ikaijua/Awesome-AIResearch

Folders and files

Latest commit

History

Repository files navigation

Spatial Intelligence

AI Agent

Distributed Training Framework

Robot

Multi-modal LLM

Vision-Language (VL) Model

NeRF

3D Gaussian Splatting

Brain Computer Interface

LLM Datasets

LMMs Benchmark

Summarization

TTS

Image Generage

Sign Language

Video Generate

Talking Face Synthesis

Video Comprehension

Data Cleaning

3D Generate

Object Dectection

Image/Video Enhancements

Super-Resolution

Virtual Try-On

AI Muisc Generation

RAG(Retrieval-Augmented Generation)

OCR

Visual Speech Processing

3D Human Pose Estimation

Computer Vision

Star History

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages