Skip to content

Collection of AI-related research. Welcome to submit issues and pull requests /收藏AI相关的研究,欢迎提交issues 或者pull requests

Notifications You must be signed in to change notification settings

ikaijua/Awesome-AIResearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 

Repository files navigation

This repo collects AI-related research.

Spatial Intelligence

Name Description Links Publish Time
Behavior Vision Suite BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation Project website 2024

AI Agent

Name Description Links Publish Time
TEN-Agent TEN Agent is a realtime conversational AI agent powered by TEN. It seamlessly integrates the OpenAI Realtime API, RTC capabilities, and advanced features like weather updates, web search, computer vision, and Retrieval-Augmented Generation (RAG). TEN-Agent GitHub Repo stars 2024

Distributed Training Framework

Name Description Links Publish Time
DeepSpeed DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective. Github GitHub Repo stars -
Megatron-LM Ongoing research training transformer models at scale. Github GitHub Repo stars -

Robot

Name Description Links Publish Time
huggingface/lerobot State-of-the-art Machine Learning for Real-World Robotics in Pytorch Github GitHub Repo stars 2024
TidyBot A household cleanup robot done by StanfordAILab. GitHub GitHub Repo stars 2023
Eureka Human-Level Reward Design via Coding Large Language Models, such as GPT-4, to perform in-context evolutionary optimization over reward code. Harnessing them to learn complex low-level manipulation tasks, such as dexterous pen spinning Github GitHub Repo stars 2023
NOIR Neural Signal Operated Intelligent Robots for Everyday Activities. Stanford University Project website 2023
robotics-survey/Awesome-Robotics-Foundation-Models This repository is largely based on the following paper: Foundation Models in Robotics: Applications, Challenges, and the Future By Stanford University, Princeton University, UT Austin, NVIDIA, Scaled Foundations, Google DeepMind, TU Berlin, Shanghai Jiao Tong University Github GitHub Repo stars 2023
JeffreyYH/robotics-fm-survey Survey Paper of foundation models for robotics. paper: oward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis By CMU, Bosch Center for AI, SAIR Lab, Georgia Tech, FAIR at Meta, UC San Diego, Google DeepMind Github GitHub Repo stars 2023

Multi-modal LLM

Name Description Links Publish Time
mPLUG-DocOwl Modularized Multimodal Large Language Model for Document Understanding. By Alibaba Group Github GitHub Repo stars 2024

Vision-Language (VL) Model

Name Description Links Publish Time
DeepSeek-VL An open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. DeepSeek-VL possesses general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios. Github GitHub Repo stars 2024
An Introduction to Vision-Language Modeling An Introduction to Vision-Language Modeling. By Meta. URL 2024
Insight-V Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models. By 1S-Lab, NTU.  Github GitHub Repo stars 2024

NeRF

Name Description Links Publish Time
NeRF Code release for NeRF (Neural Radiance Fields). Paper: https://arxiv.org/abs/2003.08934 Github GitHub Repo stars 2020

3D Gaussian Splatting

Name Description Links Publish Time
gaussian-splatting Original reference implementation of "3D Gaussian Splatting for Real-Time Radiance Field Rendering". Github GitHub Repo stars 2023

Brain Computer Interface

Name Description Links Publish Time
TBC-TJU/MetaBCI China’s first open-source platform for non-invasive brain computer interface. The project of MetaBCI is led by Prof. Minpeng Xu from Tianjin University, China. Github GitHub Repo stars 2022

LLM Datasets

Name Description Links Publish Time
Awesome-LLMs-Datasets Summarize existing representative LLMs text datasets. Github GitHub Repo stars 2024

LMMs Benchmark

Name Description Links Publish Time
mathvista A benchmark designed to combine challenges from diverse mathematical and visual tasks. By UCLA and Microsoft Research Project website 2023
hallucination-leaderboard Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents. Github GitHub Repo stars 2023
GAIA A benchmark for General AI Assistants. By Meta-FAIR, Meta-GenAI, HuggingFace and AutoGPT Project website 2023
microsoft/promptbench A Unified Library for Evaluating and Understanding Large Language Models. Github GitHub Repo stars 2023

Summarization

Name Description Links Publish Time
Summarization is (Almost) Dead Our findings indicate a clear preference among human evaluators for LLM-generated summaries over human-written summaries and summaries generated by fine-tuned models. https://arxiv.org/pdf/2309.09558.pdf 2023

TTS

Name Description Links Publish Time
F5-TTS By A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching. By Shanghai Jiao Tong University. Github GitHub Repo stars 2024
fishaudio/fish-speech Brand new TTS solution. Demo: https://fish.audio/ Github GitHub Repo stars 2024
VoiceCraft VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference. Github GitHub Repo stars 2024
Mega-TTS 2 Input text and reference audio, clone the timbre of the reference audio to generate speech corresponding to the text. By Zhejiang University and ByteDance. Paper:https://arxiv.org/abs/2307.07218 URL 2024
NaturalSpeech 3 Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models. By Microsoft Research Asia
paper: https://arxiv.org/abs/2403.03100
URL 2024
BASE TTS BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. By amazon.
paper:https://arxiv.org/abs/2402.08093
URL 2024
metavoice-src Foundational model for human-like, expressive TTS. Zero-shot cloning for American & British voices, with 30s reference audio. Github GitHub Repo stars 2024
Bark Multilingual
Demo: https://huggingface.co/spaces/suno/bark
Paper: https://arxiv.org/abs/2209.03143
Github GitHub Repo stars 2023
XTTS Multilingual
Demo: https://huggingface.co/spaces/coqui/xtts
Github GitHub Repo stars 2021
OpenVoice ZH + EN
Demo: https://huggingface.co/spaces/myshell-ai/OpenVoice
Paper: https://arxiv.org/abs/2312.01479
Github GitHub Repo stars 2023
TorToiSe TTS English
Demo: https://huggingface.co/spaces/Manmay/tortoise-tts
Paper:https://arxiv.org/abs/2305.07243
Github GitHub Repo stars 2022
GPT-SoVITS Multilingual Github GitHub Repo stars
EmotiVoice ZH + EN Github GitHub Repo stars 2023
MeloTTS high-quality multi-lingual text-to-speech library by MyShell.ai. Support English, Spanish, French, Chinese, Japanese and Korean. Github GitHub Repo stars 2024
Tacotron 2 English
Paper: https://arxiv.org/abs/1712.05884
Unofficial Repo:Github GitHub Repo stars GDrive
Silero EM + DE + ES + EA Github GitHub Repo stars
StyleTTS 2 English
Demo: https://huggingface.co/spaces/styletts2/styletts2
Paper:https://arxiv.org/abs/2306.07691
Github GitHub Repo stars 2023
Amphion Demo: https://huggingface.co/amphion
Paper: https://arxiv.org/abs/2312.09911
Github GitHub Repo stars 2023
VALL-E
Paper: https://arxiv.org/abs/2301.02111
Unofficial Repo:Github GitHub Repo stars 2023
Piper Multilingual Github GitHub Repo stars
WhisperSpeech English, Polish
Demo
Github GitHub Repo stars 2023
HierSpeech++ KR + EN
Demo:https://huggingface.co/spaces/LeeSangHoon/HierSpeech_TTS
Paper:https://arxiv.org/abs/2311.12454
Github GitHub Repo stars 2023
Glow-TTS English
Demo:https://jaywalnut310.github.io/glow-tts-demo/index.html
Paper:https://arxiv.org/abs/2005.11129
Github GitHub Repo stars 2020
xVASynth Multilingual
Demo:https://store.steampowered.com/app/1765720/xVASynth/
Paper:https://arxiv.org/abs/2009.14153
Github GitHub Repo stars 2023
IMS-Toucan Multilingual,
Demo: https://huggingface.co/spaces/Flux9665/IMS-Toucan
Paper: https://arxiv.org/abs/2206.12229
Github GitHub Repo stars 2023
Matcha-TTS English
Demo:https://huggingface.co/spaces/shivammehta25/Matcha-TTS
Paper:https://arxiv.org/abs/2309.03199
Repo GitHub Repo stars 2023
RAD-TTS English
Paper:https://openreview.net/pdf?id=0NQwnnwAORi
Github GitHub Repo stars 2022
MahaTTS English + Indic
Demo: Colab
Github GitHub Repo stars 2023
Neural-HMM TTS English
Demo:https://shivammehta25.github.io/Neural-HMM/
Paper:https://arxiv.org/abs/2108.13320
Repo GitHub Repo stars 2021
pflowTTS English
Paper:https://openreview.net/pdf?id=zNA7u7wtIN
Unofficial Repo GitHub Repo stars 2023
Pheme English
Demo:https://huggingface.co/spaces/PolyAI/pheme
Paper:https://arxiv.org/abs/2401.02839
Github GitHub Repo stars 2024
TTTS ZH
Demo:https://colab.research.google.com/github/adelacvg/ttts/blob/master/demo.ipynb
Github GitHub Repo stars
VITS/ MMS-TTS English
Demo:https://huggingface.co/spaces/kakao-enterprise/vits
Paper:https://arxiv.org/abs/2106.06103
Github 2021
OverFlow TTS English
Demo:https://shivammehta25.github.io/OverFlow/
Paper: https://arxiv.org/abs/2211.06892
Github GitHub Repo stars 2022

Image Generage

Name Description Links Publish Time
AnyText Multilingual Visual Text Generation And Editing. By Alibaba Group Github GitHub Repo stars 2023
InstantID InstantID is a new state-of-the-art tuning-free method to achieve ID-Preserving generation with only single image, supporting various downstream tasks. Github GitHub Repo stars 2023
apple/ml-mgie Guiding Instruction-based Image Editing via Multimodal Large Language Models. By Apple. Github GitHub Repo stars 2024
lllyasviel/IC-Light IC-Light is a project to manipulate the illumination of images. Demo:https://huggingface.co/spaces/lllyasviel/IC-Light Github GitHub Repo stars 2024
Tencent/HunyuanDiT A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding Github GitHub Repo stars 2024

Sign Language

Name Description Links Publish Time
SignLLM/Prompt2Sign Prompt2Sign is first comprehensive multilingual sign language dataset, which uses tools to automate the acquisition and processing of sign language videos on the web, is an evolving data set that is efficient, lightweight, reducing the previous shortcomings. Github GitHub Repo stars 2024

Video Generate

Name Description Links Publish Time
Lightricks/LTX-Video LTX-Video is the first DiT-based video generation model that can generate high-quality videos in real-time. Github
GitHub Repo stars 2024
AILab-CVC/VideoGen-Eval By Tencent. The Dawn of Video Generation: Preliminary Explorations with SORA-like Models Github GitHub Repo stars 2024
THUDM/CogVideo CogVideoX is an open-source version of the video generation model Github GitHub Repo stars 2024
MusePose MusePose is a diffusion-based and pose-guided virtual human video generation framework.By Tencent. Github GitHub Repo stars 2024
ProPainter Improving Propagation and Transformer for Video Inpainting. S-Lab, Nanyang Technological University Github GitHub Repo stars 2023
Emu Edit/Emu video Emu Edit is an AI generated image model that supports modifying local content of images through text; Emu Video is an AI generated video model that also supports text modification of local content in videos. Project website 2023
PixelDance A novel approach based on diffusion models that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation. By ByteDance Research Project website 2023
MagicDance Realistic Human Dance Video Generation with Motions & Facial Expressions Transfer. By University of Southern California Github GitHub Repo stars 2023
TencentARC/ MotionCtrl A Unified and Flexible Motion Controller for Video Generation Github GitHub Repo stars 2023
DreaMoving A Human Video Generation Framework based on Diffusion Models. By Alibaba Group Github GitHub Repo stars 2023
magicvideov2 Multi-Stage High-Aesthetic Video Generation by ByteDance URL 2024
Boximator Generating Rich and Controllable Motions for Video Synthesis. By ByteDance URL 2024
fudan-generative-vision/champ Controllable and Consistent Human Image Animation with 3D Parametric Guidance Github GitHub Repo stars 2024
TaoHuUMD/SurMo Surface-based 4D Motion Modeling for Dynamic Human Github GitHub Repo stars 2024
ToonCrafter A research paper for generative cartoon interpolation Github GitHub Repo stars 2024

Talking Face Synthesis

Name Description Links Publish Time
PersonaTalk PersonaTalk creates lip-sync visual dubbing while preserving indivisuals' talking style and facial details. Paper: https://arxiv.org/pdf/2409.05379 Porject website 2024
Loopy Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency. By Bytedance and Zhejiang University Project website 2024
V-Express V-Express aims to generate a talking head video under the control of a reference image, an audio, and a sequence of V-Kps images. By Tencent. Github GitHub Repo stars 2024
InstructAvatar InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation. By Peking University Project website 2024
X-LANCE/AniTalker Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding Github GitHub Repo stars 2024
VASA-1 Lifelike Audio-Driven Talking Faces Generated in Real Time. By Microsoft. paper:https://arxiv.org/abs/2404.10667 Project Website 2024
GeneFace Generalized and High-Fidelity 3D Talking Face Synthesis. Zhejiang University, ByteDance Github GitHub Repo stars 2023
GAIA Zero-shot talking avatar generation aims at synthesizing natural talking videos from speech and a single portrait image. GAIA (Generative AI for Avatar), which eliminates the domain priors in talking avatar generation. By Microsoft Project Website 2023

Video Comprehension

Name Description Links Publish Time
VLMEvalKit Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks. Github GitHub Repo stars 2024
NVlabs/VILA a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops) Github GitHub Repo stars 2024
PKU-YuanGroup/Video-LLaVA Video-LLaVA: Learning United Visual Representation by Alignment Before Projection Github GitHub Repo stars 2023
evolvinglmms-lab/longva Long Context Transfer from Language to Vision.
介绍文章:
机器之心:7B最强长视频模型! LongVA视频理解超千帧,霸榜多个榜单
Github GitHub Repo stars 2024
Vision-CAIR/MiniGPT4-video Goldfish model for long video understanding and MiniGPT4-video for short video understanding. Goldfish_website Github GitHub Repo stars 2024

Data Cleaning

Name Description Links Publish Time
cleanlab The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels. Github GitHub Repo stars

3D Generate

Name Description Links Publish Time
DMV3D Denoising Multi-View Diffusion using 3D Large Reconstruction Model. A single-stage approach for high-quality text-to-3D generation and single-image reconstruction in 30s. By Adobe, Stanford, etc Project website 2023
Make-A-Character High Quality Text-to-3D Character Generation within Minutes. By Alibaba Github GitHub Repo stars 2023

Object Dectection

Name Description Links Publish Time
facebookresearch/segment-anything-2 Demo: https://sam2.metademolab.com/demo, blog: https://ai.meta.com/blog/segment-anything-2-video/ Github GitHub Repo stars 2024
open-mmlab/mmdetection MMDetection is an open source object detection toolbox based on PyTorch. Github GitHub Repo stars
AILab-CVC/YOLO-World Real-Time Open-Vocabulary Object Detection. By Tencent. Github GitHub Repo stars 2024
LiheYoung/Depth-Anything Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. Foundation Model for Monocular Depth Estimation. By 1The University of Hong Kong · 2TikTok · 3Zhejiang Lab · 4Zhejiang University Github GitHub Repo stars 2024
t-rex Towards Generic Object Detection via Text-Visual Prompt Synergy. Github GitHub Repo stars 2024

Image/Video Enhancements

Name Description Links Publish Time
CodeFormer Towards Robust Blind Face Restoration with Codebook Lookup Transformer (NeurIPS 2022) . By S-Lab, Nanyang Technological University Github GitHub Repo stars 2023

Super-Resolution

Name Description Links Publish Time
Upscale-A-Video Upscale-A-Video is a diffusion-based model that upscales videos by taking the low-resolution video and text prompts as inputs. S-Lab, Nanyang Technological University Github GitHub Repo stars 2023
ComfyUI-SUPIR SUPIR upscaling wrapper for ComfyUI Github GitHub Repo stars 2024
APISR APISR: Anime Production Inspired Real-World Anime Super-Resolution (CVPR 2024). APISR aims at restoring and enhancing low-quality low-resolution anime images and video sources with various degradations from real-world scenarios. Github GitHub Repo stars 2024
EvTexture Event-driven Texture Enhancement for Video Super-Resolution. By University of Science and Technology of China Github GitHub Repo stars 2024
jnjaby/KEEP Kalman-Inspired Feature Propagation for Video Face Super-Resolution. By S-Lab, Nanyang Technological University.ECCV 2024 Github GitHub Repo stars

Virtual Try-On

Name Description Links Publish Time
OutfitAnyone Outfit Anyone: Ultra-high quality virtual try-on for Any Clothing and Any Person. Institute for Intelligent Computing, Alibaba Group Github GitHub Repo stars 2023
OOTDiffusion Official implementation of OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on Github GitHub Repo stars Demo:https://ootd.ibot.cn/ 2024
ViViD ViViD: Video Virtual Try-on using Diffusion Models. By Alibaba Github GitHub Repo stars 2024

AI Muisc Generation

Name Description Links Publish Time
StemGen StemGen: A music generation model that listens, ByteDance Inc Project Website 2023

RAG(Retrieval-Augmented Generation)

Name Description Links Publish Time
microsoft/graphrag A modular graph-based Retrieval-Augmented Generation (RAG) system Github GitHub Repo stars 2024
Retrieval-Augmented Generation for Large Language Models: A Survey Shanghai Research Institute for Intelligent Autonomous Systems URL 2023

OCR

Name Description Links Publish Time
surya Surya is a multilingual document OCR toolkit. It can do: Accurate line-level text detection Github GitHub Repo stars 2024
Nutlope/llama-ocr Document to Markdown OCR library with Llama 3.2 vision Github GitHub Repo stars 2024

Visual Speech Processing

Name Description Links Publish Time
sally-sh/vsp-llm Visual Speech Processing incorporated with LLMs
paper:https://arxiv.org/abs/2402.15151v1
Github GitHub Repo stars 2024

3D Human Pose Estimation

Name Description Links Publish Time
NationalGAILab/HoT Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation Github GitHub Repo stars 2024

Computer Vision

Name Description Links Publish Time
GeneOH-Diffusion Towards Generalizable Hand-Object Interaction Denoising via Denoising Diffusion Github GitHub Repo stars 2024
Efficient-Large-Model/VILA VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops) Github GitHub Repo stars 2024
roboflow/supervision We write your reusable computer vision tools. Github GitHub Repo stars 2023

Star History

Star 历史记录

Buy Me A Coffee

如果您喜欢这个项目,可以赞赏一下支持我们,谢谢您的支持!

About

Collection of AI-related research. Welcome to submit issues and pull requests /收藏AI相关的研究,欢迎提交issues 或者pull requests

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published