This repo collects AI-related research.
Name | Description | Links | Publish Time |
---|---|---|---|
Behavior Vision Suite | BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation | Project website | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
TEN-Agent | TEN Agent is a realtime conversational AI agent powered by TEN. It seamlessly integrates the OpenAI Realtime API, RTC capabilities, and advanced features like weather updates, web search, computer vision, and Retrieval-Augmented Generation (RAG). | TEN-Agent | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
DeepSpeed | DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective. | Github | - |
Megatron-LM | Ongoing research training transformer models at scale. | Github | - |
Name | Description | Links | Publish Time |
---|---|---|---|
huggingface/lerobot | State-of-the-art Machine Learning for Real-World Robotics in Pytorch | Github | 2024 |
TidyBot | A household cleanup robot done by StanfordAILab. | GitHub | 2023 |
Eureka | Human-Level Reward Design via Coding Large Language Models, such as GPT-4, to perform in-context evolutionary optimization over reward code. Harnessing them to learn complex low-level manipulation tasks, such as dexterous pen spinning | Github | 2023 |
NOIR | Neural Signal Operated Intelligent Robots for Everyday Activities. Stanford University | Project website | 2023 |
robotics-survey/Awesome-Robotics-Foundation-Models | This repository is largely based on the following paper: Foundation Models in Robotics: Applications, Challenges, and the Future By Stanford University, Princeton University, UT Austin, NVIDIA, Scaled Foundations, Google DeepMind, TU Berlin, Shanghai Jiao Tong University | Github | 2023 |
JeffreyYH/robotics-fm-survey | Survey Paper of foundation models for robotics. paper: oward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis By CMU, Bosch Center for AI, SAIR Lab, Georgia Tech, FAIR at Meta, UC San Diego, Google DeepMind | Github | 2023 |
Name | Description | Links | Publish Time |
---|---|---|---|
mPLUG-DocOwl | Modularized Multimodal Large Language Model for Document Understanding. By Alibaba Group | Github | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
DeepSeek-VL | An open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. DeepSeek-VL possesses general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios. | Github | 2024 |
An Introduction to Vision-Language Modeling | An Introduction to Vision-Language Modeling. By Meta. | URL | 2024 |
Insight-V | Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models. By 1S-Lab, NTU. | Github | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
NeRF | Code release for NeRF (Neural Radiance Fields). Paper: https://arxiv.org/abs/2003.08934 | Github | 2020 |
Name | Description | Links | Publish Time |
---|---|---|---|
gaussian-splatting | Original reference implementation of "3D Gaussian Splatting for Real-Time Radiance Field Rendering". | Github | 2023 |
Name | Description | Links | Publish Time |
---|---|---|---|
TBC-TJU/MetaBCI | China’s first open-source platform for non-invasive brain computer interface. The project of MetaBCI is led by Prof. Minpeng Xu from Tianjin University, China. | Github | 2022 |
Name | Description | Links | Publish Time |
---|---|---|---|
Awesome-LLMs-Datasets | Summarize existing representative LLMs text datasets. | Github | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
mathvista | A benchmark designed to combine challenges from diverse mathematical and visual tasks. By UCLA and Microsoft Research | Project website | 2023 |
hallucination-leaderboard | Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents. | Github | 2023 |
GAIA | A benchmark for General AI Assistants. By Meta-FAIR, Meta-GenAI, HuggingFace and AutoGPT | Project website | 2023 |
microsoft/promptbench | A Unified Library for Evaluating and Understanding Large Language Models. | Github | 2023 |
Name | Description | Links | Publish Time |
---|---|---|---|
Summarization is (Almost) Dead | Our findings indicate a clear preference among human evaluators for LLM-generated summaries over human-written summaries and summaries generated by fine-tuned models. | https://arxiv.org/pdf/2309.09558.pdf | 2023 |
Name | Description | Links | Publish Time |
---|---|---|---|
F5-TTS | By A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching. By Shanghai Jiao Tong University. | Github | 2024 |
fishaudio/fish-speech | Brand new TTS solution. Demo: https://fish.audio/ | Github | 2024 |
VoiceCraft | VoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on in-the-wild data including audiobooks, internet videos, and podcasts.To clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference. | Github | 2024 |
Mega-TTS 2 | Input text and reference audio, clone the timbre of the reference audio to generate speech corresponding to the text. By Zhejiang University and ByteDance. Paper:https://arxiv.org/abs/2307.07218 | URL | 2024 |
NaturalSpeech 3 | Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models. By Microsoft Research Asia paper: https://arxiv.org/abs/2403.03100 |
URL | 2024 |
BASE TTS | BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. By amazon. paper:https://arxiv.org/abs/2402.08093 |
URL | 2024 |
metavoice-src | Foundational model for human-like, expressive TTS. Zero-shot cloning for American & British voices, with 30s reference audio. | Github | 2024 |
Bark | Multilingual Demo: https://huggingface.co/spaces/suno/bark Paper: https://arxiv.org/abs/2209.03143 |
Github | 2023 |
XTTS | Multilingual Demo: https://huggingface.co/spaces/coqui/xtts |
Github | 2021 |
OpenVoice | ZH + EN Demo: https://huggingface.co/spaces/myshell-ai/OpenVoice Paper: https://arxiv.org/abs/2312.01479 |
Github | 2023 |
TorToiSe TTS | English Demo: https://huggingface.co/spaces/Manmay/tortoise-tts Paper:https://arxiv.org/abs/2305.07243 |
Github | 2022 |
GPT-SoVITS | Multilingual | Github | |
EmotiVoice | ZH + EN | Github | 2023 |
MeloTTS | high-quality multi-lingual text-to-speech library by MyShell.ai. Support English, Spanish, French, Chinese, Japanese and Korean. | Github | 2024 |
Tacotron 2 | English Paper: https://arxiv.org/abs/1712.05884 |
Unofficial Repo:Github | GDrive |
Silero | EM + DE + ES + EA | Github | |
StyleTTS 2 | English Demo: https://huggingface.co/spaces/styletts2/styletts2 Paper:https://arxiv.org/abs/2306.07691 |
Github | 2023 |
Amphion | Demo: https://huggingface.co/amphion Paper: https://arxiv.org/abs/2312.09911 |
Github | 2023 |
VALL-E | Paper: https://arxiv.org/abs/2301.02111 |
Unofficial Repo:Github | 2023 |
Piper | Multilingual | Github | |
WhisperSpeech | English, Polish Demo |
Github | 2023 |
HierSpeech++ | KR + EN Demo:https://huggingface.co/spaces/LeeSangHoon/HierSpeech_TTS Paper:https://arxiv.org/abs/2311.12454 |
Github | 2023 |
Glow-TTS | English Demo:https://jaywalnut310.github.io/glow-tts-demo/index.html Paper:https://arxiv.org/abs/2005.11129 |
Github | 2020 |
xVASynth | Multilingual Demo:https://store.steampowered.com/app/1765720/xVASynth/ Paper:https://arxiv.org/abs/2009.14153 |
Github | 2023 |
IMS-Toucan | Multilingual, Demo: https://huggingface.co/spaces/Flux9665/IMS-Toucan Paper: https://arxiv.org/abs/2206.12229 |
Github | 2023 |
Matcha-TTS | English Demo:https://huggingface.co/spaces/shivammehta25/Matcha-TTS Paper:https://arxiv.org/abs/2309.03199 |
Repo | 2023 |
RAD-TTS | English Paper:https://openreview.net/pdf?id=0NQwnnwAORi |
Github | 2022 |
MahaTTS | English + Indic Demo: Colab |
Github | 2023 |
Neural-HMM TTS | English Demo:https://shivammehta25.github.io/Neural-HMM/ Paper:https://arxiv.org/abs/2108.13320 |
Repo | 2021 |
pflowTTS | English Paper:https://openreview.net/pdf?id=zNA7u7wtIN |
Unofficial Repo | 2023 |
Pheme | English Demo:https://huggingface.co/spaces/PolyAI/pheme Paper:https://arxiv.org/abs/2401.02839 |
Github | 2024 |
TTTS | ZH Demo:https://colab.research.google.com/github/adelacvg/ttts/blob/master/demo.ipynb |
Github | |
VITS/ MMS-TTS | English Demo:https://huggingface.co/spaces/kakao-enterprise/vits Paper:https://arxiv.org/abs/2106.06103 |
Github | 2021 |
OverFlow TTS | English Demo:https://shivammehta25.github.io/OverFlow/ Paper: https://arxiv.org/abs/2211.06892 |
Github | 2022 |
Name | Description | Links | Publish Time |
---|---|---|---|
AnyText | Multilingual Visual Text Generation And Editing. By Alibaba Group | Github | 2023 |
InstantID | InstantID is a new state-of-the-art tuning-free method to achieve ID-Preserving generation with only single image, supporting various downstream tasks. | Github | 2023 |
apple/ml-mgie | Guiding Instruction-based Image Editing via Multimodal Large Language Models. By Apple. | Github | 2024 |
lllyasviel/IC-Light | IC-Light is a project to manipulate the illumination of images. Demo:https://huggingface.co/spaces/lllyasviel/IC-Light | Github | 2024 |
Tencent/HunyuanDiT | A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding | Github | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
SignLLM/Prompt2Sign | Prompt2Sign is first comprehensive multilingual sign language dataset, which uses tools to automate the acquisition and processing of sign language videos on the web, is an evolving data set that is efficient, lightweight, reducing the previous shortcomings. | Github | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
Lightricks/LTX-Video | LTX-Video is the first DiT-based video generation model that can generate high-quality videos in real-time. | Github | |
2024 | |||
AILab-CVC/VideoGen-Eval | By Tencent. The Dawn of Video Generation: Preliminary Explorations with SORA-like Models | Github | 2024 |
THUDM/CogVideo | CogVideoX is an open-source version of the video generation model | Github | 2024 |
MusePose | MusePose is a diffusion-based and pose-guided virtual human video generation framework.By Tencent. | Github | 2024 |
ProPainter | Improving Propagation and Transformer for Video Inpainting. S-Lab, Nanyang Technological University | Github | 2023 |
Emu Edit/Emu video | Emu Edit is an AI generated image model that supports modifying local content of images through text; Emu Video is an AI generated video model that also supports text modification of local content in videos. | Project website | 2023 |
PixelDance | A novel approach based on diffusion models that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation. By ByteDance Research | Project website | 2023 |
MagicDance | Realistic Human Dance Video Generation with Motions & Facial Expressions Transfer. By University of Southern California | Github | 2023 |
TencentARC/ MotionCtrl | A Unified and Flexible Motion Controller for Video Generation | Github | 2023 |
DreaMoving | A Human Video Generation Framework based on Diffusion Models. By Alibaba Group | Github | 2023 |
magicvideov2 | Multi-Stage High-Aesthetic Video Generation by ByteDance | URL | 2024 |
Boximator | Generating Rich and Controllable Motions for Video Synthesis. By ByteDance | URL | 2024 |
fudan-generative-vision/champ | Controllable and Consistent Human Image Animation with 3D Parametric Guidance | Github | 2024 |
TaoHuUMD/SurMo | Surface-based 4D Motion Modeling for Dynamic Human | Github | 2024 |
ToonCrafter | A research paper for generative cartoon interpolation | Github | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
PersonaTalk | PersonaTalk creates lip-sync visual dubbing while preserving indivisuals' talking style and facial details. Paper: https://arxiv.org/pdf/2409.05379 | Porject website | 2024 |
Loopy | Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency. By Bytedance and Zhejiang University | Project website | 2024 |
V-Express | V-Express aims to generate a talking head video under the control of a reference image, an audio, and a sequence of V-Kps images. By Tencent. | Github | 2024 |
InstructAvatar | InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation. By Peking University | Project website | 2024 |
X-LANCE/AniTalker | Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding | Github | 2024 |
VASA-1 | Lifelike Audio-Driven Talking Faces Generated in Real Time. By Microsoft. paper:https://arxiv.org/abs/2404.10667 | Project Website | 2024 |
GeneFace | Generalized and High-Fidelity 3D Talking Face Synthesis. Zhejiang University, ByteDance | Github | 2023 |
GAIA | Zero-shot talking avatar generation aims at synthesizing natural talking videos from speech and a single portrait image. GAIA (Generative AI for Avatar), which eliminates the domain priors in talking avatar generation. By Microsoft | Project Website | 2023 |
Name | Description | Links | Publish Time |
---|---|---|---|
VLMEvalKit | Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 40+ benchmarks. | Github | 2024 |
NVlabs/VILA | a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops) | Github | 2024 |
PKU-YuanGroup/Video-LLaVA | Video-LLaVA: Learning United Visual Representation by Alignment Before Projection | Github | 2023 |
evolvinglmms-lab/longva | Long Context Transfer from Language to Vision. 介绍文章: 机器之心:7B最强长视频模型! LongVA视频理解超千帧,霸榜多个榜单 |
Github | 2024 |
Vision-CAIR/MiniGPT4-video | Goldfish model for long video understanding and MiniGPT4-video for short video understanding. Goldfish_website | Github | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
cleanlab | The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels. | Github |
Name | Description | Links | Publish Time |
---|---|---|---|
DMV3D | Denoising Multi-View Diffusion using 3D Large Reconstruction Model. A single-stage approach for high-quality text-to-3D generation and single-image reconstruction in 30s. By Adobe, Stanford, etc | Project website | 2023 |
Make-A-Character | High Quality Text-to-3D Character Generation within Minutes. By Alibaba | Github | 2023 |
Name | Description | Links | Publish Time |
---|---|---|---|
facebookresearch/segment-anything-2 | Demo: https://sam2.metademolab.com/demo, blog: https://ai.meta.com/blog/segment-anything-2-video/ | Github | 2024 |
open-mmlab/mmdetection | MMDetection is an open source object detection toolbox based on PyTorch. | Github | |
AILab-CVC/YOLO-World | Real-Time Open-Vocabulary Object Detection. By Tencent. | Github | 2024 |
LiheYoung/Depth-Anything | Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. Foundation Model for Monocular Depth Estimation. By 1The University of Hong Kong · 2TikTok · 3Zhejiang Lab · 4Zhejiang University | Github | 2024 |
t-rex | Towards Generic Object Detection via Text-Visual Prompt Synergy. | Github | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
CodeFormer | Towards Robust Blind Face Restoration with Codebook Lookup Transformer (NeurIPS 2022) . By S-Lab, Nanyang Technological University | Github | 2023 |
Name | Description | Links | Publish Time |
---|---|---|---|
Upscale-A-Video | Upscale-A-Video is a diffusion-based model that upscales videos by taking the low-resolution video and text prompts as inputs. S-Lab, Nanyang Technological University | Github | 2023 |
ComfyUI-SUPIR | SUPIR upscaling wrapper for ComfyUI | Github | 2024 |
APISR | APISR: Anime Production Inspired Real-World Anime Super-Resolution (CVPR 2024). APISR aims at restoring and enhancing low-quality low-resolution anime images and video sources with various degradations from real-world scenarios. | Github | 2024 |
EvTexture | Event-driven Texture Enhancement for Video Super-Resolution. By University of Science and Technology of China | Github | 2024 |
jnjaby/KEEP | Kalman-Inspired Feature Propagation for Video Face Super-Resolution. By S-Lab, Nanyang Technological University.ECCV 2024 | Github |
Name | Description | Links | Publish Time |
---|---|---|---|
OutfitAnyone | Outfit Anyone: Ultra-high quality virtual try-on for Any Clothing and Any Person. Institute for Intelligent Computing, Alibaba Group | Github | 2023 |
OOTDiffusion | Official implementation of OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on | Github Demo:https://ootd.ibot.cn/ | 2024 |
ViViD | ViViD: Video Virtual Try-on using Diffusion Models. By Alibaba | Github | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
StemGen | StemGen: A music generation model that listens, ByteDance Inc | Project Website | 2023 |
Name | Description | Links | Publish Time |
---|---|---|---|
microsoft/graphrag | A modular graph-based Retrieval-Augmented Generation (RAG) system | Github | 2024 |
Retrieval-Augmented Generation for Large Language Models: A Survey | Shanghai Research Institute for Intelligent Autonomous Systems | URL | 2023 |
Name | Description | Links | Publish Time |
---|---|---|---|
surya | Surya is a multilingual document OCR toolkit. It can do: Accurate line-level text detection | Github | 2024 |
Nutlope/llama-ocr | Document to Markdown OCR library with Llama 3.2 vision | Github | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
sally-sh/vsp-llm | Visual Speech Processing incorporated with LLMs paper:https://arxiv.org/abs/2402.15151v1 |
Github | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
NationalGAILab/HoT | Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation | Github | 2024 |
Name | Description | Links | Publish Time |
---|---|---|---|
GeneOH-Diffusion | Towards Generalizable Hand-Object Interaction Denoising via Denoising Diffusion | Github | 2024 |
Efficient-Large-Model/VILA | VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops) | Github | 2024 |
roboflow/supervision | We write your reusable computer vision tools. | Github | 2023 |
如果您喜欢这个项目,可以赞赏一下支持我们,谢谢您的支持!