Stars
Your own personal AI assistant. Any OS. Any Platform. The lobster way. 🦞
[Survey] A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
Official Repo of paper "KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction". In the paper, we propose KnowCoder, the most powerful large language model so far for…
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks
Official repository of 'Visual-RFT: Visual Reinforcement Fine-Tuning' & 'Visual-ARFT: Visual Agentic Reinforcement Fine-Tuning'’
NVIDIA Isaac GR00T N1.6 - A Foundation Model for Generalist Robots.
[ECCV2024] 🐙Octopus, an embodied vision-language model trained with RLEF, emerging superior in embodied visual planning and programming.
AutoCoA (Automatic generation of Chain-of-Action) is an agent model framework that enhances the multi-turn tool usage capability of reasoning models.
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
Implementation code of the paper MIGE: A Unified Framework for Multimodal Instruction-Based Image Generation and Editing
Official repo for GPTFUZZER : Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Repo of ACL 2025 Paper "Quantification of Large Language Model Distillation"
Towards Large Multimodal Models as Visual Foundation Agents
A generative world for general-purpose robotics & embodied AI learning.
✨✨Latest Advances on Multimodal Large Language Models
[ICLR 2023] ReAct: Synergizing Reasoning and Acting in Language Models
Official repo with the MM-PlanLLM code, from the paper Show and Guide: Instructional-Plan Grounded Vision and Language Model.
Pytorch implementation of Transfusion, "Predict the Next Token and Diffuse Images with One Multi-Modal Model", from MetaAI
✨✨[NeurIPS 2025] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with good capability of general video understanding.
【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
openvla / openvla
Forked from TRI-ML/prismatic-vlmsOpenVLA: An open-source vision-language-action model for robotic manipulation.
A flexible and efficient codebase for training visually-conditioned language models (VLMs)
PyTorch implementation for Score-Based Generative Modeling through Stochastic Differential Equations (ICLR 2021, Oral)
PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
AppAgent: Multimodal Agents as Smartphone Users, an LLM-based multimodal agent framework designed to operate smartphone apps.
JunnYu / PaddleNLP
Forked from PaddlePaddle/PaddleNLPEasy-to-use and powerful NLP library with Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including Neural Search, Question Answering, Information Ex…