Papers
-
Pretrained Vision-Language-Action Models are Surprisingly Resistant to Forgetting in Continual Learning
-
Phi-4-reasoning-vision-15B Technical Report
-
UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?
-
Beyond Pixel Histories: World Models with Persistent 3D State
-
Modular Memory is the Key to Continual Learning Agents
-
Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization
-
Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization
-
WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks
-
Experiential Reinforcement Learning
-
WizardLM: Empowering large pre-trained language models to follow complex instructions
-
Florence: A New Foundation Model for Computer Vision
-
LLM-in-Sandbox Elicits General Agentic Intelligence
-
On-Policy Context Distillation for Language Models
-
CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation
-
See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning
-
LIVE: Long-horizon Interactive Video World Modeling
-
Closing the Loop: Universal Repository Representation with RPG-Encoder
-
CUA-Skill: Develop Skills for Computer Using Agent
-
AgentRx: Diagnosing AI Agent Failures from Execution Trajectories
-
X-Coder: Advancing Competitive Programming with Fully Synthetic Tasks, Solutions, and Tests
-
LLM-42: Enabling Determinism in LLM Inference with Verified Speculation
-
Lost in Transmission: When and Why LLMs Fail to Reason Globally
-
Efficient Autoregressive Video Diffusion with Dummy Head
-
Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs
-
Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning
-
Controlled LLM Training on Spectral Sphere
-
InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training
-
Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation
-
From Word to World: Can Large Language Models be Implicit Text-based World Models?
-
Sigma-MoE-Tiny Technical Report
-
FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction
-
Spatia: Video Generation with Updatable Spatial Memory
-
Native and Compact Structured Latents for 3D Generation
-
Wait, Wait, Wait... Why Do Reasoning Models Loop?
-
Glance: Accelerating Diffusion Models with 1 Sample
-
Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion
-
VIGS-SLAM: Visual Inertial Gaussian Splatting SLAM
-
The Art of Scaling Test-Time Compute for Large Language Models
-
ThetaEvolve: Test-time Learning on Open Problems
-
LatBot: Distilling Universal Latent Actions for Vision-Language-Action Models
-
SageServe: Optimizing LLM Serving on Cloud Data Centers with Forecast Aware Auto-Scaling
-
Shifting Work Patterns with Generative AI
-
LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs
-
Teaching LLMs to Plan: Logical Chain-of-Thought Instruction Tuning for Symbolic Planning
-
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
-
Autoregressive Speech Synthesis without Vector Quantization
-
LLMs Get Lost In Multi-Turn Conversation
-
AI-Instruments: Embodying Prompts as Instruments to Abstract & Reflect Graphical Interface Commands as General-Purpose Tools
-
AI at Work Is Here. Now Comes the Hard Part
