Papers
-
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
-
Splat and Replace: 3D Reconstruction with Repetitive ElementsAdobe / Massachusetts Institute of Technology, National Institute for Research in Digital Science and Technology, Université Côte d’Azur
-
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length
-
HybGRAG: Hybrid Retrieval-Augmented Generation on Textual and Relational Knowledge Bases
-
SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation
-
Scaling Diffusion Language Models via Adaptation from Autoregressive Models
-
M+: Extending MemoryLLM with Scalable Long-Term Memory
-
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
-
BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model
-
Skywork Open Reasoner 1 Technical Report
-
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
-
More is not always better? Enhancing Many-Shot In-Context Learning with Differentiated and Reweighting Objectives
-
Optimizing Robustness and Accuracy in Mixture of Experts: A Dual-Model Approach
-
Autoregressive Speech Synthesis without Vector Quantization
-
Vision as LoRA
-
syftr: Pareto-Optimal Generative AI
-
SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond
-
Gemini Robotics: Bringing AI into the Physical World
-
OmniGenBench: A Benchmark for Omnipotent Multimodal Generation across 50+ Tasks
-
A Minimalist Method for Fine-tuning Text-to-Image Diffusion Models
-
One RL to See Them All: Visual Triple Unified Reinforcement Learning
-
GiGL: Large-Scale Graph Neural Networks at Snapchat
-
Incremental Sequence Classification with Temporal Consistency
-
From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition
-
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
-
M-RewardBench: Evaluating Reward Models in Multilingual Settings
-
Lessons from Defending Gemini Against Indirect Prompt Injections
-
G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning
-
Progressive Autoregressive Video Diffusion Models
-
FastVLM: Efficient Vision Encoding for Vision Language Models
-
VGGT: Visual Geometry Grounded Transformer
-
Qwen3 Technical Report
-
The Leaderboard Illusion
-
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
-
LLMs Get Lost In Multi-Turn Conversation
-
Reasoning Models Don't Always Say What They Think
-
A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well
-
Command A: An Enterprise-Ready Large Language Model
-
InteractRank: Personalized Web-Scale Search Pre-Ranking with Cross Interaction Features
-
Investigating the Overlooked Hessian Structure: From CNNs to LLMsByteDance / Beijing Institute of Mathematical Sciences and Applications, Hong Kong Baptist University, Hong Kong University of Science and Technology, Rutgers University
-
The Leaderboard IllusionCohere / Allen Institute for Artificial Intelligence, Massachusetts Institute of Technology, Princeton University, Stanford University, University of Washington, University of Waterloo
-
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
-
Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use
-
Perception Encoder: The best visual embeddings are not at the output of the network
-
Kimi-Audio Technical Report
-
I-Con: A Unifying Framework for Representation Learning
-
Describe Anything: Detailed Localized Image and Video Captioning
-
LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities
-
UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing
-
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
