Papers
-
STRUCTUREDAGENT: Planning with AND/OR Trees for Long-Horizon Web Tasks
-
SenTSR-Bench: Thinking with Injected Knowledge for Time-Series Reasoning
-
Perceptive Humanoid Parkour: Chaining Dynamic Human Skills via Motion Matching
-
SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
-
Visual Foresight for Robotic Stow: A Diffusion-Based World Model from Sparse Snapshots
-
Iterative Reranking as a Compute-Scaling Method for LLM-based Rankers
-
KG-CRAFT: Knowledge graph-based contrastive reasoning with LLMs for enhancing automated fact-checking
-
Pattern Discovery with Wide-Lens Analysis and Sharp-Focus Validation
-
Autoregressive Image Generation with Masked Bit Modeling
-
AgentArk: Distilling Multi-Agent Intelligence into a Single LLM Agent
-
Interpretable Tabular Foundation Models via In-Context Kernel Regression
-
RFS: Reinforcement Learning with Residual Flow Steering for Dexterous Manipulation
-
Differentiable Semantic ID for Generative Recommendation
-
AnyView: Synthesizing Any Novel View in Dynamic Scenes
-
MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents
-
TerraFormer: Automated Infrastructure-as-Code with LLMs Fine-Tuned via Policy-Guided Verifier Feedback
-
Internal Representations as Indicators of Hallucinations in Agent Tool Selection
-
ELLA: Efficient Lifelong Learning for Adapters
-
Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
-
Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking
-
Diffusion Language Model Inference with Monte Carlo Tree Search
-
s3: You Don't Need That Much Data to Train a Search Agent via RL
-
A Comprehensive Survey on Reinforcement Learning-based Agentic Search: Foundations, Roles, Optimizations, Evaluations, and Applications
-
Chronos-2: From Univariate to Universal Forecasting
-
WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning
-
TabArena: A Living Benchmark for Machine Learning on Tabular Data
-
Amazon Ads Multi-Touch Attribution
-
Establishing Best Practices for Building Rigorous Agentic Benchmarks
-
Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction
-
MAPoRL: Multi-Agent Post-Co-Training for Collaborative Large Language Models with Reinforcement Learning
-
Evaluating the Critical Risks of Amazon’s Nova Premier under the Frontier Model Safety Framework
-
Steering Your Diffusion Policy with Latent Space Reinforcement Learning
-
HybGRAG: Hybrid Retrieval-Augmented Generation on Textual and Relational Knowledge Bases
-
M+: Extending MemoryLLM with Scalable Long-Term Memory
-
How Does Critical Batch Size Scale in Pre-training?
-
A Systematic Survey of Automatic Prompt Optimization Techniques
-
The Amazon Nova Family of Models: Technical Report and Model Card
-
Evaluating Nova 2.0 Lite model under Amazon’s Frontier Model Safety Framework
-
Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models
-
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making
-
Retrieval-Augmented Generation with Graphs (GraphRAG)
-
Chronos: Learning the Language of Time Series
-
Multimodal Chain-of-Thought Reasoning in Language Models
-
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
-
AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2seq Model
-
Towards Total Recall in Industrial Anomaly Detection
-
DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks
-
Deep Sets
