研究论文

立场论文

vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models

作者：vLLM Semantic Router Team

发表/活动：arXiv 技术报告

We introduce vLLM Semantic Router, a signal-driven decision routing framework for Mixture-of-Modality deployments that composes heterogeneous signals into deployment-specific routing policies across cost, privacy, latency, and safety constraints.

2026📄 论文

vLLMSemantic Router

愿景论文

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

作者：Huamin Chen, Xunzhuo Liu, Bowei He, Fuyuan Lyu, Yankai Chen, Xue Liu, Yuhan Liu, Junchen Jiang

发表/活动：arXiv 技术报告

We synthesize the project’s recent routing, fleet, multimodal, and governance results into the Workload-Router-Pool (WRP) architecture, connecting signal-driven routing to a full-stack inference optimization framework and outlining future research directions across workload, router, and pool design.

2026📄 论文

vLLMSemantic Router

研究论文

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

作者：Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

发表/活动：arXiv 技术报告

We formalize the visual confused deputy as a security failure mode in computer-using agents and introduce a dual-channel guardrail that independently checks click targets and action reasoning before execution.

2026📄 论文

vLLMSemantic Router

研究论文

Outcome-Aware Tool Selection for Semantic Routers: Latency-Constrained Learning Without LLM Inference

作者：Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, Xue Liu

发表/活动：arXiv 技术报告

We introduce Outcome-Aware Tool Selection (OATS), an offline embedding refinement method that improves semantic-router tool ranking under single-digit millisecond CPU budgets without adding serving-time model inference.

2026📄 论文

vLLMSemantic Router

研究论文

Adaptive Vision-Language Model Routing for Computer Use Agents

作者：Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

发表/活动：arXiv 技术报告

We propose Adaptive VLM Routing (AVR), which estimates action difficulty and routes computer-use agent steps to the cheapest model that still satisfies a target reliability threshold.

2026📄 论文

vLLMSemantic Router

研究论文

98× Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

作者：Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

发表/活动：arXiv 技术报告

We combine Flash Attention, prompt compression, and near-streaming body processing to cut routing latency from seconds to tens of milliseconds while keeping the router lightweight enough to share hardware with serving.

2026📄 论文

vLLMSemantic Router

研究论文

inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference

作者：Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

发表/活动：arXiv 技术报告

We present a queueing-theory-grounded fleet planner and discrete-event simulator for sizing multi-pool LLM GPU fleets against P99 TTFT targets, without requiring hardware profiling runs up front.

2026📄 论文

vLLMSemantic Router

研究论文

FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism

作者：Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

发表/活动：arXiv 技术报告

We derive the minimum-cost two-pool LLM fleet directly from the workload CDF and P99 TTFT target, then use Compress-and-Route to make the optimal boundary deployable in practice.

2026📄 论文

vLLMSemantic Router

研究论文

The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency

作者：Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

发表/活动：arXiv 技术报告

We derive the 1/W law showing that tokens per watt roughly halve whenever the serving context window doubles, making context-length routing topology a larger energy-efficiency lever than a pure GPU generation upgrade.

2026📄 论文

vLLMSemantic Router

研究论文

Conflict-Free Policy Languages for Probabilistic ML Predicates: A Framework and Case Study with the Semantic Router DSL

作者：Xunzhuo Liu, Hao Wu, Huamin Chen, Bowei He, Xue Liu

发表/活动：arXiv 技术报告

We show how probabilistic ML predicates in policy languages can silently co-fire on the same query, and implement conflict detection plus a softmax-based prevention mechanism in the Semantic Router DSL.

2026📄 论文

vLLMSemantic Router

研究论文

Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents

作者：Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

发表/活动：arXiv 技术报告

We show that conversational memory and retrieval-grounded routing let a lightweight 8B model recover most of a 235B model’s performance on persistent user-specific queries while cutting effective inference cost by 96%.

2026📄 论文

vLLMSemantic Router

RAG 验证

Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems

作者：Xunzhuo Liu, Bowei He, Xue Liu, Haichen Zhang, Huamin Chen

发表/活动：arXiv 技术报告

We present a real-time verification component for long-document RAG that processes contexts up to 32K tokens, balancing latency and grounding coverage so interactive systems can detect unsupported answers without falling back to truncated checks.

2026📄 论文

vLLMSemantic Router

研究论文

When to Reason: Semantic Router for vLLM

作者：Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo, Junchen Jiang, Huamin Chen

发表/活动：NeurIPS - MLForSys

We present a semantic router that classifies queries based on their reasoning requirements and selectively applies reasoning only when beneficial.

2025📄 论文

vLLMSemantic Router

研究论文

Category-Aware Semantic Caching for Heterogeneous LLM Workloads

作者：Chen Wang, Xunzhuo Liu, Yue Zhu, Alaa Youssef, Priya Nagpurkar, Huamin Chen

We present a category-aware semantic caching where similarity thresholds, TTLs, and quotas vary by query category, with a hybrid architecture separating in-memory HNSW search from external document storage.

2025📄 论文

vLLMSemantic Router

研究论文

Semantic Inference Routing Protocol (SIRP)

作者：Huamin Chen, Luay Jalil

发表/活动：互联网工程任务组（IETF）

This document specifies the Semantic Inference Routing Protocol (SIRP), a framework for content-level classification and semantic routing in AI inference systems.

2025📄 论文

vLLMSemantic Router

研究论文

Multi-Provider Extensions for Agentic AI Inference APIs

作者：H. Chen, L. Jalil, N. Cocker

发表/活动：Internet Engineering Task Force (IETF) - Network Management Research Group

This document specifies multi-provider extensions for agentic AI inference APIs. Published: 20 October 2025. Intended Status: Informational. Expires: 23 April 2026.

2025📄 论文

vLLMSemantic Router

会议演讲

Intelligent LLM Routing: A New Paradigm for Multi-Model AI Orchestration in Kubernetes

演讲者：Chen Wang, Huamin Chen

发表/活动：KubeCon NA 2025

This research-driven talk introduces a novel architecture paradigm that complements recent advances in timely intelligent inference routing for large language models.

2025🎤 活动页面

vLLMSemantic Router

会议演讲

vLLM Semantic Router: Unlock the Power of Intelligent Routing

演讲者：Xunzhuo Liu

发表/活动：vLLM Meetup Beijing

A deep dive into vLLM Semantic Router capabilities, demonstrating how intelligent routing can unlock new possibilities for efficient LLM inference.

2025🎤 观看录播

vLLMSemantic Router

会议演讲

AI-Powered vLLM Semantic Router

演讲者：Huamin Chen

发表/活动：vLLM Office Hours

An overview of AI-powered features in vLLM Semantic Router, showcasing the latest developments and community contributions.

2025📹 观看录播

vLLMSemantic Router