ACTIVE Research

EdgeRoutingICML

Guardian-Analyst two-tier edge intelligence with sub-2ms HMM routing on Jetson Orin Nano.

status ACTIVE
type Research
stack Python HMM TinyLlama

// DESCRIPTION

Guardian-Analyst Two-Tier Edge Intelligence

EdgeRoutingICML introduces a two-tier architecture for intelligent query routing at the network edge, where a lightweight Guardian model handles simple queries locally while routing complex ones to more capable Analyst models in the cloud or edge cluster. This design minimizes latency for routine requests while maintaining quality for challenging inputs.

The routing mechanism uses a Hidden Markov Model (HMM) that models query complexity as a latent state, using observable features (query length, vocabulary complexity, entity density) as emissions. The HMM achieves sub-2ms routing decisions on NVIDIA Jetson Orin Nano hardware, making it practical for real-time edge deployment where even neural classifier overhead is prohibitive.

The Guardian tier runs TinyLlama locally on the edge device, handling approximately 60% of incoming queries without any cloud communication. For the remaining 40%, the HMM router selects among a pool of Analyst models based on the predicted query category and complexity, optimizing for the quality-latency-cost Pareto frontier.

Evaluation on edge computing benchmarks shows that the two-tier system achieves 94% of cloud-only quality while reducing average latency by 65% and cloud API costs by 58%. The HMM router is trained online and adapts to shifting query distributions without requiring redeployment.

// HIGHLIGHTS

  • Sub-2ms HMM routing decisions on Jetson Orin Nano edge hardware
  • Two-tier Guardian-Analyst architecture handling 60% of queries locally
  • 94% of cloud-only quality with 65% latency reduction and 58% cost savings
  • Online HMM adaptation to shifting query distributions without redeployment
  • TinyLlama as Guardian model for resource-constrained edge deployment