ACTIVE AI/ML

RecCache

ML-aware caching for recommendation systems: 45% cache hit rate, 85% latency reduction on hits, and under 3% NDCG degradation on MovieLens 100K.

status ACTIVE
type AI/ML
stack Python PyTorch Redis online K-Means NDCG evaluation

// DESCRIPTION

The Problem: Recommendation Systems Are Caught in a Latency-Quality Trap

Modern recommendation systems face a fundamental tension. Fresh inference — running a full neural model for every request — delivers the best personalization quality but is computationally expensive and introduces latency that degrades user experience at scale. Naive caching — returning pre-computed recommendations without regard to user context — reduces latency but can serve stale results that miss recent behavioral signals, hurting recommendation quality. Standard cache invalidation strategies (TTL-based, LRU) treat all cached items identically and have no understanding of how much a particular user's preferences have drifted since their recommendations were cached — they are ML-oblivious.

RecCache proposes a different contract: cache aggressively, but let an ML quality predictor decide, for each individual request, whether the cached result is still good enough to serve or whether fresh inference is worth the cost.

问题背景:推荐系统陷入延迟与质量的两难困境

RecCache 提出了不同的解决契约:积极缓存,但让 ML 质量预测器对每个请求单独判断——缓存结果是否仍然足够好可以直接返回,还是値得付出实时推理的代价。

Approach: Online K-Means Clustering + Two-Level Cache + Quality Gate

Online K-Means User Clustering: Users are embedded by their behavioral history (click sequences, dwell time, category affinity) into a dense representation. Online K-Means runs continuously, updating cluster centroids as new behavior arrives. Users in the same cluster have similar preference profiles and can share cached recommendation results — the key insight that makes ML-aware caching tractable.

Quality Predictor: A lightweight classifier trained to predict, given a user's current context and the age of their cluster's cached recommendations, whether the cached result is likely to meet a minimum NDCG threshold. If predicted quality is sufficient, the cache result is served. Otherwise, fresh inference is triggered. The predictor itself is cheap — inference takes under 1ms.

Two-Level Cache Architecture: An L1 in-process LRU cache delivers sub-millisecond retrieval for recently accessed clusters. An L2 Redis cache provides shared, durable storage accessible across multiple serving replicas. A lightweight reranking step personalizes cached results at the individual-user level.

研究方法:在线 K-Means 聚类 + 两级缓存 + 质量门控

在线 K-Means 用户聚类:将用户的行为历史嵌入为密集表示向量,在线 K-Means 随新行为到来持续更新聚类中心。同一聚类中的用户具有相似的偏好特征,可以共享缓存的推荐结果。

质量预测器:轻量级分类器,推理耗时不足 1ms,判断缓存结果是否能满足最低 NDCG 阈値。

两级缓存架构:L1 进程内 LRU 缓存提供亚毫秒级检索;L2 Redis 缓存提供跨副本共享的持久化存储。轻量级重排序进一步个体化缓存结果。

Results: 45% Hit Rate, 85% Latency Reduction, Under 3% Quality Loss

Evaluated on MovieLens 100K, RecCache achieves:

Cache hit rate: ~45% — nearly half of all requests served from cache.

Latency reduction on hits: ~85% — L1 hits arrive in under 1ms.

Quality degradation: <3% NDCG — quality gate prevents stale results.

Compute savings: ~40% — GPU inference avoided for nearly half of requests.

实验结果:45% 命中率、85% 延迟降低、质量损失不足 3%

MovieLens 100K 上进行评估:

缓存命中率:约 45%

命中时延迟降低:约 85%,L1 命中响应时间不足 1ms。

质量降级:NDCG <3%

计算节省:约 40%

// HIGHLIGHTS

  • ML-aware caching for recommendation: quality predictor gates each cache hit individually
  • ~45% cache hit rate on MovieLens 100K with online K-Means user clustering
  • ~85% latency reduction on cache hits; L1 in-process cache delivers sub-1ms retrieval
  • <3% NDCG degradation — quality gate prevents stale results from hurting ranking quality
  • ~40% compute savings — GPU inference avoided for nearly half of all requests
  • Two-level cache: L1 in-process LRU (<1ms) + L2 Redis (shared across replicas)
  • Lightweight post-retrieval reranking recovers per-user personalization within cluster
  • GitHub: GeoffreyWang1117/RecCache