COMPLETED AI/ML

MemoryForge

Three-layer cognitive memory for LLM agents -- working, episodic, and semantic -- with 473 passing tests, hybrid retrieval, and a FastAPI WebSocket API.

status COMPLETED
type AI/ML
started 2024-09-01
completed 2024-12-01
stack Python FastAPI Qdrant Neo4j SQLite WebSocket OpenAI Anthropic Ollama

// DESCRIPTION

The Problem: LLMs Forget Everything Beyond Their Context Window

Large language models have a fixed context window — typically 8K to 128K tokens. For short interactions this is sufficient. But for long-running multi-agent systems — research assistants that accumulate information over days, coding agents that maintain project context across sessions, customer service bots that should remember past interactions — the fixed window creates a hard ceiling. Once a conversation grows beyond the window size, earlier messages are dropped entirely. There is no graceful degradation: the model simply loses access to potentially critical context with no ability to retrieve it on demand.

Naive approaches fail for different reasons. Unlimited context stuffing hits model limits and is computationally prohibitive. A single vector database can retrieve semantically similar memories but cannot reason about relationships between entities or represent the temporal structure of episodic memories. What is needed is a layered memory architecture that mirrors how human cognitive memory is organized.

问题背景:LLM 在上下文窗口之外遗忘一切

大型语言模型有固定的上下文窗口——通常为 8K 到 128K token。对于长期运行的多智能体系统——跨天积累信息的研究助手、跨会话维护项目上下文的编码智能体——固定窗口形成硬性上限。所需要的是一种镜像人类认知记忆组织方式的分层记忆架构。

Approach: Three-Layer Cognitive Memory Architecture

Layer 1 — Working Memory (In-Process LRU Cache): The most recent and actively relevant context is kept in a fast in-process LRU cache. Zero-latency retrieval with no database round-trips. Items age out based on recency and an importance score that considers interaction frequency and user-marked salience.

Layer 2 — Episodic Memory (Qdrant Vector Database): Past interactions and observations are embedded and stored in Qdrant. Retrieval uses approximate nearest-neighbor search over dense embeddings, allowing the agent to recall semantically similar past experiences regardless of when they occurred. Episodic memories carry temporal metadata (timestamps, session IDs) for time-filtered retrieval.

Layer 3 — Semantic Memory (Neo4j Knowledge Graph): Structured knowledge about entities, relationships, and facts is stored in a Neo4j graph database. This layer supports relational queries that vector search cannot answer: 'what are all the projects entity X is involved in?' or 'show me the dependency chain.' The graph representation naturally encodes the web of relationships that accumulates over long agent sessions.

Hybrid Retrieval: A query router combines results from all three layers, reranking by a composite relevance score weighting recency (working), semantic similarity (episodic), and structural relationship distance (semantic). WebSocket real-time updates stream memory state to connected clients. A FastAPI REST + WebSocket API provides the primary interface. A Rich CLI supports interactive use and debugging. Multiple LLM backends supported: OpenAI, Anthropic (Claude), and local models via Ollama.

研究方法:三层认知记忆架构

第一层——工作记忆(进程内 LRU 缓存):最近且最活跃相关的上下文保存在快速进程内 LRU 缓存中,零延迟检索。

第二层——情节记忆(Qdrant 向量数据库):过去的交互和观察被嵌入并存储在 Qdrant 中,支持跨会话语义相似检索。

第三层——语义记忆(Neo4j 知识图谱):关于实体、关系和事实的结构化知识存储在 Neo4j 图数据库中,支持向量搜索无法回答的关系性查询。

混合检索组合三层结果,加权时近性、语义相似性和结构关系距离进行重排序。WebSocket 实时更新允许记忆状态嵁式传输到已连接客户端。

Results: 473 Passing Tests, Complete Implementation

MemoryForge is a completed project with production-quality engineering standards. The test suite comprises 473 passing tests covering all three memory layers, the hybrid retrieval logic, the WebSocket event system, and each LLM backend integration — unit tests, integration tests, and end-to-end tests for complete agent interaction sessions.

The FastAPI REST + WebSocket API supports memory storage, targeted retrieval by layer, hybrid retrieval with configurable layer weights, and real-time subscription to memory change events.

Multiple LLM backends — OpenAI, Anthropic/Claude, and Ollama for local models — are supported through a clean backend abstraction, making the system independent of any particular LLM provider and suitable for both cloud-hosted and air-gapped deployments.

实验结果:473 个通过测试,完整实现

测试套件包含 473 个通过测试,覆盖全部三个记忆层、混合检索逻辑、WebSocket 事件系统以及每个 LLM 后端集成。

支持 OpenAI、Anthropic/Claude 和 Ollama 多个 LLM 后端,适用于云托管和离线部署。

// HIGHLIGHTS

  • Complete project with 473 passing tests across all components and integrations
  • Three-layer cognitive memory: Working (LRU) → Episodic (Qdrant) → Semantic (Neo4j)
  • Hybrid retrieval combines recency, semantic similarity, and graph relationship distance
  • Importance scoring with interaction frequency + user-marked salience for graceful memory aging
  • FastAPI REST + WebSocket real-time API for live memory state streaming
  • Supports OpenAI, Anthropic/Claude, and Ollama via unified backend abstraction
  • Rich CLI for interactive memory inspection and agent debugging
  • GitHub: GeoffreyWang1117/MemoryForge