🧠the-brain

Benchmarks & Metrics

Performance metrics and evaluation results for the-brain's memory pipeline

SPM Curator — WildChat Evaluation

The SPM (Surprise-Gated Prediction Error) curator was evaluated on a 22-action WildChat benchmark to measure its ability to discriminate signal from noise.

Results

ConfigurationScoreNotes
Hybrid TF-IDF 512D + SBERT MiniLM 384D33.0%Best configuration, no consensus, k=20
Baseline (no filtering)~15%All interactions stored equally
TF-IDF only (512D)~25%Good spread, worse semantic matching
SBERT only (384D)28%Better semantics, worse novelty detection
EMA-Gaussian (fallback)~18%Used before TF-IDF vocabulary is locked

Key insight: The hybrid approach combines the wide statistical spread of TF-IDF (93% better discrimination vs raw features) with the semantic understanding of sentence embeddings, achieving the highest composite surprise accuracy.

Component Weights

The composite surprise score uses three weighted components:

composite = 0.35 Ɨ scalarScore + 0.40 Ɨ embScore + 0.25 Ɨ noveltyScore
ComponentWeightWhat It Measures
Scalar0.35Prompt/response length, lexical diversity, time patterns
Embedding0.40Semantic distance from learned centroid
Novelty0.25N-gram character patterns not seen before

Dual-Mode Architecture

ModeWhen UsedDiscrimination
TF-IDF (default)After vocabulary is locked (daemon startup)+93% better vs raw features
EMA-Gaussian (fallback)Before vocabulary is initializedRunning mean/variance of 6 scalar features

Graph Memory — Detection Accuracy

The 6-stage detection pipeline was evaluated on production interactions:

Detection StagePrecisionRecallNotes
Correction detection~85%~70%Short prompt + long response ratio heuristic
Preference detection~75%~65%Cross-interaction cluster tracking
Pattern detection~90%~80%Keyword frequency ≄3 in recent window
Concept nodes~95%~95%New keyword = new node (low false positive risk)

Weight Dynamics

RuleValueEffect
Initial correction weight0.5–0.85Based on structural heuristic confidence
Preference weight0.7Fixed
Concept weight0.4Fixed
Weight boost on match+0.05Per interaction
Weight decay2% per ~10 interactionsFor nodes >24h without match
Floor0.05Nodes never fully disappear

LoRA Training — Convergence

ModelFragmentsIterationsFinal LossAdapter Size
Llama 3.2 1B (4-bit)52000.064~5 MB
SmolLM2-360M-Instruct50500.023~2 MB
SmolLM2-135M-Instruct50500.031~1 MB
Llama-3.1-8B-4bit502000.018~5 MB

Training runs on Apple Silicon (M1–M4) via MLX. Typical training time: 30s–5min depending on model size and fragment count.

On this page