Harvesters
Adding support for new AI tools and data sources — formats, deduplication, state management
Harvesters read AI tool data and emit interactions. Each follows a standard pattern with format-specific parsing.
Cursor Harvester
Reads from ~/.cursor/ and ~/Library/Application Support/Cursor/.
Supported Source Formats
| Source | Format | Source ID | Details |
|---|---|---|---|
state.vscdb | SQLite ItemTable | cursor | Keys: aiChat.%, chat.%, composer.% |
state.vscdb | SQLite cursorDiskKV | cursor | Keys: chat::%, composer::%, conversation::% |
logs/ | JSONL / JSON / .log | cursor | Incremental read via file offsets |
agent-transcripts/ | JSONL | cursor (id: cursor-ag-...) | Cursor v3+ |
ai-tracking/ai-code-tracking.db | SQLite | cursor (id: cursor-tr-...) | conversation_summaries + ai_code_hashes |
Deduplication
- SHA-256 of
messages + request + response + sessionId + timestamp, truncated to 16 hex - Two-level: file offsets +
processedIdsSet (capped at 10,000)
State File
~/.the-brain/cursor-harvester-state.json
{
"lastPollTimestamp": 1714800000000,
"processedIds": ["cursora1b2c3d4e5f6g7h8"],
"fileOffsets": { "/path/to/log.jsonl": 12345 }
}Claude Harvester
Reads from ~/.claude/projects/ and ~/.claude/history.jsonl.
Supported Source Formats
| Source | Format | Source ID | Details |
|---|---|---|---|
projects/<slug>/ | JSONL sessions + .json sub-dirs | claude-code | Full user→assistant pairs |
history.jsonl | JSONL | claude-code-history | Prompt-only, supplementary |
Filters: Excludes isMeta and isSidechain messages.
Deduplication
- SHA-256 of
prompt + "\n" + response, truncated to 16 hex - Three-level: file offsets +
processedIdsSet + in-batchseenSet
State File
~/.the-brain/claude-harvester-state.json
Hermes Harvester
Reads from ~/.hermes/state.db (Hermes Agent's SQLite database).
Supported Source Format
| Source | Format | Source ID | Details |
|---|---|---|---|
state.db | SQLite (read-only) | hermes-agent | sessions + messages tables, user↔assistant pairs |
Filters: Excludes session_meta and tool role messages.
Deduplication
- SHA-256 of
prompt + "\\n" + response, truncated to 16 hex - Incremental via
lastIdoffset (messages table AUTOINCREMENT id)
State File
~/.the-brain/hermes-state.json
{
"lastId": 42,
"lastAt": 1714800000000,
"sessions": ["session-1"],
"totalIx": 15,
"totalSes": 3
}Gemini CLI Harvester
Reads from ~/.gemini/tmp/ (Gemini CLI's local conversation logs).
Supported Source Formats
| Source | Format | Source ID | Details |
|---|---|---|---|
~/.gemini/tmp/<project>/logs.json | JSON array | gemini-cli | Flat message array: {sessionId, messageId, type, message, timestamp} |
~/.gemini/tmp/<project>/chats/session-*.json | JSON | gemini-cli-chat | Full chat sessions with block-based content |
~/.gemini/projects.json | JSON | — | Maps project paths to slugs (discovery only) |
Filters: Excludes "info" type messages. Pairs consecutive user → gemini messages.
Content blocks: In chat sessions, content is an array of blocks (text, tool_use, thinking). Text and thinking blocks are joined; tool_use blocks are stripped with a fallback [tool use] marker.
Deduplication
- SHA-256 of
prompt + "\x00" + response, truncated to 16 hex - Per-session message ID dedup via
processedMessageIdsSet
State File
~/.the-brain/gemini-harvester-state.json
{
"lastPollTimestamp": 1714800000000,
"processedIds": ["gemini-a1b2c3d4e5f6g7h8"],
"projectSlugs": ["my-project"],
"fileOffsets": { "/path/to/logs.json": 12345 }
}lm-eval Harvester
Reads lm-evaluation-harness JSON result files for benchmark tracking and regression fingerprinting. Enables the-brain to serve as a cognitive layer for meta-harness systems.
Supported Source Format
| Source | Format | Source ID | Details |
|---|---|---|---|
~/.the-brain/eval-results/*.json | JSON | lm-eval | Standard lm-eval output with results, config, task_hashes |
Watch directory: Configurable via LM_EVAL_WATCH_DIR env var. Auto-created on first daemon start.
Deduplication
- Run-level: FNV-1a hash of
model + task_hashes - Processed hashes persisted in state file (capped at 1000)
Fingerprinting
- Per-model per-benchmark per-metric running statistics via Welford's online algorithm
- Anomaly detection: >2σ deviation from baseline (>3 samples required)
- Confidence: grows with sample count, caps at 0.95
- Drift detection: sliding window Z-score against historical baseline
State File
~/.the-brain/lm-eval-harvester-state.json
{
"lastPollTimestamp": 1714800000000,
"processedHashes": ["a1b2c3d4"],
"fingerprints": {
"claude-sonnet-4::mmlu::acc": {
"modelName": "claude-sonnet-4",
"benchmark": "mmlu",
"metric": "acc",
"mean": 0.892,
"std": 0.012,
"n": 15,
"values": [0.88, 0.89, 0.90, ...]
}
}
}Integration with Identity Anchor
The lm-eval harvester feeds into the identity anchor's HarnessFingerprintStore. On each ON_INTERACTION event, fingerprints are auto-updated. Custom hooks (identity-anchor:predictRegression, identity-anchor:assessSurprise) expose predictions to meta-harness consumers via MCP.
Windsurf Harvester
Reads Windsurf IDE's Cascade conversation history from the state.vscdb SQLite database.
Supported Source Format
| Source | Format | Source ID | Details |
|---|---|---|---|
state.vscdb → codeium.windsurf → cachedActiveTrajectory:* | Base64 protobuf (wire-format) | windsurf | Trajectory steps: f19=user, f20=AI (f3=thinking, f7=tool_calls, f8=visible, f12=provider) |
Data location (User/globalStorage/state.vscdb):
- macOS:
~/Library/Application Support/Windsurf/(also checksWindsurf - Nextfirst) - Linux:
~/.config/Windsurf/ - Windows:
%APPDATA%/Windsurf/
Extracted data: user prompts, AI responses, thinking content (Cascade thinking mode), tool calls with parameters, provider info.
Deduplication
- SHA-256 of
prompt + "\x00" + response, truncated to 16 hex - Processed IDs persisted in state file (capped at 10,000)
- Trajectory size change detection (skip unchanged trajectories)
State File
~/.the-brain/windsurf-harvester-state.json
{
"lastPollTimestamp": 1714800000000,
"processedIds": ["a1b2c3d4e5f6g7h8"],
"trajectorySizes": {
"workspace_hash": 12345
}
}Project Detection
Reads workspaceStorage/<id>/workspace.json to resolve workspace paths for project context matching.
Limitations
- Only reads the active (most recently selected) trajectory per workspace
- To harvest a different conversation, first select it in Windsurf's Cascade sidebar
- Protobuf format may change with Windsurf updates
Creating a Custom Harvester
Required Behaviors
- Deduplication: SHA-256 hash of prompt + response
- State persistence: Save
lastOffset/processedIdsto~/.the-brain/<name>-state.json - Project detection: Match
workDiragainst registered contexts - Incremental reading: Track file offsets — never re-read
Template
import { definePlugin, HookEvent } from "@the-brain-dev/core";
import { createHash } from "node:crypto";
const STATE_PATH = join(process.env.HOME!, ".the-brain",
"my-harvester-state.json");
export default definePlugin({
name: "harvester-my-ide",
async setup(hooks) {
let state = { lastOffset: 0, processedIds: [] as string[] };
hooks.hook(HookEvent.HARVESTER_POLL, async () => {
const lines = await readNewLines(state.lastOffset);
for (const line of lines) {
const hash = createHash("sha256")
.update(line.prompt + "\x00" + line.response)
.digest("hex");
if (state.processedIds.includes(hash)) continue;
state.processedIds.push(hash);
if (state.processedIds.length > 10000)
state.processedIds = state.processedIds.slice(-5000);
await hooks.callHook(HookEvent.HARVESTER_NEW_DATA, {
interaction: {
id: hash.slice(0, 16),
timestamp: line.timestamp,
prompt: line.prompt,
response: line.response,
source: "my-ide",
},
fragments: [],
promoteToDeep() {},
});
}
});
},
});