PatrickSys · PatrickSys · Feb 10, 2026 · Feb 10, 2026
diff --git a/.gitignore b/.gitignore
@@ -15,3 +15,4 @@ nul
 *~
 .claude
 .codebase-intelligence.json
+.cursor/
diff --git a/AGENTS.md b/AGENTS.md
@@ -12,6 +12,62 @@ These are non-negotiable. Every PR, feature, and design decision must respect th
 - **No overclaiming in public docs**: README and CHANGELOG must be evidence-backed. Don't claim capabilities that aren't shipped and tested.
 - **internal-docs is private**: Never commit `internal-docs/` pointer changes unless explicitly intended. The submodule is always dirty locally; ignore it.
 
+## Evaluation Integrity (NON-NEGOTIABLE)
+
+These rules prevent metric gaming, overfitting, and false quality claims. Violation of these rules means the feature CANNOT ship.
+
+### Rule 1: Eval Sets are Frozen Before Implementation
+
+- **Define test queries and expected results BEFORE writing any code**
+- Commit the eval fixture (e.g., `tests/fixtures/eval-queries.json`) BEFORE starting implementation
+- **NEVER adjust expected results to match system output** - If the system returns different results, that's a failure, not a fixture bug
+- Exception: If the original expected result was factually wrong (file doesn't exist, query is ambiguous), document the correction with justification
+
+### Rule 2: Eval Sets Must Be General
+
+- **Minimum 20 queries** across diverse patterns (exact names, conceptual, multi-concept, edge cases)
+- Test on **multiple codebases** (minimum 2: one you control, one public/real-world)
+- Include queries that are HARD and likely to fail - don't cherry-pick easy wins
+- Eval set must represent real user queries, not synthetic examples designed to pass
+
+### Rule 3: Public Eval Methodology
+
+- Full eval harness code must be in `tests/` (public repository)
+- Eval fixtures must be public (or provide reproducible public examples)
+- Document how to run eval: `npm run eval -- /path/to/codebase`
+- Results must be reproducible by external users
+
+### Rule 4: No Score Manipulation
+
+- **NEVER add heuristics specifically to game eval metrics** (e.g., "if query contains X, boost Y")
+- **NEVER adjust scoring to break ties just to improve top-1 accuracy**
+- If you add ranking heuristics, they must be general-purpose and justified by search theory, not by "it makes test #7 pass"
+- Document all ranking heuristics with research citations or principled justification
+
+### Rule 5: Report Honestly
+
+- Report **both improvements AND failures** (e.g., "9/20 pass, 11/20 fail")
+- If top-3 recall is 80% but top-1 is 45%, say so - don't hide behind a single cherry-picked metric
+- Acknowledge when improvements are **workarounds** (filtering, heuristics) vs **fundamental** (better embeddings, ML models)
+- Include failure analysis in CHANGELOG: "Known limitations: struggles with multi-concept queries"
+
+### Rule 6: Cross-Check with Real Usage
+
+- Before claiming "X% improvement", test on a real codebase you didn't develop against
+- Ask: "Would this improvement generalize to a Python codebase? A Go codebase?"
+- If the improvement is framework-specific (e.g., Angular-only), say so explicitly
+
+### Violation Response
+
+If any agent violates these rules:
+1. **STOP immediately** - do not proceed with the release
+2. **Revert** any fixture adjustments made to game metrics
+3. **Re-run eval** with frozen fixtures
+4. **Document the violation** in internal-docs for learning
+5. **Delay the release** until honest metrics are available
+
+These rules exist because **trustworthiness is more valuable than a good-looking number**.
+
 ## Codebase Context
 
 **At start of each task:** Call `get_memory` to load team conventions.

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,37 @@
 # Changelog
 
+## [1.6.0](https://github.com/PatrickSys/codebase-context/compare/v1.5.1...v1.6.0) (2026-02-10)
+
+### Added
+
+- **Search Quality Improvements** — Weighted hybrid search with intent-aware classification
+  - Intent-aware query classification (EXACT_NAME, CONCEPTUAL, FLOW, CONFIG, WIRING)
+  - Reciprocal Rank Fusion (RRF, k=60) for robust rank-based score combination
+  - Hard test-file filtering (eliminates spec contamination in non-test queries)
+  - Import-graph proximity reranking (structural centrality boosting)
+  - File-level deduplication (one best chunk per file)
+- **Evaluation Harness** — Frozen fixture set with reproducible methodology
+- **Embedding Upgrade** — Granite model support (47M params, 8192 context)
+- **Chunk Optimization** — 100→50 lines, overlap 10→0, merge small chunks
+
+### Changed
+
+- **Dependencies**: `@xenova/transformers` v2 → `@huggingface/transformers` v3
+- **Indexing**: Tighter chunks (50 lines) with zero overlap
+- **Search**: RRF fusion immune to score distribution differences
+
+### Fixed
+
+- Intent-blind search (conceptual queries now classified and routed correctly)
+- Spec file contamination (test files hard-filtered from non-test query results)
+- Embedding truncation (granite's 8192 context eliminates previous 512 token limit)
+
+### BREAKING CHANGES
+
+**Re-index required** after upgrade due to model and chunking changes:
+- Existing `.codebase-context/` indices from v1.5.x incompatible
+- Run `refresh_index(incrementalOnly: false)` or delete `.codebase-context/` folder
+
 ## [1.5.1](https://github.com/PatrickSys/codebase-context/compare/v1.5.0...v1.5.1) (2026-02-08)
 
 

diff --git a/package.json b/package.json
@@ -94,10 +94,10 @@
     "type-check": "tsc --noEmit"
   },
   "dependencies": {
+    "@huggingface/transformers": "^3.8.1",
     "@lancedb/lancedb": "^0.4.0",
     "@modelcontextprotocol/sdk": "^1.25.2",
     "@typescript-eslint/typescript-estree": "^7.0.0",
-    "@xenova/transformers": "^2.17.0",
     "fuse.js": "^7.0.0",
     "glob": "^10.3.10",
     "hono": "4.11.7",
@@ -125,6 +125,7 @@
   "pnpm": {
     "onlyBuiltDependencies": [
       "esbuild",
+      "onnxruntime-node",
       "protobufjs",
       "sharp"
     ]
-Original file line number
+Diff line change
@@ Expand Up / @@ -15,3 +15,4 @@ nul @@
     *~
     .claude
     .codebase-intelligence.json
+    .cursor/