Corpus Analytics¶
Use Corpus Analytics when you want offline clustering of historical change-control intents — for example to compare agent workflow cohorts, inspect outliers, or export HTML/JSON summaries for maintainer review.
Prerequisites¶
- A repository with audit enabled and historical
intent.declaredevents. - Engineering Memory trajectory projection (optional but improves selection).
- Install optional dependencies:
Quick start¶
Build snapshot, embeddings, and a recommended clustering run in one step:
--use-recommended requires --sweep. It renders the heuristic winner for
inspection; it does not set selected_by_maintainer.
Use a versioned profile lens when the review question is more specific:
codeclone analytics build \
--root . \
--profile intent-small-balanced-v1 \
--use-recommended \
--html-out /tmp/profile-report.html \
--json-out /tmp/profile-report.json
--profile implies a finite sweep. --profile auto requires
default_profile_id in pyproject.toml; omitting --profile preserves the
ordinary single-run or sweep behavior.
Write a detailed single-run report to explicit paths:
codeclone analytics build \
--root . \
--representation description \
--html-out /tmp/corpus-clusters.html \
--json-out /tmp/corpus-clusters.json
Write a sweep comparison without choosing a primary detail view:
codeclone analytics build \
--root . \
--sweep \
--html-out /tmp/corpus-sweep.html \
--json-out /tmp/corpus-sweep.json
Reading the reports¶
Corpus Analytics separates formal technical validity from human interpretation:
flowchart LR
R["Persisted clustering run"] --> V{"V1-V10 pass?"}
V -->|"yes"| F["Full interpretation<br/>metrics, previews, provenance"]
V -->|"no"| L["Limited diagnostic<br/>codes, status, safe counts"]
F --> P{"Profile lens?"}
P -->|"yes"| S["Suitability + profile ranking"]
P -->|"no"| O["Global heuristic comparison"]
S --> O2["JSON 1.3 / HTML"]
O --> O2
L --> O
A valid run can still be only a candidate. The banner distinguishes maintainer-selected, profile-recommended, valid-but-profile-rejected, heuristically recommended, candidate-only, and technically invalid runs; none of those labels claims a semantic taxonomy.
Full reports show dominant-cluster ratios against both the whole corpus and
assigned non-noise items, bounded representative/boundary previews, numeric
summaries, categorical correlations, provenance completeness for small
clusters, and observable noise flags. Sweep comparison includes failed and
invalid runs as limited rows with unavailable metrics rather than silently
dropping them.
Normalized text previews are capped at 240 Unicode code points. JSON keeps raw
strings; HTML escapes them. The export content_disclosure block reports
whether previews were actually emitted and in which scopes. See
Report Interpretability
for the invariants and safe-output rules, and
JSON export schema
for the wire shape.
Step-by-step¶
# 1. Immutable snapshot from audit + trajectory (+ optional registry overlay)
codeclone analytics snapshot --root .
# 2. Analytics embeddings (separate LanceDB sidecar)
codeclone analytics embed --root . --snapshot-id SNAPSHOT_ID
# 3. Cluster (add --sweep or --profile for a finite parameter search)
codeclone analytics cluster \
--root . \
--snapshot-id SNAPSHOT_ID \
--embedding-generation-id GENERATION_ID
# Optional profile registry and profile-scoped sweep
codeclone analytics profiles list --root .
codeclone analytics cluster \
--root . \
--snapshot-id SNAPSHOT_ID \
--embedding-generation-id GENERATION_ID \
--profile intent-small-discovery-v1
# 4. Inspect runs
codeclone analytics clusters --root . --snapshot-id SNAPSHOT_ID
codeclone analytics cluster-show \
--root . --snapshot-id SNAPSHOT_ID --run-id RUN_ID
# 5. Record an explicit maintainer choice
codeclone analytics cluster --root . --select-run RUN_ID \
--selected-by "$USER" \
--selection-rationale "Chosen for maintainer review"
For a profile-scoped decision, add
--selection-profile PROFILE_ID_OR_PROFILE_BATCH_ID. Use none for global
scope.
Configuration¶
Defaults live in [tool.codeclone.analytics] inside pyproject.toml. See
Corpus Analytics contract for the full table.
The historical audit source follows top-level [tool.codeclone].audit_path.
[tool.codeclone.analytics]
default_profile_id = "intent-small-balanced-v1"
profile_paths = ["analytics/profiles/team-review.json"]
sweep_pca_dimensions = [32, 64, 128]
sweep_min_cluster_sizes = [5, 8, 12, 15]
sweep_min_samples = [1, 3, 5]
sweep_selection_methods = ["eom", "leaf"]
Repository-local manifests use the same schema as bundled profiles. Paths must
resolve to files inside the repository. The default profile is consulted only
for explicit --profile auto.
Reproducibility¶
Exports persist snapshot and embedding manifests, vector digests, requested and effective parameters, fixed PCA/HDBSCAN settings, package versions, and the random seed. Unless the model revision and artifact fingerprint are known, CodeClone explicitly reports that full vector reproducibility is not guaranteed from the model id alone.
Existing embedding generations created under an incompatible embedding contract
are rejected. Run embed again for the same snapshot to create a compatible
generation.
Failure behavior¶
- Expected input, capability, schema, and artifact-integrity errors exit with
code
2and no traceback. - A clustering run is persisted as
running, then becomescompletedorfailed; failed runs contain no committed assignments or summaries. - Resolved invalid or failed runs remain exportable in limited diagnostic mode; they never receive partition metrics, previews, score, or rank.
- A missing embedding-generation record is rendered explicitly as unavailable metadata rather than fabricated from the run.
- JSON and HTML outputs are written atomically.
- Snapshot, embed, cluster, and report spans are recorded only when
CODECLONE_OBSERVABILITY_ENABLED=1.
What this is not¶
- Not a second analyzer — it does not replace
codeclonestructural reports. - Not Engineering Memory semantic search — vectors are stored separately.
- Not MCP-visible in Slice 1 — CLI only.
Contract reference: 27-corpus-analytics.md.