Corpus Analytics¶
Corpus Analytics is an optional, offline analytics lane for clustering historical change-control intents. It reconstructs an intent corpus from retained controller evidence, creates immutable-by-contract snapshots, writes separate analytics embeddings, and runs deterministic PCA + HDBSCAN clustering. Slice 1.1 adds an interpretation plane over those persisted facts. Slice 1.2 adds a control plane: versioned profile lenses, finite profile-scoped sweeps, separate suitability and ranking, immutable batch receipts, and append-only maintainer selection events.
It is derived evidence, not authority. Corpus Analytics never changes the canonical structural report, reports/gates/baselines, cache compatibility, Engineering Memory governance, or edit authorization.
For a command-oriented walkthrough, see the Corpus Analytics guide. Configuration is indexed in Config and Defaults, CLI behavior in CLI, and storage layout in Schema Layouts.
Trust Boundary¶
flowchart LR
A["Audit DB<br/>intent.declared"] --> S["Corpus snapshot<br/>SQLite metadata"]
T["Trajectory projection<br/>outcome and quality"] --> S
P["Patch Trail<br/>scope and verification facts"] --> S
R["Live registry overlay<br/>inspection only"] -.-> S
S --> E["Embedding generation<br/>separate LanceDB sidecar"]
E --> C["L2 normalize<br/>PCA(full)<br/>HDBSCAN(euclidean)"]
C --> D["Persisted assignments<br/>summaries and diagnostics"]
D --> V{"V1-V10<br/>technically valid?"}
V -->|" yes "| Q["Profile suitability<br/>optional lens, soft gates"]
Q --> F["Full interpretation<br/>metrics, previews, provenance"]
V -->|" no "| L["Limited diagnostic<br/>status, validity, safe raw counts"]
F --> J["JSON export 1.3"]
F --> H["Self-contained HTML"]
L --> J
L --> H
S -. " never authorizes or gates " .-> X["Structural report<br/>baseline, gates, memory governance"]
Source ownership is explicit:
| Fact | Owner |
|---|---|
| Original description and declaration order | Earliest audit intent.declared by audit_sequence |
| Declared/changed files and verification facts | Patch Trail |
| Outcome, quality tier, labels, anomalies | Selected current-version trajectory |
| Lease/status and other live coordination state | Optional registry overlay |
The registry overlay is exported for inspection when present, but it never
changes normalized text, representation identity, or source_digest.
End-to-End Lifecycle¶
sequenceDiagram
participant U as Maintainer
participant CLI as codeclone analytics
participant SQL as Analytics SQLite
participant V as Analytics LanceDB
participant O as Platform Observability
U ->> CLI: snapshot --root .
CLI ->> O: analytics.snapshot span
CLI ->> SQL: snapshot + corpus items
U ->> CLI: embed --snapshot-id SNAPSHOT
CLI ->> O: analytics.embed span
CLI ->> V: float32 vectors
CLI ->> SQL: generation + row keys + digests
U ->> CLI: cluster --snapshot-id ... --embedding-generation-id ...
CLI ->> O: analytics.cluster span
CLI ->> SQL: running run
CLI ->> V: validated vectors
CLI ->> SQL: completed assignments and summaries
opt --profile PROFILE
CLI ->> SQL: manifest snapshot + immutable profile batch
CLI ->> SQL: suitability assessments + profile recommendation
end
opt --select-run RUN
CLI ->> SQL: append-only selection event
end
U ->> CLI: cluster-show or build outputs
CLI ->> O: analytics.report span
CLI -->> U: atomic JSON / HTML
Every clustering run is inserted as running. A successful run atomically
commits assignments, summaries, and completed; a processing error rolls those
artifacts back and persists failed with an error message.
Installation¶
Capability tiers:
| Tier | Packages | Commands |
|---|---|---|
base |
core only | snapshot, clusters, cluster-show, outliers, cluster --select-run |
embed |
FastEmbed + LanceDB | embed |
cluster |
scikit-learn + external hdbscan |
clustering and sweep |
full |
all of the above | build |
Missing optional dependencies are contract errors (exit 2) with an install
hint. Inspection/export commands do not import FastEmbed.
umap-learn remains an optional dependency on supported Python versions, but
Slice 1 does not emit a UMAP visualization. Any later UMAP view must be labeled
visualization-only and must never feed clustering.
Configuration¶
[tool.codeclone.analytics] overrides repository-local defaults. Relative paths
resolve from the repository root; absolute paths are allowed but are represented
as <external> in snapshot manifests so user-specific paths do not enter
portable identity.
| Key | Default | Contract |
|---|---|---|
db_path |
.codeclone/analytics/corpus_clustering.sqlite3 |
Analytics metadata store |
vectors_path |
.codeclone/analytics/corpus_vectors |
Dedicated LanceDB vectors |
embedding_model |
BAAI/bge-small-en-v1.5 |
FastEmbed model id |
embedding_dimension |
384 |
Vector width |
embedding_provider |
fastembed |
Only supported provider in Slice 1 |
embedding_cache_dir |
memory semantic cache | Shared model artifact cache, not shared vectors |
min_correlation_sample_size |
5 |
Correlation denominator guard |
cluster_random_seed |
42 |
PCA deterministic seed |
default_pca_dimensions |
64 |
Requested PCA width |
default_min_cluster_size |
8 |
HDBSCAN default |
default_min_samples |
3 |
HDBSCAN default |
default_cluster_selection_method |
eom |
eom or leaf |
default_profile_id |
unset | Used only by explicit --profile auto |
profile_paths |
[] |
Additional repo-contained manifest files |
sweep_pca_dimensions |
[32, 64, 128] |
Non-profile sweep PCA axis |
sweep_min_cluster_sizes |
[5, 8, 12, 15] |
Non-profile sweep size axis |
sweep_min_samples |
[1, 3, 5] |
Non-profile sweep sample axis |
sweep_selection_methods |
["eom", "leaf"] |
Non-profile sweep method axis |
allow_model_download |
memory semantic setting | Whether FastEmbed may download |
The historical audit database follows top-level
[tool.codeclone].audit_path. This prevents Analytics from silently reading a
different audit source than the controller.
Identity And Digests¶
Corpus identity has three layers:
source_record_key = sha256(project_id + "\n" + intent_id)
representation_key = sha256(lane + kind + version + source_record_key)
snapshot_item_id = sha256(snapshot_id + "\n" + representation_key)
source_digest hashes source schema versions, lane, representation contract,
normalizer version, and sorted source/provenance digests. It excludes:
- snapshot ids and timestamps;
- absolute source paths;
- live registry overlay state.
Representation contract 3 retains the contract-2 raw-input hashing rules and
materializes explicit provenance presence facts for new snapshots:
provenance.trajectory.selected;provenance.patch_trail.present;provenance.registry_overlay.present.
For description_with_frame, representation identity includes description,
intent kind, declared path families, and typed declared constraints before
normalization. Registry-overlay content and presence remain outside
source_digest; existing contract-2 snapshots are immutable and are not
backfilled.
Cluster membership identity is:
HDBSCAN numeric labels are not stable identity. Display ids are assigned after canonical ordering by size descending, actual PCA-space medoid item id, then membership digest. Noise remains an explicit non-display bucket.
Storage And Integrity¶
Current analytics store schema is 1.2.
- Writable open migrates supported
1.0stores through1.1to1.2. - Read-only open never migrates and rejects a stale schema.
- SQLite relationship triggers reject orphan-producing inserts/updates/deletes.
- Vector row keys and non-null display cluster ids are unique.
- Reporting and inspection open the metadata store read-only.
SQLite and LanceDB cannot participate in one physical transaction. The embedding workflow therefore:
- computes a new generation;
- stages SQLite metadata;
- writes LanceDB rows;
- commits SQLite only after the sidecar write succeeds;
- rolls back metadata and removes the generation on ordinary failures.
Before clustering, CodeClone validates the generation contract, exact snapshot item set, dimensions, row keys, and vector digests. Cross-snapshot runs, missing sidecar rows, stale embedding contracts, or corrupted float32 payloads are rejected rather than accepted as completed analytics.
Embedding Reproducibility¶
Embedding contract 2 stores:
- provider and provider package version;
- model id, optional revision, optional artifact fingerprint;
- dimensions and embedding contract version;
- cosine similarity manifest and L2 preprocessing contract;
- vector row key and SHA-256 digest over canonical little-endian float32 bytes.
When model revision/artifact fingerprint is unavailable,
exact_model_artifact_reproducibility=false. JSON and HTML then state:
Full vector reproducibility is not guaranteed from model id alone.
Exact reproduction additionally depends on the model artifact, provider and numeric-library versions, hardware/runtime behavior, and identical normalized inputs. Old embedding contract generations must be regenerated.
Clustering Contract¶
The fixed path is:
float32 embeddings
-> L2 normalization
-> PCA(svd_solver="full", whiten=false, random_state=42)
-> external hdbscan.HDBSCAN(metric="euclidean", core_dist_n_jobs=1)
-> canonical partitions
-> diagnostics
The run manifest records Python, NumPy, SciPy, scikit-learn, and HDBSCAN
versions plus all fixed algorithm choices. run_digest covers snapshot,
embedding generation, effective sample/feature dimensions, effective
parameters, random seed, and the algorithm manifest.
A sweep discards invalid small-corpus candidates and deduplicates requested settings that collapse to the same effective parameters. A corpus with no valid candidate fails explicitly instead of producing an empty successful sweep.
Sweep ranking sets exactly one generation-wide
recommended_by_heuristic=true. Maintainer selection is an append-only event:
codeclone analytics cluster --root . --select-run RUN_ID \
--selected-by "$USER" \
--selection-rationale "Best inspectable partition"
The legacy selected_by_maintainer run field is synchronized only for global
selection and is not authoritative. Recommendation is evidence, not a human
decision.
Profile Control Plane (Slice 1.2)¶
A profile is a versioned lens over completed clustering facts, not a property of a run and not a semantic taxonomy. The bundled registry contains stable, balanced, discovery, and outlier-oriented lenses. Inspect it with:
codeclone analytics profiles list --root .
codeclone analytics profiles show --root . \
--profile-id intent-small-balanced-v1
codeclone analytics profiles validate --root .
flowchart LR
M["Validated manifest<br/>schema 1"] --> G["Finite deduplicated grid"]
G --> B["Immutable profile batch<br/>manifest + candidate-space digests"]
B --> R["Completed run memberships"]
R --> V["Technical validity<br/>V1-V10"]
V --> S["Profile suitability<br/>soft gates"]
S --> K["Profile-aware ranking<br/>suitable runs only"]
K --> P["Profile recommendation"]
P --> E["JSON 1.3 / HTML"]
H["Maintainer selection event"] --> E
Run a profile sweep explicitly:
codeclone analytics cluster \
--root . \
--snapshot-id SNAPSHOT_ID \
--embedding-generation-id GENERATION_ID \
--profile intent-small-discovery-v1
--profile implies --sweep. --profile auto uses
default_profile_id; when --profile is absent, the default profile is never
applied. Profile grids are authoritative for profile sweeps. For ordinary
sweeps, --sweep-pca, --sweep-min-cluster-size,
--sweep-min-samples, and --sweep-selection-method replace the matching
configured axes. Single-run flags (--pca-dimensions,
--min-cluster-size, --min-samples, --cluster-selection-method) are
mutually exclusive with sweep mode.
Every profile execution creates a new immutable batch receipt. Candidate
failures are retained as failed runs but do not abort remaining candidates;
the batch becomes completed_partial when at least one candidate succeeds.
Technical validity, profile suitability, and maintainer acceptance remain
three separate verdict levels.
Profile-scoped selection names a batch directly, or resolves the latest batch for a profile:
codeclone analytics cluster --root . --select-run RUN_ID \
--selection-profile pbatch-0123456789abcdef
See Schema Layouts for the immutable tables and CLI for the complete flag matrix.
Diagnostics¶
Each cluster summary includes:
- size and corpus percentage;
- average membership strength;
- PCA-space medoid;
- representatives and low-strength/far-boundary items;
- nearest cluster ids by PCA centroid distance;
- metadata distributions with numerator and denominator;
- explicit
insufficient_samplewhen the denominator is below the configured guard.
The noise explorer emits only observable text/membership flags:
short_text, long_text, multiple_paragraphs,
high_conjunction_count, template_match, and
low_membership_strength. It does not invent semantic classes.
Report Interpretability (Slice 1.1)¶
The report layer does not decide whether a cluster is semantically meaningful. It first evaluates formal persisted-data invariants, then projects only the facts that those invariants permit.
flowchart TD
P["Persisted snapshot, run, assignments, summaries"] --> A["assess_partition_validity"]
A -->|" V1-V10 pass "| OK["full_interpretation"]
A -->|" one or more fail "| BAD["limited_diagnostic"]
OK --> M["Partition metrics<br/>dominant ratios and size histogram"]
OK --> I["Cluster interpretation<br/>previews, correlations, numeric summaries"]
OK --> R["Small-cluster provenance completeness"]
BAD --> D["Validity codes + presentation banner<br/>safe diagnostic_facts only"]
M --> X["Shared JSON / HTML projection"]
I --> X
R --> X
D --> X
Technical validity covers:
| Invariant | Formal check |
|---|---|
V1 |
assignments exactly cover snapshot items with no duplicate item ids |
V2 |
assignment labels and unique summaries are fully linked, including noise |
V3 |
every summary size and membership digest matches its members |
V4 |
every assignment carries its summary membership digest |
V5 |
non-noise clusters satisfy effective min_cluster_size |
V6a |
persisted numeric values used by interpretation are finite or null |
V7 |
run is completed and carries the canonical algorithm manifest |
V8 |
embedding generation metadata covers the snapshot item set |
V9 |
representative and boundary ids exist and belong to their cluster |
V10 |
every decoded persisted JSON field has the expected object shape |
An invalid run is still inspectable. JSON and HTML expose its invariant codes,
status, presentation banner, and only the raw counts allowed by the safe-output
matrix. They omit partition_metrics, cluster interpretation, item previews,
and heuristic score. A missing embedding-generation record is represented as
embedding_generation: null with an empty embedding_items array.
Presentation is separate from validity:
maintainer_selectedis explicit persisted provenance, not taxonomy truth;heuristic_recommendedis sweep evidence, not a semantic verdict;candidate_onlyis a valid run selected by neither mechanism;technically_invalidalways forceslimited_diagnostic.
Slice 1.2 adds profile_recommended, valid_but_profile_rejected, and the
comparison-level no_profile_suitable_candidate banner. Profile rejection
never removes partition metrics or changes a technically valid run to limited
diagnostic mode. Labels and descriptions come from the persisted manifest
snapshot linked to the batch, not from the current working tree.
Full interpretation includes the largest-cluster ratio against the whole corpus and against assigned non-noise items, a fixed cluster-size histogram, representative and boundary previews, categorical correlations, numeric summaries for file counts and description length, and observable machine-inspectability signals. Small clusters (up to 15 items) also show provenance completeness.
Previews are normalized corpus text, truncated to 240 Unicode code points.
They appear only for representatives, boundary items, and noise exploration.
JSON keeps raw strings with ordinary JSON escaping; HTML escapes text at render
time. content_disclosure is computed from the previews actually emitted and
lists their scopes. Default exports never attach text previews to every
items[] entry.
Export schema 1.2 introduced the interpretation fields:
interpretation_contract_version = "1.0"andcontent_disclosure;clustering_run.validityandclustering_run.presentation;partition_metricsin full mode ordiagnostic_factsin limited mode;- per-cluster
interpretationblocks in full mode; - candidate-local nullable
comparisonfacts and top-levelcomparison_summary.
Export schema 1.3 preserves those keys and adds:
interpretation_contract_version = "1.1";control_plane_contract_version = "1.0";- optional run-level
profile_contextand activeselection; - optional sweep-level
profile_summary; - candidate-local
profile_suitableandis_profile_recommended.
Sweep comparison includes every persisted run for the requested snapshot and
embedding generation, including failed or otherwise invalid runs. Only valid
runs receive a score and rank. Invalid dominant ratios and largest-cluster size
are null in JSON and unavailable in HTML.
For the wire layout, see Schema Layouts. For compatibility rules, see Compatibility and Versioning.
CLI And Reports¶
The approved direct namespace is codeclone analytics:
| Command | Purpose |
|---|---|
snapshot |
Build an intent corpus snapshot |
embed |
Generate a separate analytics embedding generation |
cluster |
Run one configuration or a bounded sweep |
build |
Run snapshot → embed → cluster |
clusters |
List runs for a snapshot |
cluster-show |
Export a resolved run as full interpretation or limited diagnostics |
outliers |
Emit noise assignment ids |
profiles |
List, show, or validate profile manifests |
build --sweep --use-recommended renders the global heuristic winner.
build --profile PROFILE --use-recommended renders the profile-batch winner,
or the comparison view when no candidate satisfies the lens. Neither action
records a maintainer selection. --use-recommended without explicit or
profile-implied sweep is rejected before dependency checks or artifact
creation.
Output behavior:
- single-run JSON contains snapshot/generation manifests, validity, presentation, and either full interpretation or limited diagnostic facts;
- sweep JSON contains every persisted candidate, nullable comparison fields, and aggregate valid/invalid/recommendation/selection counts;
- sweep HTML without
--use-recommendedis comparison-only; - detailed full-mode HTML includes dominant ratios, cluster index, escaped representative/boundary previews, split categorical/numeric metadata, provenance completeness, and the noise explorer;
- detailed limited-mode HTML includes the technical-invalid banner and safe diagnostic overview, without cluster interpretation panels;
- JSON and HTML are self-contained and written atomically to explicit output paths.
Expected user/config/capability/schema/integrity errors exit 2 on stderr
without a traceback.
Observability¶
With CODECLONE_OBSERVABILITY_ENABLED=1, the CLI creates one operation named
cli.analytics.<command> with nested spans:
analytics.snapshotanalytics.embedanalytics.clusteranalytics.buildanalytics.reportwhen an export is rendered
Observability is bootstrapped before analytics stores open, so instrumented SQLite queries are attributed to the active stage. These measurements are development telemetry only; see Platform Observability.
Cross-Links¶
- Historical trajectory evidence: Trajectory Quality and Passport
- Runtime configuration: Config and Defaults
- Exit semantics and terminal surfaces: CLI
- Version bump rules: Compatibility and Versioning
- SQLite/LanceDB layout: Schema Layouts
Locked By Tests¶
tests/test_analytics_foundation.pytests/test_analytics_trajectory_selection.pytests/test_analytics_integration.pytests/test_analytics_integrity.pytests/test_analytics_reporting.pytests/test_analytics_cli.pytests/test_config_analytics.pytests/test_sqlite_readonly_openers.pytests/test_architecture.py::test_analytics_package_does_not_import_forbidden_surfaces