12. Determinism¶
Purpose¶
Document deterministic behavior and canonicalization controls.
Public surface¶
- Sorting and traversal:
codeclone/scanner.py,codeclone/report/serialize.py,codeclone/cache.py - Canonical hashing:
codeclone/baseline.py,codeclone/cache.py - Golden detector snapshot policy:
tests/test_detector_golden.py
Data model¶
Deterministic outputs depend on:
- fixed Python tag
- fixed baseline/cache/report schemas
- sorted file traversal
- sorted group keys and item records
- canonical JSON serialization for hashes
Contracts¶
- JSON report uses deterministic ordering for files/groups/items.
- TXT report uses deterministic metadata key order and group/item ordering.
- Baseline hash is canonical and independent from non-payload metadata fields.
- Cache signature is canonical and independent from JSON whitespace.
Refs:
codeclone/report/json_contract.py:build_report_documentcodeclone/report/serialize.py:render_text_report_documentcodeclone/baseline.py:_compute_payload_sha256codeclone/cache.py:_sign_data
Invariants (MUST)¶
inventory.file_registry.itemsis lexicographically sorted.- finding groups/items and derived hotlists are deterministically ordered.
- Baseline clone lists are sorted and unique.
- Golden detector test runs only on canonical Python tag from fixture metadata.
Refs:
codeclone/report/json_contract.py:_build_inventory_payloadcodeclone/baseline.py:_require_sorted_unique_idstests/test_detector_golden.py::test_detector_output_matches_golden_fixture
Failure modes¶
| Condition | Determinism impact |
|---|---|
| Different Python tag | Clone IDs may differ; baseline considered incompatible |
| Unsorted/non-canonical baseline IDs | Baseline rejected as invalid |
| Cache signature mismatch | Cache ignored and recomputed |
| Different cache provenance state | meta.cache_* differs by design |
Determinism / canonicalization¶
Primary canonicalization points:
json.dumps(..., sort_keys=True, separators=(",", ":"), ensure_ascii=False)for baseline/cache payload hash/signature.- tuple-based sort keys for report record arrays.
Refs:
codeclone/baseline.py:_compute_payload_sha256codeclone/cache.py:_canonical_jsoncodeclone/report/json_contract.py:_build_integrity_payload
Locked by tests¶
tests/test_report.py::test_report_json_deterministic_group_ordertests/test_report.py::test_report_json_deterministic_with_shuffled_unitstests/test_report.py::test_text_report_deterministic_group_ordertests/test_baseline.py::test_baseline_hash_canonical_determinismtests/test_cache.py::test_cache_signature_validation_ignores_json_whitespace
Non-guarantees¶
- Determinism is not guaranteed across different
python_tagvalues. - Byte-identical reports are not guaranteed across different cache provenance
states (
cache_status,cache_used,cache_schema_version).