Skip to content

12. Determinism

Purpose

Document deterministic behavior and canonicalization controls.

Public surface

  • Sorting and traversal: codeclone/scanner.py, codeclone/report/serialize.py, codeclone/cache.py
  • Canonical hashing: codeclone/baseline.py, codeclone/cache.py
  • Golden detector snapshot policy: tests/test_detector_golden.py

Data model

Deterministic outputs depend on:

  • fixed Python tag
  • fixed baseline/cache/report schemas
  • sorted file traversal
  • sorted group keys and item records
  • canonical JSON serialization for hashes

Contracts

  • JSON report uses deterministic ordering for files/groups/items.
  • TXT report uses deterministic metadata key order and group/item ordering.
  • Baseline hash is canonical and independent from non-payload metadata fields.
  • Cache signature is canonical and independent from JSON whitespace.

Refs:

  • codeclone/report/json_contract.py:build_report_document
  • codeclone/report/serialize.py:render_text_report_document
  • codeclone/baseline.py:_compute_payload_sha256
  • codeclone/cache.py:_sign_data

Invariants (MUST)

  • inventory.file_registry.items is lexicographically sorted.
  • finding groups/items and derived hotlists are deterministically ordered.
  • Baseline clone lists are sorted and unique.
  • Golden detector test runs only on canonical Python tag from fixture metadata.

Refs:

  • codeclone/report/json_contract.py:_build_inventory_payload
  • codeclone/baseline.py:_require_sorted_unique_ids
  • tests/test_detector_golden.py::test_detector_output_matches_golden_fixture

Failure modes

Condition Determinism impact
Different Python tag Clone IDs may differ; baseline considered incompatible
Unsorted/non-canonical baseline IDs Baseline rejected as invalid
Cache signature mismatch Cache ignored and recomputed
Different cache provenance state meta.cache_* differs by design

Determinism / canonicalization

Primary canonicalization points:

  • json.dumps(..., sort_keys=True, separators=(",", ":"), ensure_ascii=False) for baseline/cache payload hash/signature.
  • tuple-based sort keys for report record arrays.

Refs:

  • codeclone/baseline.py:_compute_payload_sha256
  • codeclone/cache.py:_canonical_json
  • codeclone/report/json_contract.py:_build_integrity_payload

Locked by tests

  • tests/test_report.py::test_report_json_deterministic_group_order
  • tests/test_report.py::test_report_json_deterministic_with_shuffled_units
  • tests/test_report.py::test_text_report_deterministic_group_order
  • tests/test_baseline.py::test_baseline_hash_canonical_determinism
  • tests/test_cache.py::test_cache_signature_validation_ignores_json_whitespace

Non-guarantees

  • Determinism is not guaranteed across different python_tag values.
  • Byte-identical reports are not guaranteed across different cache provenance states (cache_status, cache_used, cache_schema_version).