05. Core Pipeline¶
Purpose¶
Describe the detection pipeline from file discovery to grouped clones.
Public surface¶
Pipeline entrypoints:
- Discovery stage:
codeclone/pipeline.py:discover - Per-file processing:
codeclone/pipeline.py:process_file - Extraction:
codeclone/extractor.py:extract_units_and_stats_from_source - Grouping:
codeclone/grouping.py
Data model¶
Stages:
- Discover Python files (
iter_py_files, sorted traversal) - Load from cache if
statsignature matches - Process changed files:
- read source
- AST parse with limits
- extract units/blocks/segments
- Build groups:
- function groups by
fingerprint|loc_bucket - block groups by
block_hash - segment groups by
segment_sigthensegment_hash|qualname
- function groups by
- Report-layer post-processing:
- merge block windows to maximal regions
- merge/suppress segment report groups
- Structural report findings:
- duplicated branch families from per-function AST structure facts
- clone cohort drift families built from existing function groups (no rescan)
Refs:
codeclone/pipeline.pycodeclone/extractor.py:extract_units_and_stats_from_sourcecodeclone/report/blocks.py:prepare_block_report_groupscodeclone/report/segments.py:prepare_segment_report_groups
Contracts¶
- Detection core (
extractor,normalize,cfg,blocks) computes clone candidates. - Report-layer transformations do not change function/block grouping keys used for baseline diff.
- Segment groups are report-only and do not participate in baseline diff/gating.
- Structural findings are report-only and do not participate in baseline diff/gating.
- Dead-code liveness references from test paths are excluded at extraction/cache-load boundaries for both local-name references and canonical qualname references.
Refs:
codeclone/cli.py:_main_impl(diff uses only function/block groups)codeclone/baseline.py:Baseline.diffcodeclone/extractor.py:extract_units_and_stats_from_sourcecodeclone/pipeline.py:_load_cached_metrics
Invariants (MUST)¶
Files found = Files analyzed + Cache hits + Files skippedwarning if broken.- In gating mode, unreadable source IO (
source_read_error) is a contract failure. - Parser time/resource protections are applied in POSIX mode via
_parse_limits.
Refs:
codeclone/_cli_summary.py:_print_summarycodeclone/cli.py:_main_implcodeclone/extractor.py:_parse_limits
Failure modes¶
| Condition | Behavior |
|---|---|
| File stat/read/encoding error | File skipped; tracked as failed file; source-read subset tracked separately |
| Source read error in gating mode | Contract error exit 2 |
| Parser timeout | ParseError returned through processing failure path |
| Unexpected per-file exception | Captured as ProcessingResult(error_kind="unexpected_error") |
Determinism / canonicalization¶
- File list is sorted.
- Group sorting in reports is deterministic by key and stable item sort.
Refs:
codeclone/scanner.py:iter_py_filescodeclone/report/json_contract.py:_build_clone_groupscodeclone/report/json_contract.py:_build_structural_groupscodeclone/report/json_contract.py:_build_integrity_payload
Locked by tests¶
tests/test_scanner_extra.py::test_iter_py_files_deterministic_sorted_ordertests/test_cli_inprocess.py::test_cli_summary_cache_miss_metricstests/test_cli_inprocess.py::test_cli_unreadable_source_fails_in_ci_with_contract_errortests/test_extractor.py::test_parse_limits_triggers_timeouttests/test_extractor.py::test_dead_code_marks_symbol_dead_when_referenced_only_by_teststests/test_extractor.py::test_extract_collects_referenced_qualnames_for_import_aliasestests/test_pipeline_metrics.py::test_load_cached_metrics_ignores_referenced_names_from_test_files
Non-guarantees¶
- Parallel scheduling order is not guaranteed; only final grouped output determinism is guaranteed.