05. Core Pipeline¶

Purpose¶

Describe the detection pipeline from file discovery to grouped clones.

Pipeline entrypoints:

Stages:

Discover Python files (iter_py_files, sorted traversal)
Load from cache if stat signature matches
Process changed files:
- read source
- AST parse with limits
- extract units/blocks/segments
Build groups:
- function groups by fingerprint|loc_bucket
- block groups by block_hash
- segment groups by segment_sig then segment_hash|qualname
Report-layer post-processing:
- merge block windows to maximal regions
- merge/suppress segment report groups
Structural report findings:
- duplicated branch families from per-function AST structure facts
- clone cohort drift families built from existing function groups (no rescan)

Refs:

Detection core (extractor, normalize, cfg, blocks) computes clone candidates.
Report-layer transformations do not change function/block grouping keys used for baseline diff.
Segment groups are report-only and do not participate in baseline diff/gating.
Structural findings are report-only and do not participate in baseline diff/gating.
Dead-code liveness references from test paths are excluded at extraction/cache-load boundaries for both local-name references and canonical qualname references.

Refs:

Files found = Files analyzed + Cache hits + Files skipped warning if broken.
In gating mode, unreadable source IO (source_read_error) is a contract failure.
Parser time/resource protections are applied in POSIX mode via _parse_limits.

Refs:

Condition	Behavior
File stat/read/encoding error	File skipped; tracked as failed file; source-read subset tracked separately
Source read error in gating mode	Contract error exit 2
Parser timeout	`ParseError` returned through processing failure path
Unexpected per-file exception	Captured as `ProcessingResult(error_kind="unexpected_error")`

Refs:

tests/test_scanner_extra.py::test_iter_py_files_deterministic_sorted_order
tests/test_cli_inprocess.py::test_cli_summary_cache_miss_metrics
tests/test_cli_inprocess.py::test_cli_unreadable_source_fails_in_ci_with_contract_error
tests/test_extractor.py::test_parse_limits_triggers_timeout
tests/test_extractor.py::test_dead_code_marks_symbol_dead_when_referenced_only_by_tests
tests/test_extractor.py::test_extract_collects_referenced_qualnames_for_import_aliases
tests/test_pipeline_metrics.py::test_load_cached_metrics_ignores_referenced_names_from_test_files

Parallel scheduling order is not guaranteed; only final grouped output determinism is guaranteed.