05. Core Pipeline¶

Purpose¶

Describe the runtime pipeline from file discovery to grouped clones, metrics, report assembly, and gating.

Discovery: codeclone/core/discovery.py:discover
Per-file processing: codeclone/core/worker.py:process_file
Extraction: codeclone/analysis/units.py:extract_units_and_stats_from_source
Clone grouping: codeclone/findings/clones/grouping.py
Project metrics and suggestions: codeclone/core/pipeline.py
Report/gating integration: codeclone/core/reporting.py:report, codeclone/core/reporting.py:gate

Stages:

Bootstrap runtime paths and config.
Discover Python files with deterministic traversal.
Load usable cache entries by stat signature and compatible analysis profile.
Process changed/missed files:
- read source
- parse AST with limits
- extract function, block, and segment units
- collect referenced names/qualnames and dead-code candidates
Build groups:
- function groups by fingerprint|loc_bucket
- block groups by block_hash
- segment groups by segment_sig then segment_hash|qualname
Compute project metrics in full mode:
- complexity, coupling, cohesion
- dead code
- dependency graph and cycles
- health score
- adoption, API surface, optional coverage join
Build canonical report document and deterministic projections.
Evaluate clone diff and metric gates.

Refs:

Detection core computes facts; report layer materializes canonical findings from those facts.
Report-layer transformations do not change function/block grouping keys used for baseline diff.
Segment groups are report-only and do not participate in baseline diff/gating.
Structural findings are report-only and do not participate in baseline diff/gating.
golden_fixture_paths is a clone-policy exclusion layer: excluded groups remain visible as suppressed canonical report facts, but do not affect health, gates, or suggestions.
Test-path liveness references are filtered both on fresh extraction and on cache decode.
Default discovery skips generated/dependency directories such as .git, virtualenvs, site-packages, node_modules, migrations, dist, and build; users can still pass explicit scanner excludes for project-specific layouts.

Refs:

files_found = files_analyzed + cache_hits + files_skipped, or CLI warns explicitly.
In gating mode, unreadable source IO is a contract failure.
Parser time/resource protections are applied before AST extraction.

Refs:

Condition	Behavior
File stat/read/encoding error	File skipped; tracked as failed file
Source read error in gating mode	Contract error, exit `2`
Parser timeout	`ParseError` through processing failure path
Unexpected per-file exception	Captured as `unexpected_error` processing result

Refs:

tests/test_scanner_extra.py::test_iter_py_files_deterministic_sorted_order
tests/test_scanner_extra.py::test_iter_py_files_excludes_node_modules
tests/test_cli_inprocess.py::test_cli_summary_cache_miss_metrics
tests/test_cli_inprocess.py::test_cli_unreadable_source_fails_in_ci_with_contract_error
tests/test_extractor.py::test_parse_limits_triggers_timeout
tests/test_extractor.py::test_dead_code_marks_symbol_dead_when_referenced_only_by_tests
tests/test_pipeline_metrics.py::test_load_cached_metrics_ignores_referenced_names_from_test_files

Parallel worker scheduling order is not guaranteed; only final output determinism is guaranteed.