05. Core Pipeline¶
Purpose¶
Describe the detection pipeline from file discovery to grouped clones.
Public surface¶
Pipeline entrypoints:
- Discovery stage:
codeclone/pipeline.py:discover - Per-file processing:
codeclone/pipeline.py:process_file - Extraction:
codeclone/extractor.py:extract_units_and_stats_from_source - Grouping:
codeclone/grouping.py
Data model¶
Stages:
- Discover Python files (
iter_py_files, sorted traversal) - Load from cache if
statsignature matches - Process changed files:
- read source
- AST parse with limits
- extract units/blocks/segments
- Build groups:
- function groups by
fingerprint|loc_bucket - block groups by
block_hash - segment groups by
segment_sigthensegment_hash|qualname
- function groups by
- Report-layer post-processing:
- merge block windows to maximal regions
- merge/suppress segment report groups
- optionally split out clone groups fully contained in configured
golden_fixture_paths
- Structural report findings:
- duplicated branch families from per-function AST structure facts
- clone cohort drift families built from existing function groups (no rescan)
- Metrics computation (full mode only):
- per-function cyclomatic complexity
- per-class coupling (CBO) and cohesion (LCOM4)
- dead-code analysis: declaration-only, qualname-based liveness
- dependency graph and cycle detection
- Health scoring:
- seven dimension scores: clones, complexity, coupling, cohesion, dead code, dependencies, coverage
- weighted blend → composite score (0–100) and grade (A–F)
- Suggestion generation:
- advisory cards from clone groups, structural findings, metric violations
- deterministic priority sort, never gates CI
- Current-run coverage join (optional):
- when
--coverageis present, join external Cobertura XML to discovered function spans - invalid XML becomes
coverage_join.status="invalid"for that run rather than mutating baseline state
- when
- Design finding extraction:
- threshold-aware findings for complexity, coupling, cohesion
- coverage
coverage_hotspot/coverage_scope_gapfindings from valid coverage-join rows only - thresholds recorded in
meta.analysis_thresholds.design_findings
- Derived overview and hotlists:
- overview families, top risks, source breakdown, health snapshot
- directory hotspots by category (
derived.overview.directory_hotspots) - hotlists: most actionable, highest spread, production/test-fixture hotspots
- Gate evaluation:
- clone-baseline diff (NEW vs KNOWN)
- metric threshold gates (
--fail-complexity,--fail-coupling, etc.) - metric regression gates (
--fail-on-new-metrics) - coverage hotspot gate (
--fail-on-untested-hotspots) - gate reasons emitted in deterministic order
Refs:
codeclone/pipeline.pycodeclone/extractor.py:extract_units_and_stats_from_sourcecodeclone/report/blocks.py:prepare_block_report_groupscodeclone/report/segments.py:prepare_segment_report_groupscodeclone/metrics/health.py:compute_healthcodeclone/metrics/coverage_join.py:build_coverage_joincodeclone/report/json_contract.py:_build_design_groupscodeclone/report/suggestions.py:generate_suggestionscodeclone/report/overview.py:build_directory_hotspotscodeclone/pipeline.py:metric_gate_reasons
Contracts¶
- Detection core (
extractor,normalize,cfg,blocks) computes clone candidates. - Report-layer transformations do not change function/block grouping keys used for baseline diff.
- Segment groups are report-only and do not participate in baseline diff/gating.
- Structural findings are report-only and do not participate in baseline diff/gating.
golden_fixture_pathsis a project-level clone exclusion policy, not a fingerprint/baseline rule:- it applies only to clone groups fully contained in matching
tests//tests/fixtures/paths - excluded groups do not affect health, clone gates, or suggestions
- excluded groups remain observable as suppressed canonical report facts
- it applies only to clone groups fully contained in matching
- Dead-code liveness references from test paths are excluded at extraction/cache-load boundaries for both local-name references and canonical qualname references.
Refs:
codeclone/cli.py:_main_impl(diff uses only function/block groups)codeclone/baseline.py:Baseline.diffcodeclone/extractor.py:extract_units_and_stats_from_sourcecodeclone/pipeline.py:_load_cached_metrics
Invariants (MUST)¶
Files found = Files analyzed + Cache hits + Files skippedwarning if broken.- In gating mode, unreadable source IO (
source_read_error) is a contract failure. - Parser time/resource protections are applied in POSIX mode via
_parse_limits.
Refs:
codeclone/_cli_summary.py:_print_summarycodeclone/cli.py:_main_implcodeclone/extractor.py:_parse_limits
Failure modes¶
| Condition | Behavior |
|---|---|
| File stat/read/encoding error | File skipped; tracked as failed file; source-read subset tracked separately |
| Source read error in gating mode | Contract error exit 2 |
| Parser timeout | ParseError returned through processing failure path |
| Unexpected per-file exception | Captured as ProcessingResult(error_kind="unexpected_error") |
Determinism / canonicalization¶
- File list is sorted.
- Group sorting in reports is deterministic by key and stable item sort.
Refs:
codeclone/scanner.py:iter_py_filescodeclone/report/json_contract.py:_build_clone_groupscodeclone/report/json_contract.py:_build_structural_groupscodeclone/report/json_contract.py:_build_integrity_payload
Locked by tests¶
tests/test_scanner_extra.py::test_iter_py_files_deterministic_sorted_ordertests/test_cli_inprocess.py::test_cli_summary_cache_miss_metricstests/test_cli_inprocess.py::test_cli_unreadable_source_fails_in_ci_with_contract_errortests/test_extractor.py::test_parse_limits_triggers_timeouttests/test_extractor.py::test_dead_code_marks_symbol_dead_when_referenced_only_by_teststests/test_extractor.py::test_extract_collects_referenced_qualnames_for_import_aliasestests/test_pipeline_metrics.py::test_load_cached_metrics_ignores_referenced_names_from_test_files
Non-guarantees¶
- Parallel scheduling order is not guaranteed; only final grouped output determinism is guaranteed.