SARIF for IDEs and Code Scanning¶
Purpose¶
Explain how CodeClone projects canonical findings into SARIF and what IDEs or code-scanning tools can rely on.
SARIF is a machine-readable projection layer. The canonical source of report truth remains the JSON report document.
Source files¶
codeclone/report/sarif.pycodeclone/report/json_contract.pycodeclone/report/findings.py
Design model¶
CodeClone builds SARIF from the already materialized canonical report document. It does not recompute analysis in the SARIF layer.
That means:
- finding identities come from canonical finding IDs
- severity/confidence/category data comes from canonical report payloads
- SARIF ordering remains deterministic
Path model¶
To improve IDE and code-scanning integration, SARIF uses repo-relative paths
anchored through %SRCROOT%.
Current behavior:
run.originalUriBaseIds["%SRCROOT%"]points at the scan root when an absolute scan root is knownrun.artifacts[*]enumerates referenced filesartifactLocation.uriuses repository-relative pathsartifactLocation.indexaligns locations with artifacts for stable linkingrun.invocations[*].workingDirectorymirrors the scan root URI when availablerun.invocations[*].startTimeUtcis emitted when analysis start time is available in canonical runtime metarun.automationDetails.idis unique per run so code-scanning systems can correlate uploads reliably
This helps consumers resolve results back to workspace files consistently.
Result model¶
Current SARIF output includes:
tool.driver.rules[*]with stable rule IDs and help linksresults[*]for clone groups, dead code, design findings, and structural findingslocations[*]with primary file/line mappinglocations[*].messageandrelatedLocations[*].messagewith human-readable role labels such asRepresentative occurrencerelatedLocations[*]when the result has multiple relevant locationspartialFingerprints.primaryLocationLineHashfor stable per-location identity without encoding line numbers into the hash digest- result
propertieswith stable identity/context fields such as primary path, qualname, and region - explicit
kind: "fail"on results
For clone results, CodeClone also carries novelty-aware metadata when known:
baselineState
This improves usefulness in IDE/code-scanning flows that distinguish new vs known findings.
Coverage join can materialize coverage / coverage_hotspot and
coverage_scope_gap design findings when the canonical report already
contains valid metrics.families.coverage_join facts. SARIF projects those
findings like other design findings; it does not parse Cobertura XML or create
coverage-specific analysis truth.
Rule metadata¶
Rule records are intentionally richer than a minimal SARIF export.
They include:
- stable rule IDs
- stable rule names derived from
ruleId - display name
- help text / markdown
- tags
- docs-facing help URI
The goal is not only schema compliance, but a better consumer experience in IDEs and code-scanning platforms.
What SARIF is good for here¶
SARIF is useful as:
- an IDE-facing findings stream
- a code-scanning upload format
- another deterministic machine-readable projection over canonical report data
It is not the source of truth for:
- report integrity digest
- gating semantics
- baseline compatibility
Those remain owned by the canonical report and baseline contracts.
Limitations¶
- Consumer UX depends on the IDE/platform; not every SARIF field is shown by every tool.
- HTML-only presentation details are not carried into SARIF.
- SARIF wording may evolve as long as IDs, semantics, and deterministic structure remain stable.
Validation and tests¶
Relevant tests:
tests/test_report.pytests/test_report_contract_coverage.pytests/test_report_branch_invariants.py
Contract-adjacent coverage includes:
- reuse of canonical report document
- stable SARIF branch invariants
- deterministic artifacts/rules/results ordering