SARIF for IDEs and Code Scanning¶
Purpose¶
Explain how CodeClone projects canonical findings into SARIF and what IDEs or code-scanning tools can rely on.
SARIF is a machine-readable projection layer. The canonical source of report truth remains the JSON report document.
Source files¶
codeclone/report/sarif.pycodeclone/report/json_contract.pycodeclone/report/findings.py
Design model¶
CodeClone builds SARIF from the already materialized canonical report document. It does not recompute analysis in the SARIF layer.
That means:
- finding identities come from canonical finding IDs
- severity/confidence/category data comes from canonical report payloads
- SARIF ordering remains deterministic
Path model¶
To improve IDE and code-scanning integration, SARIF uses repo-relative paths
anchored through %SRCROOT%.
Current behavior:
run.originalUriBaseIds["%SRCROOT%"]points at the scan root when an absolute scan root is knownrun.artifacts[*]enumerates referenced filesartifactLocation.uriuses repository-relative pathsartifactLocation.indexaligns locations with artifacts for stable linkingrun.invocations[*].workingDirectorymirrors the scan root URI when availablerun.columnKindis fixed toutf16CodeUnits
This helps consumers resolve results back to workspace files consistently.
Result model¶
Current SARIF output includes:
tool.driver.rules[*]with stable rule IDs and help linksresults[*]for clone groups, dead code, design findings, and structural findingslocations[*]with primary file/line mappinglocations[*].messageandrelatedLocations[*].messagewith human-readable role labels such asRepresentative occurrencerelatedLocations[*]when the result has multiple relevant locationspartialFingerprints.primaryLocationLineHashfor stable per-location identity
For clone results, CodeClone also carries novelty-aware metadata when known:
baselineState
This improves usefulness in IDE/code-scanning flows that distinguish new vs known findings.
Rule metadata¶
Rule records are intentionally richer than a minimal SARIF export.
They include:
- stable rule IDs
- display name
- help text / markdown
- tags
- docs-facing help URI
The goal is not only schema compliance, but a better consumer experience in IDEs and code-scanning platforms.
What SARIF is good for here¶
SARIF is useful as:
- an IDE-facing findings stream
- a code-scanning upload format
- another deterministic machine-readable projection over canonical report data
It is not the source of truth for:
- report integrity digest
- gating semantics
- baseline compatibility
Those remain owned by the canonical report and baseline contracts.
Limitations¶
- Consumer UX depends on the IDE/platform; not every SARIF field is shown by every tool.
- HTML-only presentation details are not carried into SARIF.
- SARIF wording may evolve as long as IDs, semantics, and deterministic structure remain stable.
Validation and tests¶
Relevant tests:
tests/test_report.pytests/test_report_contract_coverage.pytests/test_report_branch_invariants.py
Contract-adjacent coverage includes:
- reuse of canonical report document
- stable SARIF branch invariants
- deterministic artifacts/rules/results ordering