SARIF for IDEs and Code Scanning¶

Purpose¶

Explain how CodeClone projects canonical findings into SARIF and what IDEs or code-scanning tools can rely on.

SARIF is a machine-readable projection layer. The canonical source of report truth remains the JSON report document.

Source files¶

codeclone/report/sarif.py
codeclone/report/json_contract.py
codeclone/report/findings.py

Design model¶

CodeClone builds SARIF from the already materialized canonical report document. It does not recompute analysis in the SARIF layer.

That means:

finding identities come from canonical finding IDs
severity/confidence/category data comes from canonical report payloads
SARIF ordering remains deterministic

Path model¶

To improve IDE and code-scanning integration, SARIF uses repo-relative paths anchored through %SRCROOT%.

Current behavior:

run.originalUriBaseIds["%SRCROOT%"] points at the scan root when an absolute scan root is known
run.artifacts[*] enumerates referenced files
artifactLocation.uri uses repository-relative paths
artifactLocation.index aligns locations with artifacts for stable linking
run.invocations[*].workingDirectory mirrors the scan root URI when available
run.columnKind is fixed to utf16CodeUnits

This helps consumers resolve results back to workspace files consistently.

Result model¶

Current SARIF output includes:

tool.driver.rules[*] with stable rule IDs and help links
results[*] for clone groups, dead code, design findings, and structural findings
locations[*] with primary file/line mapping
locations[*].message and relatedLocations[*].message with human-readable role labels such as Representative occurrence
relatedLocations[*] when the result has multiple relevant locations
partialFingerprints.primaryLocationLineHash for stable per-location identity

For clone results, CodeClone also carries novelty-aware metadata when known:

baselineState

This improves usefulness in IDE/code-scanning flows that distinguish new vs known findings.

Rule metadata¶

Rule records are intentionally richer than a minimal SARIF export.

They include:

stable rule IDs
display name
help text / markdown
tags
docs-facing help URI

The goal is not only schema compliance, but a better consumer experience in IDEs and code-scanning platforms.

What SARIF is good for here¶

SARIF is useful as:

an IDE-facing findings stream
a code-scanning upload format
another deterministic machine-readable projection over canonical report data

It is not the source of truth for:

report integrity digest
gating semantics
baseline compatibility

Those remain owned by the canonical report and baseline contracts.

Limitations¶

Consumer UX depends on the IDE/platform; not every SARIF field is shown by every tool.
HTML-only presentation details are not carried into SARIF.
SARIF wording may evolve as long as IDs, semantics, and deterministic structure remain stable.

Validation and tests¶

Relevant tests:

tests/test_report.py
tests/test_report_contract_coverage.py
tests/test_report_branch_invariants.py

Contract-adjacent coverage includes:

reuse of canonical report document
stable SARIF branch invariants
deterministic artifacts/rules/results ordering