11. Security Model¶
Purpose¶
Describe implemented protections and explicit security boundaries.
Public surface¶
- Scanner path validation:
codeclone/scanner.py:iter_py_files - File read limits and parser limits:
codeclone/cli.py:process_file,codeclone/extractor.py:_parse_limits - Baseline/cache validation:
codeclone/baseline.py,codeclone/cache.py - HTML escaping:
codeclone/_html_escape.py,codeclone/html_report.py - MCP read-only enforcement:
codeclone/mcp_service.py,codeclone/mcp_server.py
Data model¶
Security-relevant input classes:
- filesystem paths (root/source/baseline/cache/report)
- untrusted JSON files (baseline/cache)
- untrusted source snippets and metadata rendered into HTML
Contracts¶
- CodeClone parses source text; it does not execute repository Python code.
- Sensitive root directories are blocked by scanner policy.
- Symlink traversal outside root is skipped.
- HTML report escapes text and attribute contexts before embedding.
- MCP server is read-only by design: no tool mutates source files, baselines, cache, or report artifacts.
--allow-remoteguard must be passed explicitly for non-local transports; default is local-only (stdio).cache_policy=refreshis rejected — MCP cannot trigger cache invalidation.- Review markers (
mark_finding_reviewed) are session-local in-memory state; they are never persisted to disk or leaked into baselines/reports. git_diff_refis validated as a safe single revision expression before anygit diffsubprocess call. Leading option-like prefixes, whitespace/control characters, and unsupported punctuation are rejected.- Run history is bounded by
--history-limit(default 10) to prevent unbounded memory growth.
Refs:
codeclone/extractor.py:_parse_with_limitscodeclone/scanner.py:SENSITIVE_DIRScodeclone/scanner.py:iter_py_filescodeclone/_html_escape.py:_escape_html
Invariants (MUST)¶
- Baseline and cache integrity checks use constant-time comparison.
- Size guards are enforced before parsing baseline/cache JSON.
- Cache failures degrade safely (warning + ignore), baseline trust failures follow trust model.
Refs:
codeclone/baseline.py:Baseline.verify_integritycodeclone/cache.py:Cache.loadcodeclone/cli.py:_main_impl
Failure modes¶
| Condition | Security behavior |
|---|---|
| Symlink points outside root | File skipped |
| Root under sensitive dirs | Validation error |
| Oversized baseline | Baseline rejected |
| Oversized cache | Cache ignored |
| HTML-injected payload in metadata/source | Escaped output |
--allow-remote not passed for HTTP |
Transport rejected |
cache_policy=refresh requested |
Policy rejected |
git_diff_ref fails validation |
Parameter rejected |
Determinism / canonicalization¶
- Canonical JSON hashing for baseline/cache prevents formatting-only drift.
- Security failures map to explicit statuses (baseline/cache enums).
Refs:
codeclone/baseline.py:_compute_payload_sha256codeclone/cache.py:_canonical_jsoncodeclone/baseline.py:BaselineStatuscodeclone/cache.py:CacheStatus
Locked by tests¶
tests/test_security.py::test_scanner_path_traversaltests/test_scanner_extra.py::test_iter_py_files_symlink_loop_does_not_traversetests/test_security.py::test_html_report_escapes_user_contenttests/test_html_report.py::test_html_report_escapes_script_breakout_payloadtests/test_cache.py::test_cache_too_large_warnstests/test_mcp_service.py::test_cache_policy_refresh_rejectedtests/test_mcp_server.py::test_allow_remote_guard
Non-guarantees¶
- Baseline/cache integrity is tamper-evident at file-content level; it is not cryptographic attestation against a privileged attacker.