Configure optional text scanners — scanner

scanner_options() enables optional checks that sit beside deterministic policy rules. These scanners are intentionally lightweight and local. They are useful for catching common wrappers around risky text, such as invisible Unicode format characters, encoded payloads, disallowed URLs, simple token budget violations, language allowlists, and topic bans.

Usage

scanner_options(
  invisible_text = TRUE,
  encoded_payloads = TRUE,
  urls = FALSE,
  malicious_urls = TRUE,
  max_tokens = NULL,
  allowed_languages = NULL,
  language_fn = NULL,
  blocked_topics = NULL,
  blocked_url_hosts = NULL,
  allowed_url_hosts = NULL
)

Arguments

invisible_text: Whether to flag Unicode format characters such as zero-width spaces. Normalization removes these characters before rule matching, but a finding records that evasive formatting was present.
encoded_payloads: Whether to inspect URL-encoded and base64-like payloads by decoding candidates and scanning the decoded text.
urls: Whether to create low-severity inventory findings for URLs.
malicious_urls: Whether to flag URLs whose hosts are explicitly blocked or fall outside allowed_url_hosts.
max_tokens: Optional maximum estimated tokens for a single scanned text. Exceeding the limit creates an OWASP LLM10 block finding.
allowed_languages: Optional language allowlist. Uses language_fn when supplied, otherwise a minimal ASCII/non-Latin heuristic.
language_fn: Optional function that receives text and returns a single language label.
blocked_topics: Optional character vector of regular expressions, or a named character vector. Matches create topic-ban findings.
blocked_url_hosts: Optional character vector of blocked URL hosts.
allowed_url_hosts: Optional character vector of allowed URL hosts. When supplied, URL hosts outside the allowlist are flagged.

Value

A shieldr_scanner_options object.

Details

Scanner findings use the same finding schema as rule findings and therefore contribute to risk_score, action, audit logs, and explanations.

The encoded-payload scanner tries URL decoding and base64 decoding on candidate substrings, then runs the active policy rules over decoded text. It does not execute decoded content. The language scanner is deliberately basic unless language_fn is supplied; a custom function should accept a single string and return a language label such as "en", "es", or "non_latin".

Examples

scanners <- scanner_options(
  max_tokens = 500,
  blocked_topics = c("internal layoffs", "unreleased earnings")
)

scan_prompt("Summarize this public note.", scanners = scanners)
#> llmshieldr report
#> action: allow
#> risk_score: 0.000
#> findings: 0