Architecture
YAMLRocks is a Rust extension (via PyO3, built by maturin) with a thin Python
package on top. It ships its own YAML scanner and parser rather than depending on
an external one, which is what makes first-class comments, native includes, and
source tracking possible. This page is a code map: it follows a document through
the pipeline and points at where each stage lives in src/.
The pipeline
Section titled “The pipeline”bytes -> scanner -> tokens -> parser -> events ┬-> resolver -> Python objects (fast path) └-> composer -> AST -> YAMLRocksDocument (round-trip)The split after events is the central design choice: one front end (scanner +
parser) feeds two back ends. The fast path is tuned for raw throughput; the
round-trip path is tuned for fidelity. Which one runs is decided by the option
flags on the call, chiefly OPT_ROUND_TRIP.
Front end
Section titled “Front end”Scanner (src/scanner/)
Section titled “Scanner (src/scanner/)”A state machine over a UTF-8 reader (reader.rs) that tracks indentation and
block/flow context and recognizes every scalar style: plain, single- and
double-quoted, literal (|), and folded (>). Scalars are scanned in
scalar.rs and tokens are defined in token.rs.
Comments are extracted as first-class items with source spans (comment.rs), but
only on the round-trip path. The fast path skips comment retention entirely
so it does no work it will throw away.
Parser (src/parser/)
Section titled “Parser (src/parser/)”Turns the token stream into a flat sequence of events (stream / document /
mapping / sequence / scalar / alias start and end), each carrying a span. The
event type lives in event.rs. Events are deliberately low-level and shared by
both back ends, so neither has to re-walk tokens.
Resolver (src/resolver/)
Section titled “Resolver (src/resolver/)”The only component that differs between YAML 1.1 and 1.2. A Resolver trait has
two implementations, yaml12.rs (the default core schema) and yaml11.rs
(yes/no booleans, 0777 octals, sexagesimal numbers). The resolver decides
how a plain scalar gets typed; everything upstream is schema-agnostic. This is
ADR-004:
a single parser with a dual resolver.
The two decode paths, and why
Section titled “The two decode paths, and why”Fast path (src/decode/, src/encode/)
Section titled “Fast path (src/decode/, src/encode/)”Events become a compact Value tree (an internal enum), which is then
materialized into Python objects. Merge keys (<<) are resolved here. This path
is allocation-light and never touches comments, which is why it carries the bulk
of the throughput advantage over PyYAML and ruamel.
src/encode/ is the reverse: native Python objects to YAML bytes for dumps.
Round-trip path (src/roundtrip/)
Section titled “Round-trip path (src/roundtrip/)”Events become a rich YamlNode AST (ast.rs) built by the composer
(composer.rs), carrying comments, scalar styles, anchors, and include markers.
A post-pass reattaches comments to nodes by source position (ADR-011): a comment
above a node becomes its head comment, a comment trailing a value on the same
line becomes its inline comment. The emitter (emit.rs) reproduces the document,
and document.rs backs the Python YAMLRocksDocument/YAMLRocksDocumentView types. An unmodified
document returns its original source verbatim; only changed nodes are
re-rendered. upgrade.rs implements yamlrocks.upgrade() on top of this AST.
Zero-copy scalar borrowing
Section titled “Zero-copy scalar borrowing”Scalars are borrowed from the input buffer wherever possible. The scanner returns
Cow<'input, str> (see src/scanner/scalar.rs): a plain scalar with no escapes
borrows directly from the input bytes (the Borrowed variant, no allocation),
and only a scalar that needs unescaping or unfolding allocates an Owned string.
The lifetime 'input threads through events and the Value tree
(src/decode/mod.rs), so a typical document is parsed with very few string
allocations. Strings are only copied into owned Python objects at the final
materialization step.
PyO3 FFI materialization (src/ffi/)
Section titled “PyO3 FFI materialization (src/ffi/)”The PyO3 module lives here:
mod.rs: the#[pyfunction]entry points registered on the module:loads,loads_all,dumps,to_json,schema_ref,yaml_version,dump_includes,dump_includes_map, plus the internal round-trip helpers (loads_roundtrip,loads_via_ast). The publicload,load_all,dump, andupgradeare thin Python wrappers inpysrc/yamlrocks/__init__.pythat call these.convert/: the materialization layer that turns the internalValuetree into Python objects. The hot path (value_to_python_withinconvert/decode.rs) builds containers with raw CPython calls (PyList_New+PyList_SET_ITEM,PyDict_New+PyDict_SetItem) to avoid per-element overhead, then hands back ownedPyhandles.convert/encode.rsis the reverse direction fordumps, andconvert/annotate.rsproducesYAMLRocksAnnotatedDict/YAMLRocksAnnotatedList/YAMLRocksAnnotatedStr.types.rs: the Python-facing#[pyclass]types defined here areYAMLRocksTagand the annotated containersYAMLRocksAnnotatedDict/YAMLRocksAnnotatedList(YAMLRocksAnnotatedStris a pure-Python subclass;YAMLRocksDocument/YAMLRocksDocumentView/YAMLRocksNodelive inroundtrip/document.rs).
The decode and encode hot paths drop to raw pyo3-ffi: list and dict
materialization, exact-type dispatch, and direct iteration all go through the
CPython API, and single-line plain scalars are parsed zero-copy by borrowing
straight from the input (ADR-012
and ADR-013).
High-level PyO3 is kept only
where it makes the structure-preserving YAMLRocksDocument proxies and the dict/list
subclasses memory-safe and tractable (ADR-009,
which supersedes the original
low-level pyo3-ffi plan in ADR-002). The remaining headroom is a full arena
rewrite of the scanner, deliberately left as a separate effort.
Includes (src/include/)
Section titled “Includes (src/include/)”Resolves the !include and !include_dir_* family (gated by OPT_INCLUDES),
plus !secret (gated by OPT_SECRETS) and !env_var (gated by OPT_ENV_VAR).
A ResolveTags struct threads which tags are enabled through the resolver, so
each tag is inert unless its flag is set. Each resolved node’s span records the
file it came from, which is what lets round-trip save() and dump_includes()
write an edit back to the correct source file. Include cycles
(a.yaml -> b.yaml -> a.yaml) are detected and rejected before recursing.
Schema (src/schema/)
Section titled “Schema (src/schema/)”A focused JSON Schema validator (mod.rs) that runs against the AST, so a
validation error carries the precise line and column of the offending node rather
than a path-only message.
Security and limits
Section titled “Security and limits”Untrusted input is bounded at several points (see security):
- Nesting depth is capped at
MAX_DEPTH = 1000in both the fast-path decoder (src/decode/mod.rs) and the round-trip composer (src/roundtrip/composer.rs), preventing stack exhaustion from deeply nested input. - Alias expansion is bounded by a node budget,
MAX_NODES = 10_000_000insrc/decode/mod.rs. Expansion is measured before an alias is cloned, so a “billion laughs” document is rejected instead of exhausting memory. - Include cycles are rejected in
src/include/mod.rs. - No arbitrary object construction: tags never instantiate Python objects.
Unknown tags keep their underlying scalar unless a
tag_handlerorOPT_PASSTHROUGH_TAGopts in.
Design records
Section titled “Design records”The reasoning behind the major choices (custom parser over saphyr-parser, bytes
output from dumps, dual resolver, annotated dict/list subclasses,
position-based comment attachment, the PyO3 strategy) is recorded as ADRs in
the adr/ folder.
Start there before proposing a structural change.
See also
Section titled “See also”- Security: the limits above, from a user’s view.
- Round-trip editing: the feature the round-trip path exists to serve.
- Includes: the include and write-back model.