Structured Chinese document library

The Structured Chinese document library is a local-first architecture designed to manage large collections of documents alongside structured bibliographic metadata. It prioritises performance, offline access, and seamless integration with terminal-based workflows. It is currently designed to integrate with Helix editor, but can easily be accessed via other text editors such as Vim, Neovim, etc. The library design separates raw document storage, plain-text access, and structured metadata into distinct but linked components, all unified through a shared UUID-based identifier scheme.

Core design principles

Local-first: all data is stored and accessed without requiring network connectivity
Separation of concerns: plain text access, and metadata are independent layers, binary AST
Zero-copy access: memory mapping is used wherever possible for performance
Stable identifiers: UUIDs provide consistent linkage across all components
Editor-agnostic terminal workflow: optimised for terminal usage; easily integrated with any editor that can access the terminal. Use Vim, Neovim, Helix, VSCode, Emacs, Sublime, Zed, nano, micro, etc.

The system consists of four main components: document storage (blob), document storage (plain text), the metadata database, and citation format with editor integration. Each component is linked through a shared UUID.

Document storage

engine: mmap
purpose: store plain text document for direct access
key type: UUID

This component stores extracted or authored plain text versions of documents (e.g., document.txt, metadata.txt). Files are memory-mapped to allow fast, zero-copy access from terminal tools and editors.

Memory-mapped file access enables efficient handling of large text corpora, with immediate usability in the terminal and text editor, and is optimised for offline workflows.

This layer serves as the primary interface for reading and editing documents, enabling fast searching, viewing, and manipulation without the need to decode binary formats.

Metadata storage

engine: SQLite
purpose: store structured bibliographic metadata
key type: UUID (linked to LMDB key)

The metadata layer maintains structured information about each document. It uses SQLite for portability and performance, with optional full-text search via semantic search and FTS5. Single-file relational database design provides fast read performance, supports full-text search and allows easy integration with scripts and command-line tools.

This layer acts as the primary entry point for all queries, enabling structured filtering that narrows the search space before more expensive operations are performed. All document discovery flows through the metadata database, which reduces the candidate set of documents prior to semantic vector search and full-text search. This significantly improves both performance and accuracy when retrieving relevant documents and specific passages. In addition, it provides citation metadata for export and links all storage layers via UUID.

Database schema

Field	Description
UUID	Primary key, link to documents, metadata, database
Author	Individual / organisation
Title	Document title
Document classification	Document typology: 令, 公告, etc
Year	Publication year
Issuing body	Organisation that issued or released the document
Identifier	External identifier: DOI, government document number, etc
Keywords	searchable keyword tags
Attributes	Extended set of metadata not suitable for core fields (json)

Binary blob storage (optional)

engine: lightning mapped data base (liblmdb)
purpose: store binary documents where plain text is unavailable, incomplete, or unsuitable
key type: UUID

This layer stores documents in their original binary formats when a plain text representation is not available, not yet processed, or inherently unsuitable (e.g., audio, video). It also serves as a preservation and fallback layer, ensuring that the original source material is retained.

LMDB is used as a high-performance key-value store, indexing each object by UUID and enabling efficient retrieval of raw data when needed. High-performance key-value storage with memory-mapped access enables efficient reads, while ACID compliance ensures data integrity and supports reliable storage of large binary assets. This layer stores original source material for preservation and reprocessing, acts as a fallback when plain text representations are missing or insufficient, supports inherently binary media (e.g., mp3, mp4, images), and provides reliable UUID-addressable access to raw documents.

Citation format

standard: BibLaTeX | CSL JSON
purpose: ensure compatibility with document processing tools

The system supports widely used citation formats to integrate with tools such as Pandoc. Metadata from SQLite can be exported into these formats.

Human-readable structured formats enable interoperability with external tools and support easy import and export workflows. This component bridges internal metadata with publishing pipelines and enables reproducible academic and technical writing.

Editor integration

editor: Helix; any terminal-accessible editor
purpose: provide seamless citation workflows

Integration with text editor is achieved through external command hooks and CLI tools, allowing users to search, insert, and navigate citations directly within the editor.

Custom command-line tools enable querying of metadata, while editor integrates via external command hooks and keybindings for @citekey lookup and insertion. This makes the database directly usable during writing and eliminates context switching between tools.

Citekey

Format: human-readable identifier (optional)
Example: @zhang_2014_keyword

The UUID is the canonical identifier used across the entire system and may be cited directly when desired. Citekeys are optional, human-readable aliases created locally on a per-document basis. They are only introduced when needed by the author for convenience during writing. Citekeys follow a structured pattern (e.g., author_year_keyword) and are generated with awareness of the existing bibliography. When a potential naming conflict is detected, the system adjusts the keyword (next keyword in list) to ensure uniqueness within the local collection. This approach preserves global uniqueness and stability via UUIDs, while allowing flexible, readable references where beneficial.