Structured Chinese document library
The Structured Chinese document library is a local-first architecture designed to manage large collections of documents alongside structured bibliographic metadata. It prioritises performance, offline access, and seamless integration with terminal-based workflows. It is currently designed to integrate with Helix editor, but can easily be accessed via other text editors such as Vim, Neovim, etc. The library design separates raw document storage, plain-text access, and structured metadata into distinct but linked components, all unified through a shared UUID-based identifier scheme.
Core design principles
- Local-first: all data is stored and accessed without requiring network connectivity
- Separation of concerns: plain text access, and metadata are independent layers, binary AST
- Zero-copy access: memory mapping is used wherever possible for performance
- Stable identifiers: UUIDs provide consistent linkage across all components
- Editor-agnostic terminal workflow: optimised for terminal usage; easily integrated with any editor that can access the terminal. Use Vim, Neovim, Helix, VSCode, Emacs, Sublime, Zed, nano, micro, etc.
- Format: human-readable identifier (optional)
- Example:
@zhang_2014_keyword
The system consists of four main components: document storage (blob), document storage (plain text), the metadata database, and citation format with editor integration. Each component is linked through a shared UUID.
Document storage
engine: mmap
purpose: store plain text document for direct
access
key type: UUID
This component stores extracted or authored plain text versions of
documents (e.g., document.txt, metadata.txt).
Files are memory-mapped to allow fast, zero-copy access from terminal
tools and editors.
Memory-mapped file access enables efficient handling of large text corpora, with immediate usability in the terminal and text editor, and is optimised for offline workflows.
This layer serves as the primary interface for reading and editing documents, enabling fast searching, viewing, and manipulation without the need to decode binary formats.
Metadata storage
engine: SQLite
purpose: store structured bibliographic metadata
key type: UUID (linked to LMDB key)
The metadata layer maintains structured information about each document. It uses SQLite for portability and performance, with optional full-text search via semantic search and FTS5. Single-file relational database design provides fast read performance, supports full-text search and allows easy integration with scripts and command-line tools.
This layer acts as the primary entry point for all queries, enabling structured filtering that narrows the search space before more expensive operations are performed. All document discovery flows through the metadata database, which reduces the candidate set of documents prior to semantic vector search and full-text search. This significantly improves both performance and accuracy when retrieving relevant documents and specific passages. In addition, it provides citation metadata for export and links all storage layers via UUID.
Database schema
| Field | Description |
|---|---|
| UUID | Primary key, link to documents, metadata, database |
| Author | Individual / organisation |
| Title | Document title |
| Document classification | Document typology: 令, 公告, etc |
| Year | Publication year |
| Issuing body | Organisation that issued or released the document |
| Identifier | External identifier: DOI, government document number, etc |
| Keywords | searchable keyword tags |
| Attributes | Extended set of metadata not suitable for core fields (json) |
Binary blob storage (optional)
engine: lightning mapped data base (liblmdb)
purpose: store binary documents where plain text is
unavailable, incomplete, or unsuitable
key type: UUID
This layer stores documents in their original binary formats when a plain text representation is not available, not yet processed, or inherently unsuitable (e.g., audio, video). It also serves as a preservation and fallback layer, ensuring that the original source material is retained.
LMDB is used as a high-performance key-value store, indexing each object by UUID and enabling efficient retrieval of raw data when needed. High-performance key-value storage with memory-mapped access enables efficient reads, while ACID compliance ensures data integrity and supports reliable storage of large binary assets. This layer stores original source material for preservation and reprocessing, acts as a fallback when plain text representations are missing or insufficient, supports inherently binary media (e.g., mp3, mp4, images), and provides reliable UUID-addressable access to raw documents.
Citation format
standard: BibLaTeX | CSL JSON
purpose: ensure compatibility with document processing
tools
The system supports widely used citation formats to integrate with tools such as Pandoc. Metadata from SQLite can be exported into these formats.
Human-readable structured formats enable interoperability with external tools and support easy import and export workflows. This component bridges internal metadata with publishing pipelines and enables reproducible academic and technical writing.
Editor integration
editor: Helix; any terminal-accessible editor
purpose: provide seamless citation workflows
Integration with text editor is achieved through external command hooks and CLI tools, allowing users to search, insert, and navigate citations directly within the editor.
Custom command-line tools enable querying of metadata, while editor integrates via external command hooks and keybindings for @citekey lookup and insertion. This makes the database directly usable during writing and eliminates context switching between tools.
Citekey
The UUID is the canonical identifier used across the entire system and may be cited directly when desired. Citekeys are optional, human-readable aliases created locally on a per-document basis. They are only introduced when needed by the author for convenience during writing. Citekeys follow a structured pattern (e.g., author_year_keyword) and are generated with awareness of the existing bibliography. When a potential naming conflict is detected, the system adjusts the keyword (next keyword in list) to ensure uniqueness within the local collection. This approach preserves global uniqueness and stability via UUIDs, while allowing flexible, readable references where beneficial.