Chinese government document library
This software tool began with a single goal: to bring all the Chinese government documents needed for research into a single place: offline, fully searchable, analysable, and integrated directly into a terminal-based workflow. It should be lightning-fast, support paragraph-level search and text analysis, and easy to use.
While writing in a text editor, a simple command such as “2007 environmental law” should instantly return a ranked list of relevant documents. Selecting one should open it immediately in the terminal, allowing navigation to a specific paragraph or sentence. From there, a precise BibLaTeX citation for that passage is generated automatically and written into the working directory. Furthermore, it should support deep textual analysis, either using qualitative, 'hand-made' methods or computational methods with NLP. This means that it must be a plain text file, but cannot be flat; it must contain the same document, chapter, section, paragraph, list, table, etc structure as is present in the original document. This is absolutely vital; without a fully structured document, computational analysis is not really possible. With a structured document, it becomes possible to analyse language at the chapter, section or paragraph level. A document is a semantic object; each paragraph and section is a node in a document or corpus-level semantic graph. The meaning can only be understood within that larger graph context, otherwise it is simply a seemingly random list of words and characters.
To avoid becoming a software dead end, the system is designed to be fully open source, with no reliance on proprietary dependencies. I follow a unix-style philosophy: small, composable components that are flexible, interoperable, and able to integrate cleanly into a wider research workflow rather than existing as a closed, monolithic tool. While I have used binary files occasionally, this is only where the alternative is inefficient (SQLite), or where plain text files would be inefficient and opaque (ast.cbor). The core documents (document.txt and metadata.txt) are themselves plain text files, easily viewable in any text editor.
Designing the tool
Before any data collection began, a significant amount of time was spent designing the overall architecture of the system. The central requirement was that content must exist as plain text, but not as unstructured text. In practical terms, this meant defining a system that could simultaneously satisfy several constraints: the text must be human-readable, machine-processable, structurally precise, and portable across tools and formats. These constraints are often in tension with one another, and resolving them required treating the document not as a file, but as a structured object with clearly defined internal relationships.
The design process was guided by some core principles:
- All content must be stored as plain text wherever possible, ensuring transparency, longevity, and ease of inspection.
- The full logical structure of the document must be preserved, including chapters, sections, paragraphs, lists, and tables.
- Text, metadata, and structure must remain separable, so that each layer can be processed independently without redundancy.
- The complete text must be reproducible at will in multiple popular formats without any loss of data
These principles led to a three part design for document representation. The first part is the plain text content (document.txt), which contains the full text in a clean and readable form. The second file is metadata (metadata.txt), capturing information such as issuing body, author, publication date, and unique document identifiers, also in plain text. The third document is the structural representation, which encodes how the text is organised into a hierarchy of sections, paragraphs, and inline elements. Separating these parts ensures that the system remains both flexible and precise: the text can be read directly, while the structure enables advanced computational linguistic processing and analysis. Representing structure in a reliable way required the use of an abstract syntax tree (AST). Rather than embedding formatting directly into the text, the AST models the document as a hierarchy of nodes, each corresponding to a meaningful unit such as a paragraph, heading, or span. This makes it possible to operate on the document programmatically at different levels of granularity, from entire sections down to individual inline elements, without ambiguity or loss of information.
After evaluating different options, Pandoc was chosen as the foundation for this representation. It is a widely adopted, open-source system with a well-defined and extensively tested AST, capable of accurately representing complex documents. Using Pandoc as the structural standard avoids the need to invent a custom format, while ensuring compatibility with a large ecosystem of tools. It also provides a stable reference point: the AST defines what the document is, independent of how it is displayed or exported. It is normally in json format:
{"t":"Str","c":"二○○八年十月十八日"}]},
{"t":"Para","c":[{"t":"LineBreak"},
{"t":"LineBreak"}]},
{"t":"Para","c":[{"t":"Str","c":"关于创业投资引导基金"},
{"t":"LineBreak"},
{"t":"Str","c":"规范设立与运作的指导意见"}
In this representation, each paragraph is a node (Para), and the text inside it is further broken down into inline elements such as strings (Str) and spaces. This structure is obviously verbose, but it provides a precise and lossless way to describe the document. Crucially, it allows computational linguistic analysis to operate at different levels of granularity: sections, subsections and paragraphs can be selected and analysed precisely. This is absolutely essential for good quality analysis, but it results in large plain text files that can be double the size of the original content, since it contains all that verbose code {"t":"Str", "c": } as well as the actual content of the document itself. This is fine for a few smaller documents, but its pretty wasteful if you have 10,000 + documents, especially for longer texts. This is where plain text becomes less efficient that a well-designed binary alternative. The solution I came up with is to ensure that the plain text content and metadata are first-class citizens, while the structured AST is designed to be as small and sparse as possible; it is just a binary file with byte offsets that point into the document.txt and metadata.txt files. Each element simply points to a position and length within the text. The AST is then serialised in CBOR for efficiency, but can be represented conceptually in JSON as follows:
[
0, [[0,0,0,24]]
],
[
33, [[18],[18]]
],
[
33, [[4, [[0,0,24,18]]],
[18],
[4, [[0,0,42,30]]]]
]
Here, the structure is just an array of integers, perfect for efficient computation:
[0, 0, 0, 24]
│ │ │ │
│ └───┴────┘
│ │
│ span: [label=0(doc), start=0, len=24]
│
tag 0 → "Str"
This represents byte ranges within the text file. Basically it says "start at the first byte in document.txt and go to byte 24; this is a text node" The text itself is never duplicated inside the structure; it exists only once in either document.txt and metadata.txt.
An important advantage of using Pandoc as a dependency is that it decouples internal representation from output format. Once a document is captured in a Pandoc-compatible AST, it can be transformed into a wide range of formats with minimal additional effort, using the Pandoc ecosystem. This includes Markdown for lightweight editing, html for web display, xml (such as JATS) for archival and publishing workflows, docx for compatibility with common office tools, and pdf for final output.
Collecting documents
Chinese government website
With the architecture in place, the next stage was document ingestion and normalisation. Raw HTML documents were collected and converted into this internal representation, with careful separation of content and metadata. During this process, each document was assigned a stable UUID, allowing it to be referenced consistently across different components of the system. This step established the foundation of a coherent corpus, where every document is both uniquely identifiable and structurally comparable.
Once documents were standardised, attention shifted to metadata management. A dedicated SQLite database was introduced to store structured metadata linked to each document’s UUID. This database does not contain the full text itself, but instead provides a fast and flexible index over the corpus. By separating metadata from content, it becomes possible to construct project-specific datasets, filter documents by various criteria, and perform structured queries without duplicating or modifying the underlying files.
With storage and indexing in place, the project moved into text analysis. An initial TF-IDF pipeline was implemented to extract keywords from each document, providing a simple but effective way to identify salient terms. However, this quickly revealed a limitation specific to Chinese policy language: many high-frequency terms are not noise, but carry essential administrative meaning. Words such as “贯彻”, “加强”, and “严格” were consistently downweighted or removed, despite encoding instructions, emphasis, and enforcement.
This led to a rethinking of the concept of stopwords. Instead of treating frequent terms as disposable, the vocabulary was reorganised into functional categories, including topic terms, action verbs, and modality markers. This allows the system to distinguish between what a document is about and how it expresses authority and obligation. In effect, the analysis moves from surface-level keyword extraction toward a more structured interpretation of bureaucratic language.
A further challenge arose from the nature of Chinese text itself. Because there are no explicit word boundaries, standard tokenisation often splits meaningful phrases into smaller units, losing important semantic relationships. Expressions such as “严格管理” or “贯彻落实” are not arbitrary combinations of words, but stable and highly characteristic constructions within policy documents. Recovering these required moving below the level of tokenisation to character-level analysis.
To address this, the system computes frequent n-grams across the corpus, identifying sequences of characters that consistently appear together. These sequences can then be filtered using statistical measures such as frequency and association strength, allowing genuine phrases to be distinguished from accidental co-occurrences. The result is a corpus-derived vocabulary of domain-specific expressions, discovered directly from the data rather than imposed externally.
These extracted phrases are then reintegrated into the pipeline as a custom segmentation dictionary. This creates a feedback loop: the corpus informs the segmentation process, which in turn improves the quality of downstream analysis. Over time, the system becomes increasingly adapted to the specific language patterns of Chinese government documents, rather than relying on generic linguistic models.
The system now consists of several interconnected layers: structured document representation, metadata indexing, keyword extraction, phrase discovery, and semantic classification. Each layer builds on the previous one, gradually enriching the representation of the corpus. Rather than treating documents as flat text, the system models them as structured, interpretable objects embedded within a larger analytical framework.
At this stage, the project is moving beyond basic information retrieval toward a deeper analysis of bureaucratic language. By combining structural representation with semantic classification, it becomes possible to identify not only topics, but also patterns of instruction, enforcement, and governance. This opens the way for more advanced forms of search and analysis, where documents can be compared and queried not just by content, but by how they express policy and authority.