Machine-Readable Text
Structured Text in the Digital Age
In the digital age, research must not only be readable by humans — it must be readable by machines. This means not only that plain text is king; binary file formats like Microsoft Word, PDF, excel sheets for columnar data, and image formats like .jpg and .png are unsuitable for human readers and machines alike. But not all plain text is equally suitable: text files must be 'structured'; hierarchically nested text that makes explicit the relationship between sections and sub-sections of the document, tables and citations are explicitly defined and readable.
Unstructured Text
A plain .txt file is a linear stream of characters.
It contains no explicit hierarchy, no formally defined sections,
and no machine-recognisable objects. Headings are just ordinary text. Tables are just aligned characters.
Citations are simply a stream of characters, not defined objects. Footnotes are not anchored to the text they explain.
Because structure is implicit rather than encoded, machines cannot reliably interpret meaning. Search engines can index words, but they cannot understand document architecture. There is no way to reliably extract the citations from a document, or to compare the methodology section between a dozen articles, or just extract the findings, etc. And this is not just a 'nice to have'; parsing text is an absolute necessity in an age where the number of scholarly articles have exploded in volume. The only way to know what others have written is to be able to reliably parse and search those texts on an industrial scale.
Structured Text
Structured formats — such as Markdown, HTML, XML, TeX, Typst, or canonical representations like Pandoc JSON AST— encode hierarchy and semantic meaning explicitly.
- Headings are formally defined
- Sections are nested hierarchically
- Tables are structured data objects
- Citations are machine-resolvable identifiers
- Footnotes and cross-references are explicitly encoded
- Images, graphs or tables are defined explicitly
Structure transforms text from a visual artifact into a computational object.
Why Structure Enables Search
When structure is explicit, machines can:
- Index content at the section level
- Extract tables as data
- Resolve citations automatically
- Compare versions structurally
- Generate knowledge graphs
- Provide precise semantic search
Without structure, search is limited to keywords. With structure, search becomes semantic.
Why Structure Enables Sharing
Structured text can be transformed deterministically into:
- Accessible HTML
- EPUB for e-readers
- Suitable starting point for textual analysis
- Data extracts for computational analysis
A single structured source can generate many outputs without loss of meaning. This makes research portable, durable, and interoperable.
Machine-Readable Research
In an environment of large-scale digital scholarship, text that cannot be parsed cannot fully participate. Structured plain text preserves the durability of text files while enabling computational interpretation.
Only structured text can meet the demands of modern search, sharing, accessibility, reproducibility, and long-term preservation.
Machine-readable structure is not an enhancement — it is infrastructure.