Machine-Readable Text

Structured Text in the Digital Age

In the digital age, research must not only be readable by humans — it must be readable by machines. This means not only that plain text is king; binary file formats like Microsoft Word, PDF, excel sheets for columnar data, and image formats like .jpg and .png are unsuitable for human readers and machines alike. But not all plain text is equally suitable: text files must be 'structured'; hierarchically nested text that makes explicit the relationship between sections and sub-sections of the document, tables and citations are explicitly defined and readable.

Unstructured Text

A plain .txt file is a linear stream of characters. It contains no explicit hierarchy, no formally defined sections, and no machine-recognisable objects. Headings are just ordinary text. Tables are just aligned characters. Citations are simply a stream of characters, not defined objects. Footnotes are not anchored to the text they explain.

Because structure is implicit rather than encoded, machines cannot reliably interpret meaning. Search engines can index words, but they cannot understand document architecture. There is no way to reliably extract the citations from a document, or to compare the methodology section between a dozen articles, or just extract the findings, etc. And this is not just a 'nice to have'; parsing text is an absolute necessity in an age where the number of scholarly articles have exploded in volume. The only way to know what others have written is to be able to reliably parse and search those texts on an industrial scale.

Structured Text

Structured formats — such as Markdown, HTML, XML, TeX, Typst, or canonical representations like Pandoc JSON AST— encode hierarchy and semantic meaning explicitly.

Headings are formally defined
Sections are nested hierarchically
Tables are structured data objects
Citations are machine-resolvable identifiers
Footnotes and cross-references are explicitly encoded
Images, graphs or tables are defined explicitly

Structure transforms text from a visual artifact into a computational object.

Why Structure Enables Search

When structure is explicit, machines can:

Index content at the section level
Extract tables as data
Resolve citations automatically
Compare versions structurally
Generate knowledge graphs
Provide precise semantic search

Without structure, search is limited to keywords. With structure, search becomes semantic.

Why Structure Enables Sharing

Structured text can be transformed deterministically into:

Accessible HTML
EPUB for e-readers
Suitable starting point for textual analysis
Data extracts for computational analysis

A single structured source can generate many outputs without loss of meaning. This makes research portable, durable, and interoperable.

Machine-Readable Research

In an environment of large-scale digital scholarship, text that cannot be parsed cannot fully participate. Structured plain text preserves the durability of text files while enabling computational interpretation.

Only structured text can meet the demands of modern search, sharing, accessibility, reproducibility, and long-term preservation.

Machine-readable structure is not an enhancement — it is infrastructure.