There is no standard, universally accepted practice for the digitisation of historical sources. Like all historical scholarship, there are no rules, only guidelines that we struggle and strain against, bending them to our individual wills. Even within the subdomain of textual sources, methods vary widely—constrained by financial and technological realities and the ways the digitser envisioned the resource being used. The creation of machine-readable texts, for example, has developed along several different pathways, including simple transcriptions (TXT, RTF, DOCX), tabular representations (CSV/TSV/XLSX), and structured (XML) or linked (RDF) datasets. From one perspective, these represent a continuum of detail—the bare textual content of Plain Text to the richly documented Resource Description Framework. Yet, the choice of a particular format reflects not only the digitisers' technical skill but the way in which they conceive the data, specifically the often implied hierarchies to which the data belongs.
This paper will explore these hierarchies, at document and corpus level, and their repercussions on the digitisation process. In particular, it will explore the underlying ontological assumptions made by encoding models such as the Text Encoding Initiative (TEI) and the Dublin Core (DC) Metadata Initiative and how these correlate with the abstract and practical hierarchies used by historians in their archival work. Through this exploration, it will identify the core shared practices employed by historians in the development of machine-readable transcriptions and discuss the extent to which existing frameworks meet the general analytical needs of historians as researchers and teachers. It will conclude with recommendations for a transferable set of practices and vocabularies for the encoding of historical sources—one that will allow for both widespread comprehension and reuse as well as flexibility and specificity when working with varied genres.