Session 6 — Text Analytics 2: Identifying complex meanings in historical texts
Friday 11:30 - 13:00
Chair: Michael Pidd
University of Sheffield
This paper critically interrogates the theoretical frameworks and goals of computational distributional semantics in relation to the study of historical concepts in the humanities. Distributional semantics involves the automatic analysis of word meaning in texts, and can in turn identify various types of concepts related to those words (cf. Manning and Schueze 1999, Turney and Pantel 2001). The method involves statistical analysis of lexical co-occurrence patterns in texts. The tradition of distributional semantic analysis emerged in the humanities in the mid-20th century (Geeraerts 2010, cf. Firth 1962, Austin 1962), and has been adapted by computational linguists since the 1990s (cf. Turney and Pantel 2001). Today, distributional semantic studies can be divided between those that aim to learn about language and concepts (cf. Heylen et al. 2008), and those that aim to complete an engineering task (cf. Tahmasebi et al. 2013). Even as those categories of research diverge, their methods remain remarkably similar (cf. Piersman et al. 2007). This similarity raises important questions:
The paper argues that a careful consideration of both semantics and concepts is necessary to link the two research categories meaningfully, and we propose ‘discursive concepts’ as a useful bridge.
Finally, we present findings from a major research project that uses distributional semantics alongside close reading to map conceptual change in Early Modern English. We present outputs of distributional semantic methods as the basis for discerning patterns of meaning in historical discourse cultures, and link those findings to the theoretical perspective we have proposed.
Austin, J. L. 1962. How to do things with words. Cambridge, Massachusetts: Harvard University Press.
Firth, J. R. 1962. A synopsis of linguistic theory. In Studies in linguistic analysis. Oxford: Basil Blackwell. 1-31.
Geeraerts, Dirk. 2010. Theories of lexical semantics. Oxford: Oxford University Press.
Heylen, Kris, Yves Peirsmany, Dirk Geeraerts, Dirk Speelman. 2008. Modelling word similarity: An evaluation of automatic synonymy extraction algorithms. In Proceedings of the Sixth International Language Resources and Evaluation, 3243-49.
Manning, Christopher and Hinrich Schuetze. 2001. Foundations of statistical natural language processing. Boston: MIT Press.
Peirsman, Yves, Kris Heylen, Dirk Speelman. 2007. Finding semantically related words in Dutch: Co-occurrences versus syntactic contexts. In Proceedings of the CoSMO workshop, Roskilde, Denmark, 9-16.
Tahmasebi, Nina, Kai Niklas, Gideon Zenz, Thomas Risse. 2013. On the applicability of word sense discrimination on 201 years of modern english. In International Journal on Digital Libraries 13,135–53.
University of Cambridge
Discussing the history of collaboration between “humanities computing” and the computational linguistics community, Susan Hockey noted that despite efforts to bring the fields closer together in the late 1980s, “there was limited communication between these communities, and humanities computing did not benefit as much as it could have done from computational linguistics techniques” (p. 13, Hockey 2004).Today, it would appear that this has changed, at least with respect to models of distributional semantics—computational models that attend to statistical patterns of association between words in large text corpora. For example, an entire issue of the Journal of Digital Humanities was devoted to topic models in 2012, and another class of distributional models known as word embeddings has recently been making waves (Schmidt 2015; Bjerva & Praet 2015; Heuser 2016). Such methods have particular promise for humanists interested in identifying groups of words that appear in similar contexts, identifying the semantic fields relevant to corpora that may be too large to read in their entirety (Newman & Block 2006), or identifying changes in concept use across time (Goldstone & Underwood 2012; Wevers, Kenter, & Huijnen 2015).
However, topic models and word embeddings often result in relatively opaque mathematical representations, creating difficulties for researchers attempting to use them to draw conclusions about the use of particular words. This paper will argue for the utility of count-based distributional models, which have largely escaped the purview of the humanities despite their long history in computational linguistics and recent articles arguing in their favour (Lebret & Collobert 2015; Levy, Goldberg, & Dagan 2015). Because the interpretation of every component of a vector in a count-based model is much clearer than in an embedding model, questions about why specific words appear as highly associated according to the model can be more clearly and rigorously investigated. By applying a count-based model to tasks that have recently been highlighted as use cases for word embeddings in the digital humanities, the present paper will illustrate that novel insights can be uncovered that would not have been possible with word embeddings or topic models alone.
Bjerva, Johannes, and Raf Praet. "Word Embeddings Pointing the Way for Late Antiquity." LaTeCH 2015 (2015): 53.
Goldstone, Andrew, and Ted Underwood. "What Can Topic Models of PMLA Teach Us About the History of Literary Scholarship?" Journal of Digital Humanities 2.1 (2012): 40-49.
Heuser, Ryan. "Word Vectors in the Eighteenth Century, Episode 1: Concepts." Adventures of the Virtual. 14 Apr. 2016. Web. 14 May 2016.
Hockey, Susan. “The History of Humanities Computing.” A Companion to Digital Humanities. Eds. Susan Schreibman, Ray Siemens, & John Unsworth. Oxford: Blackwell, 2004. 3-19.
Lebret, Rémi, and Ronan Collobert. "Rehabilitation of Count-based Models for Word Vector Representations." Computational Linguistics and Intelligent Text Processing. Springer International Publishing, 2015. 417-429.
Levy, Omer, Yoav Goldberg, and Ido Dagan. "Improving distributional similarity with lessons learned from word embeddings." Transactions of the Association for Computational Linguistics 3 (2015): 211-225.
Newman, David J., and Sharon Block. "Probabilistic Topic Decomposition of an Eighteenth‐Century American Newspaper." Journal of the American Society for Information Science and Technology 57.6 (2006): 753-767.
Schmidt, Ben. "Word Embeddings for the Digital Humanities." Ben's Bookworm Blog. 25 Oct. 2015. Web. 14 May 2016.
Wevers, Melvin, Kenter, Tom, and Huijnen, Pem. 2015. "Concepts Through Time: Tracing Concepts In Dutch Newspapers Discourse (1890-1990) Using Word Embeddings." DH2015, Sydney, Australia.
Sociolinguists know that competition between linguistic variables is not unorganized and random, but guided by both language internal and external factors. Variation also reveals language ideologies that can lie behind any linguistic choice. By connecting linguistic data with language external factors such as place, time, social status, level of education, or ideological and political environment, we can observe how social meanings arise in context. Interfacing structured and unstructured data also enables new kinds of questions about linguistic variation and change, and offers opportunities to test and experiment on novel methods to reveal the logic behind ostensibly random variation.
This poses a challenge from a computer science perspective, as current tools are not able to fluidly cross-question linguistic phenomena and contextual information. The new open source tools to be discussed allow the user to define data subsets based on both linguistic features and the various extralinguistic criteria included in the corpus metadata. This constrained subset can be subjected to further linguistic analysis and visualization, as well as projected again through structured metadata into interactive visualizations combining the two, such as plotting the spread of linguistic phenomena through time and space. The exploration of visualization parameters allows us to detect interesting variations in the material, and to extract the relevant subset of the data for linguistic analysis.
As a case study of defining data subsets based on structured data, we explore the computer-assisted filtering of -er derivatives in a historical corpus by cross-referencing types with gold-standard present-day data as well as the Oxford English Dictionary. If successful, the same procedure will later be applied to the study of the inflectional comparative -er. Our aim is to analyse sociolinguistic variation in the productivity of both derivational and inflectional -er, operating on the hypothesis that similar variation and change may be observed in both derivational and inflectional processes.