Armadillo is a system for producing automatic domain-specific annotation on large repositories in a largely unsupervised way. It annotates by extracting information from different sources and integrating the retrieved knowledge into a repository.
The armadillo project has previously been involved in Semantic Web research, mostly working towards producing structured information from less structured resources and redundant data. An early application of Armadillo was within the winning entry for the semantic web challenge 2003: CSAKTiveSpace (http://www.informatik.uni-bremen.de/swc/submissions.html). Armadillo has since been tested and developed in a number of other domains, including artworks, academic publications and more conventional tasks, such as identifying and listing restaurants by location.
The basic idea behind Armadillo is to retrieve Information according to a pre-agreed ontology, and populate it with instances. In the case of the Historical Data Mining project, the initial ontology will focus upon dates, names and places centred around 18th Century London. Armadillo combines data on an evidence based approach. It uses the statistical likelihood of deviations in spelling, typographic formatting and contextual information to deduce the most likely things to combine as referring to the same place, name, or date. Of course, this is a task which a human can perform with ease, but on a large scale and across multiple resources it is hugely laborious and time consuming. It is hoped that Armadillo's ontology-based approach and ability to discard redundant data will alleviate the problems associated with conducting research across distributed repositories.
![]() |
A diagram of the Armadillo Architecture |
The Armadillo Architecture is based around Semantic Web Services (SWS). This means that the system's underlying functions are distributed in an environment (normally the web) and work in an independent way. They are also semantically enabled. By this, we mean that each function must have a semantically typed input and output; what the function uses and produces must have a defined meaning (a concrete object, an abstract concept, etc.).
Input Documents:
To begin, Armadillo needs an ontology, an initial lexicon and a repository of documents. It is designed to be generic enough to cater for most domains, and should have little trouble adapting to our collection of 18th Century historical sources.
Internal Modules and External Web Services:
Because the Armadillo Architecture is based on the concept of Semantic Web Services, Armadillo's internal modules can employ external web services for performing sub-tasks. For example, a module designed to recognize researcher's names in a University Web Site could use a Named Entity Recognition system as a sub-service, in order to recognize generic christian names. The main service could then combine this functionality with its own internal strategies to identify real researcher's names, as opposed to student's or secretaries' names.
Wrapper Induction:
Material on the web is usually formatted for use by people and rendering it machine readable is a non-trivial task. Conventionally, customized procedures for information extraction are used for this task, also known as wrappers. Wrapper Induction is a technique for automatically constructing wrappers using a selection of easily learnable but also moderately flexible wrapper models.
Information Extraction (IE):
This part of the Armadillo Architecture handles Adaptive Information Extraction from texts. Essentially, it spots information which fits the forms of annotation the system already recognises, and uses these to learn each new ontology.
Information Integration:
This part of Armadillo is charged with confirming newly extracted information using multiple evidence from different sources. For example, a new piece of information is considered confirmed if it is found in several different linguistic or semantic contexts.
RDF Repository:
The Resource Description Framework (RDF) is a general-purpose language for representing information on the web (http://www.w3.org/TR/rdf-syntax-grammar/). Armadillo stores the ontological instances it extracts in a repository of RDF conformant XML . This repository is the end result of its data mining activities and should hold the answers to the research questions posed.