A guide to project methodologies and participants
Contents of this article
Project partners and funders
Connected Histories was created by a partnership between the University of Hertfordshire, the Institute of Historical Research, University of London, and the University of Sheffield. Natural language processing, indexing and the development of the search engine were carried out by the Humanities Research Institute (University of Sheffield). The website front end was implemented by the Institute of Historical Research, using designs provided by Mickey and Mallory. Evaluation was carried out by the Centre for Computing in the Humanities at King's College London. See below for Project staff.
The project was made possible by a generous grant from the JISC e-Content Capital Programme. We are also grateful for assistance from the Universities of Hertfordshire, London and Sheffield.
Connected Histories has not created any new digital content. Instead, it provides integrated access to electronic content already available on distributed websites. Our search engine does not search these resources directly. Instead, it searches indexes we have created from the full content of each resource. Our approach to indexing depends on the nature of the electronic resource available:
databases, such as the Clergy of the Church of England Database, and semi-structured sources where the text is marked up with xml tags, such as the Old Bailey Proceedings Online, were processed by extracting identified information on names, places and dates into indexes;
text which is largely unstructured, such as the British Newspapers 1600-1900, was processed using natural language processing in order to identify names, places and dates in the original texts. ANNIE, an open source information extraction system, was used, in conjunction with custom-scripted pattern action rules, to apply named entity recognition to the texts. Gazetteers were constructed from a range of sources, including the already marked-up text and the Digimap gazetteer, and they evolved as text was marked up. This methodology is subject to a degree of error, the extent of which was measured by the evaluation process.
The search engine uses the Apache Lucene text search engine, within a Java environment. It is made available to the Connected Histories website via a JavaServer Page application programming interface (API), which provides results in an XML format to the interface, which is hosted at the Institute of Historical Research.
Evaluation of the natural language processing and search engine was carried out by the Centre for Computing in the Humanities at King's College London.
For the natural language processing (nlp), text samples from resources were manually marked up with names, places and dates, and the results compared with the markup produced by the natural language processing. Statistics were compiled of the numbers of true positives, false positives and false negatives, in order to generate measures of precision (a measure of the number of entities correctly classified divided by the number of entities identified by the nlp) and recall (the number of correctly classified entities divided by the total number of entities that are actually of that type). These two measures were combined into a single measure, the F-measure, which can vary between 0 (totally inaccurate) and 1 (completely accurate).
The results of this process indicate that the success of the natural language processing varied significantly, depending on the structure of the original text (the extent to which it follows expected language patterns) and, more importantly, the quality of the transcription. Text generated by optical character reading (OCR) produces less accurate results from the natural language processing than rekeyed text, because errors in the OCR make the text, both the words to be marked up and its surrounding context, less recognisable to a machine processor. The best results were for British History Online (F-measures between 0.64 to 0.74) and the Parliamentary Papers (0.625 to 0.775). Owing to the OCR, the worst results were for the 17th- and 18th-century British Newspapers (0.22 to 0.52). In general, the best results were found for locations and the worst for persons and dates, though persons and dates in the Parliamentary Papers and dates in British History Online also achieved good results.
The search engine was evaluated by 1) ensuring that it was not possible to break searches; 2) checking whether the searches produced relevant results; and 3) testing the links to the distributed websites.
Access to subscription sources
Some of the resources searched by Connected Histories are only accessible via subscription. While Connected Histories allows users to search these resources and examine snippet results free of charge, we do not and cannot provide non-subscribers full access to these resources. To arrange such access, it is necessary to contact the proprietors of the relevant resource directly.
If you do have subscription access to a resource and encounter a login page you cannot get through, you should first log in to that resource using your normal access procedure before clicking on links in Connected Histories.
Connected Histories is a not-for-profit project whose sole objective is to provide more efficient access to electronic resources for those engaged in researching and teaching British history. Access to this website is free to all users. Since it costs money to maintain the site, and the grant which funded its creation has ended, it is necessary to obtain separate funding to ensure its continuation. For this reason, the site includes advertising. All profits derived from advertising will be devoted to maintaining and upgrading the site.
Since Connected Histories is a free resource with no recurrent funding, and advertising revenue can only fund part of the running costs of this website, we welcome donations to help us keep this website free and available to all users. All donations will be devoted exclusively to maintaining and upgrading the site. To make a donation click
here. Payments may be made using Gift Aid: for every £1 donated to Connected Histories, Gift Aid adds an additional 28p. Donations are made through Google Checkout, and details of its secure payments service can be found here.
Future sources to be included
We welcome proposals for the inclusion of additional resources. If you are responsible for an electronic resource which you believe is appropriate for Connected Histories, please consult our New content information page.
The Directors of this project are Professor Tim Hitchcock (University of Sussex), Professor Robert Shoemaker (University of Sheffield), and Dr Jane Winters (Institute of Historical Research).
The Project Manager was Dr Sharon Howard.
Dr Matthew Davies (Centre for Metropolitan History, Institute of Historical Research) was an academic adviser.
- Michael Pidd manages the technical work at the Humanities Research Institute, University of Sheffield.
- Katherine Rogers (Humanities Research Institute) is the principal Technical Officer, in charge of data processing and development of the search engine.
- Jamie McLaughlin (Humanities Research Institute) is a Technical Officer
- Bruce Tate (Institute of Historical Research) is responsible for the construction of the website front end.
- Jonathan Blaney and Dr Peter Webster (Institute of Historical Research) contributed to the development of the background pages.
- Jamie Norrish and Miguel Vieira (Centre for Computing in the Humanities, King's College London) performed the evaluation.
We are grateful to the following for their help in bringing this project to completion:
- Alastair Dunning, Programme Manager, Digitisation at JISC, for helpful advice at every stage of the project.
Our advisory panel, for providing helpful feedback throughout the project. The panel included Arthur Burns, Richard Deswarte, Alastair Dunning, Ian Galbraith, Mark Greengrass, Ed King, Rob Newman, Sarah Richardson, Simon Tanner, Miles Taylor and David Thomas.
Participants in focus groups at the Universities of Hertfordshire, London and Sheffield, who provided useful feedback on our plans.
Sarah Charlton and the design staff of Mickey and Mallory who not only designed the templates for this website but also provided valuable advice on many issues.
And above all, the creators of the resources included in Connected Histories for agreeing to participate in this project, and for providing us with copies of their data so that we could create the indexes which are searched by this website.