About

A guide to project methodologies and participants

Connected Histories provides an integrated search facility for interrogating major electronic resources in early modern and 19th-century British history. This page provides information about how we carried out this project and explains our future plans. You may also find it useful to consult our Privacy policy and Terms of use.

Contents of this article

Project partners and funders
Technical methods
Evaluation
Access to subscription sources
Advertising policy
Future sources to be included
Project staff
Acknowledgements

Project partners and funders

Connected Histories was created by a partnership between the University of Hertfordshire, the Institute of Historical Research, University of London, and the University of Sheffield. Natural language processing, indexing and the development of the search engine were carried out by the The Digital Humanities Institute (University of Sheffield). The website front end was implemented by the Institute of Historical Research, using designs provided by Mickey and Mallory. Evaluation was carried out by the Centre for Computing in the Humanities at King's College London. See below for Project staff.

The project was made possible by a generous grant from the JISC e-Content Capital Programme. We are also grateful for assistance from the Universities of Hertfordshire, London and Sheffield.

In 2019 the website front end was transferred to the Digital Humanities Institute, University of Sheffield, which now has full responsibility for hosting this website.

Technical methods

Connected Histories has not created any new digital content. Instead, it provides integrated access to electronic content already available on distributed websites. Our search engine does not search these resources directly. Instead, it searches indexes we have created from the full content of each resource. Our approach to indexing depends on the nature of the electronic resource available:

databases, such as the Clergy of the Church of England Database, and semi-structured sources where the text is marked up with xml tags, such as the Old Bailey Proceedings Online, were processed by extracting identified information on names, places and dates into indexes;
text which is largely unstructured, such as the British Newspapers 1600-1900, was processed using natural language processing in order to identify names, places and dates in the original texts. ANNIE, an open source information extraction system, was used, in conjunction with custom-scripted pattern action rules, to apply named entity recognition to the texts. Gazetteers were constructed from a range of sources, including the already marked-up text and the Digimap gazetteer, and they evolved as text was marked up. This methodology is subject to a degree of error, the extent of which was measured by the evaluation process.

The search engine uses the Apache Lucene text search engine, within a Java environment. It is made available to the Connected Histories website via a JavaServer Page application programming interface (API), which provides results in an XML format to the interface, which is hosted by the Digital Humanities Institute.

Evaluation

Evaluation of the natural language processing and search engine was carried out by the Centre for Computing in the Humanities at King's College London.

For the natural language processing (nlp), text samples from resources were manually marked up with names, places and dates, and the results compared with the markup produced by the natural language processing. Statistics were compiled of the numbers of true positives, false positives and false negatives, in order to generate measures of precision (a measure of the number of entities correctly classified divided by the number of entities identified by the nlp) and recall (the number of correctly classified entities divided by the total number of entities that are actually of that type). These two measures were combined into a single measure, the F-measure, which can vary between 0 (totally inaccurate) and 1 (completely accurate).

The results of this process indicate that the success of the natural language processing varied significantly, depending on the structure of the original text (the extent to which it follows expected language patterns) and, more importantly, the quality of the transcription. Text generated by optical character reading (OCR) produces less accurate results from the natural language processing than rekeyed text, because errors in the OCR make the text, both the words to be marked up and its surrounding context, less recognisable to a machine processor. The best results were for British History Online (F-measures between 0.64 to 0.74) and the Parliamentary Papers (0.625 to 0.775). Owing to the OCR, the worst results were for the 17th- and 18th-century British Newspapers (0.22 to 0.52). In general, the best results were found for locations and the worst for persons and dates, though persons and dates in the Parliamentary Papers and dates in British History Online also achieved good results.

The search engine was evaluated by 1) ensuring that it was not possible to break searches; 2) checking whether the searches produced relevant results; and 3) testing the links to the distributed websites.

Access to subscription sources

Some of the resources searched by Connected Histories are only accessible via subscription. While Connected Histories allows users to search these resources and examine snippet results free of charge, we do not and cannot provide non-subscribers full access to these resources. To arrange such access, it is necessary to contact the proprietors of the relevant resource directly.

If you do have subscription access to a resource and encounter a login page you cannot get through, you should first log in to that resource using your normal access procedure before clicking on links in Connected Histories.

Advertising policy

Connected Histories is a not-for-profit project whose sole objective is to provide more efficient access to electronic resources for those engaged in researching and teaching British history. Access to this website is free to all users. Since it costs money to maintain the site, and the grant which funded its creation has ended, it is necessary to obtain separate funding to ensure its continuation. For this reason, the site includes advertising. All profits derived from advertising will be devoted to maintaining and upgrading the site.

Future sources to be included

We welcome proposals for the inclusion of additional resources. If you are responsible for an electronic resource which you believe is appropriate for Connected Histories, please consult our New content information page.

Project staff

The Directors of this project are Professor Tim Hitchcock (University of Sussex), Michael Pidd (University of Sheffield) and Professor Robert Shoemaker (University of Sheffield)
Professor Jane Winters (University of London). served as a director of the project from 2011 to 2019.
The Project Manager was Dr Sharon Howard.
Dr Matthew Davies (Centre for Metropolitan History, Institute of Historical Research) was an academic adviser.
Katherine Rogers (The Digital Humanities Institute) is the principal Research Software Engineer, in charge of data processing and development of the search engine.
Jamie McLaughlin (The Digital Humanities Institute) is a Research Software Engineer and assisted with the developemnt of the search engine.
Mattthew Groves (The Digital Humanities Institute) is a Research Software Engineer and is responsible for the construction of the revised website front end.
Bruce Tate (Institute of Historical Research) was responsible for the construction of the original website front end.
Jonathan Blaney and Dr Peter Webster (Institute of Historical Research) contributed to the development of the background pages.
Jamie Norrish and Miguel Vieira (Centre for Computing in the Humanities, King's College London) performed the evaluation.

Acknowledgements

We are grateful to the following for their help in bringing this project to completion:

Alastair Dunning, Programme Manager, Digitisation at JISC, for helpful advice at every stage of the project.
Our advisory panel, for providing helpful feedback throughout the project. The panel included Arthur Burns, Richard Deswarte, Alastair Dunning, Ian Galbraith, Mark Greengrass, Ed King, Rob Newman, Sarah Richardson, Simon Tanner, Miles Taylor and David Thomas.
Participants in focus groups at the Universities of Hertfordshire, London and Sheffield, who provided useful feedback on our plans.
Sarah Charlton and the design staff of Mickey and Mallory who not only designed the templates for this website but also provided valuable advice on many issues.
And above all, the creators of the resources included in Connected Histories for agreeing to participate in this project, and for providing us with copies of their data so that we could create the indexes which are searched by this website.