LibGuides: AM Impact: Digital Scholarship

What data is available?

Employing artificial intelligence, machine learning and neural networks, Handwritten Text Recognition allows keyword searching across handwritten manuscript material. Across all primary source collections published on the AM Quartex platform, HTR transcipt technology also allows the downloading of uncorrected transcripts in .txt format.

Collections such as East India Company and Colonial Caribbean have millions of handwritten pages that could be used for text and data mining; other collections, such as Literary Print Culture and Life at Sea, have hundreds of thousands of pages. Mexico in History is a landmark collection in being the first AM database almost entirely in Spanish with HTRT applied to it.

A handwritten page next to an uncleaned text file transcript

A11's response to 1983 Summer directive. Digitised from University of Sussex Special Collections in Mass Observation Project.

Mass Observation Project, 1981-2009

Optical character recognition (OCR) is a technology that changes printed and typed documents into machine-readable files. Collections like the ones contained in AM Archives Direct - a unique digitisation of key Foreign Office and other British government file classes from The National Archives in London - contains millions of pages of typed and printed dispatches, memoranda, faxes, telegrams and newspaper articles from around the world. Other collections, such as Interwar Culture and Indigenous Newspapers in North America, contain newspapers and periodicals that also benefit from OCR technology and transcripts that can be used for text and data mining projects.

A typed telegram next to a uncleaned text file transcript

Case detailing alleged crimes by Nelson Mandela, African National Congress (ANC) leader, imprisoned in South Africa, Winnie Mandela, his wife, and Zindzi Mandela, his daughter, 1990. Digitized from The National Archives, UK, in Apartheid South Africa, 1948-1994.

Apartheid South Africa, 1948-1994

The first audio descriptions published by AM were for silent film collections like Victorians on Film and British Newsreels, but since the publication of Hindi Cinema in 2024, audio descriptions has become possible for non-silent films too, with descriptive audio playing in the gaps of the original audio track.

Audio descriptions aim to provide a clear summary of on-screen activity, so visually impaired users can interrogate the videos fully. However, these descriptions also provide additional transcripts and textual data that can be requested for use for text and data mining projects.

A still from an old film showing a young man magician in a suit about to lift a cone to reveal something underneath it, next to a transcript that has highlighted the words "kitten"

The Magic Extinguisher, 1901. Digitized from The British Film Institute in Victorians on Film.

Victorians on Film: Entertainment, Innovation and Everyday Life

Each AM collection contains metadata: the data that describes the source in the catalogue from the archive where the physical copy of the source is held. In addition to the archive's own metadata, the Editorial team at AM also enriches the metadata in consultation with an expert academic board to aid discovery of material.

Common metadata categories include author, title, date, document type, place, language and archive collection and sub-collection names. Often individuals and organisations are also captured in the metadata.

An image of a crowd gathered around a tree in Africa, with an image of the metadata associated with the image next to it.

[A crowd listening], Photographs, n.d. Digitised from Bodleian Libraries in Africa and the New Imperialism: European Borders on the African Continent, 1870-1914.

Africa and the New Imperialism: European Borders on the African Continent, 1870-1914

All primary source collections published by AM include a number of research and teaching tools, from guides to archival collections to academic essays and digital exhibitions. In addition, a number of AM collections contain visualisations of archival data, from interactive maps of sea journeys and commodity trade to a visualisation of more than 5,000 manuscript items from The Florence Nightingale Papers, revealing developing themes in her correspondence over time in Medical Services and Warfare.

Eighteenth Century Drama database includes an open-access feature called The London Stage Database, extracting textual data from playbills, newspapers, theatrical diaries and more. It serves as a master directory of actors, plays, theatres and more in London between 1660-1800. The database is an analysis tool to illustrate trends via data associations and visualisations, and is cross-searchable, providing researchers with new pathways into digital materials.

All the data used to create these secondary features is also available to institutions for their own teaching and research projects.

Images of Florence Nightingale's letters and manuscripts and a still from a coxcomb data visualisation.

The Nightingale Papers - Interactive Browsing Tool in Medical Services and Warfare.

Medical Services and Warfare

AM Research Skills contains example case studies for working with data in history from the presentation of a data set to understanding how to interrogate, interpret and use the data within.

Here are some Datasets in AM Research Skills:

Rebecca Crites: The "criminal type": How to analyse historical statistics and challenge the neutrality of data
Terry Reimer: Surviving the Civil War: Historic Medical Data in Hospital Registers from Frederick, Maryland
Jack Newman: Mining the Medieval: Applying Text Analysis to a Fourteenth Century Court Roll
Joanna E. Taylor and Ian N. Gregory, Creating a Literary GIS of the English Lake District
Hannah Knox Tucker: Using Data in Early-Modern Historical Research: A Case Study of the 1733 Virginia Shipping Returns
Barry Godfrey: Assessing and Presenting Data Drawn from a Nineteenth-Century Report on the Transportation of Convicts to Australia
Frances Richardson: Welsh Adult Male Occupations c.1817: Evidence from Parish and Nonconformist Baptism Records
Giovanni Colavizza: How to use Automated Entity Recognition to find marginalised voices in colonial documents
Wouter Raaijmakers: Quantifying Colonial Newspapers: A case study of the internal slave trade at the Cape, 1830-34
Olly Ayers: Switching the Lens: Constructing Personal Narratives through Colonial History Datasets

Text and Data Mining: policy, restrictions and licence agreement

AM recognises the benefits that Data Mining has for new research in the Humanities and Social Sciences and we are committed to enabling these research methods on the following principles:

We allow Data Mining/Text Analysis by "Authorised Users" for fair use/academic research
Secure transfer of data to a university server can be made via FTP on submission of the information form.
Data can be extracted from the main collection website by automated software if we are informed about this so we can monitor server performance and reserve the right to restrict this operation if it impacts standard online usage for our customers generally.
We are committed, where possible, to apply text analysis and data visualisation functionality within our latest products.
Data mining as an activity is no different from all other usage of our products. It has to conform to all the standard requirements in our licence agreement e.g. it is carried out by Authorised Users under Fair Use academic purposes.

Extract of Standard User Licence Agreement:

Subject to all other provisions of our User Licence Agreement and save for the circumstances (as set out in section III of this Agreement) in which the Licensor’s prior written consent is required, the Licensee and the Authorised Users may use the Licensed Materials to perform and engage in text mining /data mining activities in relation to the Licensed Materials for legitimate academic research and other non-commercial educational purposes, without obtaining the Licensor’s prior written consent.

Electronic analysis of data from our products is permitted as outlined above; however there are two key elements that mean we have to have additional processes in place to ensure the following:

Performance of live product websites for standard usage are not damaged by automated data mining software crawling online websites.
Large volumes Data extracted or full data sets provided from the products are stored in a secure way that does not risk the availability of that data to unauthorized/open usage and therefore risk breaching User Licence agreement

As a result, any significant automated data extraction or provision of large volumes of data is unauthorised without receiving written request and in offline data supply; permission being granted in writing. As long as suitable assurances as to the purpose and security of the research is assured on completion of a form then this provision will not be unreasonably withheld.

Extract of relevant section of standard user licence agreement:

Section III

In order to protect the integrity of server performance for the Licensee’s customers, automated extraction of data directly from the Licensed Materials online (for example only, by the use of data mining software) is only permitted after notification to the Licensor for performance monitoring purposes, and if such automatic extraction of data does not affect the performance of the Licensor’s servers. In the event that the Licensor’s servers are negatively impacted, the Licensor reserves the right to decline and prevent access to the Licensed Materials to stop any disruption to the Licensor’s business.

As standard with no further permissions:

Secure transfer of data to a university server can be made via FTP on submission of the information form.
An offline copy of data provided on a hard drive for secure local storage and analysis. Under current agreements this is limited to a 3 year storage period after which time a renewal can be requested or if project complete, the original data (not any research material) deleted.

Extract of relevant part of licence:

On submission to the Licensor of completed form outlined in Appendix A, an offline copy of data from the Licensed Materials for Data/Text Mining purposes can be made available to be securely hosted locally and accessed by Authorized Users. Local hosting for each Data/ Text Mining purpose must not exceed five years unless further written consent is provided by Licensor; after which agreed period the data must be returned or confirmed as destroyed within 15 days.

Licensor and copyright holder of Licensed Materials must be acknowledged in published text analysis research results derived from the Licensed Materials.

Request data

Make a TDM request

At AM, we recognise the benefits that Data Mining has for new research in the Humanities and Social Sciences. Our aim has been to make the process of requesting data as simple as possible, and to make that data available for free.

In order to make a text and data mining request, simply follow the link above to fill in a form. The form will go to our Customer Support team, who will then process the request and send the data to you via FTP.

Webinar: Digital Humanities with AM

Digital Humanities with AM

Watch a recording of a webinar with Dr Ben Lacey, Head of Engagment at AM, for an overview of how you can use your AM collections for Digital Humanities scholarship. In the recording, Ben provides an overview of how to request data from the collections for use in text and data mining. He will also present a case study of a project that used a full-text data set as well as examples of how instructors have applied the content of these collections in a more introductory way, focusing on digital presentation of student work.