Tuesday, 7 February 2012

Assessing legacy historical research data: not just a technical matter


Overview

As part of gathering a knowledge base about research data training needs I have been looking at historical research datasets created by our SHARD partner IHR and in particular the Centre for Metropolitan History (CMH) which is based within IHR at UoL. CMH was established by the IHR in 1988, is one of the world’s leading centres for the study of the history of London and other metropolises. It specialises in innovative research projects, covering a wide range of periods, themes and problems in metropolitan history, publishing the results and data online and in print. A survey had been done by ULCC of the legacy data held at CMH a few years ago so we had an idea that we had some good material for our study. The data covers metropolitan history looking at health, social issues and of London and comparative studies with other cities.

To start of the assessment I had to get my hands on the data. We agreed to share the data on a shared drive at the University of London. However in practise it wasn’t so simple. The size of the data was an issue among other things. We then decided to use Dropbox, removing the data once it had been assessed. The experience clarified the need for procedures and the client focussed approach as to state the obvious, different people and different departments have different needs. It also reinforced that old adage ‘assume nothing’.

Assessing the data

To assess the data we developed a set of questions based loosely on the AIDA and CARDIO approach to data assessment. We looked to identify gaps in data management in relation to preservation.

Storage and identification: This consisted of basic information about where it had been stored, name and unique identifier. Where has the data been stored? Was it generally accessible?

Description: What metadata had been provided, looking at both descriptive and technical. Did the researcher/data creator describe how was the data created and processed using which software? What, of any documentation has been provided? Were data models or schemas provided?

Structure and organisation: How well was the structure of the data described and to what level within the dataset? How well described were fields? Any information provided about how to process the data or how data was processed? If the data had been encoded, have the codes been made clearly and easily accessible?

Access and preservation: Can the files be opened? If not, what are the issues?

Legal: Here we looked at any information which described copyright and any IPR which would implicate the data regarding preservation and sharing. Is anything missing?

Some results of the assessment:

The data, while safe as they were being backed up every night by central computer service at the University of London were locally stored on a secure network drive and not very accessible as no one beyond CMH would find them unless they knew about them. They do not appear on any catalogue or database of holdings. Thus access was immediately extremely limited. Is this what we want to happen to our research data? Or are researchers really bothered about what happens to their research data once their results are published? Are they encouraged to do so? Why should they bother?

There were some very good examples of some projects providing guides and introductions to the data, about half the data provided some sort of documentation about the data. However it lacked consistency in approach and the remainder of the data assessed lacked any documentation about the data which left one pretty much in the dark about its purpose and context as well as the nitty gritty of the technical and methodologies used.

The data itself consisted of text based files, some structured numerical data, some audio and some mapping data. The various applications used included Oracle, Dbase, Adobe files. GIS software such as ARCview was used. Other applications included Microsoft in its various forms, Excel, Access, Word. Dbase, Wordperfect files and Notepad. All of these were used in their various incarnations over time resulting in that some were too old and too proprietary to be opened. Open Office proved very helpful at opening some files which were too old to be opened by their contemporary software. None of the indications about applications were (with two exceptions who noted this in their guide to their data) explicitly indicated anywhere. Out of all the datasets I looked at in the study two had documents describing aspects of the dataset itself from general descriptive information to technical information about operating systems, others had guides introducing the dataset generally and the rest had nothing at all. In this study I found that lack of access to the data was hampered more by insufficent information about its organisation and structure than techncial problems. I could open most of the data I accessed but without any overview of the data it took time for it to make sense. Often this was time I didn't have!

At a more detailed level there were many instances of fields which were encoded but with no explanations of the codes available, thus rendering the data unusable. Half of the datasets assessed did not include anything explaining where to find these codes or how to use them. In addition many seemed to use place name and personal names authority files but few referred to the standard used. There was little if no evidence that IPR or any other issues had been considered. No statements for use or sharing were in evidence with the exception of one dataset. An obvious exception to this was an oral history project which was very explicit about copyright and permissions and generally was very well documented. Oral history traditionally being strong on good data collection and management procedures.

Conclusions

Research data is a unique collection of data and individual. It is frequently multiformat, using various software applications. This all needs to be described somewhere as part of the dataset in order for it to have some hope of surviving beyond the fleeting lifespan of the software/hardware used to create it. However beyond the technical problems which are complex, the issue which also loom large the is the lack of any introductory and decsriptive information about sets of research data. How to address this?

Institutions should ideally have practical data management plans which are easy to implement with some quick wins. Such systematic and consistent approaches to data management should be introduced at an early stage in a research project. Researchers are busy doing research and some time should be explicitly dedicated in projects to write supporting documentation for the data. Otherwise sharing and reuse is going to become less of an option as time goes by. Research data supports a narrative of research and is an invaluable resource to the researcher while the research is active. Once completed, the data is dropped, deleted or abandoned to fight for its survival, some make it to enhance the narrative of further research projects or validate the research published while other data is not so lucky.

Next: Shard interviews people managing and or creating research data!