Wednesday, 18 July 2012
The leaflet has been jointly developed by the DICE, SHARD and PrePARe projects, and designed and printed at LSE. It is intended for printing A4 double-sided and folding into 3.
What material and data should I preserve?
To enable the use and reuse of research data over time by others it is important to ensure that you provide documentation which describes the research data as well as the context of its creation as part of the research project. Technical information about the research data should also be kept to enable its reuse. If the data is encoded then code details must be kept. So in addition to the core research material you should provide a clear introduction to the entirety of the research data to enable future understanding and use.
Documentation such as emails and other material accompanying the core research data may seem irrelevant but they will all provide important contextualisation of the research project and can be appraised for relevance. Cambridge University uses terms such as embedded, supported and catalogue data to describe data which should accompany the search data itself.
Will I lose control over the material if I preserve it?
A significant number of research funders require that data produced in the course of the research they fund should be made available for other researchers to discover, examine and build upon to allow for new knowledge to be discovered through use, reuse, comparing data and so on. However you are responsible for deciding what data is legally obliged to be open or closed according to various pieces of legislation such as FOI and data protection. This should be stated at time of deposit.
Why shouldn't I just keep my data/material on my hard drive?
Keeping all your research data in one place is not a good idea in general. It is essential not to keep your research data on your hard drive as inevitably hard drives fail and you will lose your data. You should always back up your data at least two more devices or systems (ideally a repository) external to your hard drive.
I have all my data on an external hard drive - do I need to do anything else?
Ensure that your data is well documented and be held on at least two external devices/systems, ideally including an institutional digital repository.
Why should I preserve research material?
Researchers from all disciplines accumulate material in the course of their research. Considerable time, effort and money is spent in this endeavour. The preservation of research data is essential in order to further research through sharing of the data; to enable validation of results and demonstrate the process behind the conclusions and results of research.
What is a digital repository?
A digital repository is a system which provides a convenient infrastructure through which to store, manage, re-use and preserve digital materials. They are used by a variety of communities, may carry out many different functions, and can take many forms but essentially they are a secure way to keep data safe and accessible.
What archives/repositories are there for preserving my data?
There is no single UK repository for research data. Instead many are being developed within universities. The OpenDoar initiative provides a comprehensive list of open repositories worldwide and in the UK.Here are some UK wide repositories for specific types of data:
- The Archaeology Data Service supports research, learning and teaching with freely available, high quality and dependable digital resources. It does this by preserving digital data in the long term, and by promoting and disseminating a broad range of data in archaeology. The ADS promotes good practice in the use of digital data in archaeology, it provides technical advice to the research community, and supports the deployment of digital technologies.
- The University of Oxford Text Archive develops, collects, catalogues and preserves electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning. We also give advice on the creation and use of these resources, and are involved in the development of standards and infrastructure for electronic language resources.
- The History Data Service (HDS) collects, preserves, and promotes the use of digital resources, which result from or support historical research, learning and teaching. The History Data Service is a successor service to AHDS History which from 1996 to March 2008 was one of the five centres of the Arts and Humanities Data Service.
Can I use my institutional repository for data preservation?
Yes, you should be able to do this, if your institution has an institutional repository which collects research material. You should enquire of your institution if this is the case.
Can/should I deposit in more than one repository/archive?
No, it should be more than adequate to deposit in one repository but it depends on the service offered by the specific repository, e.g. does it guarantee that it will maintain access to the data over time?
Note: This page was developed by LSE/Cambridge/University of London and is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License
Thursday, 5 July 2012
Tuesday, 27 March 2012
"It's a question of discipline," the little prince told me later on. "When you've finished washing and dressing each morning, you must tend your planet." ~Antoine de Saint-Exupéry, The Little Prince, 1943.
We had a JISC Research data preservation projects meeting on Thursday 22 March. We had attendees from Cambridge, LSE, Bristol, and of course ULCC. Each project updated on its findings and progress including briefing on past and current training provision. Each project summarised the findings of their user survey. Although we had approached this in different ways (structured interviews, workshop, on-line questionnaire) our findings were remarkably similar. Even Bristol/DataSafe, which is concentrating on support staff and records data preservation, found resonances.
An important point which we see emerging again and again is the fact (and Cambridge, LSE and ULCC had all found this in our research) the phrase “research data” was not recognised by most researchers especially in the arts and humanities. They simply could not relate to the term. “Data” implies science and structural data and in large amounts. Whereas we are using it to define all and any information in any form for research purposes. LSE has already adopted “research material and data” as a catch all phrase and is a more accessible term. ULCC have not used the term metedata in relation to training as we consider it an alienating term.
We looked at comparing the identified needs of trainee community. There was much discussion about the attitudinal aspects of managing research data. People despair often of the lack of appreciation of the value of research data and why it should be kept. I personally think that we, the infromation managers, must share responsibility for this. Terms such as 'data' and 'metadata' for example are meaningless and alienating for most people not involved in the management of information. In a way we need to address our attitude to working with the research community. We have to develop ways of tailoring our approach/language to the non information management community. In a sense what I feel we need is to tend to ourselves first in order to make sure that we communicate effectively with the outside world.
Otherwise I noted that people are simply not getting the advice or assistance from their instiutions who are fostering their research. hence the value of our projects if we pitch our advice correctly. Very often there are no guidelines on the management of researc data avilable so researchers are very much left to their own devices. This can be demonstrated by their storage solutions as almost everyone we interviewed uses the cloud in one way or another. The exceptions were the few who knew the risks of the cloud (or who actually read the tems and conditions).
Issues in preservation skills included: choosing and using appropriate file formats; incorporating data preservation into their project; working with repository criteria for research data deposit.
We agreed that no one method of delivery or approach would suit all our target audiences, but having material that could be re-purposed for several modes (e.g. group training and on-line learning) would be the best tactic. Furthermore, all projects are constrained in what they will produce by their project scopes and institution-specific requirements. We did, though, identify several areas where collaboration would be mutually beneficial, so we agreed the following joint action:
Cambridge will set up a wiki to enable us to develop firstly a structure and set of questions for a FAQ, then secondly to develop where possible generic answers to these questions, accepting that some will need to be tailored for each institution;
LSE will develop and design a top-level brochure about research data preservation containing the core points and links to further information. This will be adapted from the similar-but-independent 4-point structures proposed by ULCC and Cambridge, namely: Explain it – Store it Safely – Share it – Start Early. And as the little prince said it is all a question of discipline but communicate the 'why bother' effectively and it will be a less bitter pill to swallow with remarkably beneficial results.
So, a good get together. More later.
Tuesday, 20 March 2012
He considered that the state of nature - competing desires amongst essentially equal human beings for the limited supplies, generate conflict and, in Hobbes' most famous phrase, the life of man is 'solitary, poor, nasty, brutish and short'.
Let's say we apply this to data management and why not? I am not a philosopher but if his idea is true then we would think that people are not interested in sharing resources or thinking beyond their immediate desires and needs. This research data is mine, hands off!
I am pessimistic at the best of times but after running our training on the preservation of research data entitled 'What's in it for me?', I felt less so by the end of the day. It seemed that people do want to share their research data after publication, as they want to enhance existing research and contribute to the body of work which is essential to the understanding of the thought processes involved in research output. And yes there is the unaltruistic side to us all, a bit of appealing to the immediate desires and needs as ultimately sharing your research data will enhance your standing in the community of expertise if it is well and often cited.
The premise of our training on March 14th was to lure folk in to speak about their experiences of preserving research data in the course of their research while we learnt a whole lot from them and what they need so we can best plan and design an online course on this for the great History Spot site at IHR. Our cohort of people attending our training day came from a variety of research backgrounds and made it a rich day for information gathering about their needs and 'desires'.
'I lost my data in a USB key which fell into a cup of coffee' - Anonymous.
So why did people come to our workshop? People spoke about various drivers which brought them to us. Experiencing the loss of data seems to sharpen the mind somewhat when it comes to preservation of data. People also spoke about being 'swamped with data and the information overload', wanting to take care of the material they had gathered over the years and worried they might loose it. Language struck a chord with many around the table. a lot of people don't use the word 'data' to describe their research material. The term 'data' is regarded as scientific and as a result people in the Humanities ofen feel alienated.We also reflected on the project so far, the knowledge base which we are gathering based on legacy data assessment and interviews.
The good, the bad and the ugly of research data preservation
We thought that it would be good to show them examples of what I had found in the assessments and what I had heard in the interviews. Feedback was without exception good for the whole day and people seemed to take to this particular session! It demonstrated a variety of practical examples of documentation for research data from well documented examples to inadequate to nothing. Lack of documentation about research data is a severe inhibitor to allowing access to it in the future. If the researcher does not write down information both descriptive and technical about the data we will loose the capability to access it both intellectually and literally. Lack of safe storage was another point, people often didn't back up and relied heavily on the cloud for storage not really knowing what they were agreeing to when they signed the terms and conditions of cloud services. However some were well advanced in good storage solutions and backups and used good formats for preservation and consideration of how to future proof their material.
Intellectual Property Rights (IPR) rears its inevitable head and as Kit Good has rightly pointed out Data Protection and Freedom of Information can affect research data. Some people had data on living individuals and this would have implications in relation to data protection. Many people interviewed simply did not remember what permissions they had regarding use of the primary material they had copied or recorded. They had signed a piece of paper in the library or archive but didn't remember what it said. As a result they would not be able to share this data in the future as copyright and usage was not clear.
Four good ideas
We gave an overview of Four things which they could all do to enhance the preservation of their research data. Here are the main ideas for each of which we gave practical solutions.
1. Write everything down.
2. Store your data safely.
3. Interventions are needed, the earlier the better!
4. Consider sharing, the why and how.
These questions included:
1. Why bother keeping research data?
2. What are the risks of not keeping your research data?
3. Give us your examples of good and bad practise
4. What are your storage needs?
5. If you could have a single magic tool to do this, what would it be?
6. Are you comfortable with sharing your data at any time? If yes, why and if no, why?
We got tremendous answers which will guide us while developing our on line course.
Thanks to everyone for such a good afternoon and now to work honing our moodle skills!
Thursday, 8 March 2012
What do I mean by ‘access to information’? Let’s get the acronyms established early on for three pieces of legislation: The first is the Data Protection Act 1998 (DPA), which is concerned with the ‘personal data’ of living identifiable individuals. The other two – the Freedom of Information Act 2000 (FOIA) and the Environment Information Regulations 2004 (EIR) – are concerned with ‘public’ information held by ‘public authorities’. Research data can be covered by all three.
Many researchers are looking at these issues already. Data management plans are a routine requirement for many research funding bodies. If there is no data management plan available for your research use one of the available templates provided by your institution or organisations such as JISC and the Digital Curation Centre.
Research data that contains reference to living individuals – interview scripts, contact details, even statistical information relating to small numbers of individuals etc. – should be managed according to the eight principles of the DPA. I won’t go into too much detail about this here, as there is so much guidance already available, suffice to say that the following should be considered:
- Do the individuals identified in your research data know how and for what purpose their data is being held? Have they given their consent?
- Is there provision to store the personal data safely and securely?
- How long are you planning to hold the personal data for? If the answer is ‘forever’, can you anonymise it and still retain its value?
Freedom of Information
Since 2005, all organisations defined as ‘public authorities’ in England, Wales and Northern Ireland are subject to the Freedom of Information Act 2000 (FOIA). In Scotland they follow the similar (with at least one important difference for research data, as we will see) Freedom of Information (Scotland) Act 2002 (FOISA). The crux of the Act is that the public has a right of access to information ‘held’ by public authorities. If asked for information, the authority has to confirm that it is held and provide it, unless a legal exemption applies. The Environment Information Regulations 2004 provides a right of access to ‘environmental information’ under similar timescales and some slight differences in detail to FOIA but, for the purposes of this blog, my statements should generally cover both.
Universities are defined as public authorities by the Act and therefore obliged to respond to FOIA requests. This is not always as simple as it sounds, in that unlike many other public authorities, Universities operate in a competitive, increasingly international environment with an ever-decreasing proportion of public funding. More nuanced still is the relationship of the individual academic with ‘their’ research data, produced in everything from solitary sabbatical study to global partnerships of research institutions. At the same time, there is a significant ‘open access’ movement in academia which is arguing for the pro-active publication of research data through online journals and repositories.
FOIA and EIR requests have been made for research data and in some cases have required the Information Commissioner’s Office (ICO) to issue a ‘Decision Notice’ in order to ensure disclosure. Queen’s University Belfast were ordered by the ICO to release over 40 years of research data on tree rings, used for climate research (see the news item) under the EIR legislation.
There are, however, several exemptions in the FOI Act that can apply to research data requests: Section 22 ‘Intended for future publication’ allows a University to exempt information that will be later published. Section 43 ‘Commercial Interests’, exempt the disclosure of information which could prejudice the commercial interests of the University or another party, such as a partner institution or research funding body. If your research data contains personal data, then parts of it are likely to be exempt under Section 40 ‘Personal Information’. FOISA includes a specific research data exemption - Section 27(2) – but even so this derives from the general principle of ‘intended for future publication’ and is unlikely to prevent disclosure of research data held in the manner of the ‘tree ring’ dataset.
It is definitely worth reading the ICO’s guidance for the Higher Education sector around FOIA.
Once again, if you are unsure, do ask your institution’s Freedom of Information Officer or similar information compliance contact. Try and envisage in your data management plan how you would deal with a request for your research data. It may be that public disclosure of research data is a desired outcome of the project; it may require some serious consideration and discussion amongst the research team.
Access to information legislation in the UK can apply to research data. This can have important implications for a research project and therefore acts as another driver for ensuring that your data is managed and preserved effectively. Ensure that you create a data management plan when starting on a new project and discuss any issues with the FOI/DPA compliance officers at your institution.
Monday, 5 March 2012
Preservation and research data: what's in it for me?
Registration for this workshop is now CLOSED as we have reached capacity
All researchers create research data, from the database you've set up for your PhD, to your extensive collection of references, to more complex and extensive projects.
As researchers you have two choices:
A. Do you want to ensure that your research data gets the attention and validation it deserves over time? Do you want to make sure that it is safe, accessible and shareable as a valued resource for as long as possible?
B. Or would you rather it disappeared, became corrupted and was never used again with little or no opportunity for reuse and academic recognition of your research?
If you choose the former then we strongly recommend that you attend this FREE workshop.
We will cover issues such as why researchers need to bother about best practice in relation to managing their research data to ensure access over time.
- Case studies: The good, the bad and the ugly of the research data in relation to ensuring that it is accessible over time.
- Quick wins: Some simple things you can do to make sure your data lasts
- Tools: Things that can do these things.
The workshop will be held on 14 March 2012, from 2.00 to 4.30.
Tuesday, 7 February 2012
As part of gathering a knowledge base about research data training needs I have been looking at historical research datasets created by our SHARD partner IHR and in particular the Centre for Metropolitan History (CMH) which is based within IHR at UoL. CMH was established by the IHR in 1988, is one of the world’s leading centres for the study of the history of London and other metropolises. It specialises in innovative research projects, covering a wide range of periods, themes and problems in metropolitan history, publishing the results and data online and in print. A survey had been done by ULCC of the legacy data held at CMH a few years ago so we had an idea that we had some good material for our study. The data covers metropolitan history looking at health, social issues and of London and comparative studies with other cities.
To start of the assessment I had to get my hands on the data. We agreed to share the data on a shared drive at the University of London. However in practise it wasn’t so simple. The size of the data was an issue among other things. We then decided to use Dropbox, removing the data once it had been assessed. The experience clarified the need for procedures and the client focussed approach as to state the obvious, different people and different departments have different needs. It also reinforced that old adage ‘assume nothing’.
Assessing the data
To assess the data we developed a set of questions based loosely on the AIDA and CARDIO approach to data assessment. We looked to identify gaps in data management in relation to preservation.
Storage and identification: This consisted of basic information about where it had been stored, name and unique identifier. Where has the data been stored? Was it generally accessible?
Description: What metadata had been provided, looking at both descriptive and technical. Did the researcher/data creator describe how was the data created and processed using which software? What, of any documentation has been provided? Were data models or schemas provided?
Structure and organisation: How well was the structure of the data described and to what level within the dataset? How well described were fields? Any information provided about how to process the data or how data was processed? If the data had been encoded, have the codes been made clearly and easily accessible?
Access and preservation: Can the files be opened? If not, what are the issues?
Legal: Here we looked at any information which described copyright and any IPR which would implicate the data regarding preservation and sharing. Is anything missing?
Some results of the assessment:
The data, while safe as they were being backed up every night by central computer service at the University of London were locally stored on a secure network drive and not very accessible as no one beyond CMH would find them unless they knew about them. They do not appear on any catalogue or database of holdings. Thus access was immediately extremely limited. Is this what we want to happen to our research data? Or are researchers really bothered about what happens to their research data once their results are published? Are they encouraged to do so? Why should they bother?
There were some very good examples of some projects providing guides and introductions to the data, about half the data provided some sort of documentation about the data. However it lacked consistency in approach and the remainder of the data assessed lacked any documentation about the data which left one pretty much in the dark about its purpose and context as well as the nitty gritty of the technical and methodologies used.
The data itself consisted of text based files, some structured numerical data, some audio and some mapping data. The various applications used included Oracle, Dbase, Adobe files. GIS software such as ARCview was used. Other applications included Microsoft in its various forms, Excel, Access, Word. Dbase, Wordperfect files and Notepad. All of these were used in their various incarnations over time resulting in that some were too old and too proprietary to be opened. Open Office proved very helpful at opening some files which were too old to be opened by their contemporary software. None of the indications about applications were (with two exceptions who noted this in their guide to their data) explicitly indicated anywhere. Out of all the datasets I looked at in the study two had documents describing aspects of the dataset itself from general descriptive information to technical information about operating systems, others had guides introducing the dataset generally and the rest had nothing at all. In this study I found that lack of access to the data was hampered more by insufficent information about its organisation and structure than techncial problems. I could open most of the data I accessed but without any overview of the data it took time for it to make sense. Often this was time I didn't have!
At a more detailed level there were many instances of fields which were encoded but with no explanations of the codes available, thus rendering the data unusable. Half of the datasets assessed did not include anything explaining where to find these codes or how to use them. In addition many seemed to use place name and personal names authority files but few referred to the standard used. There was little if no evidence that IPR or any other issues had been considered. No statements for use or sharing were in evidence with the exception of one dataset. An obvious exception to this was an oral history project which was very explicit about copyright and permissions and generally was very well documented. Oral history traditionally being strong on good data collection and management procedures.
Research data is a unique collection of data and individual. It is frequently multiformat, using various software applications. This all needs to be described somewhere as part of the dataset in order for it to have some hope of surviving beyond the fleeting lifespan of the software/hardware used to create it. However beyond the technical problems which are complex, the issue which also loom large the is the lack of any introductory and decsriptive information about sets of research data. How to address this?
Institutions should ideally have practical data management plans which are easy to implement with some quick wins. Such systematic and consistent approaches to data management should be introduced at an early stage in a research project. Researchers are busy doing research and some time should be explicitly dedicated in projects to write supporting documentation for the data. Otherwise sharing and reuse is going to become less of an option as time goes by. Research data supports a narrative of research and is an invaluable resource to the researcher while the research is active. Once completed, the data is dropped, deleted or abandoned to fight for its survival, some make it to enhance the narrative of further research projects or validate the research published while other data is not so lucky.
Monday, 30 January 2012
Melissa Terras, 'Number Crunching Historians' http://melissaterras.blogspot.com/2012/01/number-crunching-historians.html
I knew the legacy data being assessed for SHARD was going to be interesting. I knew how unique and valuable these once off accumulations would be but this data is more fascinating than I expected. It is also deeply frustrating when one is deprived of it. This is either due to technical or intellectual problems. Especially when you realise that all it might have taken would have been a brief document detailing some codes/techncial specifications and whatever else might be useful for the non expert, which is probably everyone else in the world as the primary investigator will usually be the authority on the data.
It is important to remember that I am looking at the data not with a view to its content (however distracting) but rather at how well it has been managed to enable access and sharing it over time. How easy would it be to open up this data and use it again if I was not the data owner or creator? How much time and money would it cost to recreate these rich data experiences? I think we can easily forget how things used to be when it came to research, before databases and spreadsheets and PCs. I am old enough to do so. Many researchers are not. How powerful electronic data is, allowing us to use and resuse data in myriad ways to further research with a speed which would have been unimaginable 20 years ago.
With this 'power' so to speak comes responsibilities. To preserve this data we must look after it. We have a responsibility to mind it so that these rich and diverse accumulutaions of data are kept for future researchers. One day you may well be in the position where you embark on a research project and find some research data which complements/informs your topic. You find you are unable to open it due to software and hardware issues and even if you can open it you find that you can't comprehend the data as perhaps the codes used have not been written down and maybe there is no guide to the data.
Just bear this in mind, that some day that researcher might well be you.
Thursday, 26 January 2012
A simple procedure such as moving a dataset internally from one drive to another may not always be as simple as that sounds.
This made me think up some rules of internal data sharing.
1. Context matters. Different people have differing needs, it is not a one size fits all approach. All data is not equal, some is more special/unique than others.
2. Procedures matter. There should be procedures and these should be agreed on. The oral tradition of remembering belongs in a folk not a data archive.
3. Metadata matters. Information about data matters. Almost as much as data, so ensure that this is shared as well as the data. Otherwise the data can be meaningless and without context.
4. Trust: hard to win, easy to loose and very hard to regain. Personal connections often gain trust so be nice to each other.
That's all for now. More very soon!
and Jane to develop appropriate acessible preservation training for researchers who create data in the course of their research and avoid the loss of this valuable unique resource which often gets lost over time after funding ends. SHARD is an opportunity to do this and to hopefully get people thinking about how access can be maintained over time to this valuable resource. The very experienced Ed Pinsent, (first on left) will also be helping along the way.
My name’s Jane and I’m based at the Institute of Historical Research. I’m responsible both for our traditional publications activity and also for managing our digital projects. Recently, we’ve been putting a lot of time and effort into developing online research training materials for historians, building on our longstanding face-to-face courses. The SHARD project is a wonderful opportunity to raise awareness of data preservation among historians, and to present this specialist training in the more general framework of History SPOT, thereby helping to ‘demystify’ it. It’s great to have the opportunity to work with Patricia and her colleagues at ULCC, and to help embed their work within historical research practice.