I have just read an excellent blog article by Nick Poole about the Smithsonian Digitisation Fair in Washington. I gave a talk last December about the cost of mass digitisation at the Annual General Meeting of the Geological Curators' Group at Leeds Museum and feel inspired to jot down the thoughts of a curator in the middle of a mass digitisation project. Here are my 10 steps to mass digitisation dealing with some of the pitfalls, how we have managed to overcome them, a timeline and finally an estimate of the cost of this mass digitisation project.
- Data entry templates
I have been asked so many times if I can provide a template for easy data capture. In my experience, each dataset is different and considerable initial thought is required to design a good data capture structure. I was given 100,000 micropalaeontological records back in 2009 that were created using MS Access on a data entry sheet designed to mirror fields in our KE Software collections management system, KE Emu. You can never spend too much time at the start of the process testing how it works so that the data you capture is useable. It could save weeks if not months of re-formatting at a later stage. This is especially critical if you will later rely on someone else to deliver your data to the web.
The old paper microfossil registers transcribed into an MS Access database at the start of the project
- Getting help with entering data
Two contract data entry clerks were responsible for initial data entry of our old micropalaeontology specimen registers. There has been a lot of debate about whether non-specialists can work as accurately as specialists. I would say that they did an excellent job in transcribing exactly what was written in the registers apart from when the handwriting was poor. I often had trouble interpreting what had been written in these cases! They did it in a fraction of the time it would have taken me. I haven't tried crowdsourcing but I am certainly considering it to help clear some of the electronic backlog registration that has accumulated since we stopped recording everything in pen and ink.
The data entry clerks were told not to do any interpretation and to transcribe exactly what had been written in the registers. This is fine because we wanted to maintain a good balance between recording the original register data and making informed interpretations. No orginal data has been removed during the migration as we were able to record details in verbatim fields. Considerable cleansing of the data has been neccessary, mainly because the data in our registers is not sufficiently detailed or needs updating to reflect changes in political boundaries. Various other key areas required cleansing and these are dealt with below.
- Maintaining data standards
There are many ways of writing people's names (Miller, C. G., Mr C. G. Miller, Dr C. Giles Miller ... etc) and the hand written registers reflect the fact that there was never a standard followed. Matching records in the MS Access database with those already in KE Emu was therefore difficult to impossible without creating many duplicate entries. To avoid this, we compiled a list of all the names associated with the collection and distilled them down to a list of about 2,000. We then checked these against all current museum records and found that many had already been created by other members of Museum staff. We then linked these records directly back to our data using a internal record number or "irn" so that we could be sure that the correct record in the correct format was being linked to. New records were created if neccessary from the dataset of names we compiled.
Some relatively complete examples of bibliographic citations
- Breaking tasks down into manageable blocks
In some ways we did this with the process we used for people names. I was interested to see in Nick Poole's blog that the Smithsonian are using similar strategies of breaking the tasks down into smaller blocks to achieve larger digitisation goals. Bibliographic citations like those above, have not been complete enough to create records direct from the registers as many use abbreviations, lack vital data or need further research to make them meaningful. I wrote a short subproject proposal for internal funds to hire an assistant for 6 months who created full reference details for all the published specimens in the collection. In reality this took a much shorter time than expected and she was able to help with many other tasks associated with preparing the data for migration into KE Emu.
- Using pre-existing datasets
Again the registers were not complete enough to be able to create identification records from scratch because generic names were often abbreviated or the original describing author details were missing. There are many biodiversity resources on the internet including the Ellis and Messina Catalogue of microfossil species published by the Micropalaeontology Press. I asked them if I could use their list of microfossil names to help populate our database and for a small fee they provided an MS Excel file of all the species in their database. I imported about 50,000 complete microfossil names into KE Emu and used a simple VLOOKUP function in MS Excel to match these with electronic records created from the paper registers. When no match was achieved I checked why, corrected the data if neccessary or used the data to create a new species records in KE Emu.
- Thinking positively
Shortly after arriving at the Museum in the 1990s I remember being told by a senior member of staff that it would take us 250 years to database the entire collection. Sometimes it's difficult to get started when you feel that your efforts are only just touching the surface or will go off into some black hole of a database that won't ever be useful because hardly any of your hundreds of thousands of objects are registered in it. I have to admit that there have been some times in my career when I have felt like this. My mentor encouraged me to see the bigger picture and the benefits of the project that I was involved in. Bringing data checking up to the top of my list of collections management priorities has paid immediate dividends.
- The bigger picture
There are so many advantages to having the majority of your collection on an electronic database that is searchable via the web. Even though I am already half way though, I have seen real benefits in answering enquiries quickly and easily. Once everything is migrated I will be spotting areas for development of the collection, looking for potential areas for de-accession while gathering hard data on the collection strengths. It is much easier to raise the profile of the collection and encourage visitors to the collections through schemes such as SYNTHESYS when you can send out messages to list-servers advertising a web link to your collection. Another major advantage is that I now have somewhere to associate the many electronic images and documents that relate to my collections and these are being delivered to the web should I choose to.
- Estimating timescales
The initial data entry from the registers took our two clerks 4 months each to input a total of 100,000 records. In 6 months my assistant created full bibliographic records for the whole dataset and added "irn" references for all of the people associated as either collectors, donors or publishers. The process that has taken longest is my data checking, particularly for the scientific accuracy of the fossil names. I would estimate that I spent between 5 and 10 per cent of my time checking data and preparing import sheets since the project started. I am therefore the log jam! At the current rate we are looking at sometime in 2015 for completion of the entire 100,000 record dataset.
Lyndsey Douglas researching full bibliographic microfossil reference details in the Heron-Allen Library
Obviously it would be imprudent to show a breakdown of salary costs here so I will just say that at Christmas last year when 36,000 KE Emu records had been created, the cost came to roughly one pound per record. This includes the Micropalaeontology Press fee, salary costs for initial data entry, an assistant for 6 months and for 10 per cent of my time. I have not included other expenses like building and IT overheads. I expect that the final cost per record at the end of the project will be slightly less than a pound per record as the major expenditure of salary for the data entry people and the 6 month posts are now accounted for. The final cost will depend on how long it takes me to finish checking and migrating the data.
I may be only half way through importing the 100,000 records, but I would like to think that this project can provide some valuable benchmark data for those planning future projects, suggest some ways of making the process quicker and help with forecasting costs and timeframes.