Skip navigation

Day 1 and 2 from Edinburgh


Similar to Sandy I was impressed by the plant inspired ending of Friday night’s ceremony! Whilst watching the inspiring event, I was looking through our database and seeing how big my part of the task ahead really is going to be…


As Sandy explained, simple differences in typing collectors names can result in two names being allocated to a single person – like the example of A. Fernandez and A. Fernández. The accent makes all the difference to the computer! The implications of such typo’s or spelling differences are what I’ll be focusing on this week.


Let me give an exaxmple of job as the “duplicate hunter” as I have named myself. If, for example, specimen “Fernandez 212” is being entered to our database, the database performs an automated check if other specimens (also known as duplicates) of the collection event already have been entered or not. If another duplicate of the collection event has been entered as “Fernández 212” with the accent on the a, whilst the one being entered is missing it, these two specimens will become part of two separate collection events… Again we can’t blame the computer, as the names are not exactly identical!


So I went into our database and checked how many collection events are identical based on collection number (for example 212) and collection date (day, month and year). As collection number and dates are numerical, typo’s caused by alternative spellings do not generally cause issues (although see below), meaning that identical entries can be identified easily.


Using the above ever-so-clever but simple technique, I identified 1839 records that are potentially duplicated. Of course there is a large list of collections that are not true duplicates although they appear on our suspected list. These are collections that have, just by chance, same number and collection dates. A mere 1549 of the 1839 suspected are collections that lack number, which are all labelled with number “s.n.” according to old tradition as “s.n.” means “without number” in latin. What the letters s.n. truly stand for escapes me now – s. = sin, but n. = numero or numerus? Latin speakers will be able to help me out here…


Prior to our Plant Challenge, I did a spur of duplicate spotting in our database over one quiet day. I found out that there are several errors leading to duplications. Spelling mistakes or alternative spellings of collectors’ names is one reason, but alternative spellings of numbers is another reason, although small I grant you. There is a set of numbers which have been entered with an unnecessary 0, such as “012” which appears simply as “12” in another duplicate entry. I plan to tackle these duplicates by filtering all collection numbers with “0” and then sorting in numerical order. There seems to be an additional 100 or so records to check there.


And lastly there are ones where duplicates appear simply as identical duplicates. These are ones where collectors name and collection number appear perfectly identical, and truly are. Although we try to elimanate entering duplicate records, it always happens, somehow


Quite impressively, I have now tagged myself a list of 11 388 collection events to check and go through!!! By no means will all of these records represent true duplicate entries – our data set is relatively clean we believe – but one never knows …


Plant Challenge Day 1 from North London


After the wowser event on Friday night ( that even had a botanical motif at the end - data cleaning began early on Saturday morning.......


My task is to check names of people in the database, eliminate duplicates and correct spellings, and to fill in fields like first name and full name. Sounds easy.......


I made it through the letter A, well almost. Checking the identity of collectors of plants involves seeing where they were when - for example two Stephen Allens - one from the 1890s and one from the 1970s, could exist, so just assuming they are the same is dangerous. Once I had determined people were indeed different, I cross-checked with numerous external web databases ( like that at Harvard, or the one in JStor Plant Science to double check first names, initials and dates.


Along the way I correct diacritical marks (accents) in non-English surnames - it is easy when entering data to leave off the accent in a name like Fernández - the computer thinks A. Fernandez and A. Fernández are different people - it is only a machine after all - when in fact they are one and the same. So the collections attributed to each need to be merged.


All this takes time, but in the end is worth it. I also managed to find a few plant name mysteries while dealing with people - all tiny little things that once solved put another puzzle piece in place for our eventual documentation of the diversity of Solanum. Even though my eyes go squiffy from staring at the screen it is great to feel like things are getting cleaned up - and it just reminds me how much I really do like these plants - they are great!


We are also getting ready to go to a meeting in France with some eggplant (aubergines for we Brits) breeders - so I have also been thinking about how to present our results on the taxonomy of African Solanum (done by Maria Vorontsova - see her previous blogs on collecting in Africa in the most user-friendly way - a different sort of challenge!


Plant Challenge - Let's begin!

Posted by Tiina Jul 23, 2012

A certain major sporting event will get under way this Friday and we'll be having our own celebration by launching our own Plant Challenge!


Here at the Solanaceae team we will be writing daily blogs about our activities. We have set ourselves a goal – a challenging goal we hope to achieve but in order to do so we might need a bit of luck and lots of hard work! The great big goal is to clean and update our ever growing BRAHMS database which holds the data needed for running the great Solanaceae Source website soon to be updated to Scratch pad 2. This is not a small task by any means: the database currently includes 60,005 collection events, 72,301 individual specimen entries, 16,759 collectors names, 13,565 species names, 19,318 gazetteer entries, and 71,345 species determination records!


Between Friday 27 July and Friday 10 August you can follow up on our progress and hear how our efforts are going. Our Team consists of three people: Mamen (Maria Peña Chocarro), Sandy, and Tiina. Mamen will be in charge of geography, Sandy is focusing on cleaning collectors, nomenclature, and literature, and Tiina is taking on data entry and unifying data records. Despite months of hard and strenuous training, the contestants are feeling nervous yet incredibly excited! One thing is for sure - the journey will be full of surprises, as you never know what one finds inside the big matrix!!!


The team will use “divide and conquer” strategy to tackle the mammoth task. Whilst Mamen and Sandy will stay at the project headquarters in London, Tiina will be sent to Edinburgh to the Royal Botanic Gardens Edinburgh to establish a remote base for the operations. The equipment for the task will include three laptops, three internet connections, and three desks. Coordination of research will be done through email and phones.


Whether you are a scientist or a keen natural historian, join us in your efforts in Plant Challenge! Send your comments to our blog, with links to your own planty challenge feat!