82.   Taxon names in multiscript languages

Nozomi “James” Ytow1, David R. Morse2, Akira Sato3 & David McL. Roberts4

1 Graduate School of Life and Environmental Sciences/Gene Research Center, University of Tsukuba, Tsukuba, Ibaraki 305-8572, Japan

2 Computing Dept, Faculty of Mathematics and Computing, The Open University, Walton Hall, Milton Keynes MK7 6AA, UK

3 Academic Computing and Communications Center, Univesity of Tsukuba, Tsukuba, Ibaraki 305-8577, Japan

4 Department of Zoology, The Natural History Museum, Cromwell Road, London SW7 5BD, UK

A multiscript language has more than one way to represent some words in the language. For example, the Japanese word for cherry has six ways to encode it using Unicode, i.e. Hiragana, full-width Katakana, half-width Katakana, traditional Kanji, simplified Kanji and their romanised form “Sakura”. Because all of them are used to label resources on the Web or can be used in databases by preference of contents providers, a mechanism to cross-search them is beneficial to non-scientific customers of activities such as GBIF. There is no algorithmic way, however, to convert between the encodings except between Hiragana and full-width Katakana. TCS would be the best place to express and exchange equivalence of those encoded character strings, although the equivalence does not necessarly have a relationship to a scientific name. We explored the capability of the proposed TCS to express the equivalence between encoded strings rather than the equivalence of scientific name oriented “concepts”.

The work was partially supported by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Scientific Research (B), 17300071, 2005