TAXAMATCH - fuzzy matching algorithm for genus and species scientific names
Have you ever entered "Caelorinchus" into a species search page, only to discover that the data you want is held under the name "Coelorinchus" (or even Coelorynchus, Coelorhynchis, etc...), or typed "Panulirus" when you really meant "Palinurus" (or vice versa)? Or can you really not remember the correct spelling for "Syzygotettix boettcheri"? If so, you probably need a fuzzy matching algorithm tailored for taxonomic data, namely TAXAMATCH.
TAXAMATCH is an algorithm for fuzzy (approximate string) matching of taxon scientific names. With normal exact matching, it is trivial to retrieve desired information from a taxonomic database, or match content across 2 or more systems, or determine that the same taxon name appears multiple times on a single list (deduplication). However if there is a character error affecting one or multiple characters in one or both names such that they do not match exactly, use of some sort of fuzzy match is required; TAXAMATCH is intended to fulfil that function, with particular tuning to the types of errors found in real world taxonomic data. Using TAXAMATCH, the intention is to always return candidate "true" near matches where these exist (as close as possible to 100% recall); suppress as many as practicable "false" near matches (high precision); and do all this in as short a time as possible (high efficiency), even against large reference datasets (>1 million names).
To achieve the above, TAXAMATCH employs both phonetic and non-phonetic matching (to detect errors of either type, or both) along with a set of heuristic rules that are incorporated into pre- and post- filters at both genus and species epithet level. In the main, the pre-filters maximimise algorithm efficiency by ensuring that only a subset of available names have to be tested, while the post-filters apply heuristic reasoning to distinguish likely "true" from "false" near matches, although they may have the same calculated similarity. A final result shaping stage is also normally applied that further filters the result set passed from the species post-filter with the aim of further increasing precision (rejection of false hits), although in some circumstances the odd true hit may also be lost here, so the option to disable this step on request is also supported.
One characteristic of TAXAMATCH is that currently it may take up to e.g. 1 second to process a single input name against a large reference database (e.g. 1m+ target names), which is probably fine for user web input or for checking a few thousand names as a batch run, but may be expensive for full scale internal deduplication purposes (e.g. comparing 1.4m names with each other in turn = 1.4m tests = 1.4m sec = 388 hours or 16 days approx.). Accordingly, a "rapid" mode is also supported that improves the efficiency by an order of perhaps 100 fold with almost no impact on recall of true near matches at species level (although the impact at genus level is severe); basically in this mode it is presumed that EITHER the genus or the species epithet is a phonetic match, a condition that is probably 99.9% true at species level although obviously it will fail for non-phonetic genus errors where no species epithet is available. This mode should therefore be used with caution, but clearly has a place especially for the potential deduplication of large scale species lists.
This new approach is currently available via the IRMNG search interface (see below), and will be introduced into CAAB, OBIS, and other databases as soon as development has bedded down. Further information relevant to TAXAMATCH can be found here:
- TAXAMATCH Reference Implementation, as accessed via the "IRMNG Data Access" page
- TAXAMATCH project description on the TDWG Biodiversity Information Projects of the World database
- Rees, T., 2009. Fuzzy matching of taxon names for biodiversity informatics applications. Poster presented at e-Biosphere 2009 Conference, U.K., June 2009 (.pdf file, 2 MB)
- TAXAMATCH information on the GNA taxon-name-processing site
- Michael Giddens' TAXAMATCH page at www.silverbiology.com, aimed at developing an API JSON service based on a PHP/MySQL implementation of TAXAMATCH.
- Dmitry Mozzherin's TAXAMATCH port as an open source ruby module at http://github.com/dimus/taxamatch_rb/tree/master
- Paul Murray and Sam Lee's java TAXAMATCH port for the Atlas of Living Australia (2011) is available here: http://code.google.com/p/ala-nsl/wiki/Taxamatch
- The USA iPlant php TAXAMATCH implementation for the iPlant Taxonomic Name Resolution Service (TNRS) (2011) is available here: http:/https://github.com/iPlantCollaborativeOpenSource/TNRS/tree/master/taxamatch
- TAXAMATCH ancillary functions test page at CMAR
- TAXAMATCH presentation at the TDWG 2008 Conference - Perth, Australia (abstract)
- TAXAMATCH presentation at the TDWG 2008 Conference - Perth, Australia (slides only - ~ 1.6 MB)
- TAXAMATCH presentation at the TDWG 2008 Conference - Perth, Australia (slides plus audio commentary [.swf file] - ~ 12.7 MB)
- Previous conference presentation: Rees, T., 2008. Applications of fuzzy (approximate string) matching in taxonomic database searches, with an example multi-tiered approach. [Extended abstract]. Pp. 12-14 in Worcester, T., Bajona, L. & Branton, B. (eds): Proceedings of a Conference on Ocean Biodiversity Informatics, Bedford Institute of Oceanography, Dartmouth, Nova Scotia, 2-4 October 2007. Bedford Institute of Oceanography, 2008 (CSAS/SCCS Proceedings Series 2008/024). Available on the www at http://www.dfo-mpo.gc.ca/CSAS/Csas/Publications/Pro-CR/2008/2008_024_e.pdf.
- OBI 2007 Conference - Halifax, Canada (original conference abstract + link to presentation slides)
There is also a TAXAMATCH Developer's wiki available (username and password required to log in; to request one, please contact Tony.Rees@csiro.au).