CSIRO Marine and Atmospheric Research

TAXAMATCH - fuzzy matching algorithm for genus and species scientific names

Have you ever entered "Caelorinchus" into a species search page, only to discover that the data you want is held under the name "Coelorinchus" (or even Coelorynchus, Coelorhynchis, etc...), or typed "Panulirus" when you really meant "Palinurus" (or vice versa)? Or can you really not remember the correct spelling for "Syzygotettix boettcheri"? If so, you probably need a fuzzy matching algorithm tailored for taxonomic data, namely TAXAMATCH.

TAXAMATCH is an algorithm for fuzzy (approximate string) matching of taxon scientific names. With normal exact matching, it is trivial to retrieve desired information from a taxonomic database, or match content across 2 or more systems, or determine that the same taxon name appears multiple times on a single list (deduplication). However if there is a character error affecting one or multiple characters in one or both names such that they do not match exactly, use of some sort of fuzzy match is required; TAXAMATCH is intended to fulfil that function, with particular tuning to the types of errors found in real world taxonomic data. Using TAXAMATCH, the intention is to always return candidate "true" near matches where these exist (as close as possible to 100% recall); suppress as many as practicable "false" near matches (high precision); and do all this in as short a time as possible (high efficiency), even against large reference datasets (>1 million names).

To achieve the above, TAXAMATCH employs both phonetic and non-phonetic matching (to detect errors of either type, or both) along with a set of heuristic rules that are incorporated into pre- and post- filters at both genus and species epithet level. In the main, the pre-filters maximimise algorithm efficiency by ensuring that only a subset of available names have to be tested, while the post-filters apply heuristic reasoning to distinguish likely "true" from "false" near matches, although they may have the same calculated similarity. A final result shaping stage is also normally applied that further filters the result set passed from the species post-filter with the aim of further increasing precision (rejection of false hits), although in some circumstances the odd true hit may also be lost here, so the option to disable this step on request is also supported.

One characteristic of TAXAMATCH is that currently it may take up to e.g. 1 second to process a single input name against a large reference database (e.g. 1m+ target names), which is probably fine for user web input or for checking a few thousand names as a batch run, but may be expensive for full scale internal deduplication purposes (e.g. comparing 1.4m names with each other in turn = 1.4m tests = 1.4m sec = 388 hours or 16 days approx.). Accordingly, a "rapid" mode is also supported that improves the efficiency by an order of perhaps 100 fold with almost no impact on recall of true near matches at species level (although the impact at genus level is severe); basically in this mode it is presumed that EITHER the genus or the species epithet is a phonetic match, a condition that is probably 99.9% true at species level although obviously it will fail for non-phonetic genus errors where no species epithet is available. This mode should therefore be used with caution, but clearly has a place especially for the potential deduplication of large scale species lists.

This new approach is currently available via the IRMNG search interface (see below), and will be introduced into CAAB, OBIS, and other databases as soon as development has bedded down. Further information relevant to TAXAMATCH can be found here:

There is also a TAXAMATCH Developer's wiki available (username and password required to log in; to request one, please contact Tony.Rees@csiro.au).

Updated: 12/09/2011

This page maintained by Tony Rees (Tony.Rees@csiro.au)

Return to Data Centre Home Page

© Copyright CSIRO Australia, 2011
Use of this web site and information available from it is
subject to our Legal Notice and Disclaimer