README file for installation of TAXAMATCH algorithm (demo version) on an ORACLE system. Accompanies: TAXAMATCH v. 1.0 Tony Rees - 24 November 2008 (Tony.Rees@csiro.au) INTRODUCTION TAXAMATCH is an algorithm for fuzzy (approximate string) matching of taxon scientific names. With normal exact matching, it is trivial to retrieve desired information from a taxonomic database, or match content across 2 or more systems, or determine that the same taxon name appears multiple times on a single list (deduplication). However if there is a character error affecting one or multiple characters in one or both names such that they do not match exactly, use of some sort of fuzzy match is required; TAXAMATCH is intended to fulfil that function, with particular tuning to the types of errors found in real world taxonomic data. Using TAXAMATCH, the intention is to always return candidate "true" near matches where these exist (as close as possible to 100% recall); suppress as many as practicable "false" near matches (high precision); and do all this in as short a time as possible (high efficiency), even against large reference datasets (>1 million names). To achieve the above, TAXAMATCH employs both phonetic and non-phonetic matching (to detect errors of either type, or both) along with a set of heuristic rules that are incorporated into pre- and post- filters at both genus and species epithet level. In the main, the pre-filters maximimise algorithm efficiency by ensuring that only a subset of available names have to be tested, while the post-filters apply heuristic reasoning to distinguish likely "true" from "false" near matches, although they may have the same calculated similarity. A final result shaping stage is also normally applied that further filters the result set passed from the species post-filter with the aim of further increasing precision (rejection of false hits), although in some circumstances the odd true hit may also be lost here, so the option to disable this step on request is also supported. One characteristic of TAXAMATCH is that currently it may take up to e.g. 1 second to process a single input name against a large reference database (e.g. 1m+ target names), which is probably fine for user web input or for checking a few thousand names as a batch run, but may be expensive for full scale internal deduplication purposes (e.g. comparing 1.4m names with each other in turn = 1.4m tests = 1.4m sec = 388 hours or 16 days approx.). Accordingly, a "rapid" mode is also supported that improves the efficiency by an order of perhaps 100 fold with almost no impact on recall of true near matches at species level (although the impact at genus level is severe); basically in this mode it is presemed that EITHER the genus or the species epithet is a phonetic match, a condition that is probably 99.9% true at species level although obviously it will fail for non-phonetic genus errors where no species epithet is available. This mode should therefore be used with caution, but clearly has a place especially for the potential deduplication of large scale species lists. Internally, in addition to the internal settings (rules and thresholds) employed in the pre- and post- filters etc., TAXAMATCH makes use of two custom fuzzy tests, a phonetic test called "near match" and a non-phonetic (actually: phonetic independent) test called "Modified Damerau-Levenshtein Distance" (MDLD) that returns a value for Edit Distance (ED) according to this criterion (which basically allows for transposed syllables in addition to the more "traditional" insertions, deletions, and single character transpositions). Both these accessory functions are supplied in the package along with the TAXAMATCH algorithm itself, along with other custom functions that are invoked, namely "normalize", "normalize_auth", and "ngram". A separate function, "compare_auth", provides a numeric assessment of authority similarity, where these are available for comparison; this is not employed as part of the near match function of TAXAMATCH (i.e., candidate matches will be returned irrespective of whether or not there is a good match on authority), however provides additional information that may be of value to the human user regarding (for example) which of several supplied candidate near matches may in fact be the intended one. Conceptually, one could consider adding this into the ranking procedure within TAXAMATCH but in practice, since there are well known cases of the same name and publication event being cited quite differently, probably it is best not to do so. For a full description of the functionality and reasoning employed in the TAXAMATCH internal logic, please refer to the TAXAMATCH developers' wiki which is accessible via the brief information page http://www.cmar.csiro.au/datacentre/biodiversity.htm#taxamatch , and/or an upcoming publication in the Biodiversity Informatics literature. A username and password is required for the wiki, but these can be allocated on request to Tony.Rees@csiro.au if you do not have one. REQUIREMENTS This demo package is currently configured to operate with Oracle version 10 onwards and the Oracle PL/SQL procedural language; other versions may be available in the future as contributed by interested parties. To install the package and 3 associated data tables supplied for test purposes, you will need an appropriate user account and tablespace in Oracle, that can probably be created for you by your local system administrator if you do not have these privileges, and within which you can create and edit tables, and compile the supplied PL/SQL package. It is also advantageous if you have a means to view (and even edit) actual table content in a pseudo-spreadsheet format; I use a link from the "live" Oracle tables to MS Access for this purpose (if you are unfamiliar with this process, see http://www.databasejournal.com/features/oracle/article.php/10893_3358411_2/Connecting-with-Oracle-Accessing-Oracle-via-Access-and-Excel.htm), which is perfectly adequate, and other utilities exist for the same purpose which I have seen in passing but not used myself. If you need or want to edit the PL/SQL package itself, any text editor is possible but my own preference is TextPad (http://www.textpad.com/) with its extension Syntax definitions for PL/SQL available from http://www.textpad.com/add-ons/synn2t.html . TextPad should be registered for (I think) US$10 but will work indefinitely in its unregistered version as well, just with the occasional reminder to register it. This demo version of TAXAMATCH is configured to receive input directly at procedural level (e.g. from commands in a SQL+ window) and return its result back to the same window, for confirmation that it is working properly and to illustrate its functionality against different queries and/or base data. In a production system, the procedure would most likely be called from elsewhere (e.g. another PL/SQL procedure or web input) and do something different with the results - e.g. write to a file or database table, call additional functions, and/or generate HTML content for display within a web page (e.g. search results). This level of reporting of the results from a TAXAMATCH operation is a matter for you as the database / application developer from that point, as is determining what tables to actually point the algorithm at, once you have understood its basic functionality and verified it is working satisfactorily in your environment. To see the web-accessible version of the author's reference implementation of TAXAMATCH against an extensive real world dataset, please visit the IRMNG (Interim Register of Marine and Nonmarine Genera) search entry point at http://www.cmar.csiro.au/datacentre/irmng/ . For present purposes the package is configured to address the following data tables (supplied, or can be created manually as per commands given below): - "genlist_test1" (genus info); primary key (PK) is genus_id - "splist_test1" (species info); PK is species_id, has genus_id as foreign key (FK)(= link to relevant row in genus table) - "auth_abbrev_test1" (authority abbreviations) - required by subsidiary function "normalize_auth" If authority comparisons are not required in practice, calls to "normalize_auth" can be disabled and the relevant function commented out, removing the need for third table in this case. The genus and species tables each have three additional columns containing pre-computed values, to speed up the TAXAMATCH algorithm operation (in practice it would be very slow without them, against any sizeable base dataset). In due course, to run the algorithm against your own base data, it will be necessary to add equivalent columns and computed content to your own tables as well. It is recognised that in some production systems, genus and species information may combined into a single table (in other words there is no separate table of genera, and the genus is simply held as the first part of the species name in every case). TAXAMATCH can be configured to work in this instance with minor alteration as desired, however for demo purposes we stick with the (conceptually cleaner) 2-table model, which allows (for example) supplementary information to be held at genus level, catering for genera with no currently associated species content, etc. etc. SUPPORTED INPUT In the version supplied, TAXAMATCH is configured to operate on: - genus alone (e.g. "Homo") - genus + species (e.g. "Homo sapiens") - genus + species + authority (e.g. "Homo sapiens L.", "Homo sapiens Linnaeus", "Homo sapiens Linnaeus, 1758", etc.). In other words, the following are not supported in this version (but could be, with re-coding as required): - abbreviated genera (e.g. "H. sapiens") - genus + authority (with no species) (e.g. "Homo Linnaeus, 1758) - infraspecies (e.g. "Homo sapiens sapiens"), other varieties, subspecies, formae etc. (trinomials and quadrinomials), and hybrids - "starts with" searches e.g. "Homo sap" or "Homo sap*" If a subgenus is supplied in brackets (e.g. "Barbatia (Mesocibota) bistrigata Dunker, 1866"), it will be stripped out, presuming that it is the second word in the string (later instances of brackets, e.g. in the authority portion, are unaffected), such that (in this example) "Barbatia (Mesocibota) bistrigata" will match against "Barbatia bistrigata". In this version as supplied, the author string is presumed to be anything following the first two "good" terms supplied after stripping of subgenera and various other content type in between the presumed genus and species such as "cf.", "aff.", stray HTML markup, non A-Z characters, etc. The latter operations are carried out within a subsidiary function "normalize" which in practice, can also be used as a precursor to exact matching if desired. TAXAMATCH OUTPUT In the demo version, the algorithm is configured to return the following by default: - genus exact matches (with authority as available) - genus phonetic matches (other than the above) - genus non-phonetic matches - species (i.e., genus + species epithet) exact matches (with authority as available) - species phonetic matches (other than the above) - species non-phonetic matches In the species case an ED (edit distance) value is returned for genus and species separately which is an indication of calculated edit distance where 0 = exact match (if necessary, after normalization) and values of 1-4 (characters) indicate increasing dissimilarity. Also in the species case, where both an input and a target authority is available, the calculated authority similarity is returned on a 0-1 scale, where 1 = identical (after expansion of known author abbreviations and minor normalization as required) and 0 = no similarity. In practice, author similarities of <0.3 may be considered unlikely matches, values in the range 0.3-0.5 have some similarity, and values over 0.5 tend to be candidate good matches. A "debug" mode is also available which, if invoked, will print the numbers of genera and species actually tested to the screen as well as the values of the externally supplied, and selected internal parameters used for the match process. This can be very useful for tracing problems, etc. but would not normally be made visible to external users. The reason for separating the phonetic from the non-phonetic matches is that the former are more consistently reliable than the latter (which may contain up to maybe 20% false hits), for ease of bulk review purposes, however in single use situations eg. user web interfaces, this distinction may be less important. For simplicity in this version I do not report genus ED separately (or genus authority similarity, since the latter is not suported as input here), however obviously the procedure can be modified to do this if desired. CONTENT OF THIS ZIP FILE - README file (this document) - Oracle dump (output) file taxamatch.dmp (will create the three tables genlist_test1, splist_test1 and auth_abbrev_test1 with all indexes and pre-loaded content, if you can accept that format) - Three MS Excel spreadsheets of sample data identical to that held in the Oracle dump version, for loading into tables genlist_test1, splist_test1 and auth_abbrev_test1 if you need to create these manually: - genlist_test1.xls - splist_test1.xls - auth_abbrev_test1.xls Note: The author abbreviations file is fairly complete (although can be extended as needed), while the genus and species files only contain a small sample of names for initial testing (around 40-80 names each), which would be anticipated to be replaced with your own data in due course. - The PL/SQL package ("taxamatch1.sql") which contains the TAXAMATCH algorithm itself plus associated functions, two of which ("near_match" and "normalize") are also required for generating value in the pre-computed fields following loading of any new data into the genus and species tables. SETTING UP (USING SAMPLE DATA SUPPLIED) This process should take no more than 10-15 minutes (or even less) if you are moderately experienced with Oracle. The TAXAMATCH package can either be installed into an existing or a new Oracle account on your server, with a password of your choice. The actual PL/SQL file is simply stored locally in some directory accessible to the Oracle account for compilation, e.g. on my server here it is in the directory \\strait-hf\obis\plsql\ . You should now log into the TAXAMATCH or other account on Oracle where, for the purpose of this demo it is necessary (for this implementation) to either import the supplied Oracle dump (which will create the first three tables ready populated, and all relevant indexes), or manually create the following 3 tables to hold the supplied content, via these commands: ----------- create table genlist_test1 ( GENUS_ID VARCHAR2(15) not null, GENUS VARCHAR2(50), AUTHORITY VARCHAR2(150), GEN_LENGTH NUMBER(2), NEAR_MATCH_GENUS VARCHAR2(50), SEARCH_GENUS_NAME VARCHAR2(50) ); create table splist_test1 ( SPECIES_ID VARCHAR2(20) not null, GENUS_ORIG VARCHAR2(50), SPECIES VARCHAR2(60), GENUS_ID VARCHAR2(20), AUTHORITY VARCHAR2(150), SP_LENGTH NUMBER(2), NEAR_MATCH_SPECIES VARCHAR2(60), SEARCH_SPECIES_NAME VARCHAR2(60) ); create table auth_abbrev_test1 ( AUTH_ABBR VARCHAR2(200) not null, AUTH_FULL VARCHAR2(200) ); REMARKS: (1) The above field lengths are indicative only, if they do not match your own future content they can be varied as required (2) If you want to use these tables as the basis for a production system, they can be extended with as many additional fields as may be required. For example I have some 10-20 additional fields in each of the equivalent tables in my "live" system, also a foreign key "family_id" in the genus table which links to a families table, etc. etc. (3) If you have no need for the authority matching, you can ignore creation of this table, however you will have to disable relevant references to it in the PL/SQL procedure in order to avoid compilation errors. For demo purposes, my recommendation would be to include it anyway, as the functionality is worth a look and may also be useful in the future if not directly (4) Of course in a production system, you may have different table names and/or column names for equivalent content; this is no problem, so long as you modify the references in the PL/SQL procedure so that it knows where to look for the relevant values. ---------------- The following global temporary tables are also required by the TAXAMATCH application and should be created manually at this point; they are initially empty and are used to hold intermediate resultsets only (in other environments they could perhaps be replaced with arrays, but only if the results can be then selected in any desired order): create global temporary table genus_id_matches ( GENUS_ID VARCHAR2(15) not null, GENUS VARCHAR2(100), GENUS_ED NUMBER(2), PHONETIC_FLAG VARCHAR2(1) ) on commit delete rows; create global temporary table species_id_matches ( SPECIES_ID VARCHAR2(20) not null, GENUS_SPECIES VARCHAR2(100), GENUS_ED NUMBER(2), SPECIES_ED NUMBER(2), GEN_SP_ED NUMBER(2), PHONETIC_FLAG VARCHAR2(1) ) on commit delete rows; For serious use (as soon as there are significant amounts of content) it is advisable to create indexes on selected table columns, to speed up the matching process (these indexes will already exist if you have imported the tables as an Oracle dump). If the table size is anticipated to become large (e.g. over say 50,000 species names) it is also recommended to create a separate tablespace for the indexes (e.g. called "taxamatch_idx"). In this case you would add an extra clause to the index creation statements below (example: "create unique index genlist_genid on genlist_test1(genus_id) tablespace taxamatch_idx"), othewise the indexes will simply share the same tablespace as the tables. Here are the suggested indexes to create, if you need to do so manually: create unique index genlist_genid on genlist_test1(genus_id); -- (this will also prevent the inadvertent creation of duplicate rows) create index genlist_genus on genlist_test1(genus); create index genlist_genlength on genlist_test1(gen_length); create index genlist_nearmatch on genlist_test1(near_match_genus); create index genlist_searchname on genlist_test1(search_genus_name); create unique index splist_spid on splist_test1(species_id); -- (this will also prevent the inadvertent creation of duplicate rows) create index splist_species on splist_test1(species); create index splist_genid on splist_test1(genus_id); create index splist_splength on splist_test1(sp_length); create index splist_nearmatch on splist_test1(near_match_species); create index splist_searchname on splist_test1(search_species_name); create unique index authors_abbr on auth_abbrev_test1(auth_abbr); -- (this will also prevent the inadvertent creation of duplicate rows) As stated above, if you are able to import the supplied Oracle dump directly, it should already include the creation of these indexes - however they will share the same tablespace as the data (which is fine for the demo, but may need to be changed for production use as mentioned above). ------------------ Once all the above is done, if you have created the tables manually you must next load in the test content supplied as excel spreadsheets (e.g. using a MS Access or similar view of the tables) - only if you have not imported the tables ready populated. You must next compile the PL/SQL package. To compile the package (e.g. on my system) from within the relevant Oracle account, I use a command such as "@\\strait-hf\obis\plsql\taxamatch1", where \\strait-hf\obis\plsql\ is the path to the directory where I have installed the actual (editable version of the) PL/SQL file. Probably there will be something equivalent for your system. ------------------ Now you can proceed to testing the package. The easiest way to do this is via an SQL+ window, and directly type (or copy) the input commands there. An alternative is to attach the package to a webserver and address it either as a HTTP GET or POST request, however that is outside the scope of this demo, and would would also require minor alteration of the reporting section to generate the required web output (e.g. via "HTP.PRINT" commands). ** IMPORTANT NOTE ** In order to test the package e.g. via an SQL+ window, it is necessary to type the following command so that the results are returned to that window: set serveroutput on; In addition, I normally set the timer as well, so that I can see how long processes take: set timing on; Basically the input string (as genus, genus+species, or genus+species+authority) can now be supplied to the package in the following form (e.g.): execute taxamatch1.taxamatch('Homo'); -- example for exact match execute taxamatch1.taxamatch('Hommo'); -- example for phonetic match execute taxamatch1.taxamatch('Hombo'); -- example for non-phonetic match execute taxamatch1.taxamatch('Hombo sapiens'); -- NB, 'Hombo+sapiens' is also an aceptable format execute taxamatch1.taxamatch('Hombo sapiens L.'); execute taxamatch1.taxamatch('Aluteres scriptus (Osbeck)'); The only slight non-obvious complication is that any ampersands (e.g. as supplied as part of an authority string) need to be quarantined from the remainder of the input string with delimiters, e.g. as follows: execute taxamatch1.taxamatch('Homo sapiens Smith '||'&'||' Jones'); To exercise the "rapid" or "no shaping" modes of TAXAMATCH (in addition to the default (=normal) mode, the procedure is addressed e.g. as follows (or can include a space following the comma, does not matter): * "RAPID" mode: execute taxamatch1.taxamatch('Homo sapient L.','rapid'); -- will work OK, as non-phonetic error confined to one word only execute taxamatch1.taxamatch('Hombo sapiens L.','rapid'); -- will work OK, as non-phonetic error confined to one word only execute taxamatch1.taxamatch('Hombo sapient L.','rapid'); -- will fail, as non-phonetic error in both words (?? rare) execute taxamatch1.taxamatch('Hommo','rapid'); -- will work, as phonetic error in genus, no species supplied (?? common) execute taxamatch1.taxamatch('Hombo','rapid'); -- will fail, as non-phonetic error in genus, no species supplied (?? common) * "NO SHAPING" mode: execute taxamatch1.taxamatch('Aluteres scriptus (Osbeck)','no_shaping'); -- returns 1 additional match cf. "normal" example given above * "DEBUG" mode (additional to the above): At any time, to inspect the internal parameters being generated during the procedure operation, a third parameter will trigger "debug" operation, as follows (e.g. "Y" for yes, but the actual character does not matter): execute taxamatch1.taxamatch('Hombo sapient L.',null,'Y') -- must supply either null (="normal"), or desired mode as second parameter execute taxamatch1.taxamatch('Hombo sapient L.','rapid','Y') If all is functioning correctly, with the sample data as supplied, you should then obtain results such as the following (if you do not see something similar to the example results, but the procedure says that it has successfully completed, check that you have entered "set serveroutput on" for this session). *********************************** - Example 1: correctly spelled name: ---------------------------------- SQL> execute taxamatch1.taxamatch('Homo sapiens L.'); --------- ** Input name: Homo sapiens L. ** --------- Genus exact matches: * Homo Linnaeus, 1758 (ID: mam1001982) --------- Genus phonetic matches: --------- Other genus near matches: --------- Species exact matches: * Homo sapiens Linnaeus, 1758 (ID: mam10001481) ED 0,0 auth. similarity=.7617 --------- Species phonetic matches: --------- Other species near matches: PL/SQL procedure successfully completed. Elapsed: 00:00:00.03 *********************************** - Example 2: incorrectly spelled name: ---------------------------------- SQL> execute taxamatch1.taxamatch('Hombo sapient L.'); --------- ** Input name: Hombo sapient L. ** --------- Genus exact matches: --------- Genus phonetic matches: * Hamba Distant, 1907 (ID: hex1000141) --------- Other genus near matches: * Homo Linnaeus, 1758 (ID: mam1001982) * Hombak Adanson, 1763 (ID: pln1042576) * Homea Fleming, 1822 (ID: pis1007493) --------- Species exact matches: --------- Species phonetic matches: --------- Other species near matches: * Homo sapiens Linnaeus, 1758 (ID: mam10001481) ED 1,1 auth. similarity=.7617 PL/SQL procedure successfully completed. Elapsed: 00:00:00.04 *********************************** - Example 3: incorrectly spelled name (multiple near matches in sample data supplied): ---------------------------------- SQL> execute taxamatch1.taxamatch('Aluteres scriptus (Osbeck)'); --------- ** Input name: Aluteres scriptus (Osbeck) ** --------- Genus exact matches: * Aluteres Lesson, 1831 (ID: pis1007158) --------- Genus phonetic matches: * Aluterus Cloquet, 1816 (ID: pis1007186) --------- Other genus near matches: * Alutera Oken (ex Cuvier), 1817 (ID: pis1007185) * Aluetera (ID: pis1007175) * Alurites Ruzhentsev & Bogoslovskaya, 1975 (ID: mol1008842) * Alutarius Agassiz, 1846 (ID: pis1007184) * Alutella (ID: cru1010772) --------- Species exact matches: --------- Species phonetic matches: * Aluterus scriptus (Osbeck, 1765) (ID: pis10050987) ED 1,0 auth. similarity=.6771 * Aluteres scripta (Osbeck, 1765) (ID: pis10050761) ED 0,2 auth. similarity=.6771 * Aluterus scripta (Osbeck, 1765) (ID: pis10050995) ED 1,2 auth. similarity=.6771 --------- Other species near matches: --------- (Additional ED 4 near matches are present, currently hidden by result shaping) PL/SQL procedure successfully completed. Elapsed: 00:00:00.07 *********************************** - Example 4: incorrectly spelled name (multiple near matches in sample data supplied), result shaping switched off: ---------------------------------- SQL> execute taxamatch1.taxamatch('Aluteres scriptus (Osbeck)','no_shaping'); --------- SQL> execute taxamatch1.taxamatch('Aluteres scriptus (Osbeck)','no_shaping'); --------- ** Input name: Aluteres scriptus (Osbeck) ** --------- Genus exact matches: * Aluteres Lesson, 1831 (ID: pis1007158) --------- Genus phonetic matches: * Aluterus Cloquet, 1816 (ID: pis1007186) --------- Other genus near matches: * Alutera Oken (ex Cuvier), 1817 (ID: pis1007185) * Aluetera (ID: pis1007175) * Alurites Ruzhentsev & Bogoslovskaya, 1975 (ID: mol1008842) * Alutarius Agassiz, 1846 (ID: pis1007184) * Alutella (ID: cru1010772) --------- Species exact matches: --------- Species phonetic matches: * Aluterus scriptus (Osbeck, 1765) (ID: pis10050987) ED 1,0 auth. similarity=.6771 * Aluteres scripta (Osbeck, 1765) (ID: pis10050761) ED 0,2 auth. similarity=.6771 * Aluterus scripta (Osbeck, 1765) (ID: pis10050995) ED 1,2 auth. similarity=.6771 --------- Other species near matches: * Alutera scripta (Osbeck, 1765) (ID: pis10050975) ED 2,2 auth. similarity=.6771 PL/SQL procedure successfully completed. Elapsed: 00:00:00.06 *********************************** - Example 5: TAXAMATCH "rapid" mode (successful in this instance, non-phonetic error in either genus or species epithet but not both) ---------------------------------- SQL> execute taxamatch1.taxamatch('Hombo sapiens','rapid'); --------- ** Input name: Hombo sapiens ** --------- Genus exact matches: --------- Genus phonetic matches: * Hamba Distant, 1907 (ID: hex1000141) --------- Other genus near matches: * Homo Linnaeus, 1758 (ID: mam1001982) --------- Species exact matches: --------- Species phonetic matches: --------- Other species near matches: * Homo sapiens Linnaeus, 1758 (ID: mam10001481) ED 1,0 PL/SQL procedure successfully completed. Elapsed: 00:00:00.04 *********************************** - Example 6: TAXAMATCH "rapid" mode (unsuccessful in this instance, non-phonetic error in both genus or species epithet) ---------------------------------- SQL> execute taxamatch1.taxamatch('Hombo sapient','rapid'); --------- ** Input name: Hombo sapient ** --------- Genus exact matches: --------- Genus phonetic matches: * Hamba Distant, 1907 (ID: hex1000141) --------- Other genus near matches: --------- Species exact matches: --------- Species phonetic matches: --------- Other species near matches: PL/SQL procedure successfully completed. Elapsed: 00:00:00.03 *********************************** - Example 7: debug mode: ---------------------------------- SQL> execute taxamatch1.taxamatch('Aluteres scriptus (Osbeck)',null,'Y'); --------- ** Input name: Aluteres scriptus (Osbeck) ** --------- Genus exact matches: * Aluteres Lesson, 1831 (ID: pis1007158) --------- Genus phonetic matches: * Aluterus Cloquet, 1816 (ID: pis1007186) --------- Other genus near matches: * Alutera Oken (ex Cuvier), 1817 (ID: pis1007185) * Aluetera (ID: pis1007175) * Alurites Ruzhentsev & Bogoslovskaya, 1975 (ID: mol1008842) * Alutarius Agassiz, 1846 (ID: pis1007184) * Alutella (ID: cru1010772) --------- Species exact matches: --------- Species phonetic matches: * Aluterus scriptus (Osbeck, 1765) (ID: pis10050987) ED 1,0 auth. similarity=.6771 * Aluteres scripta (Osbeck, 1765) (ID: pis10050761) ED 0,2 auth. similarity=.6771 * Aluterus scripta (Osbeck, 1765) (ID: pis10050995) ED 1,2 auth. similarity=.6771 --------- Other species near matches: --------- ########## --------- DEBUG INFO searchtxt: Aluteres scriptus (Osbeck) search_mode: debug: Y this_search_genus: ALUTERES this_search_species: SCRIPTUS this_authority: (Osbeck) this_near_match_genus: ALITIRIS this_genus_start: ALU this_genus_end: RES this_genus_length: 8 this_near_match_species: SCRIPTA this_species_length: 8 No. of genera tested: 15 "GENUS ID MATCHES" TABLE CONTENT: genus_id: pis1007158, genus: Aluteres, genus_ed: 0, phonetic_flag: Y genus_id: pis1007186, genus: Aluterus, genus_ed: 1, phonetic_flag: Y genus_id: pis1007185, genus: Alutera, genus_ed: 2, phonetic_flag: genus_id: pis1007175, genus: Aluetera, genus_ed: 3, phonetic_flag: genus_id: mol1008842, genus: Alurites, genus_ed: 3, phonetic_flag: genus_id: pis1007184, genus: Alutarius, genus_ed: 3, phonetic_flag: genus_id: cru1010772, genus: Alutella, genus_ed: 3, phonetic_flag: No. of species tested: 40 "SPECIES ID MATCHES" TABLE CONTENT: species_id: pis10050987, genus_species: Aluterus scriptus, genus_ed: 1, species_ed: 0, gen_sp_ed: 1, phonetic_flag: Y species_id: pis10050761, genus_species: Aluteres scripta, genus_ed: 0, species_ed: 2, gen_sp_ed: 2, phonetic_flag: Y species_id: pis10050995, genus_species: Aluterus scripta, genus_ed: 1, species_ed: 2, gen_sp_ed: 3, phonetic_flag: Y species_id: pis10050975, genus_species: Alutera scripta, genus_ed: 2, species_ed: 2, gen_sp_ed: 4, phonetic_flag: PL/SQL procedure successfully completed. Elapsed: 00:00:00.10 *********************************** If you inspect the actual species names held in the sample data, it should be simple to devise your own tests by misspelling portions of the name elements and thus put the TAXAMATCH algorithm through its paces. UPLOADING FURTHER CONTENT This should be simple to do following the principles shown by the sample content. It is important to note, however, that the instances of genus+species name (in the species table) should be correctly linked to the "parent" genus name instances via the correct genus_id identifiers and that there are no orphan names in the species table (if there are, then they will simply not be retrieved). There can be multiple entries of the same name in either table, so long as each has a unique identifier (genus_id or species_id as appropriate), this presents no problem (and is a real requirement in the case e.g. of genus homonyms, which might however have different authorities or other fields so as to be distinguishable in use). If any new content is loaded (or existing content altered), the only requirement is to keep the derived fields up-to-date. * For genera these are three: GEN_LENGTH, NEAR_MATCH_GENUS, and SEARCH_GENUS_NAME. You can refresh these as follows (either manually, or with a trigger), e.g.: update genlist_test1 set gen_length = length(genus) where gen_length != length(genus) or gen_length is null; update genlist_test1 set near_match_genus = taxamatch1.near_match(genus) where near_match_genus != taxamatch1.near_match(genus) or near_match_genus is null; update genlist_test1 set search_genus_name = taxamatch1.normalize(genus) where search_genus_name != taxamatch1.normalize(genus) or search_genus_name is null; * For species there are three equivalent fields: SP_LENGTH, NEAR_MATCH_SPECIES, and SEARCH_SPECIES_NAME. You can refresh these as follows (either manually, or with a trigger), e.g.: update splist_test1 set sp_length = length(species) where sp_length != length(species) or sp_length is null; update splist_test1 set near_match_species = taxamatch1.near_match(species, 'epithet_only') where near_match_species != taxamatch1.near_match(species, 'epithet_only') or near_match_species is null; update splist_test1 set search_species_name = taxamatch1.normalize(species) where search_species_name != taxamatch1.normalize(species) or search_species_name is null; MODIFYING THE SUPPLIED PACKAGE TO SUIT YOUR PARTICULAR NEEDS There will be 4+ (or possibly 5+) aspects to modify to suit accessing a different, pre-existing database schema: (1) Where these do not currently exist, create additional columns for GEN_LENGTH, NEAR_MATCH_GENUS, and SEARCH_GENUS_NAME (in the genus table), SP_LENGTH, NEAR_MATCH_SPECIES, and SEARCH_SPECIES_NAME (in the species table) and populate with correct values as described above (or these could all be within one table if there is no separate genus table) (2) As required, alter the names of tables and columns in the TAXAMATCH cursor select statements to match your actual names employed (3) As required, alter (e.g. extend) the range of ancillary information extracted from the table/s in the reporting section (4) Alter the procedure reporting section output to do whatever it is that you require (e.g. build a web page, write to a table or file, generate XML output, etc. etc.) (4+) As desired, disable the information comments regarding results hidden when result shaping is active. (5) If you are running off a single table for all genus+species information rather than 2 separate tables as per this demo package, modify the "select" statements in the cursors accordingly: e.g. select distinct genera for testing in the first pass, then test just species of these genera using the genus name as the cross-match, not genus_id. ... If you want TAXAMATCH to work on genus+author alone, or subgenera, infraspecies, etc., you will have to introduce further modifications and experimentation, on an as-needs basis... USING COMPONENT FUNCTIONS AS STANDALONE FUNCTIONS This is perfectly feasible, e.g. for test/demo/production purposes: SQL> select taxamatch1.normalize('Anabaena cf. flos-aquae Ralfs ex Born. et Flah.') from dual; TAXAMATCH1.NORMALIZE('ANABAENACF.FLOS-AQUAERALFSEXBORN.ETFLAH.') -------------------------------------------------------------------------------- ANABAENA FLOSAQUAE Ralfs ex Born. et Flah. SQL> select taxamatch1.normalize_auth('Ralfs ex Born. et Flah.') from dual; TAXAMATCH1.NORMALIZE_AUTH('RALFSEXBORN.ETFLAH.') ---------------------------------------------------------------------------- RALFS EX BORNET & FLAHAULT SQL> select taxamatch1.mdld('TONY','TOBY',4) from dual; TAXAMATCH1.MDLD('TONY','TOBY',4) -------------------------------- 1 SQL> select taxamatch1.mdld('DELICULATUS','DELICATULUS',4) from dual; TAXAMATCH1.MDLD('DELICULATUS','DELICATULUS',4) ---------------------------------------------- 2 SQL> select taxamatch1.ngram('TONY','TOBY',2) from dual; TAXAMATCH1.NGRAM('TONY','TOBY',2) --------------------------------- .6 SQL> select taxamatch1.ngram('TONY','TOBY',3) from dual; TAXAMATCH1.NGRAM('TONY','TOBY',3) --------------------------------- .5 SQL> select taxamatch1.compare_auth('Smith et Jones','Jones et Smith') from dual; TAXAMATCH1.COMPARE_AUTH('SMITHETJONES','JONESETSMITH') ------------------------------------------------------ .9111 SQL> select taxamatch1.compare_auth('Taylor','F.J.R. Taylor') from dual; TAXAMATCH1.COMPARE_AUTH('TAYLOR','F.J.R.TAYLOR') ------------------------------------------------ .5925 CONCLUDING REMARKS This demo package as been constructed with a few days of testing (but not exhaustively), based on a stripped-down version of the author's reference installation available at CMAR (web accessible at http://www.cmar.csiro.au/datacentre/irmng/index.html ), over a fairly extensive reference database of 1.4m+ species names and 238,000+ genus names. I will endeavour to fix any errors or bugs as reported, also welcome suggestions for improving any aspects of the code. Please feel free to contact the author on Tony.Rees@csiro.au or if needed, by phone (in Australia) 0362 325318, normal working hours. Thanks for your interest in TAXAMATCH and good luck! I also look forward to any contributions of the same code logic ported to other languages, etc., from other developers, preferably in conjunction with relevant posts on the TAXAMATCH developers' wiki.