MarLIN – a metadatabase for research data holdings at CSIRO Marine Research

- paper presented at the first Australasian Marine and Coastal Data Management Conference, Hobart, November 1998

Anthony J. J. Rees (1, 2) and Miroslaw M. Ryba (1, 3)

 

(1) Divisional Data Centre, CSIRO Marine Research, G.P.O. Box 1538, Hobart, Tasmania 7001

(2) e-mail: Tony.Rees@marine.csiro.au

(3) e-mail: Miroslaw.Ryba@marine.csiro.au

 

Abstract

MarLIN, the Marine Laboratories Information Network, has been constructed as the Divisional metadatabase for CSIRO Marine Research (CMR) over the period 1997-8. MarLIN comprises an Oracle relational database with a world wide web interface for searching (from any computer) and for metadata submission (by Divisional staff). The starting point for developing the software has been the "Green Pages" Environmental Data Directory at Environment Australia, which is used by permission. MarLIN includes a number of additional fields in order to incorporate metadata elements compatible with the national "Blue Pages" metadata directory for marine and coastal datasets, as well as others useful for searching or for in-house data management purposes. Particular examples of the latter include subject categories which are compatible with those used by ASFA (Aquatic Sciences and Fisheries Abstracts: FAO, Rome, 1983), and taxonomic codes from CAAB (Codes for Australian Aquatic Biota: CSIRO Marine Research, Hobart). Relational aspects of the database (MarLIN supporting tables) include keywords, defined regions, CAAB species codes, contact persons, organisations, and project and research voyage details. MarLIN is currently in version 1.0 and is in the process of being populated with metadata entries for datasets held at CMR.

 

Background

In the past few years, the use of metadata to describe data holdings by research organisations has gradually been accepted as a necessary tool for locating and describing datasets. Within Australia, initiatives such as the Ocean Rescue 2000 "Blue Pages" theme directory for marine and coastal datasets in Australia (AODC, 1996) as well as the Australia New Zealand Land Information Council (ANZLIC)’s developing regional standard for geospatial metadata (ANZLIC, 1996-8) have stimulated organisations such as CSIRO Marine Research and its predecessors (CSIRO Divisions of Fisheries and Oceanography) to start to collect metadata according to a local standard, in our case using metadata elements and terminology compliant with the ANZLIC metadata standard and the "Blue Pages". Formation of the new CSIRO Division of Marine Research (CMR) in 1997, and with it a Divisional Data Centre responsible for developing Divisional data management policies, highlighted the need in our Division for an in-house metadatabase which would be capable of exporting metadata to external systems such as the "Blue Pages", while serving the additional internal data management needs of the Division.

Accordingly, in 1997 we started to develop a specification for an in-house metadatabase, to be termed the Marine Laboratories Information Network or "MarLIN". After deciding that the system should be based on our existing capability to run an Oracle relational database, we were fortunate in obtaining permission from Environment Australia to use their pre-existing "Environmental Data Directory" (EDD or "Green Pages") software as the starting point for MarLIN. Among other aspects, this software included routines to construct "on the fly" HTML metadata pages using Oracle Web Agent, reflecting the latest information in the database at any time.

 

Metadata elements in MarLIN

The "Green Pages" application, used as the basis for developing MarLIN, contains metadata fields which are based on the core metadata elements proposed by ANZLIC, plus certain other fields considered useful to its own organisation (Environment Australia) (see References - Online links). We have reviewed these pre-existing elements and retained many of them in either modified or unmodified form, and then added additional elements for three purposes: first, for compatibility with version 1.0 of the national "Blue Pages" data directory; second, to hold additional information relevant to our Division’s activities; and third, to facilitate searching, metadata entry, and database administration. New or pre-existing elements of the "Green Pages" have also been structured so that supporting tables are used wherever it is efficient to do so, for example to store information about contact persons, organisations, vessel (platform) names, and details of research projects and voyages of Divisional research vessels, defined regions and their bounding coordinates, and species codes and names.

As at November 1998, version 1.0 of MarLIN contains the following metadata elements. Elements marked * are included to facilitate compliance with "Blue Pages", version 1.0, and the present ANZLIC metadata standard. The actual number of database fields exceeds the number of metadata elements, since some elements are constructed from more than one database field: for example the element "voyage start date" comprises three separate database fields (voyage start day, voyage start month, voyage start year), to allow for these to be populated at separate times as the relevant information is to hand. In addition, there are a number of internal database fields (for example, those connected with database function or administration) which do not always return visible metadata to the user.

 

MarLIN metadata element Comments
MarLIN record ("txtsession") number… Unique identifier for metadata record - automatically assigned on record creation. Internal field only (does not appear in "Search" or "Edit" interfaces).
*Title  
Short Title/Identifier Used for quick searching, or to hold pre-existing identifiers such as CD-ROM serial numbers
*Data Type Selected from pre-defined list (IDs stored in record, descriptions in separate table)
*Originator Organisation Selected from pre-defined list (IDs stored in record, descriptions in separate table). Corresponds to ANZLIC/Blue Pages element "Custodian"
Local Custodian Selected from pre-defined list (IDs stored in record, descriptions in separate table). Used when the holder of the local copy is not responsible for maintaining and distributing the master copy of the dataset
*Project Name Selected from pre-defined list (IDs stored in record, descriptions in separate table)
*Project Leader 1 Automatically loaded from "Projects" table
Project Organisation 1 Automatically loaded from "Projects" table
Project Leader 2 (if applicable) Automatically loaded from "Projects" table
Project Organisation 2 (if applicable) Automatically loaded from "Projects" table
Project Leader 3 (if applicable) Automatically loaded from "Projects" table
Project Organisation 3 (if applicable) Automatically loaded from "Projects" table
Project Start Year Automatically loaded from "Projects" table
Project End Year Automatically loaded from "Projects" table
Project Description Automatically loaded from "Projects" table
Project Extended Description Not uploaded to metadata record, but available for separate query if needed
*Platform Name Selected from pre-defined list (IDs stored in record, descriptions in separate table)
Platform Description Not uploaded to metadata record, but available for separate query if needed
Platform Type Not uploaded to metadata record, but used in "Edit" interface
Voyage Identifier Selected from pre-defined list (IDs stored in record, descriptions in separate table)
Voyage Region Automatically loaded from "Voyages" table
Voyage Leader Name Automatically loaded from "Voyages" table
Voyage Start Date Automatically loaded from "Voyages" table
Voyage End Date Automatically loaded from "Voyages" table
Voyage Brief Description Automatically loaded from "Voyages" table
Voyage Extended Description Not uploaded to metadata record, but available for separate query if needed
Online Link to Voyage Track (GIF image) Automatically loaded from "Voyages" table (as available)
Additional Platform/Voyage Information Text field for platform or voyage not in pre-defined lists
Publication Date 3 separate fields (day, month, year)
Contributors  
Acknowledgements  
*References  
*Contact Person Selected from pre-defined list (IDs stored in record, descriptions in separate table)
*Contact Organisation Automatically loaded from "Contact Persons" table
*Contact Address, Phone, Fax, E-mail etc. Automatically loaded from "Contact Persons" table
*Abstract  
Metadata Author's Comments (does not appear except in "Edit" interface)
MarLIN Subject Categories Selected from pre-defined list (IDs stored in record, descriptions in separate table)
Additional Subject Information Text field for subjects not in pre-defined list
*Theme Keywords (Blue Pages Themes) Selected from pre-defined list (IDs stored in record, descriptions in separate table)
ANZLIC Search Words Selected from pre-defined list (IDs stored in record, descriptions in separate table)
*Habitat Keywords Selected from pre-defined list (IDs stored in record, descriptions in separate table)
*Taxonomic Keywords Selected from pre-defined list (IDs stored in record, descriptions in separate table)
CAAB Species Codes Selected from pre-defined list (IDs stored in record, descriptions in separate table)
CAAB Species Scientific Names Automatically loaded from "CAAB" table
CAAB Species Common Names Automatically loaded from "CAAB" table
Additional Taxonomy Text field for species not in CAAB species codes table
*Parameters Selected from pre-defined list (IDs stored in record, descriptions in separate table)
Additional Parameters Text field for items not in pre-defined "parameters" list
*Equipment Selected from pre-defined list (IDs stored in record, descriptions in separate table)
Additional Equipment Text field for items not in pre-defined "equipment" list
*Location Keywords (Defined Regions) Selected from pre-defined list (IDs stored in record, descriptions in separate table); also links to coordinates for defined regions on data entry and searching
*North Bounding Coordinate Loaded automatically from "Defined regions" table, then user-editable
*South Bounding Coordinate As above
*West Bounding Coordinate As above
*East Bounding Coordinate As above
(*)Documentation Link/s 3 fields (Link URL, Link Description, Access Flag). Would be included in "On-line Links" on export to "Blue Pages".
(*)Data Link/s 3 fields (Link URL, Link Description, Access Flag). Would be included in "On-line Links" on export to "Blue Pages".
(*)Graphic Link/s 3 fields (Link URL, Link Description, Access Flag). Would be included in "On-line Links" on export to "Blue Pages".
*Beginning Date 3 separate fields (day, month, year)
*Ending Date 3 separate fields (day, month, year)
*Progress Selected from pre-defined list (IDs stored in record, descriptions in separate table)
*Maintenance Selected from pre-defined list (IDs stored in record, descriptions in separate table)
*Stored Data Format Selected from pre-defined list (IDs stored in record, descriptions in separate table), plus additional field for free-text extension
Additional Stored Format Information Text field for additional stored format details
Stored Data Volume  
Stored Data Location (general)  
Stored Data Location (specific)  
Stored Data Documentation  
Software/Hardware Requirements  
*Available Data Format Selected from pre-defined list (IDs stored in record, descriptions in separate table), plus additional field for free-text extension
Additional Available Format Information Text field for additional available format details
*Access Constraints  
GIS Datum Select from pre-defined datum list (IDs stored in record, descriptions in separate table), plus additional field for free-text extension
GIS Scale 3 fields (Scale denominator, or cell size plus cell units)
* Data Source, Processing and Quality Control Corresponds to ANZLIC/Blue Pages field "Lineage"
*Logical Consistency  
*Positional Accuracy  
*Parameter Accuracy  
*Completeness  
*Additional Metadata  
Related MarLIN Datasets  
*Metadata Created By...(Person/Group) Automatically generated on record creation
*Metadata Creation Date Automatically generated on record creation
*Metadata Availability 3 user-allocated states (external/internal/restricted)
Metadata Editable By...(Person/Group) Automatically generated on record creation, may be reassigned to another user by database administrator as needed
Metadata Last Updated By...(Person/Group) Automatically generated on record update
Metadata Last Update Date Automatically generated on record update
 

MarLIN keywords and subject categories are hierarchically defined so that each keyword may be a "parent" or a "child" of other keywords. Searching on a "parent" is configured to automatically search on any "children" that keyword may possess.

One particular area which exercised us in developing MarLIN concerned the use of subject categories to which datasets could be assigned. This was considered an essential aspect of the database, only partly catered for by the pre-existing ANZLIC "Search Words" and Blue Pages "Themes" (which have been retained separately to ensure compliance with these systems). In essence, the ANZLIC search words and the Blue Pages themes represent two approaches to assigning subject categories to datasets, but in the case of the ANZLIC "search words" this is done at a fairly high level (e.g. "Oceanography – Physical" or "Oceanography – Chemical"), while a number of the Blue Pages "Themes" are closer to keywords and differ little from keywords now used separately (e.g. those covering habitat, taxonomy, equipment, etc.). In addition, the current list of Blue Pages themes is not particularly rigorous or structured in its coverage of marine science topics.

Therefore, for MarLIN we decided to introduce a comprehensive set of subject categories to encompass the general area of interest of the Division’s operation, which would be descriptive rather than attempt to be a list of keywords as such – the latter being already covered by the various "keywords" options. In addition, it was felt valuable to base our subject categories on an accepted, pre-existing scheme rather than develop a new scheme which would be incompatible with any other. After consideration of various possible contenders, for example the Library of Congress subject headings from the USA (Library of Congress, 1997), and the "Topics" and "Terms" used in NASA’s "Global Change Master Directory" (NASA, 1997), it was decided to incorporate subject categories based on those developed by ASFIS (Aquatic Sciences and Fisheries Information System) for their ASFA-1, ASFA-2, ASFA-3 and ASFA-4 bibliographic databases (CAS, 1998). These correspond well with most areas of interest of the Division, and, in addition, have been created in a well structured manner with detailed accompanying scope notes (FOA, 1983). The ASFA subject categories would already be familiar to many searchers for information on marine science through the presence of ASFA bibliographic databases (either in print, CD-ROM or on-line forms) in most marine science libraries.

Accordingly, MarLIN subject categories have been defined so as to parallel those set up by ASFIS, while making minor adjustments to terminology for the sake of conciseness and/or clarity. In some instances, "duplicate" ASFA categories from the various ASFA products have been combined into a single category for MarLIN, as is the case with the MarLIN subject category "2141/1141. Descriptive oceanography and limnology", which is equivalent to both ASFA-2 "Descriptive Oceanography and Limnology: General-2141" from ASFA-2 and "The Physical Environment: General-1141" from ASFA-1, since these two categories are effectively the same. In a few cases, subsets of previous categories have also been set up where necessary, for example the MarLIN category "2243a. Radiation and temperature" has been created as a subset of "2243. Atmospheric physical properties – general" which corresponds to the ASFA-2 category "Marine Meteorology and Climatology: Structure, mechanics and thermodynamics-2243", even though the subject "radiation and temperature" is not treated separately under the ASFA subject classification scheme.

Prefixing the MarLIN subject category name with the equivalent ASFA category number(s) is intended to facilitate ease of transfer of a search on any defined subject area between the two systems as required. This also leaves open the possibility that it may be possible to construct an interface capable of automated linked searches of the two databases at a future point.

Individual MarLIN subject categories have also been assigned to one of eight broad "subject areas" for more rapid searching at a high level. These are presently as follows:

Once again, these broad "subject areas" parallel most of the major groupings which will already be familiar to users acquainted with the ASFA subject categories from bibliographic searching.

Species-level taxonomic information in MarLIN is handled by incorporating CAAB ("Codes for Australian Aquatic Biota") numeric species codes (Yearsley et al. 1997) into the metadata record at time of data entry, and then using the numeric codes to retrieve associated species names when the metadata record is generated for the user. Up-to-date versions of the codes and the most recent taxon names are held in MarLIN supporting tables, copied from the relevant entries in the CAAB database (which resides on a separate system at CMR).

 

MarLIN user interfaces

MarLIN presently features a "Search" mode in which the user can select a full "user-defined search", a variety of "list" functions, or a JAVA search. The user-defined search is carried out in three stages: first a geographic module where it is possible to set criteria for spatial searching, then selected high-level criteria such as setting the search based on title text, organisations, projects, or broad subject area; and finally, an (optional) page of more specific criteria such as keywords, individual subject categories, CAAB species codes, and other variables. The information presented in this final part of the "user-defined search" is custom-generated according to the subject areas selected in the previous screen.

The various "list" functions allow the user to list the contents of the database in batches of 40 entries, either sorted alphabetically by title, or sorted by date of creation (latest to earliest, to see new additions to the database) or by date of update.

The JAVA search employs a JAVA applet which, once loaded to the user’s web browser, allows the searcher to specify a region on an interactive map (if desired), and also interrogate the various lists of keywords within the JAVA environment without waiting for HTML screen re-draws. Lists of titles in the database are also accessible directly from this interface together with the ability to search interactively for particular text strings in the Title field, all within the JAVA applet environment prior to searching the database. Much of the functionality of this applet was already present in the "Green Pages" version, but it has been customised and extended for "MarLIN" operation.

The MarLIN "Edit" interface is an HTML forms interface generated via Oracle Web Agent from a mixture of pre-defined text and current values in the various supporting tables such as keywords, project names, contact person names, CAAB species codes, etc. Alternatives are presented to the user to either include or exclude "Biological Options" such as Biological subject categories, parameters, plus the species codes, taxonomy and habitat keywords, since if they are not relevant, exclusion of these elements of the interface results in faster screen redraws of the HTML page in response to various user actions. These include saving information to the database, or using one of the available buttons to visit additional options such as adding extra links, submitting new project details, inserting defined region coordinates or CAAB codes to the record, etc.

More details of these interfaces can be found by visiting either the MarLIN database or the MarLIN demonstration pages at the locations given under "References - Online links".

 

MarLIN in operation

MarLIN is presently set up to allow access via the world wide web (WWW) to any user, internal or external, for on-line searching, with access for metadata submission or editing restricted to Divisional (CMR) users. Metadata access is determined by a user-defined flag, set to "external", "internal" or "restricted", which determines which dataset descriptions are visible either to any searcher, to users on the CMR computer domain only, or to only those users entering a specific user ID and password, respectively. In accordance with the original design for the "Green Pages" database, new entries created by Divisional users (or user modifications to existing entries) are stored in a set of "temporary" tables which mirror the "permanent" tables where the approved records reside, until their content is checked, amended if necessary, and approved by a database administrator. This method ensures that data in the permanent tables cannot be accidentally or purposely modified or deleted by users, only by specific actions of the database administrator, thus ensuring a degree of quality control and security independent of that which individual users may apply.

 

Database population and ongoing development

MarLIN is a relatively new system, and as such as still in the process of being populated with entries regarding Divisional data. Currently it contains some 500 entries, compared with the several thousand which it may reasonably be expected to contain in the future. Completion of these initial entries has been a valuable "testing exercise" during which aspects of the database design and the user interfaces have been modified as necessary, reflecting the metadata requirements of particular data types held within the Division. At time of writing (November 1998), MarLIN is considered to be reasonably stable as version 1.0, and minor changes to the database structure will be made only if they represent an enhancement necessary or desirable for continued operation of the database, in response to user or administration needs. However, a full review of the software and database content will be carried out after approximately 12 months of operation, at which time changes may be made to the source code, methods of administration, or user interfaces in response either to new external factors (such as revised metadata standards with which we should like MarLIN to remain compliant), or to any other desired features not currently incorporated in version 1.0.

 

Discussion

MarLIN comprises a relational database structure capable of generating HTML pages "on the fly" which reflect the latest information in the database. This contrasts with the static, text-based metadata pages used in the "Blue Pages" and some other metadata systems, whereby the HTML pages and/or an underlying system of SGML (ASCII) files reside continuously on one or more servers. The latter concept has certain advantages – for example, the HTML pages exist as (semi-) permanent documents which can be passed to a central collation point or other systems, and can also be indexed by WWW search engines – but has a higher overhead for maintenance than a relational system, since a single change to one supporting table in a relational database, which is then promulgated automatically to all linked records, may require a large number of text changes in SGML- or HTML-based records to produce the same result. Another disadvantage of the text-based scenario is that where copies of records are exported to a number of sites or systems, there is no guarantee that all copies of the record will be maintained or updated simultaneously, with the possible result being copies of the same metadata circulating in different versions, only one of which is current. From this point of view alone, a search tool which interrogates the actual data residing in an organisation’s database is likely to be of greater value than one which relies on text-based entries (where these are a secondary product of a database), unless the "static" records are updated very frequently in some automated way and also distribution of copies is tightly controlled.

Most of the metadata elements in MarLIN fit fairly closely within the ANZLIC concept of metadata "pages", "Page 0" elements representing the core nationally-compliant metadata, "Page 1" elements those specified by theme- or jurisdictional-based directories (in our case, the relevant example being the "Blue Pages"), and "Page 3" elements being additional fields relevant to a particular organisation’s needs (ANZLIC, 1996-8). The only principal area we have found it necessary to depart from standards supported by the ANZLIC guidelines and the "Blue Pages" specification is in the use of our own subject categories, since both the ANZLIC "Search Words" and the Blue Pages "Themes" were felt to be inadequate for a structured and comprehensive coverage of marine and aquatic science. However, the system adopted (subject categories based on those used by ASFIS for the ASFA bibliographic databases) is not without its own limitations, since (being designed for aquatic and fisheries science) it does not cover non-aquatic areas such as land-based surveys (e.g. of topography, groundwater, vegetation or meteorology). For these areas, the ANZLIC search words (which have been retained for compliance with ANZLIC metadata guidelines) are likely to be adequate for the time being, although ideally it would be advisable to extend MarLIN subject categories to include selected non-aquatic subject matter also. This aspect will be revisited during the design process for version 2 of MarLIN. Nevertheless, we feel that at this time, the ASFA categories are still the best model to use in terms of their structure and well-documented basis for application.

The broader metadata environment within which MarLIN, Blue Pages and the ANZLIC metadata guidelines all reside is not static, as each of the latter systems will be subject to periodic revisions and, in addition, there may be an international (ISO) standard for metadata within the next few years. Thus, it may be desirable to adopt an alternative scheme of subject categories – and other fields also – if and when such an international standard becomes available. Alternatively, an international standard may permit easy mapping between one subject scheme and another, of which the ASFA scheme could conceivably be a part. In any event, amending the MarLIN source code to permit more easily maintainable database fields would be a definite advantage for facilitating ongoing development of the database, and this will be one of the aspects to be considered in detail for constructing MarLIN version 2 in some 12 months’ time.

 

Acknowledgements

MarLIN has been developed by Kim Finney, Tony Rees and Miroslaw Ryba at CSIRO Marine Research in Hobart. Paul Smith and Pamela Brodie provided additional programming assistance with the JAVA search applet. The contribution of Angela Way (CMR, Marmion) is appreciated for creation of many of the MarLIN metadata records to date. The MarLIN application has been developed from "Green Pages" software supplied courtesy of Environment Australia in Canberra, which was written by Computer Power Pty Ltd. for the Environmental Resources Information Network.

 

References - Documentation

Australia New Zealand Land Information Council (ANZLIC) (1996-8). Core Metadata Elements for Land and Geographic Directories in Australia and New Zealand. Digital document, available online at http://www.anzlic.org.au/download.htm (as at November 1998)

Australian Oceanographic Data Centre (AODC) (1996). The marine and coastal data directory of Australia – the Blue Pages, Version 1.0 [Documentation supplied to users].

Cambridge Scientific Abstracts (CSA) (c.1998) Database Classification Codes: ASFA and Oceanic Abstracts. WWW document, available on-line at http://www.csa.com/helpV3/classcod.html (as at November 1998)

Food and Agriculture Organisation (FAO): Fishery Information, Data and Statistics Service (1983). ASFIS-2: ASFIS subject categories and scope descriptions. Rome: FAO, 59p.

Library of Congress (1997). Outline of the Library of Congress Classification. WWW document, available on-line at http://lcweb.loc.gov/catdir/cpso/lcco/lcco.html (as at November 1998)

NASA (c.1997). GCMD Parameter Valids. WWW document, available on-line at http://gcmd.gsfc.nasa.gov/cgi-bin/mduser_dir/earth_sci_keywords (as at November 1998)

Yearsley, G. K., Last, P. R. and Morris, G. B. (1997). Codes for Australian aquatic biota (CAAB): an upgraded and expanded species coding system for Australian fisheries databases. Report, Marine Laboratories, CSIRO Australia (224), 120pp. approx.

 

References - Online links (as at November 1998)

The MarLIN database is accessible at: http://www.marine.csiro.au/dmr/database/marlin/

MarLIN description/demonstration pages are at: http://www.marine.csiro.au/datacentre/examples/marlin_demo/

The Environment Australia "Green Pages" Environmental Data Directory is accessible at: http://www.erin.gov.au/database/edd/

The "Blue Pages" Marine and Coastal Data Directory of Australia is accessible at: http://www.erin.gov.au/marine/mcdd/ (Environment Australia, Canberra) and http://www.marine.csiro.au/marine/mcdd/(CSIRO Marine Research, Hobart)