Why is the ALA updating our infrastructure?

After 10 years of active development, the ALA’s core data processing infrastructure is approaching ‘end of life’. The Infrastructure Upgrade Project will enable the ALA to adapt to changing data and user requirements heading into our next decade of operation.

Phase 1 of the project has focussed on how we store, process and provide access to occurrence records. We have replaced many of our aged components with software developed by the Global Biodiversity Information Facility (GBIF), but with some adaptations for the ALA’s specific requirements. This will make our records more consistent with GBIF processing and reduce errors found in aggregating records from multiple sources. In addition, the new infrastructure will be fit-for-purpose, more maintainable and able to adapt to our future needs as Australia’s national biodiversity data infrastructure. These improvements will enable the ALA to manage increases in data volume and different types of data.  

What systems are affected?

The current phase of the project replaces our occurrence data store, processing pipeline and index.

The way people give us data, how we process and aggregate it, and how people search for it has changed, however, we have tried to make those changes relatively seamless for our users. 

To begin with we have changed how we load data into the processing pipeline. We now use Darwin Core Archives (DwCA) to perform this step. This allows us to better monitor the number of records processed in a dataset and ensure validity. It also allows us to be able to recreate what data was available at a given time and to more transparently deal with any errors or problems that arise in loads. We have been in contact with our regular data contributors about these changes to ensure they can still supply their data to us, and will continue to work with smaller, ad hoc contributors to bring their data into the Atlas.

Some of the fields used for processing our data have changed to be more consistent with the Darwin Core Standard and GBIF. Where field names have changed, but will have the same data, we have mapped the changed names, and queries using the old field names will continue to work. We identified some processing steps that were unreliable or used old datasets, mostly these have been updated, but where it was impractical to update processes some fields have been removed.

When processing data we also try to identify any issues that may impact the use of the data, previously these were called assertions. We have extensively reviewed the issues we flag; again there are some name changes, some changes to how we calculate the issues to ensure they are as reliable as possible, and some have been discontinued. Mapping has been done to allow users to query the APIs using the old assertion names, but results will be provided using the new names for issues. External systems accessing the issues via the APIs will likely need to be changed to accommodate the changes in names.

Lists of changed and discontinued fields and assertions are available in the links below.

For the majority of users, there may be some small changes in the number of results they get using existing queries due to more accurate processing of occurrence records. However, there should not be exponential differences in results.

How long before I need to update my systems?

We will maintain backwards compatibility for mapped fields and issues (assertions) until phase 3 of the infrastructure project, this is not expected to be in production until at least 2022 -  watch the project page for more information.

What is changing?

Primarily, changes are to do with how we process and display data; how users search and access data remains mostly the same. In particular, most changes are to do with fields and issues, also called assertions. We have split this section into general changes, followed by changes specific to fields and, finally, issues (assertions).

General

Not all data is loaded

A small number of datasets have not yet been migrated to the new production environment as we sort through some issues with the data. These datasets will be added as we work with the data providers to resolve issues.

Further information: 

If you are missing a dataset, please contact us at data_management@ala.org.au for an update on particular datasets.

Changes to data loads

Records will be loaded as Darwin Core Archives (DwCA). Our regular data contributors are assisting us to make this change. We will work with ad hoc data contributors to support them in the creation of DwCA.

Records will be loaded even if they don't match the specified kingdom for a dataset. When registering a dataset, you can specify what kingdom records are expected to come from. This is useful when trying to resolve ambiguous names. However, previously we didn't load records falling outside of the nominated kingdom, this is no longer the case.

Homonym records will load. A homonym occurs when a scientific name for an organism is also used by one or more other organisms. Previously, where a homonym was detected the record processing would stop. These records will now load with the best match provided by any other information in the record. We will monitor this once live to see what proportion of homonym records are incorrectly assigned.

No longer supporting dates supplied with words or alpha characters, e.g. 10 March 2012, 03-Dec-98. Dates in these formats are not part of the DwCA standard. We expect that only old datasets will use these formats and these will be cleaned up in the migration of data to the new environment.

Further information: 

Darwin Core Archives - explanation and "How to..." guide on Github

Homonyms - the  formal International Commission on Zoological Nomenclature (ICZN) definition.

Informal definition: where two or more taxa have the same name even though they're different organisms. The rule set by the ICZN is that the one published first is retained for that organism, the name of the organism published later has to change.

Fields

Name changes

The fields with changed names have been mapped for both querying via API and for receiving results via API - users should not see a change.

Users will see some fields have names changed in the user interface.

Further information:

See Attachments for a spreadsheet of changed and removed fields (both .xlsx and .csv files are available)

Query to get CSV file of all fields, includes changed and removed: https://biocache-test.ala.org.au/ws/index/fields.csv?deprecated=true

Removed fields

The majority of fields removed were either unpopulated or had useless values in them. We don't anticipate these fields affecting many of our users.

Attributes held in these fields will no longer be available for searches, visible in the user interface or available for download.

Fields of special interest are dealt with below.

Further information:

See Attachments for a spreadsheet of changed and removed fields (both .xlsx and .csv files are available)

Query to get CSV file of all fields, includes changed and removed: https://biocache-test.ala.org.au/ws/index/fields.csv?deprecated=true

Species_habitat field removed

This field was removed as the dataset we had populated it with was over 9 years old, was not comprehensive and gave misleading results for coastal and riverine habitats. At this time we don't have an adequate substitute. Some users did filter by the habitat information to identify occurrence data that was recorded outside the expected habitat for a species. We suggest sourcing species marine habitat information from the World Register of Marine Species (WoRMS) instead.

Further information:

WoRMS website

Biome field processing changed

Previously, the biome field was determined by checking the coordinates given for a record against the IBRA and IMCRA environmental layers.

The biome field will now be populated by checking the record coordinates against the GADM Spatial layers. These layers are more current, have greater area coverage for those records outside of Australia and will be more consistent with GBIF processing.

Further information:

GADM Spatial layers website

State values fixed

Note: this only affects the processed values, raw values are still available on the record.

We have removed values used in the State field that can't be mapped to known states. Examples are:

  • Where the value given was a hyphen
  • (SA)
  • (South Island, Otago)

These values should no longer show up in the State/Territory facet in the user interface and cannot be used in queries.

Change to record type and DNA records

The values EnvironmentalDNA and GenomicDNA are not part of the Darwin Core standard vocabulary for the basis of record field (seen in the ALA search interface as "record type"). We have removed EnvironmentalDNA and GenomicDNA as options for this field. To retain this information for these types of records, we have added these values to the records' dataset metadata. 

See this article for information on searching and filtering EnvironmentalDNA and GenomicDNA records.

Changes to the prefix "raw_"

Previously, the "raw_" prefix was given to all fields supplied by a data provider. This set up a false expectation that all fields would be processed in some way. Now only fields which have a processed equivalent will get the prefix.

Old field names have been mapped where the prefix is no longer being used.

Life Stage facet

We've implemented the GBIF life stage vocabulary. This picks up significantly more life stage terms for indexing, there are now 26 options in the facet. This means that many of the existing values in the data can be picked up in our processing and indexed.

The facet can be seen in the Occurrence group of facets in the search interface. It can also be used in the spatial portal.

Further information:

Filtering search results by facet

Using facets in the spatial portal

GBIF life stage vocabulary https://tools.gbif.org/dwca-validator/vocabulary.do?id=http://rs.gbif.org/vocabulary/gbif/life_stage

Generalised and Already Generalised numbers

People searching on species that are generalised will see different numbers (higher numbers) in the new environment than they previously saw in production.

This is because we have fixed the processing of spatial uncertainty.

Previously, we added obscurity regardless of whether the occurrence already had high uncertainty, this meant many of these records would fail the spatial validity test. Now where there is already enough spatial uncertainty in a record to match what we would normally apply to generalise the record, it will display as "Already generalised". 

Querying for field properties

If you are querying the API to get field properties, you need to use the new field names. 

Further information:

See Attachments for a spreadsheet of changed and removed fields (both .xlsx and .csv files are available)

Query to get CSV file of all fields, includes changed and removed: https://biocache-test.ala.org.au/ws/index/fields.csv?deprecated=true

Issues (assertions)

Name changes

In the user interface, you will see new names for issues (assertions).

Using the API, you can query using the old names for issues (assertions). However, the response will come back using the new names. If your systems use those attributes in subsequent processing, you will need to update to use the new issue (assertion) names or maintain a mapping.

In downloads, you will see the new names for issues (assertions).

Further information:

See Attachments for a spreadsheet of changed, new and removed assertions (both .xlsx and .csv files are available)

Query for changed and new assertions, this does not include removed assertions: https://biocache-test.ala.org.au/ws/assertions/codes?deprecated=true


Issues (assertions) removed

In the user interface, you will no longer see issues (assertions) that have been removed.

Using the API, if you query using assertions that have been removed, you will get no results.

In downloads, you will no longer see issues (assertions) that have been removed.

If you are looking for records with particular information missing,  you will need to search or query  for records where those fields are blank.

Issues (assertions) of special interest are dealt with below.

Further information:

See Attachments for a spreadsheet of changed, new and removed assertions (both .xlsx and .csv files are available)

Query for changed or new assertions, this does not include removed assertions: https://biocache-test.ala.org.au/ws/assertions/codes?deprecated=true

An example of a query to find all records with IdentifiedBy unpopulated: https://biocache-test.ala.org.au/occurrences/search?q=*%3A*&disableAllQualityFilters=true&qualityProfile=ALA&fq=-identified_by%3A*

Habitat_mismatch issue (assertion) removed

This assertion was based on checking the species habitat value against the IBRA and IMCRA environmental layers. The species habitat field has been removed as there were issues with the data used to populate this field, resulting in species incorrectly being identified as exclusively marine or terrestrial. The assertion would then be incorrectly raised. As we haven't yet negotiated the use of a suitable dataset to populate the species habitat field, this issue (assertion) is no longer supported.

Geospatial_kosher issue (assertion) replaced by spatially valid flag

While queries using the geospatial_kosher field will be mapped to spatially valid, the calculation of true or false on this field will be done differently to be in line with GBIF processing.

Further information:

This article contains information on the checks used for spatially valid and a comparison of the old and new checks.

More spatially invalid records 

There are many more records being flagged as spatially invalid, there are 2 main reasons for this:

  1. the new GADM spatial layers give an approximation of the actual coastline, occurrence records within 10km of the coastline may be incorrectly flagged as not within Australia - this will be addressed in an upcoming release.
  2. previously the ALA had 3 values for spatial validity: valid; suspect; and unknown. Anything that had text location information rather than coordinates was given the value "unknown". In line with GBIF processing, these records now have the "suspect" value.

If you are looking for records specific to coastal habitats, we suggest turning the spatially valid data quality filter off and checking whether records you are interested in are being incorrectly excluded.

Missing coordinate precision issue (assertion) removed

This value is rarely given. It is missing from 99.5% of occurrence records in the Atlas. As a result, we will no longer support this issue.

Technical lowdown

We have adapted a mirror of GBIF's technical stack (Apache Beam, Spark and AVRO) for the ALA.  

We’ve also either rewritten, borrowed from GBIF, or collaborated with GBIF on these pieces of functionality:  

  • Generating globally unique identifiers for each occurrence record  
  • Ingesting only a full refresh DwCA for every dataset, ensuring we can better adhere to existing DwC standards  
  • Upgraded our image service – we’re transferring to use AWS S3 storage for greater volumes and a more efficient asynchronous loading  
  • Updated our ingestion code for the following functions:   
    • generate globally unique identifiers for occurrence records  
    • process temporal, spatial and taxon fields  
    • detect duplicates  
    • set data quality assertions  
    • obscure sensitive species records  
    • detect environmental outliers.

Code for this system can be found in Github.


Attachments

These attachments were correct at time of publication. We will endeavour to keep them up to date, however, please use the linked queries for the most up to date information.