Sharing a dataset with the ALA : ALA Support

Jump to section:

Datasets
Why share data?
Sharing a dataset with ALA is a five-step process
How should a dataset be formatted?
Metadata
Sharing a dataset with us
Frequently asked questions
More information

Looking for how to submit just an individual sighting or a handful of sightings?

Datasets

If you already have a dataset of species sightings, particularly ones that have been collected using a systematic method, it can be more convenient to submit them as a dataset, rather than individually.

Datasets may be one-off and complete, or, with regular and recurring updates.

See below for more information about how you can prepare and submit datasets to the ALA using our accepted format.

The ALA itself does not generate or own any biodiversity data. We support biodiversity and conservation analysis by aggregating data provider contributions, and subsequent data processing and indexing to make data openly searchable and reusable.

By sharing biodiversity data, data providers increase the pool of accessible information for research, monitoring, and reporting on the state of Australian biodiversity. Easily accessed and quality data are able to better support research and evidence-based decision-making by conservation and management authorities.

When data providers share their data with the ALA, they:

Gain increased visibility of their activities, resources, and expertise
Gain an opportunity to collaborate. As their datasets are made discoverable via the ALA, other individuals and organisations that are interested in the same target species and locations can find these groups and potentially collaborate on future projects
Are able to receive feedback and advice. Other ALA users can comment on the data, (and vice versa) - helping improve the quality of data available
Benefit from the ALA’s data processing pipeline which augments data with valuable standardising taxonomic, spatial, and data quality assessment values
Visualise and analyse data against other variables using ALA’s spatial tools.

Sharing a dataset with ALA is a five-step process:

Download our template
Format the dataset
Include metadata
Package files
Send to ALA

How should a dataset be formatted?

Darwin Core format

Darwin Core format describes how to format biodiversity data in a table using accepted standard column names in plain text.

Darwin Core fields include information about the what, where, and how of a species sighting. For example, the species name (what species was observed), location information (where the species was observed), and the basis of record (how the species was observed).

Files in Darwin Core format are delivered by zipping up the tabular data files into a Darwin Core Archive, along with some metadata files (additional valuable information - or “data about the base data”).

ALA’s preferred data format is the Darwin Core Archive. We don't accept occurrence data in non-tabular formats like PDF or Word documents.

Why Darwin Core?

The Darwin Core standard provides very clear guidance for how to specify taxonomy, dates, and locations in biodiversity records. The benefit of having data from multiple providers structured the same way, or “standardised” – is that they can be combined, processed, and analysed at scale.

To demonstrate, an occurrence record might contain the following fields (see below) which are defined by the Darwin Core standard.

occurrenceID = An identifier for the occurrence (as opposed to a particular digital record of the occurrence). In the absence of a persistent global unique identifier, one is constructed from a combination of identifiers in the record that will most closely make the occurrenceID globally unique.

basisOfRecord = The specific nature of the data record. Recommended best practice is to use the standard label of one of the Darwin Core classes, e.g. PreservedSpecimen, LivingSpecimen, HumanObservation, MachineObservation.

decimalLatitude = The geographic latitude (in decimal degrees, using the spatial reference system given in geodeticDatum) of the geographic centre of a Location. Positive values are north of the equator, negative values are south of it. Legal values lie between -90 and 90, inclusive.

scientificName = The full scientific name (with authorship and date information if known). When forming part of an identification, this should be the name in lowest level taxonomic rank that can be possibly determined. This term should not contain identification qualifications, which should instead be supplied in the IdentificationQualifier term.

The Darwin Core standard overall is maintained by the international body the Taxonomic Databases Working Group (TDWG). This means that the terms and definitions are arrived at via a broad community consensus; and are published, versioned and freely available. For more information about how Darwin Core standards are implemented by the ALA, see our ALA Data Standards article.

The ALA requires a handful of Darwin Core fields to be completed for an occurrence record to be accepted, and other fields are optional - but strongly recommended for optimum data quality ingested

What are the minimum required fields needed by the ALA for species occurrences?

The information below is essential for a species occurrence record to be included in the ALA, our template includes all of these fields.

ALA requirement

Property Description

Darwin Core Field

Unique identifier

A unique record identifier must be provided in the source data set. Without this the data ingestion will fail. It’s important that these are unique within the dataset so that records can be ingested, indexed, searched and even corrected if necessary.

occurrenceID
OR

catalogNumber

recordNumber

Basis of record

Identifies the nature of the record. For occurrence records these values are used: PreservedSpecimen, FossilSpecimen, LivingSpecimen, MaterialSample,

HumanObservation, MachineObservation

basisOfRecord

Recognisable species

The scientific name of the organism, e.g. Gymnorhina tibicen, and the lowest applicable taxon rank, e.g. species, genus.

The ALA will attempt to match the given taxon fields to an authoritative taxon name. See our taxonomy article for more information about how we do this.

scientificName

Location information

We prefer that coordinates are specified in decimal degrees using these fields, along with the datum (usually WGS84 or EPSG:4326). If your coordinates are from a GPS system, your data's already using the WGS84 datum. It’s important to specify uncertainty and give as much detail as possible about the geospatial determination.

There are many other ways to specify location using free text fields, e.g., “100km south of Darwin” in the locality field. The ALA will not be able to translate this to appear on maps.

For more information on georeferencing, see Chapman & Wieczorek’s Georeferencing Best Practices (2020).

See also the full list of Darwin Core Location fields.

decimalLatitude

AND

decimalLongitude

AND

geodeticDatum

AND

coordinateUncertaintyInMeters

AND/OR

other DwC location fields

Date

Recommended best practice is to use a date that conforms to ISO 8601-1:2019, which primarily states that dates should be formatted as YYYY-MM-DD but also provides standards for specifying date and time ranges and time zones.

eventDate

What other fields should be supplied? (Recommended fields and additional information)

Additional information produces a higher quality record, whereby users can more easily determine its fitness for use, for example, whether it can be used for analysis, modelling, or an environmental impact assessment.

The automated ALA ingestion process reviews provided data and produces a series of quality assertions that helps users make an informed assessment as to whether to include them in their analysis or not. These assertions are used to filter out records from a user’s default view, if, for example, the ALA’s processing couldn’t determine a spatial location, find a good taxonomic match, or establish a date from the provided data.

Location

It’s important to provide as much information as possible about the spatial location of the occurrence. This includes information about the accuracy and precision of the location and how the location was determined. For more information, see these GBIF publications: Georeferencing Quick Reference Guide, Georeferencing Best Practices.

The ALA will assume a geodeticDatum value of WGS84 and attempt to re-project coordinates given in any other datum. If the provided data have been generalised or obfuscated for sensitivity purposes, it’s useful if that is documented with a short explanation in the dataGeneralizations field and/or useful detail in the informationWithheld field.

Strongly recommended: geodeticDatum, coordinatePrecision, coordinateUncertaintyInMeters, stateProvince, country, locality

Share if available: dataGeneralizations, informationWithheld

Useful: verbatimLatitude, verbatimLongitude, verbatimCoordinateSystem, georeferencedBy georeferenceProtocol, georeferenceSources, georeferenceVerificationStatus

Taxonomy

The ALA expects the scientificName field to be provided and will attempt to match it to names provided by Australian taxonomic authorities. The strongly recommended fields taxonRank and kingdom will help the matching service differentiate between homonyms across taxon groups. See our taxonomy article for how the ALA matches taxonomic information.

Record quality is vastly improved by demonstrating how and when a species identification was made.

Strongly recommended: taxonRank, kingdom

Share if available: vernacular name, phylum, class, order, family, genus, identifiedBy, identifiedById, identificationVerificationStatus, identificationReferences, identificationQualifier

Sampling or Survey Event

To make species observations useful in future analyses, expressing information about how a record was collected is very important. An eventId can be used to group occurrences together at a particular time using a particular methodology.

Strongly recommended: eventId, recordedBy, samplingProtocol, occurrenceStatus, individualCount

Share if available: recordedById, fieldNotes, eventRemarks, samplingEffort, sampleSizeValue, sampleSizeUnit, eventTime

Metadata

Metadata is data about the base DarwinCore data –not the actual content of the data. Examples of metadata include information about who authored the data, when, and what sort of keywords best describe the data and their collection process.

Good quality metadata raises the quality of a dataset, because metadata helps make data FAIR – findable, accessible, interoperable, and reusable. This also helps end-users evaluate if the dataset, and even particular records, are fit for their purpose or require manipulation or data cleaning. For more information about this, see our Data Cleaning article.

For open-sourced data (like on the ALA), good-quality metadata becomes even more critical. For example – since data arrive from multiple sources, good metadata can indicate how data were collected, such as if a survey was conducted systematically or opportunistically.

To describe where a dataset originated from and what sort of data it contains, the following metadata fields are mandatory for ALA dataset submissions:

Title: a brief but descriptive name of the dataset, as displayed on ALA. The name should be unique and distinguishable, e.g. “Mammals surveyed in Namadgi National Park from 2012 to 2014” is a more suitable title than simply “Mammals.”
Description: a longer description that ideally documents the geographic, temporal, and taxonomic scope of the collection, as well as information about the methodology and purpose of the project or collection. Think of this as the who, what, where, when, why and how information about the dataset. This may include information such as who was the principal investigator, what institutions were involved, how was it funded, when were the data collected, what were the research aims. Any relevant literature references, or any other relevant information may be included here. A second version of the description in a language other than English may be added underneath.
License: licensing determines how a piece of work (in this case, data or an image) may be used and shared. All data ingested into the ALA must be assigned a license. A dataset author may choose to allocate the same license for the entire dataset, or may wish to specify it for each record or image. ALA’s preferred licenses are Creative Commons CC0, CC-BY and CC-BY-NC. This means the data can be reused in derivative works, either freely (CC0), with attribution (CC-BY), or with attribution and only for non-commercial purposes (CC-BY-NC).
Contact: at least one name and email address are required for an administrative contact for the dataset. ALA requires this to ensure we can contact the organisation or individual about the dataset. The email address may be for an individual or a functional role (e.g., collectionname@museum.org.au).
Creator: The full name of the person, organisation, or position who created the resource.
Citation: Text specifying how the dataset should be cited in publications that make use of the data. For example: “Echidna CSI - Observation Photos of Echidnas: University of Adelaide (2021) Echidna CSI - Observation Photos of Echidnas dataset.”
URL: The website for the institution of the data provider.

ALA metadata template [.docx 110 KB]

Note: we have provided our metadata template in .docx format. If you are unfamiliar with .xml format, we recommend you write your metadata in the .docx format. However, if you are comfortable working with .xml files - take a look at the example (eml.exl) as a template. Darwin Core Archives require your metadata in .xml format, but the ALA is happy to help you with this if it’s unfamiliar to you.

There are a few different ways to do each step depending on the level of technical expertise and type of data. If you get stuck at a specific point, just email us at data_management@ala.org.au and the staff at the ALA can give you a hand.

decision tree flowchart illustrating the steps required to submit a dataset to ALA

1. Download a Darwin Core file template

ALA occurrence data template [.xlsx 113 KB]

ALA multimedia template [.xlsx 88 KB]

2. Format the dataset using the template

The first stage in submitting a dataset is to produce a file suitable for loading into the ALA.

We’ve provided examples in the templates, and have also indicated which fields are required and which are recommended. The data upload templates provide definitions for each header.

Note: if the data are managed in an existing spreadsheet, the author may edit the headings to align with Darwin Core column headings.

Please export to a .csv file before continuing to the following steps. Alternatively, the data may be copied into one of the templates provided.

If the data are managed in a database, it is necessary to export to a .csv file before moving on.

If there is additional information that doesn’t fit the standard fields provided, then an author may add additional columns to the end of the records sheet. The information will stay with the records, and we’ll include them as much as possible in the ALA display. When the customised template is set, copy the data across.

Note: At the end of this step the data should be in .csv file with Darwin Core column headings.

3. Include metadata

Good metadata is crucial, so please don’t forget to include it.

Option 1. Provide metadata in docx format

If you’re unfamiliar with .xml, we have provided a template in .docx format where you can provide the relevant information for your dataset.

ALA metadata template [.docx 110 KB]

Option 2. Provide metadata in xml (for those with experience in software development)

You can write your metadata directly into eml.xml format, see this example for how this file will look. The .docx template above will outline the required fields and might help you gather the necessary information.

Example eml.xml [.xml 3kb]

4. Create a Darwin Core Archive

A Darwin Core Archive (DwC-A) is a data standard that is commonly used to package species occurrence data into a single, self-contained, machine-readable dataset. This ensures the most efficient transfer of the whole dataset.

A DwC-A includes four files:

A core dataset file
An ‘extension’ file related to the main dataset, i.e. multimedia (optional)
A metadata file (eml.xml)
A descriptor metafile (meta.xml)
This file (see an example) describes how the files in the archive are organised, and maps each data column to a corresponding standard Darwin Core or Extension term.

four blue diamonds representing the four key files required in an Darwin Core Archive, which is represented by an orange circle around the outside

Above: The four key files that make up a Darwin Core Archive.

Option 1: Create Darwin Core Archive

The best way to submit a Darwin Core Archive is to submit the complete Archive yourself (ie a zip file with dataset.csv, meta.xml, eml.xml and extension files [optional]). This method helps us immensely by speeding up the data processing stage and allows you to specify all the details and structure of your dataset. See TDWG’s instructions for how to prepare a Darwin Core Archive.

Option 2: Collate data and send files to ALA

If you are new to creating Darwin Core Archives, you can send through raw files (occurrence [.csv], multimedia [.csv] and metadata [.docx], prepared as in Steps 1-3 above) and we can help to create your first Darwin Core Archive for you. Raw files can be sent to data_management@ala.org.au.

5. Send data

You’re almost done!

How the ALA gets the files depends on the size of the dataset and whether it is a one-off complete set of data, or part of an ongoing, or updating dataset.

One-off complete datasets under 10MB (compressed) can simply be emailed to the ALA at data_management@ala.org.au.

For a large dataset (over 10MB) or where there will be regular updates, please contact the team at the same address to arrange machine-to-machine delivery.

Data are complex, so we’re here to help. If you need any assistance working with existing data in another format, please reach out.

Frequently asked questions (FAQ)

What types of data can I share?

The ALA currently gathers the following types of data:

Occurrence data – this is the occurrence of a particular species in a location at a specific time.
- This also includes images, sounds and other multimedia that accompany occurrence records.
Species lists – conservation lists, sensitive lists, area checklists, species trait lists etc. (via our “Lists” tool). Creating species lists can make it easier to query occurrences for those species rather than creating complex queries.

Are there duplicate records?

If you have shared your information with another organisation such as a government department or conservation agency, then they may have already shared it with the ALA.

We have data quality checks to identify duplicate records and flag them.

How can I see how my dataset is being used?

When data is loaded into the ALA, a dataset metadata page is created which displays any metadata that has been provided (see an example for ANIC).

This serves as a useful landing page for the dataset, with several features:

A logo can be displayed
Any institutional affiliations
The Usage Stats tab provides information on the number of times records from the dataset have been downloaded from the ALA (also see the ‘Download Usage Stats’ link on the right)
A record count
‘View Records’ – a link to the records themselves
The ability to set up Alerts on the dataset if anyone flags an issue for the data
Citations – the number of citations that have been tracked on this dataset

When a dataset is loaded into the ALA it is also loaded into GBIF soon after, where the dataset and the interpreted data is given a DOI. The ALA links to the DOI landing page on GBIF, where you can also see any citations that have been tracked on that DOI.

GBIF manages a citation tracking service that populates this information. Both the ALA and GBIF mint DOIs for data downloads, making it easier for researchers to cite downloads from multiple datasets in publications. The tracking service registers citation usage and links them back to the original dataset.

What happens if there is a problem with my data?

The ALA runs a series of automated data quality checks across all submitted records. These tests will flag records for completeness and consistency across spatial and taxonomic dimensions, adherence with the standard, expected values (e.g., coordinates between -180 and 180), and environmental outliers.

When users query the ALA, a filter called ‘ALA General’ in our Data Profiles removes potentially problematic records from the results, although these records are still available. Data providers can use this tool to review their datasets, correct source data and work with the data team to refresh the dataset.

Users of the ALA can annotate records by flagging the record if they believe there is an issue.

What if I’ve spelled a species name incorrectly, or used an old name?

The ALA maintains a list of alternative names for each species. Our automated processes will try to match the species name to this list as best as possible (for more information, see our Taxonomy article).

Please note the ALA is not resourced to check actual species identifications.

If you are unsure of the species, our ‘how do I identify a species?’ article has some ideas for obtaining a correct identification.

Do you really want observations of dead organisms, like roadkill?

Yes. Dead organisms are an indication of the presence of the species, but please indicate in the occurrence remarks that the specimen was dead/deceased.

More information

If you have any other questions or need assistance, email us at data_management@ala.org.au.

How can we help you today?

Sharing a dataset with the ALA

Datasets

Sharing a dataset with ALA is a five-step process: