Data cleaning

Modified on: Mon, 27 Feb, 2023 at 6:48 PM

Jump to section:

Data variables
Anomalies
What is an outlier (is it an environmental outlier or an issue with the data)?
References

The Atlas of Living Australia is a data aggregator, we collate data from our providers making them available to our users. The ALA does not own the data we display. The data we receive come in different forms and of different qualities, bringing many challenges. Data errors can occur in multiple places from data collection through the ingestion process. Understanding that not all errors are the same can help users work with the data they receive from us.

Data variables

Open-source biodiversity data has pros and cons, and data aggregators like the ALA play a key role in improving access to these data. ALA occurrence records are collected by a range of different people across time and space, and therefore quality is variable.

The first 6 points below will help you a) familiarise yourself with the data you’re working with, and b) tidy up your data so they're fit for purpose. If you don’t have a dataset you’re working with, but want to work through the steps below, try downloading occurrence records of your favourite species (see our article on how to download occurrence records).

checklist for tidying data

Figure 1. Part 1, familiarising and tidying open source data for data cleaning

The size of the data you’re handling will likely dictate how you go about the above steps. Small datasets might allow you to do this work manually, but large datasets will rely on a more automated approach. However, regardless of the size of the dataset, these steps are important.

Note: If you don’t have much experience working with data, we recommend starting small and getting comfortable with some of the elements below. It can be overwhelming, and impractical to start with a large dataset.

Anomalies

Hopefully, you’ve now familiarized yourself with your dataset and are working with something that’s a bit cleaner. Maybe you removed some records because they were missing latitudinal values. Are all your species names correctly formatted with appropriate consistent capitalizations? Now that you’ve dealt with the general issues of your variables, it’s time to work with anomalies. These next steps are important to approach with a clear goal in mind, as the level of ‘clean’ you need your data will depend on the question you’re trying to answer.

checklist for dealing with anomalies. Also two maps of eastern Australia with species occurrences indicated by dots. the second map shows dots appearing further south than in the first map.

Figure 2. Part 2, dealing with anomalies in open-source data

You’ll have noticed above that if your record was missing some coordinates, depending on what you’re trying to do, you’ll likely remove that record as key information is missing. Similarly, if a record's coordinates place it inside a museum/large city far away from the rest of the records (Perth, in Figure 1), you can be reasonably confident that this is not where the species was recorded in the wild and therefore it can be removed. However, records can be outliers without a clear justification in the data: before you start removing these records from your dataset let’s explore what an outlier is, and why this is more challenging when you’re working with open-source data.

What is an outlier? Is it an environmental outlier or an issue with the data?

When you’re dealing with data, especially data collected by an external party, it’s important to think critically about what you are looking at. The biological definition of an outlier is an observation that lies an abnormal distance from other values in a random sample from the population. An observation being abnormally far away from the rest is not enough grounds for it to be removed. This is not indicative that the outlier represents an error in the data. Let’s use an example to help us understand this: climate change is causing range shifts, species are shifting to colder climates (if they can) as overall temperatures warm, in order to continue to meet their climatic requirements.

Note: A range shift is the change in the distribution of species boundaries from previously known boundaries (any/all development stages and/or seasons).

Figure 2 displays an example of what this range shift might look like from the perspective of open-source data. The species observed in the 2022 survey were likely similarly distributed in 2021, however the data submitted between 1950 and 2021 is sparse. If we looked at the data solely in 2021 we might want to remove the isolated data point. We get a much greater understanding after the 2022 survey which shows us a range shift to higher altitude. The record we may have been tempted to remove in 2021 now fits within our understanding in 2022.

This is not to say that all outliers will indicate trends in species data. However, because open-source data means you’re dealing with data collected by many different people, it’s even more important to carefully inspect outliers.