TY - JOUR
T1 - Large biodiversity datasets conform to Benford's law
T2 - Implications for assessing sampling heterogeneity
AU - Szabo, Judit K.
AU - Forti, Lucas Rodriguez
AU - Callaghan, Corey T.
N1 - Funding Information:
LRF received a fellowship from the Coordination for the Improvement of Higher Education Personnel (CAPES - Finance Code 001 ). CTC was supported by a Marie Skłodowska-Curie Individual Fellowship (No 891052 ). The authors are indebted to the two anonymous reviewers and the associate editor for their insightful comments that have greatly improved the article.
Publisher Copyright:
© 2023 Elsevier Ltd
PY - 2023/4
Y1 - 2023/4
N2 - Inadequate sampling can cause biased estimates of species diversity, as species occurrence generally follows a log-normal distribution with a long tail. Understanding this sampling bias is fundamental to inform biodiversity conservation actions. However, currently available tests to assess data quality, such as fitting species abundance distribution (SAD) models and rarefaction curves are computationally costly and can still lead to erroneous conclusions. We evaluated Benford's law (first digit distribution) as a complementary method to assess data heterogeneity and survey coverage in large biodiversity datasets, including eBird data for 157 countries and three non-avian GBIF datasets. We also tested conformity to Benford's law of four simulated communities with different SAD models and four corrupted datasets with log-normal SAD. Finally, we evaluated the effect of including rare species in three datasets on the conformity to Benford's law and also compared Benford fit to the results of traditional methods to estimate survey completeness in seven datasets. Species-rich datasets with a large number of observations tended to obtain a good fit. Benford conformity can be a simple and sensitive measure of sampling evenness, complementing traditional methods to assess quality data in large-scale studies. Benford's test can reflect species abundance heterogeneity, especially in log-normally distributed data, but was not ideal to evaluate surveys completeness, as its results diverged from those of traditional methods. As the contribution of citizen science continues to increase in biodiversity monitoring, this fast and efficient method can play a critical role to assess the quality of datasets.
AB - Inadequate sampling can cause biased estimates of species diversity, as species occurrence generally follows a log-normal distribution with a long tail. Understanding this sampling bias is fundamental to inform biodiversity conservation actions. However, currently available tests to assess data quality, such as fitting species abundance distribution (SAD) models and rarefaction curves are computationally costly and can still lead to erroneous conclusions. We evaluated Benford's law (first digit distribution) as a complementary method to assess data heterogeneity and survey coverage in large biodiversity datasets, including eBird data for 157 countries and three non-avian GBIF datasets. We also tested conformity to Benford's law of four simulated communities with different SAD models and four corrupted datasets with log-normal SAD. Finally, we evaluated the effect of including rare species in three datasets on the conformity to Benford's law and also compared Benford fit to the results of traditional methods to estimate survey completeness in seven datasets. Species-rich datasets with a large number of observations tended to obtain a good fit. Benford conformity can be a simple and sensitive measure of sampling evenness, complementing traditional methods to assess quality data in large-scale studies. Benford's test can reflect species abundance heterogeneity, especially in log-normally distributed data, but was not ideal to evaluate surveys completeness, as its results diverged from those of traditional methods. As the contribution of citizen science continues to increase in biodiversity monitoring, this fast and efficient method can play a critical role to assess the quality of datasets.
KW - Biodiversity data
KW - Citizen science
KW - Community science
KW - First-digit frequency
KW - Numeric data
KW - Reliability
KW - Species occurrences
UR - http://www.scopus.com/inward/record.url?scp=85148771020&partnerID=8YFLogxK
U2 - 10.1016/j.biocon.2023.109982
DO - 10.1016/j.biocon.2023.109982
M3 - Article
AN - SCOPUS:85148771020
SN - 0006-3207
VL - 280
SP - 1
EP - 13
JO - Biological Conservation
JF - Biological Conservation
M1 - 109982
ER -