Mise en forme et contrôle qualité de données, l’informatique au service de l’écologie


fr Naturae 2022 (2) - Pages 17-30

Published on 26 January 2022

Data formatting and quality control, computing for ecology.

In many scientific disciplines, experimental studies or field monitoring ensure the data collection. The data is stored on raw files with an intuitive format, easily entered by the experimenter. However, this raw format is rarely directly compatible with the analysis of the collected data and the data may be erroneous; it is then necessary to carry out a formatting and quality control. Faced with the increasing number of ever more massive raw data sets, the discipline of digital for life sciences has developed. Computer programming represents a precious help for modellers since it allows the automation of data formatting and data cleaning. The data formatting makes it possible to obtain a format that will be directly used in the analyses. Automation also enables to avoid errors generated by manual formatting, such as typing errors and omissions. Data cleaning, a term used in computer science, corresponds to the data quality control according to the criteria provided by the modeller. The modeller knows the range of values of the data obtained and the possible errors produced. In this article, we present a collaboration between computer scientist and modeller in the framework of animal abundance monitoring. The data collected on several sheets of a spreadsheet had to be gathered on one sheet and their quality had to be checked. The various functionalities of the program carrying out this verification were implemented using the “agile” method, a computer development method consisting of sprints. After a version was provided, a new sprint defined the next functionality to be implemented by the computer scientist in a new version of the program. The first version allows the appropriation of the dataset by the computer scientist thanks to the formatting functionality. A more advanced version manages the absence of data, then others check the collected data quality and reports the processing of detected anomalies, data missing or erroneous or outside a specified range, in a text file. This computer program has been explained so that it can be re-appropriated and re-used, the full version is deposited in GitHub. The link is given in conclusion.

Abundance, data management, agile software development, data cleaning, computer programming.
Download full article in PDF format Order a reprint