^{a}

^{b}

Geometric morphometric methods constitute a powerful and precise tool for the quantification of morphological differences. The use of geometric morphometrics in palaeontology is very often limited by missing data. Shape analysis methods based on landmarks are very sensible but until now have not been adapted to this kind of dataset. To analyze the prospective utility of this method for fossil taxa, we propose a model based on prosimian cranial morphology in which we test two methods of missing data reconstruction. These consist of generating missing-data in a dataset (by increments of five percent) and estimating missing data using two multivariate methods. Estimates were found to constitute a useful tool for the analysis of partial datasets (to a certain extent). These results are promising for future studies of morphological variation in fossil taxa.

Les méthodes de morphométrie géométrique constituent un outil puissant et précis dans la quantification des différences morphologiques. Cependant, l’application des méthodes de morphométrie géométriques en paléontologie soulève le problème des données manquantes. Le matériel étant souvent fragmentaire, les méthodes d’analyse de forme, et notamment les méthodes faisant appel aux points homologues, sont inadaptées à ce type de données. Dans une perspective d’application aux fossiles, un modèle de test des méthodes de reconstruction des données manquantes est proposé sur un échantillon de primates prosimiens. Ce modèle consiste à générer des données manquantes à partir d’un jeu de données complet (par tranche de 5 %) et de reconstruire ces données manquantes. La pertinence des reconstructions est testée. Les résultats indiquent que, dans une certaine limite, les méthodes de reconstruction permettent d’inclure dans l’analyse les spécimens dont la préservation est partielle. Ces conclusions sont prometteuses pour l’analyse de la variation morphologique des taxons fossiles.

Fossilization and preservation of organisms through time usually involve working with datasets that are, by the nature of palaeontology, incomplete. Multivariate morphometrics generally requires the use of a relatively complete dataset, therefore, using incomplete palaeontological collections can present a real methodological dilemma. This is especially true in the case of geometric morphometrics, in which variables are not only used in describing the dimensions or shapes of specimens, but are parts of the morphology itself (e.g. coordinates of landmarks). With this methodological complication in mind, palaeontologists commonly choose to either work on extant data, select a subsection of data that may be present on all of the available specimens, or work only on complete specimens, thereby excluding from the sample specimens with missing data. In any scenario, the palaeontologist is then presented with either choosing to exclude a part of the morphology from any analysis or decreasing the sample size for analysis. While both of these situations may be offset by working with large samples, in the case of small samples, these solutions may prove too restrictive or unworkable.

Recently, a great deal of literature has been devoted to the issue of estimating missing-data in an incomplete dataset (e.g

In the analysis of fossil material, reconstruction methods commonly rely on a single, or several reference specimens. Using this model, the shapes of incomplete specimens are fitted onto the known morphology of an undamaged specimen, and missing coordinates are estimated (

The aim of the current study is not to study morphological variation in fossil primates; rather, we anticipate that our testing of the power of reconstruction methods can eventually be applied, in the future, to a sample of extinct primates. By using a set of extant prosimian primates with a complete set of landmarks to test the power of missing data estimation, we ask the following questions: (1) is there a maximum limit to the amount of missing-data that can be estimated; (2) can we rely on methods of reconstructing missing data to study morphological differences using geometric morphometrics; and (3) is it ultimately of benefit to reconstruct missing-landmarks? We anticipate that, by addressing these research questions, we will gain new insight into the use of reconstructing missing data in any future analysis of fossil primates.

Data were collected on two genera (five species, total) in the collections of Laboratoire mammifères et oiseaux of the Muséum national d’histoire naturelle, Paris, France (

A total of 80 landmarks were defined on the crania and digitized using a Microscribe G2X digitizer (Immersion Corporation, San Jose, California). Landmarks were located on the sagittal plane of each side of the cranium (

The process we took in exploring the effect of missing data on geometric morphometric analysis was composed of several steps. Firstly, the entire dataset was subjected to random deletion of landmarks in 5% increments using the

For each step of deletion, and estimation, an estimation error has been computed using the R software. The error was quantified by calculating the deviation between the original coordinates of specimens and the estimated

We analyzed nine different datasets: the original dataset, datasets obtained after 5, 10, 15 and 20% missing-data estimations using EM and datasets obtained after 5, 10, 15 and 20% missing-data estimations using the MR method. For each dataset, we applied a generalized Procrustes analysis, using a Generalized Least-Squares (GLS) algorithm, to perform translation, rotation, and scaling (via the unit of centroid size). With this procedure, differences in shape are reported as residuals from each transformed landmark or as uniform changes in the overall shape (

Discriminant Function Analysis (DFA) was applied to the different sets of landmarks for each of the two estimation methods to determine if the shape of genera, species and subspecies could be distinguished from others statistically. DFA was used in this context as it emphasizes relationships among group covariance matrices to discriminate between groups (see, among others,

The impact of missing-data on the initial sample is presented in

In general, the mirror reflection method of estimating missing data served as an efficient solution for the estimation of pair landmarks (where one landmark was present on one side, but missing on the other). By utilizing the mirror reflection method, we were able to considerably reduce the amount of missing data. After that, multivariate estimation methods are used. The percentage of estimation error as a function of percentage of missing data is shown in

A DFA and a CVA on the discriminant functions were performed on the initial full dataset (with no missing-landmarks), with the first two canonical axes illustrated in ^{2} test with

Using both the EM and MR methods of data estimation, cranial morphologies are differentiated by the canonical functions at the 5 and 10% data deletion level (

Following simulated data deletion and landmark estimation as described here, we conclude that the impact of missing-data on 3D morphometric analysis is high, particularly as the amount of missing-data increases. Thus, it is clear that the investigator must be cautious in the estimation process as the choice of sampled specimens, deleted landmarks, or estimation methods may prove to be too restrictive. We concede, however, that palaeontologists are generally not in the procedural context presented here. Missing-data are rarely randomly distributed; instead, missing-landmarks are commonly located on the more fragile parts of a fossil. For example, considering primate crania, the zygomatic arches or the bones of the neurocranium are, in our experience, more often damaged when compared to portions of the face. Considering this, our results must be weighed against the distinct possibility that missing-data occur more often on particular portions of the skull, rather than in a random pattern. Our results concur with those discussed by

Our results suggest that missing-landmarks cannot be reliably estimated after the 20% data deletion level. Once this level of data deletion has been reached, we find that the estimation error exceeds the 10% estimation error threshold.

While using a mirror reflection method appears to serve as a relatively powerful and accurate solution for estimating missing-data, it does reduce the asymmetric variation between right and left sides of one specimen. This variation remains of high interest for studying the influence of environmental adaption on the development of fluctuating asymmetry (random variations between right and left sides) as a measurable expression of developmental instability (

In 2009, Neeser and colleagues investigated a mean substitution method for estimating missing-landmarks. The mean substitution method is one based on substitution using Thin Plate Spline and multivariate regression techniques. These authors utilized three large sample units (

With their results in mind,

With reference to our initial research questions, our results suggest that a level of data deletion greater than 15% (or possibly extended to 20%) serves as an upper-limit to the utility of data estimation. In this sample, data deletion levels greater than 15% produced relatively unreliable results. Under this maximum limit, however, methods of missing-data estimation have the potential to be very useful to study morphological differences using geometric morphometric techniques. We feel this is particularly true if the comparisons are between taxonomic groups. We conclude, therefore, that the estimation of missing-data constitutes an appropriate solution for palaeontological studies that include damaged specimens, or with comparisons with small sample sizes. Once again, however, we stress that any estimation process must be chosen with the purpose of the comparison in mind and with consideration of the amount of missing-data.

The authors want to thank Jacques Cuisin and Julie Villemain for the access to the specimens, Rémi Laffont and Eloïse Zoukouba for their methodological advices. They also want to thank Gaël Clément and Didier Geffard-Kuriyama for inviting them to submit the article for this special volume. Finally, they would like to thank the GDR 2474 CNRS “

^{e}Sér. 7, pp. 1–143.

Position of the landmarks on a crania of

Position des points homologues sur un crâne d’

Estimation of the missing-data impact on the sample.

Estimation de l’impact des données manquantes sur l’échantillon. L’axe des

Percentage of estimation error as a function of percentage of missing-data in the sample for EM (black curve) and Multiple regression (gray curve) methods. Dotted line represents the 10% threshold.

Pourcentage d’erreur d’estimation, en fonction du pourcentage de données manquantes dans l’échantillon pour les méthodes EM (courbe noire) et régression multiple (courbe grise). La ligne pointillée indique le seuil d’erreur de 10 %.

Morphospace occupation of specimens from full dataset to estimation of 20% of missing-data using both EM and Multiple regression methods. The percentage for each scatter plot indicates the amount of estimated missing-data. For each case, a visualization of an “extreme” morphology (specimen indicated with a star) is given to control the biological meaning of the estimation.

Occupation de l’espace morphologique, depuis le jeu de données complet jusqu’à l’estimation de 20 % des données manquantes, pour les méthodes EM et régression multivariée. Le pourcentage dans chaque graphique indique la proportion de données manquantes. Une visualisation de la morphologie crânienne d’un individu « extrême » (figuré par une étoile) est donnée, afin de contrôler le sens biologique de l’estimation.

Canonical variate analysis performed on discriminant functions, grouping by taxon, of the full dataset (no missing-data).

Analyse canonique réalisée sur les fonctions discriminantes de l’échantillon complet (sans données manquantes) en regroupant par taxon.

Canonical variate analyzes performed on discriminant functions, grouping by taxon. Missing-data, from 5 to 20%, have been estimated using both EM and multivariate regression methods. See

Analyses canoniques réalisées sur les fonctions discriminantes, en regroupant par taxon. Les données manquantes, de 5 à 20 %, ont été estimées par les méthodes EM et de régression multivariée. Se référer à la

Definition and position of the landmarks.

Définition et position des points homologues.