Functional data geometric morphometrics with machine learning for craniodental shape classification in shrews

Morphometrics is a fundamental discipline in biological research that focuses on quantitatively describing and analysing the shape and its variations across organisms. Initially centered on basic descriptive measurements, this field has progressed significantly and is currently employing advanced statistical and computational techniques to study shape and size variation¹. Conceptually, morphometrics can be broadly categorised into two approaches: landmark-based morphometrics which relies on the precise positioning of anatomical landmarks, and outline-based morphometrics which captures the contour of forms through a sequence of pseudo-landmarks^2,3. As morphometric techniques continue to advance, the selection of appropriate methods becomes crucial for meaningful applications in biological research.

Geometric morphometrics is a popular field for studying morphological variation in biological organisms. It is based on the idea that the shape of an organism can be described by the coordinates of a set of landmarks on its surface. Landmarks are points on the image of the organism that are consistently located in the same place, regardless of the size or orientation of the organism. Generalised Procrustes analysis (GPA) can be applied to raw landmarks to superimpose the landmark configurations using least-squares estimates and rotation parameters¹. These variables can be used to compare the shapes of different organisms using graphical visualisation of results to track changes in shape over time and to identify the underlying causes of shape variation.

The utility of GM has been demonstrated in numerous studies involving both macro and microfauna. For instance, Moneva et al. highlighted the taxonomic confusion surrounding Pomacea canaliculata, a significant rice pest in Asia, by revealing notable sexual differences in shell size and shape⁴. Importantly, the study underscores the utility of GM methods in detecting subtle morphological differences between sexes, thus offering a more nuanced understanding of shape variation in P. canaliculata⁴. Similarly, Theska et al. presented a standardised protocol for conducting GM analyses on 2D landmark data sets focusing on stomatal shapes in the model nematodes Caenorhabditis and Pristionchus, showcasing its adaptability in quantifying shape disparities within and across species⁵. In another study, Phung et al. investigated sexual dimorphism in Leptopoma perlucidum land snails using GM, revealing significant differences in shell size between the sexes. These findings underscored the importance of considering sexual dimorphism in taxonomic studies within the Leptopoma genus⁶.

The efficiency and versatility of GM are further exemplified in studies such as Maderbacher et al. (2008) which successfully discriminated three populations of the tcichlid fish, Tropheus moori using the GM method⁷. Dudzik also used GM to examine the cranial morphology of Asian and Hispanic populations by performing discriminant and canonical variate analyses⁸. The output of the GM analysis revealed significant differences in cranial shapes between the two groups, yet both studies concur that GM serves as a valuable tool for identifying morphological similarities among populations based on cranial morphology. Within morphometrics, craniodental morphology holds particular significance, offering insights into taxonomic discrimination, evolutionary studies, and biomedical implications. Adams and Rohlf highlighted the importance of craniodental morphometrics in elucidating ecological character displacements in Plethodon salamanders through landmark-based geometric morphometric analysis (GM)⁹. Slice explored the application of morphometrics in physical anthropology with a focus on craniodental morphonology. This work highlighted the use of landmark-based morphometrics in studying human evolution and practical application in anthropology¹⁰. These studies not only shed light on the functional adaptations of craniodental structures but also serve as inspiration for further extending the GM technique for the craniodental morphology of this paper. Although the advantages of GM are widely known, an important limitation of the technique is that a sufficient number of landmarks may not be available to capture the geometry of biological organisms. There is a possibility that important shape differences may occur between landmarks¹.

The study of craniodental morphology in shrews stands out as an invaluable avenue for gaining insights into their evolutionary trajectory, taxonomic classification, and ecological adaptations. Shrews, belonging to the order Eulipotyphla are characterised by their small size, insectivorous diet, and rapid metabolism. Despite their small stature, shrews exhibit remarkable diversity in craniodental morphology, reflecting adaptations to different ecological niches and evolutionary pressures. This is evident in the study conducted by Vasil’ev et al., which revealed geographical variability in the shape of the mandible in three shrew species of the genus Sorex using GM. Notably, discriminant analysis of Procrustes coordinates derived using GM reported a high percentage of correct assignment of individual shrews to distinct local taxocenes, further validating the efficiency of this methodology in taxonomic studies¹¹. Moreover, Vilchis-Conde et al. reinforced the significance of GM in supporting the taxonomic classification of semifossorial shrews. Their research also revealed that the shapes of the skull, particularly the dentary, are associated with diet specialisation, highlighting the profound impact of morphological variations on functional aspects such as bite force among shrews¹².

Our study focuses on the craniodental variation among three shrew species: Crocidura malayana (Robinson & Kloss, 1911), Crocidura monticola (Peters 1870), and Suncus murinus (Linnaeus, 1766). Each species occupies distinct ecological niches: C. malayana, a medium-sized shrew, thrives in Thailand, Malaysia, and several offshore islands ¹³. This terrestrial species has been documented in both hill and lowland forests ^14,15. Meanwhile, C. monticola, the smallest shrew in the genus Crocidura is restricted to forest areas in Malaysia and Indonesia¹⁶. On the other hand, S. murinus, the largest shrew species, is predominantly found in urban areas and the outskirts of forests, with a wide distribution spanning human settlements in the Indian subcontinent and Southeast Asia¹⁷.

Functional data analysis (FDA) is a statistical methodology used to analyse data that are represented in the form of functions, consisting of entire curves, surfaces, or other continuous functions, rather than discrete sampling points. Functional data analysis is particularly useful when dealing with data that vary continuously over a domain, such as time, space, or wavelength. In the context of our work, the basic idea behind FDA is used to represent discrete sampling points such as landmark coordinates, as a function. This involves creating functional data that encapsulates all the coordinates to represent the entire measured function. Later, models are generated to predict information based on a collection of functional data by applying statistical concepts from multivariate data analysis¹⁸. Ramsay and Silverman provided a comprehensive introduction to the FDA, covering theoretical foundations and practical applications, including methods for clustering and classification of functional data, which is particularly relevant for grouping similar surfaces or curves in morphometrics¹⁹. The FDA framework allows better accuracy in parameter estimation in the analysis phase, effective data noise reduction through curve smoothing, and applicability to data with irregular time sampling schedules¹⁸.

Bookstein introduced landmark methods for analysing shape differences in outlines which can be considered as a precursor to some FDA techniques. Both landmark and outline analysis have been combined in this study to provide a richer description of the overall shape of the human brain using MRI images²⁰. Dryden and Mardia primarily focused on statistical shape analysis which also discussed the foundations of landmark shape analysis, including geometrical concepts and statistical techniques that include analysis of curves, surfaces, images, and other types of object data²¹. Functional data analysis considers shapes as continuous functions or curves, allowing for the analysis of shape changes over a continuum such as time or developmental stages.

In our work, the landmark coordinates used in the GM method will be represented as functions. Each sample element is considered as a function under the FDA framework which often defines time, spatial location, or wavelength as the physical continuum. Functional data geometric morphometrics (FDGM) is proposed in this study, which requires steps to perform statistical analysis on signals, curves, or even more complex objects while remaining invariant to certain shape-preserving transformations²². The proposed method combines FDA with GM. Unlike multivariate data analysis, FDA accounts for the continuity of curves and models the data within the functional space, rather than treating them as a set of vectors. In this study, we utilise curvature information from shrew skulls to construct a model comprising a set of landmarks serving as endpoints. By employing interpolation techniques across these landmarks, we can create a more refined shape representation. Although our study adopts a discrete point-based format for convenience, these points fundamentally represent a continuous surface. This approach naturally aligns with the FDA perspective, as elucidated by Ramsay and Silverman¹⁹. To ensure that the functions are well-aligned for geometric features such as peaks and valleys, curve registration^23,24 or functional alignment²⁵ are applied to warp the temporal domain of functions²². The FDA framework surpasses its counterparts, including both the landmark-based approach and the set theory approach with principal component analysis (PCA), when applied to a well-known database of bone outlines²⁶. The set theory approach is adopted from a methodology outlined in Horgan²⁷ which treats shapes as sets²⁶. Each position within the image corresponds to a binary variable, indicating whether it belongs to the shape or not. Consequently, the study performed PCA specifically tailored for binary data²⁶. Building on Tian’s study of FDA in brain imaging analysis²⁸, our research aims to explore FDGM’s capacity to enhance sensitivity to subtle shape variations through the analysis of continuous function-based shape changes. This is particularly significant for studying species with minor morphological distinctions or monitoring subtle changes in response to environmental factors.

In our study, we transform landmark data into functional data following generalised Procrustes analysis (GPA). Generalised Procrustes analysis employs rigid transformations, including translation, rotation, and scaling, to align landmark configurations, standardising them for comparison²⁹. However, this method may not fully address non-rigid deformations or shape changes independent of position, orientation, or size. Consequently, GPA might not capture all aspects of shape variation, particularly those involving local deformations or complex shapes. To address this limitation, we employ FDA to model non-rigid deformations and intricate shape changes undetected by GPA. By analysing shape changes as continuous functions, FDA can identify and quantify subtle variations and local deformations, offering a more comprehensive understanding of shape variation. Moreover, GPA mandates a one-to-one correspondence between landmarks across specimens, simplifying analysis but potentially overlooking true anatomical correspondence, especially when dealing with ambiguous landmarks²⁹. In contrast, FDA relaxes this requirement, aligning shapes based on overall shape curves or surfaces rather than exact landmark correspondences. This allows for more flexible matching of shapes, particularly when landmarks are ambiguous or difficult to identify consistently.

We utilise the functional data to perform multivariate functional principal component analysis (MFPCA) to observe variation among three shrew species, comparing the results with principal component analysis (PCA) in GM. Multivariate functional principal component analysis generates principal component scores (MFPC scores), capturing major sources of shape variation among the species. Landmark data sampled from curves are succinctly represented by continuous curves based on the Karhunen-Loève theorem. Our study demonstrates that FDGM can identify shape differences using classification methods, offering insights into underlying factors such as ecology or behavior. While GM standardises landmark configurations effectively, the integration of FDA enhances morphometric analysis by capturing shape variation more comprehensively and sensitively, especially in complex structures like skulls.

We aim to implement the FDGM framework to observe the existence of significant differences in the craniodental shapes of three species of shrews. We organise our study around the null hypothesis that there are no significant differences in craniodental shapes among the three species of shrews under study. Any observed variations are attributable to random fluctuations or measurement errors, rather than indicative of genuine distinctions related to ecological niches or evolutionary processes. The hypothesis is framed within the framework of traditional morphological analyses, which have long been instrumental in understanding evolutionary relationships and ecological adaptations among shrew species. Shrews, being small mammals with diverse habitats and diets, provide an intriguing subject for morphological investigation. Thus, these craniodental differences can be related to the different ecological niches that these three species occupy³⁰.

For our analysis, we collected 89 adult shrew specimens: 29 from S. murinus, 30 from C. monticola, and 30 from C. malayana. The habitats of C. malayana span diverse locations, including Lata Belatan, Terengganu; Ulu Gombak; Aur Island, Johor; Pangkor Island, Perak; Bukit Rengit, Pahang; Cheras Road, Kuala Lumpur; Port Dickson, Negeri Sembilan; and Dusun Tua, Selangor. Conversely, C. monticola exhibits a broader habitat range, inhabiting environments such as Ulu Gombak; Wang Kelian, dominated by secondary lowland forest, and Maxwell Hill, an upper dipterocarp forest, among others. Suncus murinus, on the other hand, is observed in locations like Wang Kelian, Perlis; Alor Setar, Kedah; Air Hitam, Pulau Pinang; Lumut, Perak; Ulu Gombak, Selangor; and Bukit Katil, Melaka. These varied habitats likely contribute to the divergence in craniodental morphology between species. Notably, C. malayana and C. monticola coexist in sympatry in Ulu Gombak, sharing the same habitat or niche. This study aims to elucidate the relationships between these species, offering valuable insights into the evolutionary processes shaping their craniodental morphology.

Morphometric studies for classification and identification tasks are enhanced by extensive machine learning methods³¹. The naive Bayes (NB), support vector machine (SVM), random forest (RF), and generalised linear models (GLM) classification models³² are frequently applied because they have been successfully used in many previous studies. In Rodrigues et al., NB was the best classifier for detecting landmarks in automatic wing geometric morphometrics classification of honeybee (Apis mellifera) subspecies³³.Thomas et al. also applied the NB classifier in their study to propose a novel approach in GM to automate morphological phenotyping in ways that capture comprehensive representations of morphological variation with minimal observer bias³⁴ which indicates that NB can be a potentially valuable tool for classification and pattern recognition tasks based on shape data. Bellin et al. successfully combined geometric morphometrics with different machine learning algorithms, including SVM with radial basis function (RBF) kernel. This study demonstrated the effectiveness of SVM in correctly classifying two Anopheles sibling species of the Maculipennis complex based on shape data³⁵. Hence, this study aims to incorporate supervised learning, particularly SVM, for the classification of three shrew species based on their morphological features. SVM can be used to classify shapes into different categories based on their landmark coordinates or shape descriptors. In this approach, each classifier separates the points of two different species and combines all one-vs-one classifiers which leads to a multiclass classifier.

Arai et al. applied RF in the context of morphological identification in skulls, specifically between spotted seals and harbour seals, using geometric morphometrics. The study achieved an identification accuracy rate of 100% using RF by narrowing down to a subset of eight key landmarks out of a total of 75 landmarks³⁶. The ensemble nature of RF allows it to capture both linear and non-linear relationships in the data, making it robust and accurate for shape classification tasks. The success of RF in morphological identification^35,37,38 has encouraged this study to compare the effectiveness of this classifier in the classification of the shrew species based on the FDGM framework. Generalised linear models (GLM) serve as extensions of linear models, enabling the accommodation of nonlinearity and non-constant variance within the data. Consequently, GLMs are equipped to handle various data distributions, making them well-suited for analysing species-habitat relationships which often deviate from normal distributions³⁹.

Source link