How Neighbor Joining Works in Machine Learning Part 1 | By Monodeep Mukherjee | June 2024

Machine Learning


Monodeep Mukherjee
Photo by Jonny Gios on Unsplash
  1. Neighbor Join and Leaf State

Author: Matthias Weller

Abstract: The Neighbor-Joining algorithm is one of the most fundamental algorithmic results in computational biology. However, its definition and proof of correctness are not easy. In particular, the question “What does the NJ method try to do?” has been somewhat elusive until recently. [Gascuel \& Steel, 2006]Although a rigorous mathematical analysis is available, it is still somewhat difficult to understand and its proof is considered tedious at best. In this work, we present an alternative interpretation of the objective of the Neighbor-Joining algorithm and prove that it chooses to merge two taxa u and v that maximize a “leaf state”, i.e., the sum of the distances from all leaves to unique UV paths.

2. Combinatorial and computational studies of neighbor-joining bias

Authors: Ruth Davidson and Abraham Martin del Campo

Abstract: The neighbor-joining algorithm is a popular distance-based phylogenetic method that computes tree metrics from dissimilarity maps arising from biological data. By realizing dissimilarity maps as points in Euclidean space, the algorithm partitions the input space into polyhedral regions indexed by the combination type of the returned tree. A perfect combinatorial description of these regions has not yet been found. Different sequences of neighbor-joining agglomeration events may generate the same combination tree, so that multiple geometric regions are associated with the same algorithm output. To resolve this confusion, we define an agglomeration order on the tree, leading to a one-to-one relationship between different regions of the output space and weighted Motzkin paths. As a result, we present an expression for the number of polyhedral regions that depends only on the number of taxa. Finally, we perform a computational comparison between these polyhedral regions to reveal biases introduced in the implementation of the algorithm.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *