This repo saves the code and some simple test cases for the paper Gaussian mixture modeling by exploiting the Mahalanobis distance. What the paper did is basically to split a dataset of mixed Gaussian distributions into separate groups which contain only one Gaussian distribution. The number of groups is found by the algorithm automatically.
Here are some differences of my implement from the original paper:
- The original algorithm can split a mixture in two ways: into two groups which have different means, or two group sharing a same mean but different covariance structure. Currently my implement only does the first one
- Originally the paper does the first kind of split by choosing one dimension of variable and a threshold, but I used KMeans and set number of groups = 2 in KMeams.
From the test cases in the notebook we can see the following:
- The algorithm is possible to correctly splits the mixed dataset
- The algorithm tends to oversplit, which means it might split a single group into more groups
- If given the maximum number of splits, the algorithm works properly
My understanding of the oversplit problem is that the paper uses a hard split, which means every iteration is assigns a group to each sample. Even with a small probability, when the number of samples is large, some samples will be assigned to a wrong group and the group to which the sample is assigned will be 'polluted' so it will be split again. This happens especially to the groups with a small variance. I will look into this and try to solve it.