A classification task may be described as the task of assigning a class label
The general approach lies in constructing discriminant functions
A popular and simple implementation of a discriminant function is Bayes decision rule with a solid foundation in statistics.
It states that the posterior probabilities
Thus, our decision rule is finding the maximum of the posterior probabilities over all classes $$ C^* = max_C P(C|x) $$ with the posterior computed according to Bayes rule in following manner: $$ P(C|x) = \frac{p(x|C) \times P(C)}{ p(x) } $$
in which:
-
$x$ is a feature Vector -
$C$ is a class label -
$P(C|x)$ is the (posterior) probabilty for Class$C$ given feature vector$x$ -
$p(x|C)$ is the feature distribution of feature vector$x$ for class$C$ (Note: in discrete density models$p(x|C)$ is a probability and not a density)
Remark that the posterior probabilities in Bayes rule not only allow for classification but also give a measure for the accuracy of our decision.
The Bayesian approach requires to estimate the feature distributions per class. Often we will use a parametrical distribution in which the density estimation translates into the estimation of the parameters of the distribution.
TRAINING PHASE
- Collect the training data
- Choose the model of the distributions
- For each class: Estimate model parameters from the training data
RECOGNITION PHASE
- get the class Priors
$P(C)$ or neglect them if no prior knowledge is available - Compute class likelihoods
$p(x|C)$ for all classes - Compute weighted likelihoods
$p(x|C) \times P(C)$ - Compute the total likelihood of the feature vector:
$p(x) = \sum_C p(x|C) \times P(C)$ - Compute the posteriors: $ P(C|x) $
- Take the maximum over the posteriors
Note: The total likelihood of the sample is just a normalization factor guaranteeing that the posteriors sum up to 1.0. This step is not necessary for classification as such, but inspecting the posteriors is often a good sanity check on your result.
Bayes decision rule will be optimal (i.e. no better decision rule can be constructed!!) under following conditions:
- you need enough (correctly) labeled training data such that you can estimate class distributions
$p(x|C)$ using maximum likelihood estimation - you need to know prior class probabilities
$P(C)$ - you should use Bayes rule (about conditional probabilities) to compute the posterior probabilities
These conditions look mild at first sight. However, experience has taught us that the Bayesian approach has significant limitations for complex problems.
The central issue is the estimation of the class densities
This is not a shocking observation by itself. The question is how good are those approximate estimates and should we be concerned. But even more: how much better do our estimates get, and consequently the decision rule, with increasing amounts of data. The truth is, improvement is inherently slow with increasing amounts of data.
Looking at the formulas in detail gives us insight into the inherent underlying problem.
For a sample that scores high on one class and not so high on the others, there will be no no problem; small errors in the estimated probabilities will not influence the classification outcome. But for outliers, i.e. samples that aren't modeled well by any of the classes, there is a fundamental problem. All class densities
In the case of the Bayesian approach we use a Generative model: i.e. we estimate the model distributions (i.e. a model that can generate artificial data). The discriminant functions are computed indirectly from the generative model.
Modern Deep Neural Nets are Discriminative models as they estimate the discriminant functions directly by minimizing the classification error on a given train set (or optimize another criterion that is directly related to classification). Often it is the case the discriminant functions are normalized and that they can be interpreted as posteriors.
Generative and discriminative models each have a number of advantages and disadvantages:
- In the generative model, the feature distributions can be estimated for each class independently. This is convenient because you can add / split classes at will. It is also less demanding because a global problem (classification) is split into
$N$ subproblems (density estimations) that can be solved independent of each other. - In a discriminative model, the optimization is global. The discriminant functions are learned jointly using a single overall optimization function. This optimization will inherently be an order of magnitude more complex than in the case of a generative model.
- Given the higher complexity, discriminative models will at the same time need and also benefit from a large training corpus, again pushing the computational requirements up.
In summary Generative Models can be trained quickly with small amounts of data and may be the preferred solution in the case of limited resources, inherently sparse data problems, prototyping, ... Discriminative Models are superior when large representative corpora are available for your problem. They may also be the methodology of choice if adapatation or fine tuning is an option in which some large background model can serve as reference.
$$ p(x|C_k) = \sum_{j=1}^{M} w_{ij} \mathcal{N}(x;\mu_{kj},\Sigma_{kj}) $$ in which$\mathcal{N}{kj}$ is the $j$-th mixture of Class $C_k$ parameterized by $ w{kj}, \mu_{kj}, \Sigma_{kj}$, respectively the weight, mean and convariance matrix. Without any constraint on the parameters, these functions are also known as Radial Basis Functions. In the probabilistic literature GMMs are used as probability density functions. This merely requires that the weights sum up to 1.
The parameters of a single Gaussian are easily estimated from example data using the maximum likelihood principle. For the 1D case, yielding: $$ \hat{\mu} = \frac{1}{N} \sum_i x_i $$ $$ \hat{\sigma^2} =\frac{1}{N-1} \sum_i (x_i - \mu)^2 $$
Estimating parameters of a Gaussian Mixture model from data is more involved and only approximate. The EM (Estimation-Maximization)* algorithm finds a local optimum in an iterative way.