Regression vs. Classification: Distance and Divergence

In Machine Learning, supervised problems can be categorized into regression or classification problems. The categorization is quite intuitive as the name indicate. For instance, if the output, or the target value is a continuous value, the model tires to regress on the value; and if it is discrete, we want to predict a discrete value as well. A well-known example of such classification problem is binary classification such as spam vs. non-spam. Stock price prediction, or temperature prediction would be good examples of regression.

To solve such problems, we have to use different methods. First, for regression problems, the most widely used approach is to minimize the L1 or L2 distance between our prediction and the ground truth target. For classification problems, 1-vs-all SVMs, multinomial logistic regression, decision forest, or minimizing the cross entropy are popular choices.

Due to their drastically different treatment, sometimes, it is easy to treat them as a complete separate problems. However, we can think of the classification problem as regression problem in a non-Euclidean space and extend this concept to Wasserstein and Cramer distances.

Designing an Objective (Risk) Function for Machine Learning Models

To make a system that behaves as we expect, we have to design a loss (risk) function that captures the behavior that we would like to see and define the Risk associated with failures, or the loss function.

For example, let’s look at a typical image classification problem where we classify an image into a semantic class such as car, person etc. Most datasets use a mapping from a string (“Car”) to a numeric value so that we can handle the dataset in a computer easily. For instance, we can assign 0 to “Bird”; 1 to “Person”; 2 to “Car” etc.

However, the numbers do not have intrinsic meaning. The annotators use such numbering since it is easy to process on a computer; but not because “Person” + “Person” gives your “Car” nor because a person is “greater” (>) than a bird. So, in this case, making a machine learning model to regress such values that do not have intrinsic meaning would not make much sense.

On the other hand, if the number that we are trying to predict has actual physical meaning and ordering makes sense (e.g., price, weight, the intensity of light (pixel value) etc.), it would be reasonable to use the numbers directly for prediction.

To state this notion clearly, let $y$ be the target value (label, supervision) associated to an input $x$ and $f(\cdot)$ be a (parametric) function or a machine learning model. i.e. when we feed $x$ to the function $\hat{y}=f(x)$, we want the output $\hat{y}$ to approximate the target value $y$. So we need a measure of how different the generated values are from the supervision. Naturally, we use a distance function to measure how close a target is to the prediction and we use the distance as the loss function (objective function or a risk function).

$$ \begin{align} L(x, y) & = D(\hat{y}, y) \\ \end{align} $$

where $D(\cdot, \cdot)$ denotes a distance function.

Regression vs. Classification

Now let’s look at the simplest regression problem: linear regression using least squares fitting. In this setting, we have noise observation around the ground truth line, and our task is to estimate the line. In this case, $f(x) = Ax + b$ and $D(a, b) = ||a - b||_2^2$, square of the L2 norm. This gives also can be interpreted as the maximum likelihood estimation under Gaussian noise. However, the L2 norm is not the only distance measure used in regression problems. L1 norm is sometimes used to enforce sparsity, and the Huber loss is used for regression problems where outliers do not follow the Gaussian distribution.

$$ D(\hat{y}, y) = \|\hat{y} - y\|_2 $$

Let’s go back to the previous classification problem. Regressing the arbitrary numeric values (labels) clearly is not the best way to train a machine learning model as the numeric values do not have intrinsic meaning. Instead, we can use the probability distribution rather than the arbitrary numeric values. For example, for an image of a bird, which was class 0, we assign $P_0 = 1$ and 0 for the others: $P_{bird} = [1, 0, 0]$ where the elements are the probability of the input being a bird, person, and car respectively. Using this representation, we can train multinomial logistic regression, multi-class SVM.

Cross Entropy and f-Divergence

However, how should we measure the “distance” between the ground truth label distribution and the prediction distribution? Or is there a concept of distance between two distributions? One family of functions that measures the difference is known as the Ali-Silvey distances, or more widely known as f-divergence, provides a measure function. Specifically, one type of the f-divergence family is more widely used than others, and it is the Kullback-Leibler divergence. Formally, given two distributions $P_\hat{y}$ and $P_y$, the KL divergence is defined as

$$ \begin{align} D(P_\hat{y} || P_y) & = \sum_{i \in \mathcal{Y}} P(\hat{y} = i) \log \frac{P(\hat{y} = i)}{P(y = i)} \\ & = - H(P_y) + H(P_\hat{y}, P_y) \end{align} $$

where $H(\cdot)$ is the entropy and $H(\cdot, \cdot)$ is the cross entropy. In classification problems, where $P_\hat{y}, P_y$ denote prediction and ground truth respectively, the first term is a constant, so we drop the entropy and train our prediction model with the cross entropy only. That’s where you get your cross entropy loss.

However, the KL divergence is not the only divergence. In fact, any convex function $f: (0, \infty) \rightarrow \mathbb{R}$ such that $f(1) = 0$ can define a divergence function.

$$ D_f(P || Q) = \mathbb{E}_Q \left[ f \left( \frac{dP}{dQ} \right) \right] $$

For example, if we use $f(x) = \frac{1}{2}|x - 1|$, we have the Total Variation divergence.

$$ \begin{align} D_{TV}(P || Q) & = \frac{1}{2} \mathbb{E}_Q \left[ \left| \frac{dP}{dQ} - 1 \right| \right] \\ & = \frac{1}{2} \int |P - Q| = \frac{1}{2} ||P - Q||_1 \end{align} $$

One thing to note is that the KL divergence is not a proper metric as it is asymmetric and violates the triangle inequality.

Wasserstein Distance, Cramer Distance

However, f-divergence is not the only way to measure the difference between two distributions. In 1, the authors propose that f-divergence does not capture our regular notion of distance accurately and propose to use a different distance and led an interesting discussion in adversarial training. Let’s first look at other “distance” functions that do not belong to the f-divergence family.

First, the Wasserstein distance, also known as the probabilistic Earth Mover’s Distance, computes the minimum mass that we need to move to match a probability distribution to another. \begin{align} W_1(P, Q) = \inf \mathbb{E} [|x - y|] \end{align}

The infimum is over the joint distribution whose marginals are $P$ and $Q$. $x$ and $y$ are defined over the space where $P$ and $Q$ have non zero support. One of great follow up works 2 proposed to use yet another different distance function, Cramer distance, to remove sampling bias in the distance function. The Cramer distance is simply the squared version of it

\begin{align} W_2(P, Q) = \left( \inf \mathbb{E} [|x - y|^2] \right)^{1/2} \end{align}


Categorizing supervised problems into classification or regression can help we clearly understand the problem, but sometimes it can limit our imagination and also limit the set of distance functions that we can use. Rather, in this post, we discussed how classification and regression could be understood from how we measure differences. Classification by measuring difference using f-divergence or even probabilistic distances and regression as Euclidean distances. They are merely distances that measure the difference between a target and a prediction. There are more popular distance functions, but the set of the distance function is not set in stone. Sometimes, by defining the distance function in a clever way, we can improve our ML model!


  1. Arjovsky et al., Wasserstein Generative Adversarial Networks, 2017 

  2. Bellemare et al., The Cramer Distance as a Solution to Biased Wasserstein Gradients, 2017 

Leave a Comment