Active learning (called query learning or optimal experimental design in statistics). The key hypothesis is, if the learning algorithm is allowed to choose the data from what it learns, it will perform better with less training.

Comparison with passive learning: typically gathering a large amount of data randomly sampled from the underlying distribution and using this large dataset to train a model.

Why active learning?

Give a simple example shown as below.

Figure (a) illustrates a toy data set of 400 samples, evenly sampled from two class Gaussians. (b) A logistic regression (LR) model trained with 30 sampled instances randomly sampled from overall distribution (70% accuracy). (c) A LR model trained with 30 actively queried instances using uncertainty sampling (90% accuracy).

Fig. (b) illustrates the traditional supervised learning approach after random selecting 30 instances for labelling, drawn i.i.d. from the unlabeled pool $\mu$ . The classifier boundary is far away from $x=0$ .

Figure (c) shows the active learning approach. The active learner uses uncertainty sampling to focus on instances closest to its decision boundary, assuming it can adequately explain those in other parts of the input sapce characterized by $\mu$ .

Scenarios

Membership Query Synthesis

The learner generates the instance from some underlying natural distribution. E.g. mnist dataset, rotation or excluding some pieces of the digit.

Limitation: labelling generated arbitrary instances an be awkward if the oracle is a human annotator. Unexpected problems: for image data, generated images are not recognisable; for NLP and speech tasks, synthesised text or speech is gibberish.

Stream-based selective sampling

Alternative to synthesizing queries” selective sampling. The key assumption: obtaining an unlabeled instance is free.

Stream-based selective sampling (a.k.a. stream-based or sequential active learning): each unlabeled instance is typically drawn one at a time from the data source, and the learner must decide whether to query or discard.

Decision criteria:

informativeness measure (a.k.a. query strategy) See following section
Compute an explicit region of uncertainty, i.e. the part of the instance space that is ambiguous to the learner, and only query instances that fall within it.
- Naive way: set a threshold for informativeness measure and regard those below the threshold as query region;
- Hypothesis consistency. Define the region that is unknown to the overall model class, i.e. whether hypothesis between any two models are consistent. That is, any two models (completely expensive; requires keep models after each new query) of the same model class agree on all the labeled data, but disagree on some unlabelled instance, such instance lies within the region of uncertainty.

Pool-based sampling

Pool-based evaluates and ranks the entire unlabelled data collection before selecting the best query, whilst stream-based sampling queries and make decisions for one instance at a time. Pool-based sampling is more practical for many real-world problems.

Query-strategy

Active learning involves evaluating the informativeness of unlabelled instances by either generating de novo or sampling from the given distribution.

Let $x_A^*$ denote the most informative instance (i.e. the best query)

Uncertainty sampling

Least confident strategy

The active learner queries instances about which is least certain how to label. E.g. for binary classification models, simply query the instance whose posterior probability of being positive is nearest 0.5 .

For three or more class labels, a more general uncertainty sampling method is to query the instance whose prediction is the least confident:

$x_{LC}^* = \mathop{\arg\max}_x 1-P_{\theta}(\hat{y}|x)$

where

$\hat{y} = \mathop{\arg \max}_y P_{\theta} (y|x)$

, the class label with the highest posterior probability under the model $\theta$

One way to interpret this uncertainty measure is the expected 0/1-loss, i.e. the model’s belief that it will mislabel $x$ .

Limitations: Least confident strategy only considers information about the most probable label. It “throws away” information about the remaining label distribution. Margin sampling incorporates the posterior of the second most likely label to correct this.

Margin sampling

$x_M^* = \mathop{\arg\min}_x P_{\theta}(\hat{y}_1 |x) - P_{\theta}(\hat{y}_2|x)$

where $\hat{y}_1$ and $\hat{y}_2$ are the first and second most probable class labels, respectively.

Intuitions: Instances with large margins are easy, since classifiers have little doubt in differentiating between the two most likely class labels. Instances with small margins are more ambiguous, thus knowing the true label would help discriminate.

Limitations Problems with very large label sets, the margin method still ignores much of the output distribution for the remaining classes. solution -> entropy.

Entropy sampling

$x_H^* =\mathop{\arg\max}_x - \sum_i P_{\theta}(y_i|x) log P_{\theta}(y_i|x)$

where $y_i$ ranges over all possible labelings. Entropy represents the amount of information needed to “encode” a distribution. (Entropy is often thought of as a measure of uncertainty or impurity in machine learning).

Above three uncertainty sampling methods are application-dependent.

“Intuitively, though, entropy seems appropriate if the objective function is to minimize log-loss, while the other two (particularly margin) are more appropriate if we aim to reduce classification error, since they prefer instances that would help the model better discriminate among specific classes.”

Query-By-Committee

Query-by-committee (QBC) algorithm: maintain a committee $C = {\theta^{(1)}, ...,\theta^{C}}$ of models which all trained on the current labeled set $L$ , but represents competing hypotheses. Each committee member is then allowed to vote on the labelings of query candidates. The most informative query is considered to be the instance about which they most disagree.

Measure the level of disagreement:

vote entropy:

$x_{V E}^{*} = \mathop{\arg\max}_x - \sum_i \frac{V(y_i)}{C} \log \frac{V(y_i)}{C}$

where $y_i$ again ranges over all possible labelings, and $V(y_i)$ is the number of “votes” that a label receives from among the committee members’ predictions, and $C$ is the committee size. (It can be thought as a QBC generalisation of entropy-based uncertainty sampling)

Kullback-Leibler (KL) divergence

$x_{KL}^* = \mathop{\arg\max}_{x} \frac{1}{C} \sum_{c=1}^C D(P_{\theta^{c}} || P_c)$

where

$D(P_{\theta}||P_C) = \sum_i P_{\theta}^{c} (y_i|x) \log \frac{P_{\theta^{c}} (y_i | x)}{P_C (y_i | x)}$

Here $\theta^{(c)}$ represents a particular model in the committee, and $C$ represents the committee as a whole, thus

$P_C(y_i|x) = \frac{1}{C} \sum_{c=1}^C P_{\theta}(y_i|x)$

is the “consensus” probability that $y_i$ is the correct label.

Expected model change

A decision-theoretic approach: selecting the instance that would impart the greatest change to the current model if we knew its label.

Expected gradient length (EGL): apply to gradient-based training. The “change” imparted to the model can be measured by the length of the training gradient (i.e. the vector used to re-estimate parameter values). That is, the learner query the new instances $x$ which if labeled and added to $L$ , would result in the new training gradient of the largest magnitude. Let $\nabla l_{\theta} (L \cup <x,y>)$ be the new gradient that would be obtained by adding the training tuple $<x,y>$ to $L$ . Since the query algorithm does not know the label $y$ in advance, we must instead calculate the length as an expectation over the possible labelings:

$x_{EGL}^* = \mathop{\arg \max}_x \sum_i P_{\theta}(y_i | x) || \nabla l_{\theta} (L \cup <x, y_i>)$

where ||.|| is the Euclidean norm of each resulting gradient vector. Note that, $\nabla l_{\theta} (L)$ should be nearly zero since $l$ converged at the previous round of training. Thus, we can approximate $\nabla l_{\theta} (L \cup <x,y_i>) \approx \nabla l_{\theta} (<x, y_i>)$ for computational efficiency.

Intuition: prefer the instance that is likely to most influence the model (i.e. have the greatest impact on its parameters), regardless of the resulting query label.

Limitations: EGL can lead astray if features are not properly scaled, i.e. the informativeness of a given instance maybe over-estimated simply because one of its future values is usually large, or corresponding parameter estimate is larger. Solution -> parameter regularisation.

Expected error reduction

Idea: estimate the expected future error of a model trained using $L \cup <x,y>$ on the training unlabeled instances in $U$ and query the instance with minimal expected future error (a.k.a. risk).

One approach is to minimize the expected 0/1-loss:

$x_{0/1}^* = \mathop{\arg\min}_x \sum_i P_{theta} (y_i | x) (\sum_{u=1}^U 1-P_{theta^{+<x,y>}}(\hat{y} | x^{(u)} ) )$

where $theta^{+<x,y>}$ is the new model re-trained with the training tuple $<x, y_i>$ added to $L$ .

One interpretation is,

maximizing the expected information gain of the query x, or (equivalently) the mutual information of the output variables over x and U

Variance reduction

Reduce generalization error indirectly by minimizing output variance.

Regression problem

$E_T [(\hat{y} - y)^2 | x] = E[(y-E[y|x])^2] + (E_L [\hat{y}] - E[y|x])^2 + E_L [(\hat{y} - E_L[\hat{y}])^2]$

where $E_L[.]$ is an expectation over the labeled set $L$ , $E[.]$ is an expectation over the conditional density $P(y|x)$ and $E_T$ is an expectation over both.

Density-weighted methods

Challenges: A central idea of the estimated error and variance reduction frameworks is that they focus on the entire input space rather than individual instances. Thus, they are less prone to querying outliers than simpler query strategies like uncertainty sampling, QBC, and EGL. The least certain instance lies on the classification boundary, but is not “representative” of other instances in the distribution, so knowing its label is unlikely to improve accuracy on the data as a whole.

informative instances should not only be those which are uncertain, but also those which are “representative” of the underlying distribution (i.e., inhabit dense regions of the input space).

$x_{ID}^* = \mathop{\arg \max}_x \phi_A(x) \times (\frac{1}{U} \sum_{u=1}^U sim(x, x^{(u)}))^{\beta}$

Here, $\phi_A(x)$ denotes the informativeness of $x$ according to some “base” query strategy A, such as an uncertainty sampling or QBC method. The second term weights the informativeness of $x$ by its average similarity to all other instances in the input distribution (as approximated by U), s.t. a parameter $\beta$ that controls the relative importance of the density term.

References

[1] Burr Settles. Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison. 2009.
[2] D. Angluin. Queries and concept learning. Machine Learning, 2:319–342, 1988.

Yekun's Note

Active Learning Overview