**Active learning** (called *query learning* or *optimal experimental design* in statistics). The key **hypothesis** is, if the learning algorithm is allowed to choose the data from what it learns, it will perform better with less training.

Comparison with **passive learning**: typically gathering a large amount of data *randomly sampled from the underlying distribution* and using this large dataset to train a model.

# Why active learning?

Give a simple example shown as below.

Figure (a) illustrates a toy data set of 400 samples, evenly sampled from two class Gaussians. (b) A logistic regression (LR) model trained with 30 sampled instances randomly sampled from overall distribution (70% accuracy). (c) A LR model trained with 30 actively queried instances using uncertainty sampling (90% accuracy).

Fig. (b) illustrates the **traditional supervised learning** approach after random selecting 30 instances for labelling, drawn i.i.d. from the unlabeled pool . The classifier boundary is far away from .

Figure (c) shows the **active learning** approach. The active learner uses *uncertainty sampling* to focus on instances *closest to its decision boundary*, assuming it can adequately explain those in other parts of the input sapce characterized by .

# Scenarios

## Membership Query Synthesis

The learner generates the instance from some underlying natural distribution. E.g. mnist dataset, rotation or excluding some pieces of the digit.

**Limitation**: labelling generated arbitrary instances an be awkward if the oracle is a human annotator. Unexpected problems: for image data, generated images are not recognisable; for NLP and speech tasks, synthesised text or speech is gibberish.

## Stream-based selective sampling

Alternative to synthesizing queries” **selective sampling**. The key assumption: obtaining an unlabeled instance is free.

**Stream-based selective sampling** (a.k.a. stream-based or sequential active learning): each unlabeled instance is typically drawn **one at a time** from the data source, and the learner must decide whether to query or discard.

### Decision criteria:

- informativeness measure (a.k.a. query strategy) See following section
- Compute an explicit region of uncertainty, i.e. the part of the instance space that is ambiguous to the learner, and only query instances that fall within it.
- Naive way: set a threshold for informativeness measure and regard those below the threshold as query region;
- Hypothesis consistency. Define the region that is unknown to the overall model class, i.e. whether hypothesis between any two models are consistent. That is, any two models (completely expensive; requires keep models after each new query) of the same model class agree on all the labeled data, but disagree on some unlabelled instance, such instance lies within the region of uncertainty.

## Pool-based sampling

Pool-based evaluates and ranks the **entire unlabelled data** collection before selecting the best query, whilst stream-based sampling queries and make decisions for one instance at a time. Pool-based sampling is **more practical** for many real-world problems.

# Query-strategy

Active learning involves evaluating the informativeness of unlabelled instances by either **generating** de novo or **sampling** from the given distribution.

Let denote the most informative instance (i.e. the best query)

## Uncertainty sampling

**Least confident strategy**

The active learner queries **instances about which is least certain how to label**. E.g. for binary classification models, simply query the instance whose posterior probability of being positive is nearest 0.5 .

For three or more class labels, a more general uncertainty sampling method is to query the instance whose prediction is the *least confident*:

where

, the class label with the highest posterior probability under the model

One way to interpret this uncertainty measure is the expected 0/1-loss, i.e. the model’s belief that it will mislabel .

**Limitations**: Least confident strategy only considers information about the most probable label. It “throws away” information about the remaining label distribution. Margin sampling incorporates the posterior of the second most likely label to correct this.

### Margin sampling

where and are the first and second most probable class labels, respectively.

**Intuitions**: Instances with **large margins** are easy, since classifiers have little doubt in differentiating between the two most likely class labels. Instances with **small margins** are more **ambiguous**, thus knowing the true label would help discriminate.

**Limitations** Problems with very large label sets, the margin method still ignores much of the output distribution for the remaining classes. solution -> entropy.

### Entropy sampling

where ranges over all possible labelings. Entropy represents the amount of information needed to “encode” a distribution. (Entropy is often thought of as a measure of uncertainty or impurity in machine learning).

Above three uncertainty sampling methods are application-dependent.

“Intuitively, though, entropy seems appropriate if the objective function is to minimize log-loss, while the other two (particularly margin) are more appropriate if we aim to reduce classification error, since they prefer instances that would help the model better discriminate among specific classes.”

## Query-By-Committee

Query-by-committee (QBC) algorithm: maintain a committee of models which all trained on the current labeled set , but represents competing hypotheses. Each committee member is then allowed to vote on the labelings of query candidates. The **most informative query** is considered to be the instance about which they **most disagree**.

Measure the level of disagreement:

### vote entropy:

where again ranges over all possible labelings, and is the number of “votes” that a label receives from among the committee members’ predictions, and is the committee size. (It can be thought as a QBC generalisation of entropy-based uncertainty sampling)

#### Kullback-Leibler (KL) divergence

where

Here represents a particular model in the committee, and represents the committee as a whole, thus

is the “consensus” probability that is the correct label.

## Expected model change

A **decision-theoretic** approach: selecting the instance that would impart the greatest change to the current model *if we knew its label*.

**Expected gradient length** (EGL): apply to gradient-based training. The “change” imparted to the model can be measured by the **length of the training gradient** (i.e. the vector used to re-estimate parameter values). That is, the learner query the new instances which if labeled and added to , would result in the new training gradient of the largest magnitude. Let be the new gradient that would be obtained by adding the training tuple to . Since the query algorithm does not know the label in advance, we must instead calculate the length as an expectation over the possible labelings:

where ||.|| is the Euclidean norm of each resulting gradient vector. Note that, should be nearly zero since converged at the previous round of training. Thus, we can approximate for computational efficiency.

**Intuition**: prefer the instance that is likely to most influence the model (i.e. have the greatest impact on its parameters), regardless of the resulting query label.

**Limitations**: EGL can lead astray if features are not properly scaled, i.e. the informativeness of a given instance maybe over-estimated simply because one of its future values is usually large, or corresponding parameter estimate is larger. Solution -> parameter regularisation.

## Expected error reduction

**Idea**: estimate the expected future error of a model trained using on the training unlabeled instances in and query the instance with minimal expected future error (a.k.a. risk).

One approach is to minimize the expected 0/1-loss:

where is the new model re-trained with the training tuple added to .

One interpretation is,

maximizing the expected information gain of the query x, or (equivalently) the mutual information of the output variables over x and U

## Variance reduction

Reduce generalization error indirectly by minimizing output variance.

**Regression problem**

where is an expectation over the labeled set , is an expectation over the conditional density and is an expectation over both.

## Density-weighted methods

**Challenges**: A central idea of the *estimated error* and *variance reduction* frameworks is that they focus on the entire input space rather than individual instances. Thus, they are less prone to querying outliers than simpler query strategies like uncertainty sampling, QBC, and EGL. The least certain instance lies on the classification boundary, but is not “representative” of other instances in the distribution, so knowing its label is unlikely to improve accuracy on the data as a whole.

informative instances should not only be those which are uncertain, but also those which are “representative” of the underlying distribution (i.e., inhabit dense regions of the input space).

Here, denotes the informativeness of according to some “base” query strategy A, such as an uncertainty sampling or QBC method. The second term weights the informativeness of by its average similarity to all other instances in the input distribution (as approximated by U), s.t. a parameter that controls the relative importance of the density term.

# References

- [1] Burr Settles. Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison. 2009.
- [2] D. Angluin. Queries and concept learning. Machine Learning, 2:319–342, 1988.