Non-uniform Consistency of Online Learning with Random Sampling

Proceedings of Machine Learning Research vol 132:1–21, 2021 32nd International Conference on Algorithmic Learning Theory

Changlong Wu WUCHANGL@HAWAII.EDU

University of Hawaii at Manoa

Narayana Santhanam NSANTHAN@HAWAII.EDU

University of Hawaii at Manoa

Editors: Vitaly Feldman, Katrina Ligett and Sivan Sabato

Abstract

We study the problem of online learning a hypothesis class and a given binary 0-1 loss function,

using instances generated i.i.d. by a given distribution. The goal of the online learner is to make

only ﬁnitely many errors (loss 1) with probability 1 in the inﬁnite horizon. In the binary label case,

we show that hypothesis classes are online learnable in the above sense if and only if the class is

effectively countable. We extend the results to hypothesis classes where labels can be non-binary.

Characterization of non-binary online learnable classes is more involved for general loss functions

and is not captured fully by the countability condition even for the ternary label case.

In the computational bounded setup, we compare our results with well known results in recur-

sive function learning, showing that the class of all total computable functions is indeed learnable

with computable online learners and randomized sampling. Finally, we also show that the ﬁnite

error guarantee will not be affected even when independent noise is added to the label.

Keywords: Online Learning, Finite Error, Almost Surely, Consistency

1. Introduction

Let H be a set of functions from X → {0, 1}, where X is an instance space. The online learning

setup is a game between two parties, Nature and the learner, both of whom know the class H. Nature

chooses a function h ∈ H at the beginning of the game. The game proceeds in steps. At each step

n, the learner is provided with an instance x

∈ X . The learner has to guess the label h(x

potentially using the history built up till that step. Subsequently, the true label h(x

) is revealed to

the learner, and the learner incurs a binary loss—1 if her guess differs from h(x

) that indicates an

error, 0 for no error. The goal of the learner is a strategy that minimizes the number of errors made.

In his seminal work, Littlestone (1988) showed that the number of errors is bounded by what is

known as the Littlestone dimension Ldim(H) of H. This dimension is independent of the number

of steps the game goes on for. When the class H is ﬁnite, Littlestone (1988) proved that the number

of errors is bounded by log |H| using a halving argument.

However, the Littlestone dimension can be too restrictive a measure to handle really rich classes

that may not have ﬁnite Littlestone dimension, such as the class P of all polynomial time decid-

able problems. One common approach that circumvents this difﬁculty considers the non-realizable

setup, where the goal of learner is competitive optimality with respect to a reference class H. Ben-

David et al. (2009) showed that one can achieve an expected O(

T log(T )Ldim(H)) regret in a

horizon T game using a weighted-majority algorithm. See (Shalev-Shwartz and Ben-David, 2014,

Chapter 21) and the references therein for more results in this line. However, the regret bound

doesn’t really tell how many actual errors the learner could make.

 2021 C. Wu & N. Santhanam.

EAS ONLINE LEARNING

One common approach that deal with rich classes in the realizable case is to relax the require-

ment of uniform bound to point-wise bounds that could depend on individual hypotheses in the

class. Perhaps the most well-known setup of this ﬂavor in learning theory literature is non-uniform

PAC learnability and the concept of Structural Risk Minimization by Vapnik and Chervonenkis

(1974); Vapnik (2013). A more related scenario in the online learning setup was studied in the

theory of learning recursive functions. In this setup, one assumes the class H to be functions from

N → {0, 1}, and the instances are presented sequentially as {1, 2, · · · }. A class H is said to be

recursively learnable if there exist a computable learner such that the number of errors is ﬁnite, no

matter what model is selected from H. See Zeugmann and Zilles (2008) for a survey on this topic.

This paper generalizes the deterministic ﬁnite error setting in the recursive function learning

literature with a more realistic randomized setup. Instead of the learner observing instances in a

predetermined order, we assume that they are sampled from a distribution µ over X independently

at each time step. A class H of functions from X → {0, 1} is eventually almost surely (or eas for

abbreviation) online learnable with respect to µ, if there exists an online learner that makes only

ﬁnite many errors with probability 1, no matter what function is choosen from H.

The conceptual basis of our problem setup lies in the almost sure hypothesis testing framework

studied in the statistics and information theory communities. See Cover (1973); Dembo and Peres

(1994); Kulkarni and Tse (1994); Santhanam and Anantharam (2015); Wu and Santhanam (2019)

for a sample of results along this line in hypothesis testing, classiﬁcation and prediction. There

has been work in this line along the computability angle as well, see for example Leshem (2006).

We make a note of other related work on an inﬁnite horizon machine learning framework Mitchell

et al. (2018), the universal prediction framework in Merhav and Feder (1998), and randomized

online learning scenario in Haussler et al. (1994) as well. Novel modiﬁcations to the Littlestone

dimension have also been considered, see for example, Sayedi et al. (2010); Li et al. (2011); Zhang

and Chaudhuri (2016).

Characterization Our ﬁrst result fully characterizes eas-online learnability in the binary label

case. We show that for all instance spaces X satisfying a regularity condition (satisﬁed by most

common spaces, including e.g. the Borel space over R

) and distributions µ with potentially inﬁnite

support, a class H of measurable functions from X → {0, 1} is eas-online learnable if and only

if H is effectively countable. Informally, a class H is countable if we do not distinguish between

hypotheses that agree with probability 1 under µ (see Deﬁnition 3 for a more complete deﬁnition).

Applying our characterization to the class T of all linear threshold functions from [0, 1] →

{0, 1}, i.e. functions of form

(x) =

(

1, if x ≥ a

0, otherwise,

shows that T above is not eas-online learnable for any probability measure over [0, 1] that admits

a continuous density. However, the class T has VC dimension 1. For a uniform distribution over

[0, 1], an empirical risk minimization approach will result in an online learning strategy for the

linear threshold functions. The probability of making an error with this approach at time step n is at

most O(

log n

). Because T is not effectively countable, and hence not eas-online learnable, we can

conclude immediately that the empirical risk minimization bound cannot be improved from

log n

e.g.

n log

1+

with any  > 0. If it could, the Borel-Cantelli lemma would imply that the class T

with uniform distribution would be eas-online learnable, since 1/n log

1+

n is summable over n.

EAS ONLINE LEARNING

The lower bound only differs by a poly log(n) factor compared to the optimal 1/n lower bound as

showed in Schuurmans (1997), and we only needed to observe that the set of all linear threshold

functions is not effectively countable to infer this result!

Countable label spaces Departing from the basic result, we show that the characterization of

online learning still holds for classiﬁcation loss even if the label space is countable. However, when

more general binary losses are allowed, there exist distributions with inﬁnite support and effectively

uncountable classes that are eas-online learnable, even when the label size is 3.

Computable learners Suppose the learner is required to be computable in addition. We show

that there exist distributions over N with inﬁnite support, such that the class of all total computable

functions from N → {0, 1} is eas-online learnable with a computable learner. This is in contrast to

the sequential sampling scenario of recursive learning, see Zeugmann and Zilles (2008).

Noisy labels We ﬁnally consider online learning with noisy labels, and show that effectively

countable classes with binary labels are still online learnable when labels are corrupted with in-

dependent Bernoulli noise. Surprisingly, we show that one could also learn effectively countable

classes with binary labels when the instances are presented sequentially with noisy labels and in

deterministic order. Here we would only see any instance only once with a noisy label, yet we could

leverage the noisy labels from different instances to learn the underlying hypothesis.

2. Problem setup

Let X be the instance space that endowed with some ﬁxed separable σ-algebra F. We say F to be

separable, if for all probability distribution µ on F, we have a countable set G of measurable sets in

F, such that for any measurable set A ∈ F and  > 0, there exist G



∈ G such that

µ(A∆G



) ≤ ,

where ∆ is symmetric difference. Clearly, the Borel σ-algebra over R

is separable (e.g. we can take

G to be the algebra generated by the set of all cuboids in R

with vertices at rational coordinates).

We will also assume the single point set of X is measurable.

Let Y be a label space that is often assumed to be ﬁnite or countable. A binary loss is a function

` : Y × Y → {0, 1} that is symmetric and reﬂexive, i.e. `(y

, y

) = `(y

, y

) and `(y

, y

) = 0 for

all y

, y

∈ Y. And a learning strategy is function

Φ : (X × Y)

∗

× X → Y,

that is measurable over cylinder σ-algebra on (X × Y)

∞

× X .

We consider the following learning game between Nature and a learner that proceeds by time

steps. Let H be a class of measurable functions from X → Y and µ be a probability measure

over X , which are known to all parties. At the beginning, the nature chooses a function h ∈ H.

At each time step n, the nature independently samples X

∼ µ and provides it to the learner.

The learner then outputs an estimate Y

of the true label h(X

), potentially using the history

{(X

, h(X

)), · · · , (X

n−1

, h(X

n−1

))} thus far. Nature then provides the true label h(X

) to the

learner, and the learner incurs the binary loss `(Y

, h(X

). The learner makes an error at time step

n if `(y

, h(x

)) = 1.

Denote Z

= (X

, h(X

)) to be the instance-label pair at time step i, and Z

= (Z

, · · · , Z

) is

the history observed by the learner upto time step i.

EAS ONLINE LEARNING

Deﬁnition 1 A class H is said to be eas-online learnable w.r.t. distribution µ, if there exists a

learning strategy Φ such that

∞

n=1

`(Φ(Z

n−1

, X

), h(X

)) < ∞

= 1,

for all h ∈ H with X

, X

, · · ·

i.i.d.

∼ µ.

We also introduce the following stronger version of eas-learnability when the distribution is

unknown to the learner.

Deﬁnition 2 A class H is said to be strongly eas-online learnable, if there exists a learning strategy

Φ such that

∞

n=1

`(Φ(Z

n−1

, X

), h(X

)) < ∞

= 1,

for all h ∈ H and µ with X

, X

, · · ·

i.i.d.

∼ µ.

In both cases above, we will say that the strategy Φ eas-online learns (or strongly eas-online

learns) H w.r.t. µ. For notational convenience, we drop the reference to µ where doing so leads to

no ambiguity.

The following notion will be used frequently in this paper.

Deﬁnition 3 A class H

effectively covers a class H w.r.t. distribution µ and loss `, if for all h ∈ H

there exists h

∈ H

such that

µ{x : `(h(x), h

(x)) = 1} = 0.

We say H is effectively countable (respectively size n) w.r.t. µ and `, if there exist H

that is

countable (respectively size n), such that H

effectively covers H w.r.t. µ and `.

We begin with the following lemma, an online learning version of Structural Risk Minimization.

Lemma 4 Let H

, H

, · · · be countably function classes that share the same instance space and

loss function. If H

is eas-online learnable for all n ∈ N w.r.t. distribution µ, then

H =

[

n∈N

is also eas-online learnable w.r.t. µ.

Proof Let Φ

to be the strategy that eas-online learns class H

. We deﬁne a strategy for H as

follows. At each time step n, denote by e(n, k), the number of errors that Φ

would made if it were

employed in the ﬁrst n − 1 time steps. Let

= argmin

k∈N

{e(n, k) + k}. (1)

We then use Φ

to make the prediction at time step n.

EAS ONLINE LEARNING

We now show that this strategy indeed makes ﬁnitely many errors with probability 1 no matter

what h is chosen from H. Assume h ∈ H

for some t ∈ N. Let B ⊂ X

∞

be the set such that Φ

would make ﬁnite errors on all realizations in B. We have µ(B) = 1 because Φ

eas-online learns

. Fix any x ∈ B and denote s to be the number of errors Φ

would make on x. We have

e(n, t) + t ≤ s + t

for all n and from (1), note that therefore for all n, J

, the index chosen is ≤ s + t.

Furthermore, for any Φ

with i ∈ {1, · · · , s + t}, once Φ

makes more than s + t errors, it will

no longer be chosen in (1). Therefore, we will make at most (s + t)

errors, and thus ﬁnitely many

errors on x. Therefore, our strategy also makes ﬁnite errors with probability 1 when h is selected.

The lemma follows.

Remark 5 Note that Lemma 4 holds for strong eas-online learning as well, since the construction

of learning strategy does not depend on the knowledge of underlying distribution.

We prove the following technical lemma that relates deterministic sampling with randomized

sampling.

Lemma 6 Let a

, a

, · · · be an arbitrary ordering of N, and s

, s

, · · · ∈ N

be an arbitrary

sequence. Then there exist a distribution p over N, such that

p (∃N : ∀n ≥ N, T

− T

n−1

≥ s

n−1

) = 1

where T

is the ﬁrst time a

appears in an i.i.d. sample from p.

Proof Let p

= p(X = a

). Since T

− T

n−1

< s

n−1

implies that a

appears earlier than the

n−1

’th appearance of a

n−1

, we have

p(T

− T

n−1

< s

n−1

) ≤ 1 −



n−1

+ p

n−1



n−1

. (2)

Taking

(n!)

n−1

i=1

we have that (2) is further upper bounded by

. Since

is summable, an application of the

Borel-Cantelli Lemma completes the proof.

3. Binary label

In this section, we will consider the case when the label is binary and the loss is `(y

, y

) = 1{y

}. We will refer to this loss as the classiﬁcation loss in the sequel. Note that since our loss

function is binary valued on binary labels, the only losses can either be trivial or classiﬁcation loss.

We say a distribution over X to be non-degenerate if it has inﬁnite support.

The following theorem fully characterizes the eas-online learnability in the binary label case.

EAS ONLINE LEARNING

Theorem 7 A class H with binary labels is eas-online learnable w.r.t. distribution µ iff H is

effectively countable w.r.t. µ.

Proof To see that if H is effectively countable, it is eas-online learnable, note that for any class

that contains effectively 1 hypothesis is trivially eas-online learnable. Therefore, any class H with

effectively countably many hypotheses is eas-online learnable by appealing to Lemma 4.

For the necessary part of the proof, we will assume w.l.o.g. that for all h

6= h

∈ H we have

X∼µ

(X) 6= h

(X)) > 0. Note that any hypothesis class can be reduced to one satisfying

the above property by choosing a representative function in each equivalence class deﬁned by the

relation h

∼ h

: Pr

x∼µ

(x) 6= h

(x)) = 0.

We now prove that if H admits a function Φ : (X ×{0, 1})

∗

×{0, 1} → {0, 1} such that ∀h ∈ H

∞

n=1

1{Φ(Z

n−1

, X

) 6= h(X

)} < ∞

= 1, (3)

where Z

n−1

= (Z

, · · · , Z

n−1

) and Z

= (X

, h(X

)) is the instance-label pair at step i generated

by µ, then H is countable.

Deﬁne the event

(

∞

k=n

1{Φ(Z

, X

k+1

) 6= h(X

k+1

)} > 0

)

and let



h ∈ H : Pr





≤



. (4)

From equation (3), we must have H =

∞

n=1

. Since for any h ∈ H, we have Pr(A

) → 1 as

n → ∞, there exist some k such that Pr(A

) ≤

, i.e. h ∈ H

. We show that H

is countable for

all n ∈ N, which will prove the Theorem.

Central to our proof is our claim that for all n ∈ N we have

inf{Pr

X∼µ

(X) 6= h

(X)) : h

6= h

∈ H

} > 0. (5)

Now equation (5) implies that H

is countable. To see this, let

d(h

, h

) = Pr

X∼µ

(X) 6= h

(X)).

The σ-algebra over X is separable, so by deﬁnition, there is a countable dense subset of the

σ−algebra with respect to the distance d. Since each set in the σ−algebra can be represented

by a measurable function from X → {0, 1}, we conclude that there is a countable collection

G = {g

, g

· · · } of measurable functions from X → {0, 1}, such that G is dense in the set of

all measurable functions from X → {0, 1} under the distance d deﬁned above.

Suppose the inﬁmum in (5) is  > 0. Since G is dense in the the set of all binary valued

measurable functions on X , for each h ∈ H

, there is a function g ∈ G such that d(h, g) < /2. By

the triangle inequality, for all h

6= h, we will also have d(h

, g) > /2. Therefore every h ∈ H

can be associated with a distinct g ∈ G, which then implies that H

is at most countable.

We now establish equation (5). The intuition behind this claim is that if the inﬁmum in (5) were

0, then we could choose two hypotheses in H

that were arbitrarily close in the distance d—and

EAS ONLINE LEARNING

hence close enough that they are indistinguishable with sample size n on a large enough subset

A. However, these hypothesis will eventually differ in the sufﬁx of the strings in A, therefore no

predictor could agree with both of them in the sufﬁx past step n. But if the probability of A is large

enough, it contradicts the deﬁnition of H

in (4).

Formally, suppose (5) did not hold. Then there exist h

, h

∈ H

such that 0 < d(h

, h

) < δ

where δ

is chosen so that (1 − δ

)

Let A ⊂ X

∞

be the event that h

, h

cannot be distinguished with n samples. By construction,

we have Pr(A) >

. For j ∈ {1, 2}, let

= Pr(Φ makes error after step n on h

We show that max{p

, p

} >

, which will then contradict equation (4) since h

, h

∈ H

, thus

establishing equation (5). To see that max{p

, p

} >

, we use a probabilistic argument. Let h be

the random variable uniformly chosen from {h

, h

}. We only need to show

X∼µ

∞

[1{Φ makes error after step n on h} | X ∈ A] ≥

(6)

Let B ⊂ X

∞

be the event that there exists a instance Y after step n that ﬁrst reveals h

(Y ) 6=

(Y ). We have Pr(B) = 1, since d(h

, h

) > 0. Note that

X∼µ

∞

[1{Φ makes error after step n on h}] | X ∈ A ∩ B] ≥

, (7)

since condition on any X ∈ A ∩ B event C = {Φ makes error at sample Y on h} implies the event

in the equation above, and because E

[1{C}] =

. We now have

X∼µ

∞

[1{Φ makes error after step i on h}] | X ∈ A] ≥

, (8)

since Pr(B) = 1. Finally (8) implies (6) by exchanging order of expectation, which is justiﬁed by

Fubini’s theorem.

Example 1 Let X = [0, 1] and µ be the uniform distribution over [0, 1]. We consider the linear

threshold functions

(x) = 0 if a ≥ x and h

(x) = 1 otherwise.

Denote H = {h

: a ∈ [0, 1]}. By Theorem 7 we know that H is not eas-online learnable.

A couple of observations need to be emphasized here. Note that the VC dimension of H is 1.

Therefore the VC-theorem (Shalev-Shwartz and Ben-David, 2014, Thm 6.7) posits that there is a

learning rule

, such that for any h

∈ H and a sample S

of size n, with high probability say

1 −

over S

, we have

x∼µ

[

, x) 6= h

(x)] ≤ O



log n



Theorem 7 implies that we cannot improve the upper bound from O



log n



to, say, O



n log

1+



with  > 0. If we could, note that since

∞

n=1

n log

1+

< ∞, an application of the Borel-Cantelli

lemma would imply that

only makes a ﬁnite number of errors no matter the hypothesis in force,

thus violating Theorem 7.

EAS ONLINE LEARNING

We observe the following simple corollary.

Corollary 8 If a binary measurable function class H has ﬁnite Littlestone dimension, then H is

effectively countable w.r.t. any distribution. In particular, if the domain is countable, then H is

countable.

Proof Finite Littlestone dimension implies H is online learnable with bounded number of errors and

an adversarial sampling process (Shalev-Shwartz and Ben-David, 2014, Chapter 21). The corollary

follows by Theorem 7 and notice that the learning strategy is independent of the distribution. The

second part follows by the fact that for any hypothesis class H over a countable domain X , if H is

effectively countable w.r.t. a distribution with supported of the whole of X , then H is countable.

The proof of the necessary condition of Theorem 7 also yields that if a class H is strongly eas-

online learnable, then H is effectively countable w.r.t. any distribution. By Corollary 8 above, this

implies that the restriction of H on any countable subset of X must be countable. However, this

doesn’t mean that the class H have to be countable. To see this, consider the class T of all functions

: [0, 1] → {0, 1} with a ∈ [0, 1] that has the following form:

(x) =

(

1, if x = a

0, otherwise

It is easy to see that T is strongly eas-online learnable (in fact has Littlestone dimension 1), but

the class is uncountable. We leave it as an open problem to determine whether a class that is eas-

online learnable w.r.t. any distribution is also strongly eas-online learnbale. (Note that, in the former

setting the learning strategy can be different for different distribution, while the strongly eas-online

learnable requires a universal learning rule.)

4. Multi-Class label

We consider the case when the output set Y has more than 2 elements and we assume the loss ` to be

an arbitrary function. Before analyzing the randomized setting, we consider a simpler deterministic

setup. A deterministic sampling process in the most general setting is a function S : (X ×Y)

∗

→ X ,

which select the next sample based on the previous sample-label pairs (but not the learner’s answer).

For any class H and process S, one can construct an inﬁnite |Y|-nary tree T = (V, E) with a label

function L (for both vertices and edges) such that

1. Each edge has a label in Y and the label of edges with same parent are distinct.

2. Each vertex in T has a label in X . The root v

has label L(v

) = S(), where  is empty

string. For each vertex v ∈ T , we denote v

→ v

→ · · · v

= v to be the unique path from

root to v. Let z

= (L(v

i−1

), L(v

i−1

→ v

)), the label of v is given by L(v) = S(z

, · · · , z

3. Each vertex v in T has a child u such that L(v → u) = y if and only if ∃h ∈ H such that

h(L(v

)) = L(v

→ v

i+1

) for all 0 ≤ i ≤ n and h(L(v)) = y, where v

is deﬁned in item 2.

Clearly, every function h ∈ X will associated with a unique inﬁnite path v

→ v

→ · · ·

in T such that h(L(v

)) = L(v

→ v

i+1

). However, not every inﬁnite path will have a function

in H associated with it. The learning process can be viewed as traversing an inﬁnite path in T .

Such a tree is also known as the mistake tree in the binary online learning literature, see Zhang and

Chaudhuri (2016). A valuation of T is a function v : V → N such that for each v ∈ V we have

EAS ONLINE LEARNING

1. v(v) ≥ max

u∈C(v)

{v(u)}, where C(v) is the children vertex set of v;

2. v(v) ≥ 1 + max

u∈C(v)

{v(u)} if L

= {L(v → u) : u ∈ C

(v)} cannot be covered by Y,

where C

(v) is the set of children vertex of v that has maximal value. We say a set A ⊂ Y

can be covered by Y if there exists y ∈ Y such that ∀x ∈ A, `(y, x) = 0.

Note that the valuation may not always exist. The following theorem relates the existence of valua-

tion to the bounded error guarantee.

Theorem 9 A class H is online learnable with ≤ B errors and a deterministic sampling process

S if and only if there exists a valuation v on T such that v(v

) ≤ B, where v

is the root of T .

Proof If T has a valuation such that v(v

) ≤ B, we simply predict the element that would cover

at vertex v in the traversing (if no such element, predict anything). Note that the learner makes

an error only if the value of the child vertex is reduced by at least 1. Therefore, there will be at most

B errors.

If H is online learnable w.r.t. S with at most B errors, we deﬁne the value at vertex v to be the

maximum number of errors that the learning rule could make on the subtree rooted by v.

To prove this is a valid valuation, note that condition 1 can be veriﬁed easily. To see condition

2 holds as well, we resort to a proof by contradiction. If condition 2 does not hold at some vertex

v, i.e. we have v(v) = max

u∈C(v)

{C(v)} but L

cannot be covered by Y. We can choose a path

to a child in C

(v) that has label differs from the prediction given by the learner, thus incuring

max

u∈C(v)

v(u) + 1 errors starting from v, contradicting our premise. The theorem follows.

To capture the ﬁnite error guarantee, we will need a notion of ranking on the hypotheses in H.

A ranking of H is a function r : H → N. For any vertex v ∈ V, we denote H

to be the hypotheses

in H that share the path from v

to v in T , where v

is the root of T . For a given ranking r, we

denote H

to be the hypotheses in H

that have minimal rank, i.e.

= {h ∈ H

: r(h) = min{r(H

)}},

where r(H

) = {r(h) : h ∈ H

}. Now let L

= {h(L(v)) : h ∈ H

We say a class H to be rankable w.r.t. a deterministic sampling process S, if there exists a

ranking r such that for each h ∈ H there exists a number N

, such that for every vertex v ∈ V on

the inﬁnite path of h with depth larger than N

, we have h ∈ H

and L

is coverable.

Intuitively, the smaller the rank of a function in H, the higher priority we will assign in the

learning process. Formally, we prove the following theorem:

Theorem 10 A class H is online learnable with ﬁnitely many errors w.r.t. a deterministic sampling

process S if and only if H is rankable w.r.t. S.

Proof If H is rankable, we show that H is online learnable with ﬁnitely many errors. The prediction

rule works as follows, we predict an element that covers L

at each vertex v in the traversing (if

it doesn’t exist we predict anything). By deﬁnition of rankability, the learner will not make errors

after step N

if the underlying hypothesis is h.

To see the converse, suppose Φ is a online learning rule which makes only ﬁnitely many errors

no matter what the hypothesis h ∈ H is. The rank of each hypothesis h is the number of errors Φ

would make against h.

EAS ONLINE LEARNING

To conclude the proof, we will show that this is a valid ranking. Fix any h and let N

be the

time step of the last error Φ make on h. Suppose there exists some vertex v at depth more than N

such that L

is not coverable. Now the rule Φ will make more errors on at least one of hypotheses

∈ H

than it made on h. This is because h and h

share the same preﬁx up to vertex v and hence

have the same error pattern till then, but differ on their label in L

. Thus this hypothesis h

must

have a higher rank by deﬁnition, and therefore could not have been in H

Note that both Theorem 9 and Theorem 10 work only for deterministic sampling process where

there is one mistake tree. However, in the randomized setup the mistake tree will be different for

different realization of the sampling. A sufﬁcient condition is to have a universal ranking that works

for almost all realizable mistake trees. The following theorem shows that it will happen when the

class is effectively countable. We leave the proof to Appendix A.

Theorem 11 Let H be a set of measurable functions from X → Y. Then H is eas-online learnable

w.r.t. distribution µ and loss `, if H is effectively countable w.r.t. µ and `.

Moreover, the effective countability of H is necessary for eas-online learnability (w.r.t. any

distribution µ) if Y is countable and ` is the classiﬁcation loss, i.e. `(y

, y

) = 1{y

6= y

Corollary 12 Let H be the class of all continuous functions from [0, 1] → [0, 1], ` = 1{|y

−y

| >

B} for some B > 0. Then H is eas-online learnable w.r.t. the loss ` and any distributions.

Proof By the Stone-Weierstrass theorem, H can be covered by polynomials with rational coefﬁ-

cients under the supremum norm. Since polynomials with rational coefﬁcients are countable, H is

effectively coverable w.r.t. any distribution. Theorem 11 then implies that H is eas-online learnable

w.r.t. any distribution.

Moreover, the covering set in Corollary 12 is independent of the distribution. Therefore, we

know that the class of all continuous functions in Corollary 12 is actually strongly eas-online learn-

able.

We provide the following example, which shows that if we have |Y| ≥ 3 and X = N, then there

exists a class H, distribution p and loss `, such that H is eas-online learnable w.r.t. p and `, but H is

not effectively countable. Thus, the condition in Theorem 11 can’t be necessary for arbitrary losses

and distributions.

Example 2 Let Y = {0, 1, 2}, we deﬁne a loss ` as follows: ` is symmetric in its two arguments

and

`(0, 1) = 0 and `(0, 2) = `(1, 2) = 1.

We now construct the class H as follows. Since the domain is N, we will denote each function in

H as inﬁnite sequences in {0, 1, 2}

∞

. Let B = {0, 1}

∞

be the class of all binary sequences. We

deﬁne the following transformation T that maps {0, 1}

∞

→ {0, 1, 2}

∞

. For any b ∈ B, T (b)

is the sequence that inserts number 2 that follows each appearance of 0 in b. For example, if

b = 00100101011 . . . then T (b) would read 02021020210210211 . . .. Now, we deﬁne

H = {T (b) : b ∈ B}.

We now deﬁne the distribution. By Lemma 6, we have distribution p over N such that the i.i.d.

samples from p will appear in increasing order of N eventually almost surely—namely for all m

EAS ONLINE LEARNING

large enough, the instance m − 1 will appear prior to instance m with probability 1. Therefore, one

may assume, w.l.o.g., that the sample is generated sequentially as 1, 2, 3, · · · .

To see that the class H is eas-online learnable w.r.t. to p and loss `, for any instance m ∈ N

we simply predict 2 for the label of instance m if the label of instance m − 1 is 0 and predict 0 if

the label of m − 1 is 1 or 2. It is easy to check that the rule makes only ﬁnitely many errors with

probability 1.

Clearly, the class H is uncountable and for any two different elements b

, b

∈ B we can ﬁnd

l ∈ N such that `(T (b

)(l), T (b

)(l)) = 1. Therefore, we have H is not effectively coverable by

any countable set H

Even though the class in Example 2 is not effectively countable, it has a trivial universal ranking

that ranks all functions in H to 0. We conclude this section with the following conjecture.

Conjecture 13 A class H is eas-online learnable w.r.t µ iff there exist a universal ranking r of H

such that H is rankable with ranking r for almost all mistake tree generated by µ.

5. Computable learner

As we have mentioned in the introduction, our setup has a close connection with learning of recur-

sive functions, where one considers the domain to be N and the instances are presented sequentially,

but in addition, the learning rule is required to be computable as well.

Deﬁnition 14 A function h : N → {0, 1} is computable if there exists a Turing machine TM such

that ∀n ∈ N, TM(n) halts and outputs h(n), where the number n is in its binary representation.

Clearly the class of all computable functions from N → {0, 1} is countable, since there are

only countable Turing machines (Arora and Barak, 2009, Chapter 1). We have the following simple

corollary that follows from Lemma 4.

Corollary 15 Let H be the set of all computable functions from N → {0, 1}, and let µ be an

arbitrary distribution over N. Then H is eas-online learnable using i.i.d. samples generated from

µ.

Unfortunately, the learning rule that derived from Lemma 4 cannot be computable for arbitrary

distribution as shown in the following theorem of Barzdin¸

s and Freivald (1972) (restated here in our

notation).

Theorem 16 ((Zeugmann and Zilles, 2008, Thm 5)) Let H be a class of computable function

from N → {0, 1}. Then there exists a computable learner that eas-online learns H with sequential

sampling if and only if there exists a computable function g : N → N such that the time complexity

of each function in H is eventually dominated by g(n).

It is easy to show that if H is the class of all computable functions then g in Theorem 16 cannot

exist by a simple diagonalization argument. Therefore, by Lemma 6 there exists a distribution p

such that H is not computationally eas-online learnable even with i.i.d. sampling from p. However,

we will show in the following theorem that there are also non-degenerate distributions over N such

that the class of all computable functions is indeed computationally eas-online learnable with i.i.d.

sampling.

EAS ONLINE LEARNING

Theorem 17 Let H be the class of all computable functions from N → {0, 1}. Then there exists a

non-degenerate distribution (i.e. has inﬁnite support) q such that H is computationally eas-online

learnable w.r.t. q.

Proof Let TM

, TM

, · · · be an ﬁxed enumeration of all Turing machines. The main idea is

to construct a distribution q such that for almost all numbers n, n appears much later than any of

{1, ..., n−1} with probability 1. We choose the gap period between the ﬁrst appearances of n−1 and

n to be max{C(h

(n) : i ≤ n}, where C(h

(n)) is the computational time for the ith computable

function with input n to stop when computing with a feasible Turing machine of smallest index.

Such a distribution q exists by Lemma 6.

We now construct the computable predictor using a back and forth trick. The predictor goes as

follows:

1. Initialize index I, J = 1;

2. In the idle period between the appearance of n − 1 and n, the predictor do the following. It

simulates the computation of TM

on n with one computational step per time step. (For other

samples that encountered in that period, one simply predicts the memorized labels.)

3. At time step of ﬁrst observing n, output the result of the simulation. (Output arbitrary if the

simulation has not stopped.) If the the result matches with the true label, keep I, J. Else:

a. If I < J, set I = I + 1 and J = J;

b. Else, set I = 1 and J = J + 1.

Since the underlying function is computable, there exists a Turing machine TM

that computes it.

Now, by construction the index I changes if and only if the predictor makes an error.

We show that the predictor only makes ﬁnitely many errors by a proof by contradiction. If

indeed the predictor makes inﬁnite errors, we known that I would repeatedly hit t until the sample

is coming sequentially. By construction of the idle time, TM

would then ﬁnish the computation

and make the right prediction in the following times steps, which is a contradiction.

Remark 18 Note that the reason why Theorem 17 does not contradict to the impossibility result

implied by Theorem 16 is that, even though the gap period we constructed in the proof of Theorem 17

is information theoretically deterministic, it is not (computationally) known to the learner, i.e. the

learner would never be able to (computationally) ﬁgure out when the sample n will arrive.

We now consider a different scenario where we require the learner to be not only computable,

but also has limited computational resources. We need the following notion. A online learning

scheme is a triplet (D, R, Φ), where

1. D ∈ {0, 1}

∗

is an unlimited database which the learner can use to store anything learned,

initially an empty string;

2. Φ is the predictor which maps N → {0, 1}, using the database D as an oracle;

EAS ONLINE LEARNING

3. R is the recorder that maps {0, 1}

∗

× N × {0, 1} → {0, 1}

∗

, which updates the database D

every time a new example and its label are revealed. Where needed, we use D

to denote the

state of the database after n instances and their labels have been revealed.

Deﬁnition 19 (Exact online learnable) Let H be a class of functions from N → {0, 1}. H is said

to be exact online learnable, if there is an online learning scheme (D, R, Φ) such that for all h ∈ H

∞

n=1

1{Φ

(n) 6= h(n)} < ∞,

where D updates after every n,

n+1

= R(D

, n, h(n)).

For any computable online learning scheme (D, R, Φ), we will specify R and Φ as algorithms.

The time complexity of R, Φ and h is expressed in terms of log n, i.e. the binary representation size

of n. However, we use D as an oracle of R and Φ with no computational cost.

We focus on the worst case time complexity for R and Φ, i.e. over the most inconvenient

database and function we are trying to learn. We say an online learning scheme runs in uniformly

exponential time, if, no matter what database D is and what function h is in force, there exist some

c, N ∈ N such that both R and Φ run in exp(c log n) = n

time for all n ≥ N. We deﬁne uniform

polynomial time similarly.

Theorem 20 Let H be the class of all exponential time computable functions from N → {0, 1}.

Then there is no computable online learning scheme (D, R, Φ) that exactly learns H such that both

R and Φ run in uniformly exponential time.

We leave the proof of Theorem 20 to the appendix A. We made the following conjecture for

polynomial time computable functions.

Conjecture 21 Let H be the class of all polynomial time computable functions from N → {0, 1}.

Then there is no computable online learning scheme (D, R, Φ) that exactly learns H such that both

R and Φ uniformly run in polynomial time.

While we are unable to prove the conjecture, we can prove the following specialized version

(see appendix A for a proof). We say the recorder R to be a naive recorder, if it simply appends

h(n) to the database D at the update of step n.

Theorem 22 Let H be the class of all functions from N → {0, 1} whose time complexity is eventu-

ally bounded by log

k+1

n time for some k ≥ 1. Then there is no computable online learning scheme

(D, R, Φ) that exactly learns H such that R is a naive recorder and Φ runs in uniformly log

n time.

We note the following interesting connection with time hierarchy theorem. Denote

= {h : ∃TM s.t. ∀n ∈ N, TM(n) = h(n) and C(TM, n) ≤ log

n}.

Corollary 23 Theorem 22 implies that H

k+2

is not empty.

Remark 24 The time hierarchy theorem (Arora and Barak, 2009, Theorem 3.1) does not necessar-

ily imply Theorem 22, since the predictor uses the database D as an zero-computational cost oracle

when computing on n.

EAS ONLINE LEARNING

6. Noisy labeling

In the previous sections we considered various setups in the eas-online learning paradigm when the

labels are presented accurately. However, it is possible that the labels are corrupted by some noise.

Surprisingly, we will show in this section that our eas-online learning paradigm are actually resistant

to independent noises even if the sample are presented sequentially.

For simplicity, we will assume the domain to be X = N and that the label is binary. We consider

the following noisy process. At each time step i, we denote

= (X

, h(X

) ⊕ R

) to be the noisy

sample-label pair, where R

is a binary random variable with Pr(R

= 1) ≤ η and variables R

are independent of the instances, labels, and in addition R

are independent for each n.

We prove the following result.

Theorem 25 Let H be a function class from N → {0, 1}, p is a distribution over N. If H is

effectively countable w.r.t. p and η <

, then there exists a learning strategy Φ, such that for all

h ∈ H we have

∞

n=1

`(Φ(

n−1

, X

), h(X

)) < ∞

= 1,

where the randomness comes from both the sample and noise.

Proof Let H

a countable class that effectively covers H such that any two distinct functions in H

differs by a positive measure set in N. Let H

⊂ H

⊂ · · · H

be an nesting such that |H

| = k and

k∈N

= H

The prediction happens in stages. At stage 0, we initialize by choosing an arbitrary function h

from H

. Let h

k−1

be the function we have after stage k − 1. We will use h

k−1

to make prediction

in each stage k. At stage k we try to identify the underlying function as if it were in H

. To do

so, we note that each function in H

differs other functions by a positive measure subset of N. One

observes the samples sufﬁciently long so that there are m

samples (instances may be repeated) in

the difference sets of each pair of functions. We deﬁne h

∈ H

to be a function that has a smaller

Hamming distance to the noisy labeling of samples at the difference positions relative to all other

functions in H

. If no such function exists, choose an arbitrary function in H

We now show that the strategy indeed works. To do so, we choose m

3 log k

2(0.5−η)

. By

the Hoeffding bound and union bound, with probability at least 1 −

, we will ﬁnd the correct

underlying function at stage k, if it is in H

. Now, since the underlying function must be in some

, the predictor must ﬁnd it in ﬁnite steps. And by the Borel-Cantelli lemma, we will only miss it

ﬁnitely many times with probability 1 since

k∈N

< ∞. After that we will make no errors. The

theorem follows.

In the proof of Theorem 25, the exact probability Pr(R

= 1) is not required to be known, only the

upper bound η need be known. However, even the requirement of knowledge of η can be eliminated

by a simple application of the doubling trick. Note that, Theorem 25 also holds if the instances

are presented sequentially. In such a case, one observes for each instance exactly one noisy label.

Therefore, there is no way estimate any single label (which would have been possible in the i.i.d.

case). However, the proof of Theorem 25 shows that one would be able to identify the correct

function with arbitrary high conﬁdence by leveraging noisy labels of different instances.

EAS ONLINE LEARNING

7. Discussion

In this paper, we introduced an online learning paradigm where a learner is required to make only

ﬁnitely many errors with probability one in an inﬁnite horizon. The setup generalized the well

known online learning setup of (Littlestone, 1988) to a more relaxed non-uniform consistency. We

also demonstrated the relationship of our setup with the classic learning of recursive function liter-

ature as surveyed in (Zeugmann and Zilles, 2008), and showed that the randomness and resource

restriction could bring richer picture when considering computability. We also investigated the case

when noisy labels are presented, and showed that our setup are actually resistant to independent

Bernoulli noise.

While our setup does not provide bounds (or rates) on how many errors will the learner be

made, we should emphasize that we are also dealing with very rich classes that do not admit uniform

consistency results. Our work is more focused on the conceptual foundation of learning and is meant

to understand large scale problems that we have little prior knowledge in. Indeed, our approach

also provide a different angle for a theorist to rethink ”no free lunch” theorems. Philosophically,

this is a question that underlies much of our scientiﬁc endeavor—can a physicist make only ﬁnite

errors before ﬁnding a universal theory? Or will she perpetually move between models without

every converging on one? See Cover (1973) and Appendix C for more discussions on this type of

problems.

Acknowledgments

This work was supported by NSF grants CCF-1619452 and by the Center or Science of Information

(CSoI), an NSF Science and Technology Center, under grant agreement CCF-0939370.

References

Sanjeev Arora and Boaz Barak. Computational complexity: a modern approach. Cambridge Uni-

versity Press, 2009.

anis Martynovich Barzdin¸

s and RV Freivald. On the prediction of general recursive functions. In

Doklady Akademii Nauk, volume 206, pages 521–524. Russian Academy of Sciences, 1972.

Shai Ben-David, D

avid P

al, and Shai Shalev-Shwartz. Agnostic online learning. In COLT, 2009.

Olivier Bousquet, Steve Hanneke, Shay Moran, Ramon van Handel, and Amir Yehudayoff. A theory

of universal learning. arXiv:2011.04483, 2020.

Thomas M Cover. On determining the irrationality of the mean of a random variable. The annals of

Statistics, 1(5):862–871, 1973.

Amir Dembo and Yuval Peres. A topological criterion for hypothesis testing. The Annals of Statis-

tics, pages 106–117, 1994.

David H. Haussler, Nick Littlestone, and Manfred K. Warmuth. Predicting {0, 1}-functions on

randomly drawn points. Information and Computation, 115(2):248–292, 1994.

Sanjeev R Kulkarni and David N. C. Tse. A paradigm for class identiﬁcation problems. IEEE

Transactions on Information Theory, 40(3):696–705, 1994.

EAS ONLINE LEARNING

Amir Leshem. Cover’s test of rationality revisited: Computability aspects of hypothesis testing.

In 2006 IEEE 24th Convention of Electrical & Electronics Engineers in Israel, pages 213–216.

IEEE, 2006.

Lihong Li, Michael L Littman, Thomas J Walsh, and Alexander L Strehl. Knows what it knows: a

framework for self-aware learning. Machine learning, 82(3):399–443, 2011.

Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algo-

rithm. Machine learning, 2(4):285–318, 1988.

Neri Merhav and Meir Feder. Universal prediction. IEEE Transactions on Information Theory, 44

(6):2124–2147, 1998.

Tom Mitchell, William Cohen, Estevam Hruschka, Partha Talukdar, Bishan Yang, Justin Betteridge,

Andrew Carlson, Bhanava Dalvi, Matt Gardner, Bryan Kisiel, et al. Never-ending learning. Com-

munications of the ACM, 61(5):103–115, 2018.

Narayana Santhanam and Venkat Anantharam. Agnostic insurability of model classes. Journal

of Machine Learning Research, 16:2329–2355, 2015. URL http://jmlr.org/papers/

v16/santhanam15a.html.

Amin Sayedi, Morteza Zadimoghaddam, and Avrim Blum. Trading off mistakes and don’t-know

predictions. In Advances in Neural Information Processing Systems, pages 2092–2100, 2010.

Dale Schuurmans. Characterizing rational versus exponential learning curves. journal of computer

and system sciences, 55(1):140–160, 1997.

Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algo-

rithms. Cambridge university press, 2014.

Vladimir N. Vapnik. The nature of statistical learning theory. Springer science & business media,

2013.

Vladimir N. Vapnik and Alexey Y. Chervonenkis. Theory of pattern recognition. 1974.

Changlong Wu and Narayana Santhanam. Being correct eventually almost surely. In Information

Theory (ISIT), 2019 IEEE International Symposium on, pages 1989–1993. IEEE, 2019.

Thomas Zeugmann and Sandra Zilles. Learning recursive functions: A survey. Theoretical Com-

puter Science, 397(1-3):4–56, 2008.

Chicheng Zhang and Kamalika Chaudhuri. The extended littlestones dimension for learning with

mistakes and abstentions. In Conference on Learning Theory, pages 1584–1616, 2016.

EAS ONLINE LEARNING

Appendix A. Omitted proofs

Proof [Proof of Theorem 11] The sufﬁciency follows directly from Lemma 4. To prove the neces-

sary condition, by the argument in the proof of Theorem 7, it is sufﬁcient to show that there exists a

countable dense subset of measurable functions from X → Y with the following metric

d(f, g) = Pr

x∼µ

(f(x) 6= g(x)).

To do so, we use a truncation argument. Wolog, we assume Y = N. We observe that there exists a

class of countable measurable functions P

that is dense in the measurable functions from X → [m]

for all m ∈ N, since the σ-algebra on X is separable. We claim that

∞

m=1

is dense in the

measurable functions from X → N. Let h : X → N be an arbitrary measurable function and  > 0.

There exists m ∈ N such that

x∼µ

(h(x) ≥ m) ≤ /2.

Let h

be the function such that h

(x) = h(x) if h(x) ≤ m and h

(x) = 1 otherwise. We have

d(h

, h) ≤ /2. Now, choosing h

∈ H

with d(h

, h

) ≤ /2, we have d(h, h

) ≤ . This

completes the proof.

Proof [Proof of Theorem 20] We actually prove a stronger version of the theorem. Let H be the

set of functions that can be computed within time n

k+1

(hence exponential in log n), we show that

there is no online learning scheme (D, R, Φ) that exactly learns H such that R and Φ uniformly runs

in n

/2 time.

The proof uses a diagonalization argument. Assume to the contrary that we have such a scheme

(D, R, Φ). We construct the following algorithm ExpDiag(n):

1. Input: n

2. If n = 1 Output 1.

3. Run scheme (D, R, Φ) on samples (k,ExpDiag(k)) with k ≤ n − 1.

4. Output 1 − Φ

(n).

To analyze the running time of ExpDiag(n), let f(n) be the time needed to compute ExpDiag with

input n. We have

f(n) = n

+ f(n − 1),

since R and Φ run in n

/2 by assumption and one may reuse the database at the recursion steps.

We have

f(n) ≤

n−1

i=1

≤ n

k+1

Therefore, the function that computed by ExpDiag is in H. However, by construction ExpDiag(n) 6=

(n) for all n ≥ 2 which yields a contradiction as before.

Remark 26 Note that similar argument cannot be generalized to the functions that computed in

time log

k+1

n against predictors runs in log

n because establishing the database in step 3 will

require Ω(n) steps in the naive way. However, we still believe this is true as in Conjecture 21.

Theorem 22 establishes the conjecture when the database simply appends the true labels without

processing.

EAS ONLINE LEARNING

Proof [Proof of Theorem 22] We use a standard approach that reduces the polynomial to exponential

case. Let H be the class of all functions that can be computed in time log

k+1

n. Suppose to the

contrary, (D, R, Φ) is a scheme that exactly learns H such that R is naive recorder and Φ runs in

log

n. For any function h

that can be computed in n

k+1

time, we construct a function h that can

be computed in log

k+1

n time as follows

h(n) =

(

(t), if n = 2

for some t ∈ N

1, otherwise

Consider the following predictor Φ

1. Input: n and naive recording D of h

upto n − 1

2. Simulate Φ with input 2

as follows: if Φ queries database at position T such that T = 2

query the tth position in D. In all other positions, we know h took value 1.

3. Output Φ

By assumption Φ makes ﬁnitely many errors on h, thus Φ

makes ﬁnitely errors on h

as well.

Clearly, we have Φ

runs in (log 2

)

= n

time and it exactly learns the class of functions that can

be computed in n

k+1

. This contradicts Theorem 20.

Proof [Proof of Corollary 23] Suppose not. We will construct an online learning scheme (D, R, Φ)

that exact learns H

k+2

such that R is naive recorder and Φ runs in log

k+1

n, which will contradict

Theorem 22.

To do so, we enumerate all Turing machines TM

, TM

, · · · . The predictor Φ maintains two

indices t, i. At each time step n, both t and i are initialized to 1. The predictor simulates log

instructions on TM

(i) . If it stops within those log

i instructions and its output matches h(i), keep

t = t and increment i = i + 1. Else, move to the next machine, t = t + 1 and increment i = i + 1.

This process continues for

log

k+1

n net time (this includes the time taken to simulate steps on

all Turing machines thus far, and all relevant overheads, including that for incrementing indices).

Then set i = n and simulate TM

(n) till the run for a net log

k+1

(n) time. If TM

stops by then, the

predictor outputs TM

(n), otherwise it outputs 1.

We show that this scheme is an exact online learning scheme for H

k+2

. For any h ∈ H

k+2

, we

have an Turing machine TM

that outputs h(n) for all n within log

n time, since H

k+2

= H

assumption. Note that t increases iff the predictor makes errors. Since in addition,

log

k+1

n → ∞,

we know that t will hit j eventually. Once t hits j, note that we never increment it since TM

(i)

outputs h(i) within log

i time for all i. If the time runs out before we complete the simulation of

, note again that t is not incremented. Finally, since the overhead on universal Turing machine

is a log log n factor on time complexity (Arora and Barak, 2009, Theorem 1.13), we know that for

large enough n, the algorithm above actually completes the simulation of TM

(n).

Appendix B. Non-uniform time complexity

If we relax the requirement of worst case time complexity, we show that polynomial time com-

putable functions can be learned exactly in (pointwise) polynomial time.

EAS ONLINE LEARNING

Theorem 27 Let H be the class of all polynomial time computable functions from N → {0, 1}.

Then there exists a computable exact online learning scheme (D, R, Φ) that exactly learns H such

that both R and Φ run in point-wise polynomial time.

Proof The database D consists of a triplet D = (D

, D

, k) where D

, D

encodes for Turing

machines and k ∈ N. The predictor Φ works as follows:

1. Input: n with database D

2. If D

is not a valid encoding of a Turing machine, Output 1.

3. Simulate the Turing machine encoded by D

on input n for log

n steps.

4. If D

stops in ≤ log

n steps Output D

(n). Else, Output 1.

The recorder R works as follows:

1. Input: Old database D, prediction Φ

(n) and n, h(n)

2. Output: New database D

3. If n = 1, set D = {φ, φ}, where φ is empty string.

4. If Φ

(n) = h(n), do not change on D and Return.

5. Else:

a. If D

6= D

, Set D

= NEXT(D

), where NEXT maps strings in {0, 1}

∗

to the next

string in alphabet order, and k = k + 1

b. Else, Set D

= φ and D

= NEXT(D

Suppose that the underlying function h ∈ H can be computed in O(log

n) time by some Turing

machine with encoding D

. By the construction of R, the database D changes iff the prediction

made by Φ is wrong.

We show that the database changes only ﬁnitely many times, thus proving that the total number

of errors is ﬁnite. If the database changes inﬁnitely many times, note that it will also hit D

inﬁnitely

many times. D

is simulated for log

n steps in Step 3. of the predictor, where k increases every

time there is an error. Therefore, k will eventually be large enough that D

is simulated for enough

steps that it halts and outputs the correct value of h(n). Following this, the database can not change,

contradicting the assumption that the database changes inﬁnitely many times.

Appendix C. More variations

Contrast to the ﬁnite error guarantee, perhaps a more natural non-uniform consistency of online

learning one may think is to let the average loss per time step convergence to zero. More formally,

one may wish to ﬁnd a predictor such that

lim

N→∞

n=1

`(Φ(Z

n−1

, X

), h(X

)) = 0

= 1.

EAS ONLINE LEARNING

However, such a guarantee might not be attractive if one only considers the consistency. Since we

can show that the class of all measurable functions from R → {0, 1} are actually learnable in that

sense.

Theorem 28 Let H be the class of all measurable functions from R → {0, 1}, µ is an arbitrary

distribution over R that is unknown to the learner, ` is the classiﬁcation loss. Then there exist a

learning strategy Φ such that for all h ∈ H we have

lim

N→∞

n=1

`(Φ(Z

n−1

, X

), h(X

)) = 0

= 1,

where Z

and X

are deﬁned as in Section 2.

Proof [Sketch of Proof] Let G be a countable class of function from R → {0, 1} that is dense in

H. Note that, since the Borel σ-algebra is separable, we known such a class exist and independent

of the underlying distribution. The prediction partitioned into stages. At stage k, the learner tries

to identify a function h

in G that is 

-close to the underlying function with conﬁdence at least

1 −

. This can be easily achieved by using a structural risk minimization argument. Note that

we will use h

k−1

to make the prediction at stage k, and use the sample observed at stage k to ﬁnd

for the next stage. We now choose 

small enough so that the probability that it has average

error jump above 

in the inﬁnite horizon is at most

. This can be done by Chernoff bound with

a union bound, and observe that

∞

n=1

exp(−2n

) convergence and goes to zero when 

goes to

zero. The theorem now follows by Borel-Cantelli lemma, since

is summable.

One may also consider the scenario when the rate of convergence to zero is controlled. It can be

shown that if the class has ﬁnite VC-dimension, then we can have rate of O(

log n

) in expectation,

see Haussler et al. (1994). However, this does not automatically give us an almost sure upper bound

on the cumulative errors, since the errors at different time steps are correlated. We now show that we

can indeed achieve a O(

log

) rate almost surely. To do so, we use a doubling trick. We partition the

prediction into stages. At stage k, we will see 2

sample-label pairs, and we generate a hypothesis

using the observed pairs. We then use h

to make predictions for another 2

steps, at the end of

which we move to stage k + 1. To see why this approach works, we denote I

to be the indicator

that an error occurred at step n. By construction, we know that the indicators in each phase are

independent. We can therefore employ multiplicative Chernoff bound to show that with probability

at least 1 − O(1/2

) the number of errors at stage k is upper bounded by O(k) (using empirical risk

minimization to obtain h

is sufﬁcient to achieve this). Therefore, by the Borel-Cantelli lemma, the

partial sums of the indicators {I

} up to stage k will be eventually upper bounded by O(k

) with

probability 1. Since the ﬁrst n steps will cover at most log(n) stages, the result follows. Note that,

the O(log

n/n) rate is not meant to be optimal, it is not hard to see that similar argument could

establish a O(log n log log n/n) rate almost surely by replacing the h

with an optimal predictor

that has errors O(log k/2

) with probability 1 − O(1/k

). We leave it as an open problem to obtain

the optimal almost sure rate with ﬁnite VC-dimension.

A binary hypothesis class H is said to be closed, if for any measurable function f and distribu-

tion µ we have

inf

h∈H

x∼µ

[h(x) 6= f(x)] = 0,

EAS ONLINE LEARNING

then f ∈ H. Recently, Bousquet et al. (2020) showed that for closed hypothesis classes the expected

rate (i.e. E[I

]) can either be exponential or linear or arbitrary slow (here we have used the closeness

notion in replacing of the adversary setting in Bousquet et al. (2020)). Note that the closeness of the

class is a crucial part to establish their lower bounds. For example, consider the class R of all linear

threshold functions over [0, 1] with rational parameter, we know that the class can be learned with

ﬁnitely many errors almost surely. However, R is not closed and it has an inﬁnite Littlestone tree,

the result of Bousquet et al. (2020) only implies that the closure of R has E[I

] = Ω(1/n). It is

therefore an interesting problem to investigate whether similar phenomenon will happen for general

classes (i.e. not necessarily closed) and in the almost sure scenario.