Proceedings of Machine Learning Research vol 132:121, 2021 32nd International Conference on Algorithmic Learning Theory
Non-uniform Consistency of Online Learning with Random Sampling
Changlong Wu WUCHANGL@HAWAII.EDU
University of Hawaii at Manoa
Narayana Santhanam NSANTHAN@HAWAII.EDU
University of Hawaii at Manoa
Editors: Vitaly Feldman, Katrina Ligett and Sivan Sabato
Abstract
We study the problem of online learning a hypothesis class and a given binary 0-1 loss function,
using instances generated i.i.d. by a given distribution. The goal of the online learner is to make
only finitely many errors (loss 1) with probability 1 in the infinite horizon. In the binary label case,
we show that hypothesis classes are online learnable in the above sense if and only if the class is
effectively countable. We extend the results to hypothesis classes where labels can be non-binary.
Characterization of non-binary online learnable classes is more involved for general loss functions
and is not captured fully by the countability condition even for the ternary label case.
In the computational bounded setup, we compare our results with well known results in recur-
sive function learning, showing that the class of all total computable functions is indeed learnable
with computable online learners and randomized sampling. Finally, we also show that the finite
error guarantee will not be affected even when independent noise is added to the label.
Keywords: Online Learning, Finite Error, Almost Surely, Consistency
1. Introduction
Let H be a set of functions from X {0, 1}, where X is an instance space. The online learning
setup is a game between two parties, Nature and the learner, both of whom know the class H. Nature
chooses a function h H at the beginning of the game. The game proceeds in steps. At each step
n, the learner is provided with an instance x
n
X . The learner has to guess the label h(x
n
),
potentially using the history built up till that step. Subsequently, the true label h(x
n
) is revealed to
the learner, and the learner incurs a binary loss—1 if her guess differs from h(x
n
) that indicates an
error, 0 for no error. The goal of the learner is a strategy that minimizes the number of errors made.
In his seminal work, Littlestone (1988) showed that the number of errors is bounded by what is
known as the Littlestone dimension Ldim(H) of H. This dimension is independent of the number
of steps the game goes on for. When the class H is finite, Littlestone (1988) proved that the number
of errors is bounded by log |H| using a halving argument.
However, the Littlestone dimension can be too restrictive a measure to handle really rich classes
that may not have finite Littlestone dimension, such as the class P of all polynomial time decid-
able problems. One common approach that circumvents this difficulty considers the non-realizable
setup, where the goal of learner is competitive optimality with respect to a reference class H. Ben-
David et al. (2009) showed that one can achieve an expected O(
p
T log(T )Ldim(H)) regret in a
horizon T game using a weighted-majority algorithm. See (Shalev-Shwartz and Ben-David, 2014,
Chapter 21) and the references therein for more results in this line. However, the regret bound
doesn’t really tell how many actual errors the learner could make.
c
2021 C. Wu & N. Santhanam.
EAS ONLINE LEARNING
One common approach that deal with rich classes in the realizable case is to relax the require-
ment of uniform bound to point-wise bounds that could depend on individual hypotheses in the
class. Perhaps the most well-known setup of this flavor in learning theory literature is non-uniform
PAC learnability and the concept of Structural Risk Minimization by Vapnik and Chervonenkis
(1974); Vapnik (2013). A more related scenario in the online learning setup was studied in the
theory of learning recursive functions. In this setup, one assumes the class H to be functions from
N {0, 1}, and the instances are presented sequentially as {1, 2, · · · }. A class H is said to be
recursively learnable if there exist a computable learner such that the number of errors is finite, no
matter what model is selected from H. See Zeugmann and Zilles (2008) for a survey on this topic.
This paper generalizes the deterministic finite error setting in the recursive function learning
literature with a more realistic randomized setup. Instead of the learner observing instances in a
predetermined order, we assume that they are sampled from a distribution µ over X independently
at each time step. A class H of functions from X {0, 1} is eventually almost surely (or eas for
abbreviation) online learnable with respect to µ, if there exists an online learner that makes only
finite many errors with probability 1, no matter what function is choosen from H.
The conceptual basis of our problem setup lies in the almost sure hypothesis testing framework
studied in the statistics and information theory communities. See Cover (1973); Dembo and Peres
(1994); Kulkarni and Tse (1994); Santhanam and Anantharam (2015); Wu and Santhanam (2019)
for a sample of results along this line in hypothesis testing, classification and prediction. There
has been work in this line along the computability angle as well, see for example Leshem (2006).
We make a note of other related work on an infinite horizon machine learning framework Mitchell
et al. (2018), the universal prediction framework in Merhav and Feder (1998), and randomized
online learning scenario in Haussler et al. (1994) as well. Novel modifications to the Littlestone
dimension have also been considered, see for example, Sayedi et al. (2010); Li et al. (2011); Zhang
and Chaudhuri (2016).
Characterization Our first result fully characterizes eas-online learnability in the binary label
case. We show that for all instance spaces X satisfying a regularity condition (satisfied by most
common spaces, including e.g. the Borel space over R
d
) and distributions µ with potentially infinite
support, a class H of measurable functions from X {0, 1} is eas-online learnable if and only
if H is effectively countable. Informally, a class H is countable if we do not distinguish between
hypotheses that agree with probability 1 under µ (see Definition 3 for a more complete definition).
Applying our characterization to the class T of all linear threshold functions from [0, 1]
{0, 1}, i.e. functions of form
h
a
(x) =
(
1, if x a
0, otherwise,
shows that T above is not eas-online learnable for any probability measure over [0, 1] that admits
a continuous density. However, the class T has VC dimension 1. For a uniform distribution over
[0, 1], an empirical risk minimization approach will result in an online learning strategy for the
linear threshold functions. The probability of making an error with this approach at time step n is at
most O(
log n
n
). Because T is not effectively countable, and hence not eas-online learnable, we can
conclude immediately that the empirical risk minimization bound cannot be improved from
log n
n
to
e.g.
1
n log
1+
n
with any > 0. If it could, the Borel-Cantelli lemma would imply that the class T
with uniform distribution would be eas-online learnable, since 1/n log
1+
n is summable over n.
2
EAS ONLINE LEARNING
The lower bound only differs by a poly log(n) factor compared to the optimal 1/n lower bound as
showed in Schuurmans (1997), and we only needed to observe that the set of all linear threshold
functions is not effectively countable to infer this result!
Countable label spaces Departing from the basic result, we show that the characterization of
online learning still holds for classification loss even if the label space is countable. However, when
more general binary losses are allowed, there exist distributions with infinite support and effectively
uncountable classes that are eas-online learnable, even when the label size is 3.
Computable learners Suppose the learner is required to be computable in addition. We show
that there exist distributions over N with infinite support, such that the class of all total computable
functions from N {0, 1} is eas-online learnable with a computable learner. This is in contrast to
the sequential sampling scenario of recursive learning, see Zeugmann and Zilles (2008).
Noisy labels We finally consider online learning with noisy labels, and show that effectively
countable classes with binary labels are still online learnable when labels are corrupted with in-
dependent Bernoulli noise. Surprisingly, we show that one could also learn effectively countable
classes with binary labels when the instances are presented sequentially with noisy labels and in
deterministic order. Here we would only see any instance only once with a noisy label, yet we could
leverage the noisy labels from different instances to learn the underlying hypothesis.
2. Problem setup
Let X be the instance space that endowed with some fixed separable σ-algebra F. We say F to be
separable, if for all probability distribution µ on F, we have a countable set G of measurable sets in
F, such that for any measurable set A F and > 0, there exist G
G such that
µ(AG
) ,
where is symmetric difference. Clearly, the Borel σ-algebra over R
d
is separable (e.g. we can take
G to be the algebra generated by the set of all cuboids in R
d
with vertices at rational coordinates).
We will also assume the single point set of X is measurable.
Let Y be a label space that is often assumed to be finite or countable. A binary loss is a function
` : Y × Y {0, 1} that is symmetric and reflexive, i.e. `(y
1
, y
2
) = `(y
2
, y
1
) and `(y
1
, y
1
) = 0 for
all y
1
, y
2
Y. And a learning strategy is function
Φ : (X × Y)
× X Y,
that is measurable over cylinder σ-algebra on (X × Y)
× X .
We consider the following learning game between Nature and a learner that proceeds by time
steps. Let H be a class of measurable functions from X Y and µ be a probability measure
over X , which are known to all parties. At the beginning, the nature chooses a function h H.
At each time step n, the nature independently samples X
n
µ and provides it to the learner.
The learner then outputs an estimate Y
n
of the true label h(X
n
), potentially using the history
{(X
1
, h(X
1
)), · · · , (X
n1
, h(X
n1
))} thus far. Nature then provides the true label h(X
n
) to the
learner, and the learner incurs the binary loss `(Y
n
, h(X
n
). The learner makes an error at time step
n if `(y
n
, h(x
n
)) = 1.
Denote Z
i
= (X
i
, h(X
i
)) to be the instance-label pair at time step i, and Z
i
1
= (Z
1
, · · · , Z
i
) is
the history observed by the learner upto time step i.
3
EAS ONLINE LEARNING
Definition 1 A class H is said to be eas-online learnable w.r.t. distribution µ, if there exists a
learning strategy Φ such that
Pr
X
n=1
`(Φ(Z
n1
1
, X
n
), h(X
n
)) <
!
= 1,
for all h H with X
1
, X
2
, · · ·
i.i.d.
µ.
We also introduce the following stronger version of eas-learnability when the distribution is
unknown to the learner.
Definition 2 A class H is said to be strongly eas-online learnable, if there exists a learning strategy
Φ such that
Pr
X
n=1
`(Φ(Z
n1
1
, X
n
), h(X
n
)) <
!
= 1,
for all h H and µ with X
1
, X
2
, · · ·
i.i.d.
µ.
In both cases above, we will say that the strategy Φ eas-online learns (or strongly eas-online
learns) H w.r.t. µ. For notational convenience, we drop the reference to µ where doing so leads to
no ambiguity.
The following notion will be used frequently in this paper.
Definition 3 A class H
0
effectively covers a class H w.r.t. distribution µ and loss `, if for all h H
there exists h
0
H
0
such that
µ{x : `(h(x), h
0
(x)) = 1} = 0.
We say H is effectively countable (respectively size n) w.r.t. µ and `, if there exist H
0
that is
countable (respectively size n), such that H
0
effectively covers H w.r.t. µ and `.
We begin with the following lemma, an online learning version of Structural Risk Minimization.
Lemma 4 Let H
1
, H
2
, · · · be countably function classes that share the same instance space and
loss function. If H
n
is eas-online learnable for all n N w.r.t. distribution µ, then
H =
[
nN
H
n
is also eas-online learnable w.r.t. µ.
Proof Let Φ
k
to be the strategy that eas-online learns class H
k
. We define a strategy for H as
follows. At each time step n, denote by e(n, k), the number of errors that Φ
k
would made if it were
employed in the first n 1 time steps. Let
J
n
= argmin
kN
{e(n, k) + k}. (1)
We then use Φ
J
n
to make the prediction at time step n.
4
EAS ONLINE LEARNING
We now show that this strategy indeed makes finitely many errors with probability 1 no matter
what h is chosen from H. Assume h H
t
for some t N. Let B X
be the set such that Φ
t
would make finite errors on all realizations in B. We have µ(B) = 1 because Φ
t
eas-online learns
H
t
. Fix any x B and denote s to be the number of errors Φ
t
would make on x. We have
e(n, t) + t s + t
for all n and from (1), note that therefore for all n, J
n
, the index chosen is s + t.
Furthermore, for any Φ
i
with i {1, · · · , s + t}, once Φ
i
makes more than s + t errors, it will
no longer be chosen in (1). Therefore, we will make at most (s + t)
2
errors, and thus finitely many
errors on x. Therefore, our strategy also makes finite errors with probability 1 when h is selected.
The lemma follows.
Remark 5 Note that Lemma 4 holds for strong eas-online learning as well, since the construction
of learning strategy does not depend on the knowledge of underlying distribution.
We prove the following technical lemma that relates deterministic sampling with randomized
sampling.
Lemma 6 Let a
1
, a
2
, · · · be an arbitrary ordering of N, and s
1
, s
2
, · · · N
+
be an arbitrary
sequence. Then there exist a distribution p over N, such that
p (N : n N, T
n
T
n1
s
n1
) = 1
where T
n
is the first time a
n
appears in an i.i.d. sample from p.
Proof Let p
n
= p(X = a
n
). Since T
n
T
n1
< s
n1
implies that a
n
appears earlier than the
s
n1
’th appearance of a
n1
, we have
p(T
n
T
n1
< s
n1
) 1
p
n1
p
n
+ p
n1
s
n1
. (2)
Taking
p
n
=
1
(n!)
2
Q
n1
i=1
s
i
,
we have that (2) is further upper bounded by
1
n
2
. Since
1
n
2
is summable, an application of the
Borel-Cantelli Lemma completes the proof.
3. Binary label
In this section, we will consider the case when the label is binary and the loss is `(y
1
, y
2
) = 1{y
1
6=
y
2
}. We will refer to this loss as the classification loss in the sequel. Note that since our loss
function is binary valued on binary labels, the only losses can either be trivial or classification loss.
We say a distribution over X to be non-degenerate if it has infinite support.
The following theorem fully characterizes the eas-online learnability in the binary label case.
5
EAS ONLINE LEARNING
Theorem 7 A class H with binary labels is eas-online learnable w.r.t. distribution µ iff H is
effectively countable w.r.t. µ.
Proof To see that if H is effectively countable, it is eas-online learnable, note that for any class
that contains effectively 1 hypothesis is trivially eas-online learnable. Therefore, any class H with
effectively countably many hypotheses is eas-online learnable by appealing to Lemma 4.
For the necessary part of the proof, we will assume w.l.o.g. that for all h
1
6= h
2
H we have
Pr
Xµ
(h
1
(X) 6= h
2
(X)) > 0. Note that any hypothesis class can be reduced to one satisfying
the above property by choosing a representative function in each equivalence class defined by the
relation h
1
h
2
: Pr
xµ
(h
1
(x) 6= h
2
(x)) = 0.
We now prove that if H admits a function Φ : (X ×{0, 1})
×{0, 1} {0, 1} such that h H
Pr
X
n=1
1{Φ(Z
n1
1
, X
n
) 6= h(X
n
)} <
!
= 1, (3)
where Z
n1
1
= (Z
1
, · · · , Z
n1
) and Z
i
= (X
i
, h(X
i
)) is the instance-label pair at step i generated
by µ, then H is countable.
Define the event
A
h
n
=
(
X
1
:
X
k=n
1{Φ(Z
k
1
, X
k+1
) 6= h(X
k+1
)} > 0
)
,
and let
H
n
=
h H : Pr
A
h
n
1
4
. (4)
From equation (3), we must have H =
S
n=1
H
n
. Since for any h H, we have Pr(A
h
n
) 1 as
n , there exist some k such that Pr(A
h
k
)
1
4
, i.e. h H
k
. We show that H
n
is countable for
all n N, which will prove the Theorem.
Central to our proof is our claim that for all n N we have
inf{Pr
Xµ
(h
1
(X) 6= h
2
(X)) : h
1
6= h
2
H
n
} > 0. (5)
Now equation (5) implies that H
n
is countable. To see this, let
d(h
1
, h
2
) = Pr
Xµ
(h
1
(X) 6= h
2
(X)).
The σ-algebra over X is separable, so by definition, there is a countable dense subset of the
σalgebra with respect to the distance d. Since each set in the σalgebra can be represented
by a measurable function from X {0, 1}, we conclude that there is a countable collection
G = {g
1
, g
2
· · · } of measurable functions from X {0, 1}, such that G is dense in the set of
all measurable functions from X {0, 1} under the distance d defined above.
Suppose the infimum in (5) is > 0. Since G is dense in the the set of all binary valued
measurable functions on X , for each h H
n
, there is a function g G such that d(h, g) < /2. By
the triangle inequality, for all h
0
6= h, we will also have d(h
0
, g) > /2. Therefore every h H
n
can be associated with a distinct g G, which then implies that H
n
is at most countable.
We now establish equation (5). The intuition behind this claim is that if the infimum in (5) were
0, then we could choose two hypotheses in H
n
that were arbitrarily close in the distance d—and
6
EAS ONLINE LEARNING
hence close enough that they are indistinguishable with sample size n on a large enough subset
A. However, these hypothesis will eventually differ in the suffix of the strings in A, therefore no
predictor could agree with both of them in the suffix past step n. But if the probability of A is large
enough, it contradicts the definition of H
n
in (4).
Formally, suppose (5) did not hold. Then there exist h
1
, h
2
H
n
such that 0 < d(h
1
, h
2
) < δ
n
,
where δ
n
is chosen so that (1 δ
n
)
n
>
1
2
.
Let A X
be the event that h
1
, h
2
cannot be distinguished with n samples. By construction,
we have Pr(A) >
1
2
. For j {1, 2}, let
p
j
= Pr(Φ makes error after step n on h
j
).
We show that max{p
1
, p
2
} >
1
4
, which will then contradict equation (4) since h
1
, h
2
H
n
, thus
establishing equation (5). To see that max{p
1
, p
2
} >
1
4
, we use a probabilistic argument. Let h be
the random variable uniformly chosen from {h
1
, h
2
}. We only need to show
E
h
E
Xµ
[1{Φ makes error after step n on h} | X A]
1
2
(6)
Let B X
be the event that there exists a instance Y after step n that first reveals h
1
(Y ) 6=
h
2
(Y ). We have Pr(B) = 1, since d(h
1
, h
2
) > 0. Note that
E
Xµ
[E
h
[1{Φ makes error after step n on h}] | X A B]
1
2
, (7)
since condition on any X A B event C = {Φ makes error at sample Y on h} implies the event
in the equation above, and because E
h
[1{C}] =
1
2
. We now have
E
Xµ
[E
h
[1{Φ makes error after step i on h}] | X A]
1
2
, (8)
since Pr(B) = 1. Finally (8) implies (6) by exchanging order of expectation, which is justified by
Fubini’s theorem.
Example 1 Let X = [0, 1] and µ be the uniform distribution over [0, 1]. We consider the linear
threshold functions
h
a
(x) = 0 if a x and h
a
(x) = 1 otherwise.
Denote H = {h
a
: a [0, 1]}. By Theorem 7 we know that H is not eas-online learnable.
A couple of observations need to be emphasized here. Note that the VC dimension of H is 1.
Therefore the VC-theorem (Shalev-Shwartz and Ben-David, 2014, Thm 6.7) posits that there is a
learning rule
ˆ
h
n
, such that for any h
a
H and a sample S
n
of size n, with high probability say
1
1
n
2
over S
n
, we have
Pr
xµ
[
ˆ
h
n
(S
n
, x) 6= h
a
(x)] O
log n
n
.
Theorem 7 implies that we cannot improve the upper bound from O
log n
n
to, say, O
1
n log
1+
n
with > 0. If we could, note that since
P
n=1
1
n log
1+
n
< , an application of the Borel-Cantelli
lemma would imply that
ˆ
h
n
only makes a finite number of errors no matter the hypothesis in force,
thus violating Theorem 7.
7
EAS ONLINE LEARNING
We observe the following simple corollary.
Corollary 8 If a binary measurable function class H has finite Littlestone dimension, then H is
effectively countable w.r.t. any distribution. In particular, if the domain is countable, then H is
countable.
Proof Finite Littlestone dimension implies H is online learnable with bounded number of errors and
an adversarial sampling process (Shalev-Shwartz and Ben-David, 2014, Chapter 21). The corollary
follows by Theorem 7 and notice that the learning strategy is independent of the distribution. The
second part follows by the fact that for any hypothesis class H over a countable domain X , if H is
effectively countable w.r.t. a distribution with supported of the whole of X , then H is countable.
The proof of the necessary condition of Theorem 7 also yields that if a class H is strongly eas-
online learnable, then H is effectively countable w.r.t. any distribution. By Corollary 8 above, this
implies that the restriction of H on any countable subset of X must be countable. However, this
doesn’t mean that the class H have to be countable. To see this, consider the class T of all functions
t
a
: [0, 1] {0, 1} with a [0, 1] that has the following form:
t
a
(x) =
(
1, if x = a
0, otherwise
.
It is easy to see that T is strongly eas-online learnable (in fact has Littlestone dimension 1), but
the class is uncountable. We leave it as an open problem to determine whether a class that is eas-
online learnable w.r.t. any distribution is also strongly eas-online learnbale. (Note that, in the former
setting the learning strategy can be different for different distribution, while the strongly eas-online
learnable requires a universal learning rule.)
4. Multi-Class label
We consider the case when the output set Y has more than 2 elements and we assume the loss ` to be
an arbitrary function. Before analyzing the randomized setting, we consider a simpler deterministic
setup. A deterministic sampling process in the most general setting is a function S : (X ×Y)
X ,
which select the next sample based on the previous sample-label pairs (but not the learner’s answer).
For any class H and process S, one can construct an infinite |Y|-nary tree T = (V, E) with a label
function L (for both vertices and edges) such that
1. Each edge has a label in Y and the label of edges with same parent are distinct.
2. Each vertex in T has a label in X . The root v
0
has label L(v
0
) = S(), where is empty
string. For each vertex v T , we denote v
0
v
1
· · · v
n
= v to be the unique path from
root to v. Let z
i
= (L(v
i1
), L(v
i1
v
i
)), the label of v is given by L(v) = S(z
1
, · · · , z
n
).
3. Each vertex v in T has a child u such that L(v u) = y if and only if h H such that
h(L(v
i
)) = L(v
i
v
i+1
) for all 0 i n and h(L(v)) = y, where v
i
is defined in item 2.
Clearly, every function h X will associated with a unique infinite path v
0
v
h
1
v
h
2
· · ·
in T such that h(L(v
h
i
)) = L(v
h
i
v
h
i+1
). However, not every infinite path will have a function
in H associated with it. The learning process can be viewed as traversing an infinite path in T .
Such a tree is also known as the mistake tree in the binary online learning literature, see Zhang and
Chaudhuri (2016). A valuation of T is a function v : V N such that for each v V we have
8
EAS ONLINE LEARNING
1. v(v) max
uC(v)
{v(u)}, where C(v) is the children vertex set of v;
2. v(v) 1 + max
uC(v)
{v(u)} if L
M
v
= {L(v u) : u C
M
(v)} cannot be covered by Y,
where C
M
(v) is the set of children vertex of v that has maximal value. We say a set A Y
can be covered by Y if there exists y Y such that x A, `(y, x) = 0.
Note that the valuation may not always exist. The following theorem relates the existence of valua-
tion to the bounded error guarantee.
Theorem 9 A class H is online learnable with B errors and a deterministic sampling process
S if and only if there exists a valuation v on T such that v(v
0
) B, where v
0
is the root of T .
Proof If T has a valuation such that v(v
0
) B, we simply predict the element that would cover
L
M
v
at vertex v in the traversing (if no such element, predict anything). Note that the learner makes
an error only if the value of the child vertex is reduced by at least 1. Therefore, there will be at most
B errors.
If H is online learnable w.r.t. S with at most B errors, we define the value at vertex v to be the
maximum number of errors that the learning rule could make on the subtree rooted by v.
To prove this is a valid valuation, note that condition 1 can be verified easily. To see condition
2 holds as well, we resort to a proof by contradiction. If condition 2 does not hold at some vertex
v, i.e. we have v(v) = max
uC(v)
{C(v)} but L
M
v
cannot be covered by Y. We can choose a path
to a child in C
M
(v) that has label differs from the prediction given by the learner, thus incuring
max
uC(v)
v(u) + 1 errors starting from v, contradicting our premise. The theorem follows.
To capture the finite error guarantee, we will need a notion of ranking on the hypotheses in H.
A ranking of H is a function r : H N. For any vertex v V, we denote H
v
to be the hypotheses
in H that share the path from v
0
to v in T , where v
0
is the root of T . For a given ranking r, we
denote H
m
v
to be the hypotheses in H
v
that have minimal rank, i.e.
H
m
v
= {h H
v
: r(h) = min{r(H
v
)}},
where r(H
v
) = {r(h) : h H
v
}. Now let L
m
v
= {h(L(v)) : h H
m
v
}.
We say a class H to be rankable w.r.t. a deterministic sampling process S, if there exists a
ranking r such that for each h H there exists a number N
h
, such that for every vertex v V on
the infinite path of h with depth larger than N
h
, we have h H
m
v
and L
m
v
is coverable.
Intuitively, the smaller the rank of a function in H, the higher priority we will assign in the
learning process. Formally, we prove the following theorem:
Theorem 10 A class H is online learnable with finitely many errors w.r.t. a deterministic sampling
process S if and only if H is rankable w.r.t. S.
Proof If H is rankable, we show that H is online learnable with finitely many errors. The prediction
rule works as follows, we predict an element that covers L
m
v
at each vertex v in the traversing (if
it doesn’t exist we predict anything). By definition of rankability, the learner will not make errors
after step N
h
if the underlying hypothesis is h.
To see the converse, suppose Φ is a online learning rule which makes only finitely many errors
no matter what the hypothesis h H is. The rank of each hypothesis h is the number of errors Φ
would make against h.
9
EAS ONLINE LEARNING
To conclude the proof, we will show that this is a valid ranking. Fix any h and let N
h
be the
time step of the last error Φ make on h. Suppose there exists some vertex v at depth more than N
h
such that L
m
v
is not coverable. Now the rule Φ will make more errors on at least one of hypotheses
h
0
H
m
v
than it made on h. This is because h and h
0
share the same prefix up to vertex v and hence
have the same error pattern till then, but differ on their label in L
m
v
. Thus this hypothesis h
0
must
have a higher rank by definition, and therefore could not have been in H
m
v
.
Note that both Theorem 9 and Theorem 10 work only for deterministic sampling process where
there is one mistake tree. However, in the randomized setup the mistake tree will be different for
different realization of the sampling. A sufficient condition is to have a universal ranking that works
for almost all realizable mistake trees. The following theorem shows that it will happen when the
class is effectively countable. We leave the proof to Appendix A.
Theorem 11 Let H be a set of measurable functions from X Y. Then H is eas-online learnable
w.r.t. distribution µ and loss `, if H is effectively countable w.r.t. µ and `.
Moreover, the effective countability of H is necessary for eas-online learnability (w.r.t. any
distribution µ) if Y is countable and ` is the classification loss, i.e. `(y
1
, y
2
) = 1{y
1
6= y
2
}.
Corollary 12 Let H be the class of all continuous functions from [0, 1] [0, 1], ` = 1{|y
1
y
2
| >
B} for some B > 0. Then H is eas-online learnable w.r.t. the loss ` and any distributions.
Proof By the Stone-Weierstrass theorem, H can be covered by polynomials with rational coeffi-
cients under the supremum norm. Since polynomials with rational coefficients are countable, H is
effectively coverable w.r.t. any distribution. Theorem 11 then implies that H is eas-online learnable
w.r.t. any distribution.
Moreover, the covering set in Corollary 12 is independent of the distribution. Therefore, we
know that the class of all continuous functions in Corollary 12 is actually strongly eas-online learn-
able.
We provide the following example, which shows that if we have |Y| 3 and X = N, then there
exists a class H, distribution p and loss `, such that H is eas-online learnable w.r.t. p and `, but H is
not effectively countable. Thus, the condition in Theorem 11 can’t be necessary for arbitrary losses
and distributions.
Example 2 Let Y = {0, 1, 2}, we define a loss ` as follows: ` is symmetric in its two arguments
and
`(0, 1) = 0 and `(0, 2) = `(1, 2) = 1.
We now construct the class H as follows. Since the domain is N, we will denote each function in
H as infinite sequences in {0, 1, 2}
. Let B = {0, 1}
be the class of all binary sequences. We
define the following transformation T that maps {0, 1}
{0, 1, 2}
. For any b B, T (b)
is the sequence that inserts number 2 that follows each appearance of 0 in b. For example, if
b = 00100101011 . . . then T (b) would read 02021020210210211 . . .. Now, we define
H = {T (b) : b B}.
We now define the distribution. By Lemma 6, we have distribution p over N such that the i.i.d.
samples from p will appear in increasing order of N eventually almost surely—namely for all m
10
EAS ONLINE LEARNING
large enough, the instance m 1 will appear prior to instance m with probability 1. Therefore, one
may assume, w.l.o.g., that the sample is generated sequentially as 1, 2, 3, · · · .
To see that the class H is eas-online learnable w.r.t. to p and loss `, for any instance m N
we simply predict 2 for the label of instance m if the label of instance m 1 is 0 and predict 0 if
the label of m 1 is 1 or 2. It is easy to check that the rule makes only finitely many errors with
probability 1.
Clearly, the class H is uncountable and for any two different elements b
1
, b
2
B we can find
l N such that `(T (b
1
)(l), T (b
2
)(l)) = 1. Therefore, we have H is not effectively coverable by
any countable set H
0
.
Even though the class in Example 2 is not effectively countable, it has a trivial universal ranking
that ranks all functions in H to 0. We conclude this section with the following conjecture.
Conjecture 13 A class H is eas-online learnable w.r.t µ iff there exist a universal ranking r of H
such that H is rankable with ranking r for almost all mistake tree generated by µ.
5. Computable learner
As we have mentioned in the introduction, our setup has a close connection with learning of recur-
sive functions, where one considers the domain to be N and the instances are presented sequentially,
but in addition, the learning rule is required to be computable as well.
Definition 14 A function h : N {0, 1} is computable if there exists a Turing machine TM such
that n N, TM(n) halts and outputs h(n), where the number n is in its binary representation.
Clearly the class of all computable functions from N {0, 1} is countable, since there are
only countable Turing machines (Arora and Barak, 2009, Chapter 1). We have the following simple
corollary that follows from Lemma 4.
Corollary 15 Let H be the set of all computable functions from N {0, 1}, and let µ be an
arbitrary distribution over N. Then H is eas-online learnable using i.i.d. samples generated from
µ.
Unfortunately, the learning rule that derived from Lemma 4 cannot be computable for arbitrary
distribution as shown in the following theorem of Barzdin¸
ˇ
s and Freivald (1972) (restated here in our
notation).
Theorem 16 ((Zeugmann and Zilles, 2008, Thm 5)) Let H be a class of computable function
from N {0, 1}. Then there exists a computable learner that eas-online learns H with sequential
sampling if and only if there exists a computable function g : N N such that the time complexity
of each function in H is eventually dominated by g(n).
It is easy to show that if H is the class of all computable functions then g in Theorem 16 cannot
exist by a simple diagonalization argument. Therefore, by Lemma 6 there exists a distribution p
such that H is not computationally eas-online learnable even with i.i.d. sampling from p. However,
we will show in the following theorem that there are also non-degenerate distributions over N such
that the class of all computable functions is indeed computationally eas-online learnable with i.i.d.
sampling.
11
EAS ONLINE LEARNING
Theorem 17 Let H be the class of all computable functions from N {0, 1}. Then there exists a
non-degenerate distribution (i.e. has infinite support) q such that H is computationally eas-online
learnable w.r.t. q.
Proof Let TM
1
, TM
2
, · · · be an fixed enumeration of all Turing machines. The main idea is
to construct a distribution q such that for almost all numbers n, n appears much later than any of
{1, ..., n1} with probability 1. We choose the gap period between the first appearances of n1 and
n to be max{C(h
i
(n) : i n}, where C(h
i
(n)) is the computational time for the ith computable
function with input n to stop when computing with a feasible Turing machine of smallest index.
Such a distribution q exists by Lemma 6.
We now construct the computable predictor using a back and forth trick. The predictor goes as
follows:
1. Initialize index I, J = 1;
2. In the idle period between the appearance of n 1 and n, the predictor do the following. It
simulates the computation of TM
I
on n with one computational step per time step. (For other
samples that encountered in that period, one simply predicts the memorized labels.)
3. At time step of first observing n, output the result of the simulation. (Output arbitrary if the
simulation has not stopped.) If the the result matches with the true label, keep I, J. Else:
a. If I < J, set I = I + 1 and J = J;
b. Else, set I = 1 and J = J + 1.
Since the underlying function is computable, there exists a Turing machine TM
t
that computes it.
Now, by construction the index I changes if and only if the predictor makes an error.
We show that the predictor only makes finitely many errors by a proof by contradiction. If
indeed the predictor makes infinite errors, we known that I would repeatedly hit t until the sample
is coming sequentially. By construction of the idle time, TM
t
would then finish the computation
and make the right prediction in the following times steps, which is a contradiction.
Remark 18 Note that the reason why Theorem 17 does not contradict to the impossibility result
implied by Theorem 16 is that, even though the gap period we constructed in the proof of Theorem 17
is information theoretically deterministic, it is not (computationally) known to the learner, i.e. the
learner would never be able to (computationally) figure out when the sample n will arrive.
We now consider a different scenario where we require the learner to be not only computable,
but also has limited computational resources. We need the following notion. A online learning
scheme is a triplet (D, R, Φ), where
1. D {0, 1}
is an unlimited database which the learner can use to store anything learned,
initially an empty string;
2. Φ is the predictor which maps N {0, 1}, using the database D as an oracle;
12
EAS ONLINE LEARNING
3. R is the recorder that maps {0, 1}
× N × {0, 1} {0, 1}
, which updates the database D
every time a new example and its label are revealed. Where needed, we use D
n
to denote the
state of the database after n instances and their labels have been revealed.
Definition 19 (Exact online learnable) Let H be a class of functions from N {0, 1}. H is said
to be exact online learnable, if there is an online learning scheme (D, R, Φ) such that for all h H
X
n=1
1{Φ
D
n
(n) 6= h(n)} < ,
where D updates after every n,
D
n+1
= R(D
n
, n, h(n)).
For any computable online learning scheme (D, R, Φ), we will specify R and Φ as algorithms.
The time complexity of R, Φ and h is expressed in terms of log n, i.e. the binary representation size
of n. However, we use D as an oracle of R and Φ with no computational cost.
We focus on the worst case time complexity for R and Φ, i.e. over the most inconvenient
database and function we are trying to learn. We say an online learning scheme runs in uniformly
exponential time, if, no matter what database D is and what function h is in force, there exist some
c, N N such that both R and Φ run in exp(c log n) = n
c
time for all n N. We define uniform
polynomial time similarly.
Theorem 20 Let H be the class of all exponential time computable functions from N {0, 1}.
Then there is no computable online learning scheme (D, R, Φ) that exactly learns H such that both
R and Φ run in uniformly exponential time.
We leave the proof of Theorem 20 to the appendix A. We made the following conjecture for
polynomial time computable functions.
Conjecture 21 Let H be the class of all polynomial time computable functions from N {0, 1}.
Then there is no computable online learning scheme (D, R, Φ) that exactly learns H such that both
R and Φ uniformly run in polynomial time.
While we are unable to prove the conjecture, we can prove the following specialized version
(see appendix A for a proof). We say the recorder R to be a naive recorder, if it simply appends
h(n) to the database D at the update of step n.
Theorem 22 Let H be the class of all functions from N {0, 1} whose time complexity is eventu-
ally bounded by log
k+1
n time for some k 1. Then there is no computable online learning scheme
(D, R, Φ) that exactly learns H such that R is a naive recorder and Φ runs in uniformly log
k
n time.
We note the following interesting connection with time hierarchy theorem. Denote
H
k
= {h : TM s.t. n N, TM(n) = h(n) and C(TM, n) log
k
n}.
Corollary 23 Theorem 22 implies that H
k+2
\H
k
is not empty.
Remark 24 The time hierarchy theorem (Arora and Barak, 2009, Theorem 3.1) does not necessar-
ily imply Theorem 22, since the predictor uses the database D as an zero-computational cost oracle
when computing on n.
13
EAS ONLINE LEARNING
6. Noisy labeling
In the previous sections we considered various setups in the eas-online learning paradigm when the
labels are presented accurately. However, it is possible that the labels are corrupted by some noise.
Surprisingly, we will show in this section that our eas-online learning paradigm are actually resistant
to independent noises even if the sample are presented sequentially.
For simplicity, we will assume the domain to be X = N and that the label is binary. We consider
the following noisy process. At each time step i, we denote
˜
Z
i
= (X
i
, h(X
i
) R
n
) to be the noisy
sample-label pair, where R
n
is a binary random variable with Pr(R
n
= 1) η and variables R
n
are independent of the instances, labels, and in addition R
n
are independent for each n.
We prove the following result.
Theorem 25 Let H be a function class from N {0, 1}, p is a distribution over N. If H is
effectively countable w.r.t. p and η <
1
2
, then there exists a learning strategy Φ, such that for all
h H we have
Pr
X
n=1
`(Φ(
˜
Z
n1
1
, X
n
), h(X
n
)) <
!
= 1,
where the randomness comes from both the sample and noise.
Proof Let H
0
a countable class that effectively covers H such that any two distinct functions in H
0
differs by a positive measure set in N. Let H
0
1
H
0
2
· · · H
0
be an nesting such that |H
0
k
| = k and
S
kN
H
0
k
= H
0
.
The prediction happens in stages. At stage 0, we initialize by choosing an arbitrary function h
0
from H
0
. Let h
k1
be the function we have after stage k 1. We will use h
k1
to make prediction
in each stage k. At stage k we try to identify the underlying function as if it were in H
0
k
. To do
so, we note that each function in H
0
k
differs other functions by a positive measure subset of N. One
observes the samples sufficiently long so that there are m
k
samples (instances may be repeated) in
the difference sets of each pair of functions. We define h
k
H
0
k
to be a function that has a smaller
Hamming distance to the noisy labeling of samples at the difference positions relative to all other
functions in H
0
k
. If no such function exists, choose an arbitrary function in H
0
k
.
We now show that the strategy indeed works. To do so, we choose m
k
=
3 log k
2(0.5η)
2
. By
the Hoeffding bound and union bound, with probability at least 1
1
k
2
, we will find the correct
underlying function at stage k, if it is in H
0
k
. Now, since the underlying function must be in some
H
0
t
, the predictor must find it in finite steps. And by the Borel-Cantelli lemma, we will only miss it
finitely many times with probability 1 since
P
kN
1
k
2
< . After that we will make no errors. The
theorem follows.
In the proof of Theorem 25, the exact probability Pr(R
n
= 1) is not required to be known, only the
upper bound η need be known. However, even the requirement of knowledge of η can be eliminated
by a simple application of the doubling trick. Note that, Theorem 25 also holds if the instances
are presented sequentially. In such a case, one observes for each instance exactly one noisy label.
Therefore, there is no way estimate any single label (which would have been possible in the i.i.d.
case). However, the proof of Theorem 25 shows that one would be able to identify the correct
function with arbitrary high confidence by leveraging noisy labels of different instances.
14
EAS ONLINE LEARNING
7. Discussion
In this paper, we introduced an online learning paradigm where a learner is required to make only
finitely many errors with probability one in an infinite horizon. The setup generalized the well
known online learning setup of (Littlestone, 1988) to a more relaxed non-uniform consistency. We
also demonstrated the relationship of our setup with the classic learning of recursive function liter-
ature as surveyed in (Zeugmann and Zilles, 2008), and showed that the randomness and resource
restriction could bring richer picture when considering computability. We also investigated the case
when noisy labels are presented, and showed that our setup are actually resistant to independent
Bernoulli noise.
While our setup does not provide bounds (or rates) on how many errors will the learner be
made, we should emphasize that we are also dealing with very rich classes that do not admit uniform
consistency results. Our work is more focused on the conceptual foundation of learning and is meant
to understand large scale problems that we have little prior knowledge in. Indeed, our approach
also provide a different angle for a theorist to rethink ”no free lunch” theorems. Philosophically,
this is a question that underlies much of our scientific endeavor—can a physicist make only finite
errors before finding a universal theory? Or will she perpetually move between models without
every converging on one? See Cover (1973) and Appendix C for more discussions on this type of
problems.
Acknowledgments
This work was supported by NSF grants CCF-1619452 and by the Center or Science of Information
(CSoI), an NSF Science and Technology Center, under grant agreement CCF-0939370.
References
Sanjeev Arora and Boaz Barak. Computational complexity: a modern approach. Cambridge Uni-
versity Press, 2009.
J
¯
anis Martynovich Barzdin¸
ˇ
s and RV Freivald. On the prediction of general recursive functions. In
Doklady Akademii Nauk, volume 206, pages 521–524. Russian Academy of Sciences, 1972.
Shai Ben-David, D
´
avid P
´
al, and Shai Shalev-Shwartz. Agnostic online learning. In COLT, 2009.
Olivier Bousquet, Steve Hanneke, Shay Moran, Ramon van Handel, and Amir Yehudayoff. A theory
of universal learning. arXiv:2011.04483, 2020.
Thomas M Cover. On determining the irrationality of the mean of a random variable. The annals of
Statistics, 1(5):862–871, 1973.
Amir Dembo and Yuval Peres. A topological criterion for hypothesis testing. The Annals of Statis-
tics, pages 106–117, 1994.
David H. Haussler, Nick Littlestone, and Manfred K. Warmuth. Predicting {0, 1}-functions on
randomly drawn points. Information and Computation, 115(2):248–292, 1994.
Sanjeev R Kulkarni and David N. C. Tse. A paradigm for class identification problems. IEEE
Transactions on Information Theory, 40(3):696–705, 1994.
15
EAS ONLINE LEARNING
Amir Leshem. Cover’s test of rationality revisited: Computability aspects of hypothesis testing.
In 2006 IEEE 24th Convention of Electrical & Electronics Engineers in Israel, pages 213–216.
IEEE, 2006.
Lihong Li, Michael L Littman, Thomas J Walsh, and Alexander L Strehl. Knows what it knows: a
framework for self-aware learning. Machine learning, 82(3):399–443, 2011.
Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algo-
rithm. Machine learning, 2(4):285–318, 1988.
Neri Merhav and Meir Feder. Universal prediction. IEEE Transactions on Information Theory, 44
(6):2124–2147, 1998.
Tom Mitchell, William Cohen, Estevam Hruschka, Partha Talukdar, Bishan Yang, Justin Betteridge,
Andrew Carlson, Bhanava Dalvi, Matt Gardner, Bryan Kisiel, et al. Never-ending learning. Com-
munications of the ACM, 61(5):103–115, 2018.
Narayana Santhanam and Venkat Anantharam. Agnostic insurability of model classes. Journal
of Machine Learning Research, 16:2329–2355, 2015. URL http://jmlr.org/papers/
v16/santhanam15a.html.
Amin Sayedi, Morteza Zadimoghaddam, and Avrim Blum. Trading off mistakes and don’t-know
predictions. In Advances in Neural Information Processing Systems, pages 2092–2100, 2010.
Dale Schuurmans. Characterizing rational versus exponential learning curves. journal of computer
and system sciences, 55(1):140–160, 1997.
Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algo-
rithms. Cambridge university press, 2014.
Vladimir N. Vapnik. The nature of statistical learning theory. Springer science & business media,
2013.
Vladimir N. Vapnik and Alexey Y. Chervonenkis. Theory of pattern recognition. 1974.
Changlong Wu and Narayana Santhanam. Being correct eventually almost surely. In Information
Theory (ISIT), 2019 IEEE International Symposium on, pages 1989–1993. IEEE, 2019.
Thomas Zeugmann and Sandra Zilles. Learning recursive functions: A survey. Theoretical Com-
puter Science, 397(1-3):4–56, 2008.
Chicheng Zhang and Kamalika Chaudhuri. The extended littlestones dimension for learning with
mistakes and abstentions. In Conference on Learning Theory, pages 1584–1616, 2016.
16
EAS ONLINE LEARNING
Appendix A. Omitted proofs
Proof [Proof of Theorem 11] The sufficiency follows directly from Lemma 4. To prove the neces-
sary condition, by the argument in the proof of Theorem 7, it is sufficient to show that there exists a
countable dense subset of measurable functions from X Y with the following metric
d(f, g) = Pr
xµ
(f(x) 6= g(x)).
To do so, we use a truncation argument. Wolog, we assume Y = N. We observe that there exists a
class of countable measurable functions P
m
that is dense in the measurable functions from X [m]
for all m N, since the σ-algebra on X is separable. We claim that
S
m=1
P
m
is dense in the
measurable functions from X N. Let h : X N be an arbitrary measurable function and > 0.
There exists m N such that
Pr
xµ
(h(x) m) /2.
Let h
m
be the function such that h
m
(x) = h(x) if h(x) m and h
m
(x) = 1 otherwise. We have
d(h
m
, h) /2. Now, choosing h
0
H
m
with d(h
m
, h
0
) /2, we have d(h, h
0
) . This
completes the proof.
Proof [Proof of Theorem 20] We actually prove a stronger version of the theorem. Let H be the
set of functions that can be computed within time n
k+1
(hence exponential in log n), we show that
there is no online learning scheme (D, R, Φ) that exactly learns H such that R and Φ uniformly runs
in n
k
/2 time.
The proof uses a diagonalization argument. Assume to the contrary that we have such a scheme
(D, R, Φ). We construct the following algorithm ExpDiag(n):
1. Input: n
2. If n = 1 Output 1.
3. Run scheme (D, R, Φ) on samples (k,ExpDiag(k)) with k n 1.
4. Output 1 Φ
D
(n).
To analyze the running time of ExpDiag(n), let f(n) be the time needed to compute ExpDiag with
input n. We have
f(n) = n
k
+ f(n 1),
since R and Φ run in n
k
/2 by assumption and one may reuse the database at the recursion steps.
We have
f(n)
n1
X
i=1
i
k
n
k+1
.
Therefore, the function that computed by ExpDiag is in H. However, by construction ExpDiag(n) 6=
Φ
D
(n) for all n 2 which yields a contradiction as before.
Remark 26 Note that similar argument cannot be generalized to the functions that computed in
time log
k+1
n against predictors runs in log
k
n because establishing the database in step 3 will
require Ω(n) steps in the naive way. However, we still believe this is true as in Conjecture 21.
Theorem 22 establishes the conjecture when the database simply appends the true labels without
processing.
17
EAS ONLINE LEARNING
Proof [Proof of Theorem 22] We use a standard approach that reduces the polynomial to exponential
case. Let H be the class of all functions that can be computed in time log
k+1
n. Suppose to the
contrary, (D, R, Φ) is a scheme that exactly learns H such that R is naive recorder and Φ runs in
log
k
n. For any function h
0
that can be computed in n
k+1
time, we construct a function h that can
be computed in log
k+1
n time as follows
h(n) =
(
h
0
(t), if n = 2
t
for some t N
1, otherwise
.
Consider the following predictor Φ
0
:
1. Input: n and naive recording D of h
0
upto n 1
2. Simulate Φ with input 2
n
as follows: if Φ queries database at position T such that T = 2
t
we
query the tth position in D. In all other positions, we know h took value 1.
3. Output Φ
D
(2
n
).
By assumption Φ makes finitely many errors on h, thus Φ
0
makes finitely errors on h
0
as well.
Clearly, we have Φ
0
runs in (log 2
n
)
k
= n
k
time and it exactly learns the class of functions that can
be computed in n
k+1
. This contradicts Theorem 20.
Proof [Proof of Corollary 23] Suppose not. We will construct an online learning scheme (D, R, Φ)
that exact learns H
k+2
such that R is naive recorder and Φ runs in log
k+1
n, which will contradict
Theorem 22.
To do so, we enumerate all Turing machines TM
1
, TM
2
, · · · . The predictor Φ maintains two
indices t, i. At each time step n, both t and i are initialized to 1. The predictor simulates log
k
i
instructions on TM
t
(i) . If it stops within those log
k
i instructions and its output matches h(i), keep
t = t and increment i = i + 1. Else, move to the next machine, t = t + 1 and increment i = i + 1.
This process continues for
1
2
log
k+1
n net time (this includes the time taken to simulate steps on
all Turing machines thus far, and all relevant overheads, including that for incrementing indices).
Then set i = n and simulate TM
t
(n) till the run for a net log
k+1
(n) time. If TM
t
stops by then, the
predictor outputs TM
t
(n), otherwise it outputs 1.
We show that this scheme is an exact online learning scheme for H
k+2
. For any h H
k+2
, we
have an Turing machine TM
j
that outputs h(n) for all n within log
k
n time, since H
k+2
= H
k
by
assumption. Note that t increases iff the predictor makes errors. Since in addition,
1
2
log
k+1
n ,
we know that t will hit j eventually. Once t hits j, note that we never increment it since TM
j
(i)
outputs h(i) within log
k
i time for all i. If the time runs out before we complete the simulation of
TM
j
, note again that t is not incremented. Finally, since the overhead on universal Turing machine
is a log log n factor on time complexity (Arora and Barak, 2009, Theorem 1.13), we know that for
large enough n, the algorithm above actually completes the simulation of TM
j
(n).
Appendix B. Non-uniform time complexity
If we relax the requirement of worst case time complexity, we show that polynomial time com-
putable functions can be learned exactly in (pointwise) polynomial time.
18
EAS ONLINE LEARNING
Theorem 27 Let H be the class of all polynomial time computable functions from N {0, 1}.
Then there exists a computable exact online learning scheme (D, R, Φ) that exactly learns H such
that both R and Φ run in point-wise polynomial time.
Proof The database D consists of a triplet D = (D
1
, D
2
, k) where D
1
, D
2
encodes for Turing
machines and k N. The predictor Φ works as follows:
1. Input: n with database D
2. If D
1
is not a valid encoding of a Turing machine, Output 1.
3. Simulate the Turing machine encoded by D
1
on input n for log
k
n steps.
4. If D
1
stops in log
k
n steps Output D
1
(n). Else, Output 1.
The recorder R works as follows:
1. Input: Old database D, prediction Φ
D
(n) and n, h(n)
2. Output: New database D
3. If n = 1, set D = {φ, φ}, where φ is empty string.
4. If Φ
D
(n) = h(n), do not change on D and Return.
5. Else:
a. If D
1
6= D
2
, Set D
1
= NEXT(D
1
), where NEXT maps strings in {0, 1}
to the next
string in alphabet order, and k = k + 1
b. Else, Set D
1
= φ and D
2
= NEXT(D
2
).
Suppose that the underlying function h H can be computed in O(log
t
n) time by some Turing
machine with encoding D
h
. By the construction of R, the database D changes iff the prediction
made by Φ is wrong.
We show that the database changes only finitely many times, thus proving that the total number
of errors is finite. If the database changes infinitely many times, note that it will also hit D
h
infinitely
many times. D
h
is simulated for log
k
n steps in Step 3. of the predictor, where k increases every
time there is an error. Therefore, k will eventually be large enough that D
h
is simulated for enough
steps that it halts and outputs the correct value of h(n). Following this, the database can not change,
contradicting the assumption that the database changes infinitely many times.
Appendix C. More variations
Contrast to the finite error guarantee, perhaps a more natural non-uniform consistency of online
learning one may think is to let the average loss per time step convergence to zero. More formally,
one may wish to find a predictor such that
Pr
lim
N→∞
1
N
N
X
n=1
`(Φ(Z
n1
1
, X
n
), h(X
n
)) = 0
!
= 1.
19
EAS ONLINE LEARNING
However, such a guarantee might not be attractive if one only considers the consistency. Since we
can show that the class of all measurable functions from R {0, 1} are actually learnable in that
sense.
Theorem 28 Let H be the class of all measurable functions from R {0, 1}, µ is an arbitrary
distribution over R that is unknown to the learner, ` is the classification loss. Then there exist a
learning strategy Φ such that for all h H we have
Pr
lim
N→∞
1
N
N
X
n=1
`(Φ(Z
n1
1
, X
n
), h(X
n
)) = 0
!
= 1,
where Z
i
and X
i
are defined as in Section 2.
Proof [Sketch of Proof] Let G be a countable class of function from R {0, 1} that is dense in
H. Note that, since the Borel σ-algebra is separable, we known such a class exist and independent
of the underlying distribution. The prediction partitioned into stages. At stage k, the learner tries
to identify a function h
k
in G that is
k
-close to the underlying function with confidence at least
1
1
k
2
. This can be easily achieved by using a structural risk minimization argument. Note that
we will use h
k1
to make the prediction at stage k, and use the sample observed at stage k to find
h
k
for the next stage. We now choose
k
small enough so that the probability that it has average
error jump above
k
in the infinite horizon is at most
1
k
2
. This can be done by Chernoff bound with
a union bound, and observe that
P
n=1
exp(2n
k
) convergence and goes to zero when
k
goes to
zero. The theorem now follows by Borel-Cantelli lemma, since
1
k
2
is summable.
One may also consider the scenario when the rate of convergence to zero is controlled. It can be
shown that if the class has finite VC-dimension, then we can have rate of O(
log n
n
) in expectation,
see Haussler et al. (1994). However, this does not automatically give us an almost sure upper bound
on the cumulative errors, since the errors at different time steps are correlated. We now show that we
can indeed achieve a O(
log
2
n
n
) rate almost surely. To do so, we use a doubling trick. We partition the
prediction into stages. At stage k, we will see 2
k
sample-label pairs, and we generate a hypothesis
h
k
using the observed pairs. We then use h
k
to make predictions for another 2
k
steps, at the end of
which we move to stage k + 1. To see why this approach works, we denote I
n
to be the indicator
that an error occurred at step n. By construction, we know that the indicators in each phase are
independent. We can therefore employ multiplicative Chernoff bound to show that with probability
at least 1 O(1/2
k
) the number of errors at stage k is upper bounded by O(k) (using empirical risk
minimization to obtain h
k
is sufficient to achieve this). Therefore, by the Borel-Cantelli lemma, the
partial sums of the indicators {I
n
} up to stage k will be eventually upper bounded by O(k
2
) with
probability 1. Since the first n steps will cover at most log(n) stages, the result follows. Note that,
the O(log
2
n/n) rate is not meant to be optimal, it is not hard to see that similar argument could
establish a O(log n log log n/n) rate almost surely by replacing the h
k
with an optimal predictor
that has errors O(log k/2
k
) with probability 1 O(1/k
2
). We leave it as an open problem to obtain
the optimal almost sure rate with finite VC-dimension.
A binary hypothesis class H is said to be closed, if for any measurable function f and distribu-
tion µ we have
inf
h∈H
Pr
xµ
[h(x) 6= f(x)] = 0,
20
EAS ONLINE LEARNING
then f H. Recently, Bousquet et al. (2020) showed that for closed hypothesis classes the expected
rate (i.e. E[I
n
]) can either be exponential or linear or arbitrary slow (here we have used the closeness
notion in replacing of the adversary setting in Bousquet et al. (2020)). Note that the closeness of the
class is a crucial part to establish their lower bounds. For example, consider the class R of all linear
threshold functions over [0, 1] with rational parameter, we know that the class can be learned with
finitely many errors almost surely. However, R is not closed and it has an infinite Littlestone tree,
the result of Bousquet et al. (2020) only implies that the closure of R has E[I
n
] = Ω(1/n). It is
therefore an interesting problem to investigate whether similar phenomenon will happen for general
classes (i.e. not necessarily closed) and in the almost sure scenario.
21