View
3
Download
0
Category
Preview:
Citation preview
Representation Power of Feedforward NeuralNetworks
Based on work by Barron (1993), Cybenko (1989),Kolmogorov (1957)
Matus Telgarsky
Feedforward Neural Networks
I Two node types:I Linear combinations:
x 7→∑i
wixi + w0.
I Sigmoid thresholded linear combinations:
x 7→ σ (〈w, x〉+ w0) .
I What can a network of these nodes represent?n∑i=1
wixi one layer,
n∑i=1
wiσ
ni∑j=1
wjixj + wj0
two layers,...
...
Feedforward Neural Networks
I Two node types:I Linear combinations:
x 7→∑i
wixi + w0.
I Sigmoid thresholded linear combinations:
x 7→ σ (〈w, x〉+ w0) .
I What can a network of these nodes represent?n∑i=1
wixi one layer,
n∑i=1
wiσ
ni∑j=1
wjixj + wj0
two layers,...
...
Forget about 1 layer.
I Target set [0, 3]; target function 1[x ∈ [1, 2]].
I Standard sigmoid σs(x) := 1/(1 + e−x).
I Consider sigmoid output at x = 2.
I w ≥ 0 and σ(2w + w0) ≥ 1/2: mess up on right side.
0 3
1
0.5
I w ≥ 0 and σ(2w + w0) < 1/2: mess up on middle bump.
0 3
1
0.5
I Can symmetrize (w < 0); no matter what, error ≥ 1/2.
Forget about 1 layer.
I Target set [0, 3]; target function 1[x ∈ [1, 2]].I Standard sigmoid σs(x) := 1/(1 + e
−x).
I Consider sigmoid output at x = 2.
I w ≥ 0 and σ(2w + w0) ≥ 1/2: mess up on right side.
0 3
1
0.5
I w ≥ 0 and σ(2w + w0) < 1/2: mess up on middle bump.
0 3
1
0.5
I Can symmetrize (w < 0); no matter what, error ≥ 1/2.
Forget about 1 layer.
I Target set [0, 3]; target function 1[x ∈ [1, 2]].I Standard sigmoid σs(x) := 1/(1 + e
−x).
I Consider sigmoid output at x = 2.
I w ≥ 0 and σ(2w + w0) ≥ 1/2: mess up on right side.
0 3
1
0.5
I w ≥ 0 and σ(2w + w0) < 1/2: mess up on middle bump.
0 3
1
0.5
I Can symmetrize (w < 0); no matter what, error ≥ 1/2.
Forget about 1 layer.
I Target set [0, 3]; target function 1[x ∈ [1, 2]].I Standard sigmoid σs(x) := 1/(1 + e
−x).
I Consider sigmoid output at x = 2.I w ≥ 0 and σ(2w + w0) ≥ 1/2: mess up on right side.
0 3
1
0.5
I w ≥ 0 and σ(2w + w0) < 1/2: mess up on middle bump.
0 3
1
0.5
I Can symmetrize (w < 0); no matter what, error ≥ 1/2.
Forget about 1 layer.
I Target set [0, 3]; target function 1[x ∈ [1, 2]].I Standard sigmoid σs(x) := 1/(1 + e
−x).
I Consider sigmoid output at x = 2.I w ≥ 0 and σ(2w + w0) ≥ 1/2: mess up on right side.
0 3
1
0.5
I w ≥ 0 and σ(2w + w0) < 1/2: mess up on middle bump.
0 3
1
0.5
I Can symmetrize (w < 0); no matter what, error ≥ 1/2.
Forget about 1 layer.
I Target set [0, 3]; target function 1[x ∈ [1, 2]].I Standard sigmoid σs(x) := 1/(1 + e
−x).
I Consider sigmoid output at x = 2.I w ≥ 0 and σ(2w + w0) ≥ 1/2: mess up on right side.
0 3
1
0.5
I w ≥ 0 and σ(2w + w0) < 1/2: mess up on middle bump.
0 3
1
0.5
I Can symmetrize (w < 0); no matter what, error ≥ 1/2.
Meaning of “Universal Approximation”
Target set [0, 1]n; target function f ∈ C([0, 1]n).
I For any � > 0, exists NN f̂ ,
x ∈ [0, 1]n =⇒ |f(x)− f̂(x)| < �.
I This gives NNs f̂i → f pointwise.
I For any � > 0, exists NN f̂ and S ⊆ [0, 1]n, m(S) ≥ 1− �,
x ∈ S =⇒ |f(x)− f̂(x)| < �.
I If (for instance) bounded on Sc, gives NNs f̂i → f m-a.e..
Goal: 2-NNs approximate continuous functions over [0, 1]n.
Meaning of “Universal Approximation”
Target set [0, 1]n; target function f ∈ C([0, 1]n).I For any � > 0, exists NN f̂ ,
x ∈ [0, 1]n =⇒ |f(x)− f̂(x)| < �.
I This gives NNs f̂i → f pointwise.I For any � > 0, exists NN f̂ and S ⊆ [0, 1]n, m(S) ≥ 1− �,
x ∈ S =⇒ |f(x)− f̂(x)| < �.
I If (for instance) bounded on Sc, gives NNs f̂i → f m-a.e..
Goal: 2-NNs approximate continuous functions over [0, 1]n.
Meaning of “Universal Approximation”
Target set [0, 1]n; target function f ∈ C([0, 1]n).I For any � > 0, exists NN f̂ ,
x ∈ [0, 1]n =⇒ |f(x)− f̂(x)| < �.
I This gives NNs f̂i → f pointwise.
I For any � > 0, exists NN f̂ and S ⊆ [0, 1]n, m(S) ≥ 1− �,
x ∈ S =⇒ |f(x)− f̂(x)| < �.
I If (for instance) bounded on Sc, gives NNs f̂i → f m-a.e..
Goal: 2-NNs approximate continuous functions over [0, 1]n.
Meaning of “Universal Approximation”
Target set [0, 1]n; target function f ∈ C([0, 1]n).I For any � > 0, exists NN f̂ ,
x ∈ [0, 1]n =⇒ |f(x)− f̂(x)| < �.
I This gives NNs f̂i → f pointwise.I For any � > 0, exists NN f̂ and S ⊆ [0, 1]n, m(S) ≥ 1− �,
x ∈ S =⇒ |f(x)− f̂(x)| < �.
I If (for instance) bounded on Sc, gives NNs f̂i → f m-a.e..Goal: 2-NNs approximate continuous functions over [0, 1]n.
Meaning of “Universal Approximation”
Target set [0, 1]n; target function f ∈ C([0, 1]n).I For any � > 0, exists NN f̂ ,
x ∈ [0, 1]n =⇒ |f(x)− f̂(x)| < �.
I This gives NNs f̂i → f pointwise.I For any � > 0, exists NN f̂ and S ⊆ [0, 1]n, m(S) ≥ 1− �,
x ∈ S =⇒ |f(x)− f̂(x)| < �.
I If (for instance) bounded on Sc, gives NNs f̂i → f m-a.e..
Goal: 2-NNs approximate continuous functions over [0, 1]n.
Meaning of “Universal Approximation”
Target set [0, 1]n; target function f ∈ C([0, 1]n).I For any � > 0, exists NN f̂ ,
x ∈ [0, 1]n =⇒ |f(x)− f̂(x)| < �.
I This gives NNs f̂i → f pointwise.I For any � > 0, exists NN f̂ and S ⊆ [0, 1]n, m(S) ≥ 1− �,
x ∈ S =⇒ |f(x)− f̂(x)| < �.
I If (for instance) bounded on Sc, gives NNs f̂i → f m-a.e..Goal: 2-NNs approximate continuous functions over [0, 1]n.
Outline
I 2-nn via functional analysis (Cybenko, 1989).
I 2-nn via greedy approx (Barron, 1993).
I 3-nn via histograms (Folklore).
I 3-nn via wizardry (Kolmogorov, 1957).
Overview of Functional Analysis proof (Cybenko, 1989)
I Hidden layer as a basis:
B := {σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R} .
I Want to show cl(span(B)) = C([0, 1]n).I Work via contradiction: if f ∈ C([0, 1]n) far from
cl(span(B)), can bridge the gap with a sigmoid.
Overview of Functional Analysis proof (Cybenko, 1989)
I Hidden layer as a basis:
B := {σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R} .
I Want to show cl(span(B)) = C([0, 1]n).
I Work via contradiction: if f ∈ C([0, 1]n) far fromcl(span(B)), can bridge the gap with a sigmoid.
Overview of Functional Analysis proof (Cybenko, 1989)
I Hidden layer as a basis:
B := {σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R} .
I Want to show cl(span(B)) = C([0, 1]n).I Work via contradiction: if f ∈ C([0, 1]n) far from
cl(span(B)), can bridge the gap with a sigmoid.
Abstracting σ
I Cybenko needs σ discriminates:
µ = 0 ⇐⇒ ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.
I Satisfied for the standard choices
σs(x) =1
1 + e−x,
1
2(tanh(x) + 1) =
1
2
(ex − e−x
ex + e−x+ 1
)= σs(2x).
I Most results today only need σ approximates 1[x ≥ 0]:
σ(x)→
{1 as x→ +∞,0 as x→ −∞.
Combined with σ bounded&measurable givesdiscriminatory (Cybenko, 1989, Lemma 1).
Abstracting σ
I Cybenko needs σ discriminates:
µ = 0 ⇐⇒ ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.
I Satisfied for the standard choices
σs(x) =1
1 + e−x,
1
2(tanh(x) + 1) =
1
2
(ex − e−x
ex + e−x+ 1
)= σs(2x).
I Most results today only need σ approximates 1[x ≥ 0]:
σ(x)→
{1 as x→ +∞,0 as x→ −∞.
Combined with σ bounded&measurable givesdiscriminatory (Cybenko, 1989, Lemma 1).
Abstracting σ
I Cybenko needs σ discriminates:
µ = 0 ⇐⇒ ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.
I Satisfied for the standard choices
σs(x) =1
1 + e−x,
1
2(tanh(x) + 1) =
1
2
(ex − e−x
ex + e−x+ 1
)= σs(2x).
I Most results today only need σ approximates 1[x ≥ 0]:
σ(x)→
{1 as x→ +∞,0 as x→ −∞.
Combined with σ bounded&measurable givesdiscriminatory (Cybenko, 1989, Lemma 1).
Proof of Cybenko (1989)
I Consider the
closed
subspace
S :=
cl
(
span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})
)
.
I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.
I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =
∫h(x)dµ(x).
I Contradiction: σ is discriminatory.
I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.
I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..
Proof of Cybenko (1989)
I Consider the closed subspace
S := cl
(span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})
).
I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.
I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =
∫h(x)dµ(x).
I Contradiction: σ is discriminatory.
I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.
I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..
Proof of Cybenko (1989)
I Consider the closed subspace
S := cl
(span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})
).
I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.
I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =
∫h(x)dµ(x).
I Contradiction: σ is discriminatory.
I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.
I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..
Proof of Cybenko (1989)
I Consider the closed subspace
S := cl
(span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})
).
I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.
I Exists µ 6= 0 so that L(h) =∫h(x)dµ(x).
I Contradiction: σ is discriminatory.
I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.
I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..
Proof of Cybenko (1989)
I Consider the closed subspace
S := cl
(span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})
).
I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.
I Can think of projecting f onto S⊥.
I Don’t need inner products: Hahn-Banach Theorem.
I Exists µ 6= 0 so that L(h) =∫h(x)dµ(x).
I Contradiction: σ is discriminatory.
I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.
I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..
Proof of Cybenko (1989)
I Consider the closed subspace
S := cl
(span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})
).
I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.
I Can think of projecting f onto S⊥.I Don’t need inner products: Hahn-Banach Theorem.
I Exists µ 6= 0 so that L(h) =∫h(x)dµ(x).
I Contradiction: σ is discriminatory.
I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.
I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..
Proof of Cybenko (1989)
I Consider the closed subspace
S := cl
(span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})
).
I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.
I Exists µ 6= 0 so that L(h) =∫h(x)dµ(x).
I Contradiction: σ is discriminatory.
I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.
I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..
Proof of Cybenko (1989)
I Consider the closed subspace
S := cl
(span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})
).
I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =
∫h(x)dµ(x).
I Contradiction: σ is discriminatory.
I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.
I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..
Proof of Cybenko (1989)
I Consider the closed subspace
S := cl
(span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})
).
I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =
∫h(x)dµ(x).
I In Rn, linear functions have form 〈·, y〉 for some y ∈ Rn.
I With infinite dimensions, integrals give inner products.I A form of the Riesz Representation Theorem.
I Contradiction: σ is discriminatory.
I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.
I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..
Proof of Cybenko (1989)
I Consider the closed subspace
S := cl
(span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})
).
I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =
∫h(x)dµ(x).
I In Rn, linear functions have form 〈·, y〉 for some y ∈ Rn.I With infinite dimensions, integrals give inner products.
I A form of the Riesz Representation Theorem.
I Contradiction: σ is discriminatory.
I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.
I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..
Proof of Cybenko (1989)
I Consider the closed subspace
S := cl
(span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})
).
I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =
∫h(x)dµ(x).
I In Rn, linear functions have form 〈·, y〉 for some y ∈ Rn.I With infinite dimensions, integrals give inner products.I A form of the Riesz Representation Theorem.
I Contradiction: σ is discriminatory.
I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.
I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..
Proof of Cybenko (1989)
I Consider the closed subspace
S := cl
(span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})
).
I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =
∫h(x)dµ(x).
I Contradiction: σ is discriminatory.
I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.
I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..
Proof of Cybenko (1989)
I Consider the closed subspace
S := cl
(span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})
).
I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =
∫h(x)dµ(x).
I Contradiction: σ is discriminatory.
I L|S = 0 implies ∀w,w0 �∫σ(〈w, x〉+ w0)dµ(x) = 0.
I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..
Proof of Cybenko (1989)
I Consider the closed subspace
S := cl
(span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})
).
I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =
∫h(x)dµ(x).
I Contradiction: σ is discriminatory.I L|S = 0 implies ∀w,w0 �
∫σ(〈w, x〉+ w0)dµ(x) = 0.
I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..
Proof of Cybenko (1989)
I Consider the closed subspace
S := cl
(span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})
).
I Suppose (contradictorily) exists f ∈ C([0, 1]n) \ S.I Exists bounded linear L 6= 0 with L|S = 0.I Exists µ 6= 0 so that L(h) =
∫h(x)dµ(x).
I Contradiction: σ is discriminatory.I L|S = 0 implies ∀w,w0 �
∫σ(〈w, x〉+ w0)dµ(x) = 0.
I But “discriminatory” means µ = 0 ⇐⇒ ∀w,w0 �∫. . ..
What was Cybenko’s proof about?
I Set S := cl (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})).
I Another way to look at it.
I Given f ∈ C([0, 1]n), current approximant f̂ ∈ S.I The residue f − f̂ must have nonzero projection onto S.I Add residue projection into f̂ ; does this now match f?
I Can this be turned into an algorithm?
What was Cybenko’s proof about?
I Set S := cl (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})).I Another way to look at it.
I Given f ∈ C([0, 1]n), current approximant f̂ ∈ S.I The residue f − f̂ must have nonzero projection onto S.I Add residue projection into f̂ ; does this now match f?
I Can this be turned into an algorithm?
What was Cybenko’s proof about?
I Set S := cl (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})).I Another way to look at it.
I Given f ∈ C([0, 1]n), current approximant f̂ ∈ S.
I The residue f − f̂ must have nonzero projection onto S.I Add residue projection into f̂ ; does this now match f?
I Can this be turned into an algorithm?
What was Cybenko’s proof about?
I Set S := cl (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})).I Another way to look at it.
I Given f ∈ C([0, 1]n), current approximant f̂ ∈ S.I The residue f − f̂ must have nonzero projection onto S.
I Add residue projection into f̂ ; does this now match f?
I Can this be turned into an algorithm?
What was Cybenko’s proof about?
I Set S := cl (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})).I Another way to look at it.
I Given f ∈ C([0, 1]n), current approximant f̂ ∈ S.I The residue f − f̂ must have nonzero projection onto S.I Add residue projection into f̂ ; does this now match f?
I Can this be turned into an algorithm?
What was Cybenko’s proof about?
I Set S := cl (span ({σ(〈w, x〉+ w0) : w ∈ Rn, w0 ∈ R})).I Another way to look at it.
I Given f ∈ C([0, 1]n), current approximant f̂ ∈ S.I The residue f − f̂ must have nonzero projection onto S.I Add residue projection into f̂ ; does this now match f?
I Can this be turned into an algorithm?
Outline
I 2-nn via functional analysis (Cybenko, 1989).
I 2-nn via greedy approx (Barron, 1993).
I 3-nn via histograms (Folklore).
I 3-nn via wizardry (Kolmogorov, 1957).
The method of Barron (1993, Section 8)
Input: basis set B,
1. Choose arbitrary f̂0 ∈ B.2. For t = 1, 2, . . .:
2.1 Choose (αt, gt) apx minimizing
infα∈[0,1],g∈B
‖f − (αf̂t−1 + (1− α)g)‖22.
2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.
I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)I Is this familiar?
I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).
The method of Barron (1993, Section 8)
Input: basis set B, target f ∈ cl(conv(B)).
1. Choose arbitrary f̂0 ∈ B.2. For t = 1, 2, . . .:
2.1 Choose (αt, gt) apx minimizing
infα∈[0,1],g∈B
‖f − (αf̂t−1 + (1− α)g)‖22.
2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.
I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)
I Is this familiar?
I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).
The method of Barron (1993, Section 8)
Input: basis set B, target f ∈ cl(conv(B)).1. Choose arbitrary f̂0 ∈ B.
2. For t = 1, 2, . . .:
2.1 Choose (αt, gt) apx minimizing
infα∈[0,1],g∈B
‖f − (αf̂t−1 + (1− α)g)‖22.
2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.
I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)
I Is this familiar?
I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).
The method of Barron (1993, Section 8)
Input: basis set B, target f ∈ cl(conv(B)).1. Choose arbitrary f̂0 ∈ B.2. For t = 1, 2, . . .:
2.1 Choose (αt, gt) apx minimizing
infα∈[0,1],g∈B
‖f − (αf̂t−1 + (1− α)g)‖22.
2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.
I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)
I Is this familiar?
I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).
The method of Barron (1993, Section 8)
Input: basis set B, target f ∈ cl(conv(B)).1. Choose arbitrary f̂0 ∈ B.2. For t = 1, 2, . . .:
2.1 Choose (αt, gt) apx minimizing
infα∈[0,1],g∈B
‖f − (αf̂t−1 + (1− α)g)‖22.
2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.
I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)
I Is this familiar?
I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).
The method of Barron (1993, Section 8)
Input: basis set B, target f ∈ cl(conv(B)).1. Choose arbitrary f̂0 ∈ B.2. For t = 1, 2, . . .:
2.1 Choose (αt, gt) apx minimizing (tolerance O(1/t2))
infα∈[0,1],g∈B
‖f − (αf̂t−1 + (1− α)g)‖22.
2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.
I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)
I Is this familiar?
I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).
The method of Barron (1993, Section 8)
Input: basis set B, target f ∈ cl(conv(B)).1. Choose arbitrary f̂0 ∈ B.2. For t = 1, 2, . . .:
2.1 Choose (αt, gt) apx minimizing (tolerance O(1/t2))
infα∈[0,1],g∈B
‖f − (αf̂t−1 + (1− α)g)‖22.
2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.
I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)
I Is this familiar?
I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).
The method of Barron (1993, Section 8)
Input: basis set B, target f ∈ cl(conv(B)).1. Choose arbitrary f̂0 ∈ B.2. For t = 1, 2, . . .:
2.1 Choose (αt, gt) apx minimizing (tolerance O(1/t2))
infα∈[0,1],g∈B
‖f − (αf̂t−1 + (1− α)g)‖22.
2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.
I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)I Is this familiar?
I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).
The method of Barron (1993, Section 8)
Input: basis set B, target f ∈ cl(conv(B)).1. Choose arbitrary f̂0 ∈ B.2. For t = 1, 2, . . .:
2.1 Choose (αt, gt) apx minimizing (tolerance O(1/t2))
infα∈[0,1],g∈B
‖f − (αf̂t−1 + (1− α)g)‖22.
2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.
I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)I Is this familiar? Oh let’s see..
I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).
The method of Barron (1993, Section 8)
Input: basis set B, target f ∈ cl(conv(B)).1. Choose arbitrary f̂0 ∈ B.2. For t = 1, 2, . . .:
2.1 Choose (αt, gt) apx minimizing (tolerance O(1/t2))
infα∈[0,1],g∈B
‖f − (αf̂t−1 + (1− α)g)‖22.
2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.
I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)I Is this familiar? Oh let’s see.. gradient descent, coordinate
descent, Frank-Wolfe, projection pursuit, basis pursuit,boosting.. as usual, cf. Zhang (2003).
I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).
The method of Barron (1993, Section 8)
Input: basis set B, target f ∈ cl(conv(B)).1. Choose arbitrary f̂0 ∈ B.2. For t = 1, 2, . . .:
2.1 Choose (αt, gt) apx minimizing (tolerance O(1/t2))
infα∈[0,1],g∈B
‖f − (αf̂t−1 + (1− α)g)‖22.
2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.
I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)I Is this familiar? YES.
I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).
The method of Barron (1993, Section 8)
Input: basis set B, target f ∈ cl(conv(B)).1. Choose arbitrary f̂0 ∈ B.2. For t = 1, 2, . . .:
2.1 Choose (αt, gt) apx minimizing (tolerance O(1/t2))
infα∈[0,1],g∈B
‖f − (αf̂t−1 + (1− α)g)‖22.
2.2 Update f̂t := αtf̂t−1 + (1− αt)gt.
I Can rescale B so f ∈ cl(conv(B)). (or tweak alg.)I Is this familiar? YES.
I Barron carefully shows rate ‖f − f̂t‖22 = O(1/t).
Greedy approx as coordinate descent
I Intuition: linear operator A with basis B as “columns”:
f̂t = Awt, g = Aej .
I Split the update into two steps:
infg∈B‖f − (f̂t−1 + g)‖22 = inf
i‖f − (Awt−1 +Aei)‖22,
infα∈[0,1]
‖f − (αf̂t−1 + (1− α)gt)‖22.
I Suppose ‖g‖2 = 1 for all g ∈ B; first step equiv to
supg∈B
〈g, f̂t−1 − f
〉= sup
i(Aei)
>(Awt−1 − f).
I Second version is coordinate descent update!
I (Standard rate proofs use curvature of ‖ · ‖22.)
Greedy approx as coordinate descent
I Intuition: linear operator A with basis B as “columns”:
f̂t = Awt, g = Aej .
I Split the update into two steps:
infg∈B‖f − (f̂t−1 + g)‖22 = inf
i‖f − (Awt−1 +Aei)‖22,
infα∈[0,1]
‖f − (αf̂t−1 + (1− α)gt)‖22.
I Suppose ‖g‖2 = 1 for all g ∈ B; first step equiv to
supg∈B
〈g, f̂t−1 − f
〉= sup
i(Aei)
>(Awt−1 − f).
I Second version is coordinate descent update!
I (Standard rate proofs use curvature of ‖ · ‖22.)
Greedy approx as coordinate descent
I Intuition: linear operator A with basis B as “columns”:
f̂t = Awt, g = Aej .
I Split the update into two steps:
infg∈B‖f − (f̂t−1 + g)‖22 = inf
i‖f − (Awt−1 +Aei)‖22,
infα∈[0,1]
‖f − (αf̂t−1 + (1− α)gt)‖22.
I Suppose ‖g‖2 = 1 for all g ∈ B; first step equiv to
supg∈B
〈g, f̂t−1 − f
〉= sup
i(Aei)
>(Awt−1 − f).
I Second version is coordinate descent update!
I (Standard rate proofs use curvature of ‖ · ‖22.)
Greedy approx as coordinate descent
I Intuition: linear operator A with basis B as “columns”:
f̂t = Awt, g = Aej .
I Split the update into two steps:
infg∈B‖f − (f̂t−1 + g)‖22 = inf
i‖f − (Awt−1 +Aei)‖22,
infα∈[0,1]
‖f − (αf̂t−1 + (1− α)gt)‖22.
I Suppose ‖g‖2 = 1 for all g ∈ B; first step equiv to
supg∈B
〈g, f̂t−1 − f
〉= sup
i(Aei)
>(Awt−1 − f).
I Second version is coordinate descent update!
I (Standard rate proofs use curvature of ‖ · ‖22.)
Greedy approx as coordinate descent
I Intuition: linear operator A with basis B as “columns”:
f̂t = Awt, g = Aej .
I Split the update into two steps:
infg∈B‖f − (f̂t−1 + g)‖22 = inf
i‖f − (Awt−1 +Aei)‖22,
infα∈[0,1]
‖f − (αf̂t−1 + (1− α)gt)‖22.
I Suppose ‖g‖2 = 1 for all g ∈ B; first step equiv to
supg∈B
〈g, f̂t−1 − f
〉= sup
i(Aei)
>(Awt−1 − f).
I Second version is coordinate descent update!
I (Standard rate proofs use curvature of ‖ · ‖22.)
Story for 2-layer NNs so far
I Can L2 apx any f ∈ C([0, 1]n) at rate O(1/√t)..
I ..so why aren’t we using this algorithm for neural nets?
I This is not an easy optimization problem:
infα∈[0,1],w∈Rn,w0∈R
‖f − (αf̂t−1 + (1− α)σ(〈w, ·〉+ w0)‖22.
I For a general basis B, must check each basis element..
I From an algorithmic perspective, much remains to be done.
Story for 2-layer NNs so far
I Can L2 apx any f ∈ C([0, 1]n) at rate O(1/√t)..
I ..so why aren’t we using this algorithm for neural nets?
I This is not an easy optimization problem:
infα∈[0,1],w∈Rn,w0∈R
‖f − (αf̂t−1 + (1− α)σ(〈w, ·〉+ w0)‖22.
I For a general basis B, must check each basis element..
I From an algorithmic perspective, much remains to be done.
Story for 2-layer NNs so far
I Can L2 apx any f ∈ C([0, 1]n) at rate O(1/√t)..
I ..so why aren’t we using this algorithm for neural nets?I This is not an easy optimization problem:
infα∈[0,1],w∈Rn,w0∈R
‖f − (αf̂t−1 + (1− α)σ(〈w, ·〉+ w0)‖22.
I For a general basis B, must check each basis element..
I From an algorithmic perspective, much remains to be done.
Story for 2-layer NNs so far
I Can L2 apx any f ∈ C([0, 1]n) at rate O(1/√t)..
I ..so why aren’t we using this algorithm for neural nets?I This is not an easy optimization problem:
infα∈[0,1],w∈Rn,w0∈R
‖f − (αf̂t−1 + (1− α)σ(〈w, ·〉+ w0)‖22.
I For a general basis B, must check each basis element..
I From an algorithmic perspective, much remains to be done.
Story for 2-layer NNs so far
I Can L2 apx any f ∈ C([0, 1]n) at rate O(1/√t)..
I ..so why aren’t we using this algorithm for neural nets?I This is not an easy optimization problem:
infα∈[0,1],w∈Rn,w0∈R
‖f − (αf̂t−1 + (1− α)σ(〈w, ·〉+ w0)‖22.
I For a general basis B, must check each basis element..
I From an algorithmic perspective, much remains to be done.
Lower bound from Barron (1993, Section 10)
I The greedy algorithm gets to search whole basis.
I If an n-element basis Bn is fixed, then
supf∈cl(spanC(B))
inff̂∈span(Bn)
‖f − f̂‖2 = Ω(
1
n1/d
),
the curse of dimension!
I Can defeat this with more layers (cf. Kolmogorov (1957)).
Lower bound from Barron (1993, Section 10)
I The greedy algorithm gets to search whole basis.
I If an n-element basis Bn is fixed, then
supf∈cl(spanC(B))
inff̂∈span(Bn)
‖f − f̂‖2 = Ω(
1
n1/d
),
the curse of dimension!
I Can defeat this with more layers (cf. Kolmogorov (1957)).
Lower bound from Barron (1993, Section 10)
I The greedy algorithm gets to search whole basis.
I If an n-element basis Bn is fixed, then
supf∈cl(spanC(B))
inff̂∈span(Bn)
‖f − f̂‖2 = Ω(
1
n1/d
),
the curse of dimension!
I Can defeat this with more layers (cf. Kolmogorov (1957)).
Outline
I 2-nn via functional analysis (Cybenko, 1989).
I 2-nn via greedy approx (Barron, 1993).
I 3-nn via histograms (Folklore).
I 3-nn via wizardry (Kolmogorov, 1957).
Folklore proof of NN power
I What’s an easy way to write function as sum of others?
I Project onto favorite basis (Fourier, Chebyshev, ...)!
I Let’s use this to build NNs.
I Easiest basis: histograms!
Folklore proof of NN power
I What’s an easy way to write function as sum of others?I Project onto favorite basis (Fourier, Chebyshev, ...)!
I Let’s use this to build NNs.
I Easiest basis: histograms!
Folklore proof of NN power
I What’s an easy way to write function as sum of others?I Project onto favorite basis (Fourier, Chebyshev, ...)!
I Let’s use this to build NNs.
I Easiest basis: histograms!
Folklore proof of NN power
I What’s an easy way to write function as sum of others?I Project onto favorite basis (Fourier, Chebyshev, ...)!
I Let’s use this to build NNs.I Easiest basis: histograms!
Histogram basis for C([0, 1]n)
I Grid [0, 1]n into {Ri}mi=1; consider
f̂(x) =
m∑i=1
ai1[x ∈ Ri],
where target f(x) = ai for some x ∈ Ri.
I Fact: for any � > 0, exists histogram f̂ with
x ∈ [0, 1]n =⇒ |f(x)− f̂(x)| < �.
I Pf. Continuous function over compact set is uniformlycontinuous.
I Now just write individual histogram bars as a NNs.
Histogram basis for C([0, 1]n)
I Grid [0, 1]n into {Ri}mi=1; consider
f̂(x) =
m∑i=1
ai1[x ∈ Ri],
where target f(x) = ai for some x ∈ Ri.I Fact: for any � > 0, exists histogram f̂ with
x ∈ [0, 1]n =⇒ |f(x)− f̂(x)| < �.
I Pf. Continuous function over compact set is uniformlycontinuous.
I Now just write individual histogram bars as a NNs.
Histogram basis for C([0, 1]n)
I Grid [0, 1]n into {Ri}mi=1; consider
f̂(x) =
m∑i=1
ai1[x ∈ Ri],
where target f(x) = ai for some x ∈ Ri.I Fact: for any � > 0, exists histogram f̂ with
x ∈ [0, 1]n =⇒ |f(x)− f̂(x)| < �.
I Pf. Continuous function over compact set is uniformlycontinuous.
I Now just write individual histogram bars as a NNs.
Histogram basis for C([0, 1]n)
I Grid [0, 1]n into {Ri}mi=1; consider
f̂(x) =
m∑i=1
ai1[x ∈ Ri],
where target f(x) = ai for some x ∈ Ri.I Fact: for any � > 0, exists histogram f̂ with
x ∈ [0, 1]n =⇒ |f(x)− f̂(x)| < �.
I Pf. Continuous function over compact set is uniformlycontinuous.
I Now just write individual histogram bars as a NNs.
Histograms via 0/1 NNs
I Write 1[x ∈ R] as sum/composition of 1[〈w, x〉 ≥ w0].
I 2-D example. First carve out axes.
2
1
1 2
3
2
3
2
2
3
4
3
I Now pass through a second layer 1[input ≥ 3.5].
1
0
Histograms via 0/1 NNs
I Write 1[x ∈ R] as sum/composition of 1[〈w, x〉 ≥ w0].I 2-D example. First carve out axes.
2
1
1 2
3
2
3
2
2
3
4
3
I Now pass through a second layer 1[input ≥ 3.5].
1
0
Histograms via 0/1 NNs
I Write 1[x ∈ R] as sum/composition of 1[〈w, x〉 ≥ w0].I 2-D example. First carve out axes.
2
1
1 2
3
2
3
2
2
3
4
3
I Now pass through a second layer 1[input ≥ 3.5].
1
0
Histograms via 0/1 NNs
I Write 1[x ∈ R] as sum/composition of 1[〈w, x〉 ≥ w0].I 2-D example. First carve out axes.
2
1
1
2
3
2
3
2
2
3
4
3
I Now pass through a second layer 1[input ≥ 3.5].
1
0
Histograms via 0/1 NNs
I Write 1[x ∈ R] as sum/composition of 1[〈w, x〉 ≥ w0].I 2-D example. First carve out axes.
2
1
1 2
3
2
3
2
2
3
4
3
I Now pass through a second layer 1[input ≥ 3.5].
1
0
Histograms via 0/1 NNs
I Write 1[x ∈ R] as sum/composition of 1[〈w, x〉 ≥ w0].I 2-D example. First carve out axes.
2
1
1 2
3
2
3
2
2
3
4
3
I Now pass through a second layer 1[input ≥ 3.5].
1
0
Histograms via sigmoidal NNs
I Write 1[x ∈ R] as sum/composition of σ(〈w, x〉+ w0).
I 2-D example. First carve out axes, with fuzz region.
>2-2ε
Histograms via sigmoidal NNs
I Write 1[x ∈ R] as sum/composition of σ(〈w, x〉+ w0).I 2-D example. First carve out axes, with fuzz region.
>2-2ε
Histograms via sigmoidal NNs
I Write 1[x ∈ R] as sum/composition of σ(〈w, x〉+ w0).I 2-D example. First carve out axes, with fuzz region.
>2-2ε
Histograms via sigmoidal NNs
I Write 1[x ∈ R] as sum/composition of σ(〈w, x〉+ w0).I 2-D example. First carve out axes, with fuzz region.
>2-2ε
Histograms via sigmoidal NNs
I Write 1[x ∈ R] as sum/composition of σ(〈w, x〉+ w0).I 2-D example. First carve out axes, with fuzz region.
>2-2ε
Histograms via sigmoidal NNs
I Write 1[x ∈ R] as sum/composition of σ(〈w, x〉+ w0).I 2-D example. First carve out axes, with fuzz region.
>2-2ε
Histograms apx via sigmoidal NNs, some math
I Histogram region is clearly 2-layer 0/1 NN:
1[∀i � xi ≥ li ∧ xi ≤ ui]
= 1
[n∑i=1
(1[xi ≥ li] + 1[xi ≤ ui]) ≥ (2n− 1)/2n
].
I Choose B huge; apx 1[〈w, x〉 ≥ w0] with σ(B(〈w, x〉 −w0)).I �/2-apx f with histogram {ai, Ri}mi=1;�/(2m)-apx each bar with sigmoids.
I Final NN f̂(x) :=∑m
i=1 ai(sigmoidal apx to 1[x ∈ Ri]).I Have fuzz S ⊆ [0, 1]n with m(S) ≥ 1− �,
x ∈ S =⇒ |f(x)− f̂(x)| < �.
Histograms apx via sigmoidal NNs, some math
I Histogram region is clearly 2-layer 0/1 NN:
1[∀i � xi ≥ li ∧ xi ≤ ui]
= 1
[n∑i=1
(1[xi ≥ li] + 1[xi ≤ ui]) ≥ (2n− 1)/2n
].
I Choose B huge; apx 1[〈w, x〉 ≥ w0] with σ(B(〈w, x〉 −w0)).
I �/2-apx f with histogram {ai, Ri}mi=1;�/(2m)-apx each bar with sigmoids.
I Final NN f̂(x) :=∑m
i=1 ai(sigmoidal apx to 1[x ∈ Ri]).I Have fuzz S ⊆ [0, 1]n with m(S) ≥ 1− �,
x ∈ S =⇒ |f(x)− f̂(x)| < �.
Histograms apx via sigmoidal NNs, some math
I Histogram region is clearly 2-layer 0/1 NN:
1[∀i � xi ≥ li ∧ xi ≤ ui]
= 1
[n∑i=1
(1[xi ≥ li] + 1[xi ≤ ui]) ≥ (2n− 1)/2n
].
I Choose B huge; apx 1[〈w, x〉 ≥ w0] with σ(B(〈w, x〉 −w0)).I �/2-apx f with histogram {ai, Ri}mi=1;�/(2m)-apx each bar with sigmoids.
I Final NN f̂(x) :=∑m
i=1 ai(sigmoidal apx to 1[x ∈ Ri]).I Have fuzz S ⊆ [0, 1]n with m(S) ≥ 1− �,
x ∈ S =⇒ |f(x)− f̂(x)| < �.
Histograms apx via sigmoidal NNs, some math
I Histogram region is clearly 2-layer 0/1 NN:
1[∀i � xi ≥ li ∧ xi ≤ ui]
= 1
[n∑i=1
(1[xi ≥ li] + 1[xi ≤ ui]) ≥ (2n− 1)/2n
].
I Choose B huge; apx 1[〈w, x〉 ≥ w0] with σ(B(〈w, x〉 −w0)).I �/2-apx f with histogram {ai, Ri}mi=1;�/(2m)-apx each bar with sigmoids.
I Final NN f̂(x) :=∑m
i=1 ai(sigmoidal apx to 1[x ∈ Ri]).
I Have fuzz S ⊆ [0, 1]n with m(S) ≥ 1− �,
x ∈ S =⇒ |f(x)− f̂(x)| < �.
Histograms apx via sigmoidal NNs, some math
I Histogram region is clearly 2-layer 0/1 NN:
1[∀i � xi ≥ li ∧ xi ≤ ui]
= 1
[n∑i=1
(1[xi ≥ li] + 1[xi ≤ ui]) ≥ (2n− 1)/2n
].
I Choose B huge; apx 1[〈w, x〉 ≥ w0] with σ(B(〈w, x〉 −w0)).I �/2-apx f with histogram {ai, Ri}mi=1;�/(2m)-apx each bar with sigmoids.
I Final NN f̂(x) :=∑m
i=1 ai(sigmoidal apx to 1[x ∈ Ri]).I Have fuzz S ⊆ [0, 1]n with m(S) ≥ 1− �,
x ∈ S =⇒ |f(x)− f̂(x)| < �.
Folklore proof discussion
I Curse of dimension!
I Still, high level features useful.
Outline
I 2-nn via functional analysis (Cybenko, 1989).
I 2-nn via greedy approx (Barron, 1993).
I 3-nn via histograms (Folklore).
I 3-nn via wizardry (Kolmogorov, 1957).
Preface to the Kolmogorov Superposition Theorem
I In 1957, at 19, Kolmogorov’s student Vladimir Arnoldsolved Hilbert’s 13th problem:
Can the solution to
x7 + ax3 + bx2 + cx+ 1 = 0
be written as a finite number of 2-variablefunctions?
I Kolmogorov’s generalization is a NN apx theorem.
I The “transfer functions” (i.e., sigmoids), are not fixedacross nodes.
Preface to the Kolmogorov Superposition Theorem
I In 1957, at 19, Kolmogorov’s student Vladimir Arnoldsolved Hilbert’s 13th problem:
Can the solution to
x7 + ax3 + bx2 + cx+ 1 = 0
be written as a finite number of 2-variablefunctions?
I Kolmogorov’s generalization is a NN apx theorem.
I The “transfer functions” (i.e., sigmoids), are not fixedacross nodes.
Preface to the Kolmogorov Superposition Theorem
I In 1957, at 19, Kolmogorov’s student Vladimir Arnoldsolved Hilbert’s 13th problem:
Can the solution to
x7 + ax3 + bx2 + cx+ 1 = 0
be written as a finite number of 2-variablefunctions?
I Kolmogorov’s generalization is a NN apx theorem.
I The “transfer functions” (i.e., sigmoids), are not fixedacross nodes.
Preface to the Kolmogorov Superposition Theorem
I In 1957, at 19, Kolmogorov’s student Vladimir Arnoldsolved Hilbert’s 13th problem:
Can the solution to
x7 + ax3 + bx2 + cx+ 1 = 0
be written as a finite number of 2-variablefunctions?
I Kolmogorov’s generalization is a NN apx theorem.
I The “transfer functions” (i.e., sigmoids), are not fixedacross nodes.
The Kolmogorov Superposition Theorem
Theorem. (Kolmogorov, 1957)
There exist continuous {ψp,q : p ∈ [n], q ∈ [2n+ 1]}, so that
forany f ∈ C([0, 1]n),
there exist continuous {χq}
2n+1
q=1 ,
f(x) =
2n+1
∑q=1
χq
n
∑p=1
ψp,q(xp)
I The proof is also histogram based.
I It must remove the “fuzz” gaps.I It must remove dependence on a particular � > 0.
I The magic is within ψp,q!!
The Kolmogorov Superposition Theorem
Theorem. (Kolmogorov, 1957)
There exist continuous {ψp,q : p ∈ [n], q ∈ [2n+ 1]}, so that
forany f ∈ C([0, 1]n), there exist continuous {χq}
2n+1
q=1 ,
f(x) =
2n+1
∑q=1
χq
n
∑p=1
ψp,q(xp)
I The proof is also histogram based.
I It must remove the “fuzz” gaps.I It must remove dependence on a particular � > 0.
I The magic is within ψp,q!!
The Kolmogorov Superposition Theorem
Theorem. (Kolmogorov, 1957)
There exist continuous {ψp,q : p ∈ [n], q ∈ [2n+ 1]}, so that
forany f ∈ C([0, 1]n), there exist continuous {χq}
2n+1
q=1 ,
f(x) =
2n+1
∑q=1
χq
n∑p=1
ψp,q(xp)
I The proof is also histogram based.
I It must remove the “fuzz” gaps.I It must remove dependence on a particular � > 0.
I The magic is within ψp,q!!
The Kolmogorov Superposition Theorem
Theorem. (Kolmogorov, 1957)
There exist continuous {ψp,q : p ∈ [n], q ∈ [2n+ 1]}, so that
forany f ∈ C([0, 1]n), there exist continuous {χq}2n+1q=1 ,
f(x) =
2n+1∑q=1
χq
n∑p=1
ψp,q(xp)
I The proof is also histogram based.
I It must remove the “fuzz” gaps.I It must remove dependence on a particular � > 0.
I The magic is within ψp,q!!
The Kolmogorov Superposition Theorem
Theorem. (Kolmogorov, 1957)
There exist continuous {ψp,q : p ∈ [n], q ∈ [2n+ 1]}, so that forany f ∈ C([0, 1]n), there exist continuous {χq}2n+1q=1 ,
f(x) =
2n+1∑q=1
χq
n∑p=1
ψp,q(xp)
I The proof is also histogram based.
I It must remove the “fuzz” gaps.I It must remove dependence on a particular � > 0.
I The magic is within ψp,q!!
The Kolmogorov Superposition Theorem
Theorem. (Kolmogorov, 1957)
There exist continuous {ψp,q : p ∈ [n], q ∈ [2n+ 1]}, so that forany f ∈ C([0, 1]n), there exist continuous {χq}2n+1q=1 ,
f(x) =
2n+1∑q=1
χq
n∑p=1
ψp,q(xp)
I The proof is also histogram based.
I It must remove the “fuzz” gaps.I It must remove dependence on a particular � > 0.
I The magic is within ψp,q!!
The Kolmogorov Superposition Theorem
Theorem. (Kolmogorov, 1957)
There exist continuous {ψp,q : p ∈ [n], q ∈ [2n+ 1]}, so that forany f ∈ C([0, 1]n), there exist continuous {χq}2n+1q=1 ,
f(x) =
2n+1∑q=1
χq
n∑p=1
ψp,q(xp)
I The proof is also histogram based.
I It must remove the “fuzz” gaps.
I It must remove dependence on a particular � > 0.
I The magic is within ψp,q!!
The Kolmogorov Superposition Theorem
Theorem. (Kolmogorov, 1957)
There exist continuous {ψp,q : p ∈ [n], q ∈ [2n+ 1]}, so that forany f ∈ C([0, 1]n), there exist continuous {χq}2n+1q=1 ,
f(x) =
2n+1∑q=1
χq
n∑p=1
ψp,q(xp)
I The proof is also histogram based.
I It must remove the “fuzz” gaps.I It must remove dependence on a particular � > 0.
I The magic is within ψp,q!!
The Kolmogorov Superposition Theorem
Theorem. (Kolmogorov, 1957)
There exist continuous {ψp,q : p ∈ [n], q ∈ [2n+ 1]}, so that forany f ∈ C([0, 1]n), there exist continuous {χq}2n+1q=1 ,
f(x) =
2n+1∑q=1
χq
n∑p=1
ψp,q(xp)
I The proof is also histogram based.
I It must remove the “fuzz” gaps.I It must remove dependence on a particular � > 0.
I The magic is within ψp,q!!
Constructing ψp,q
I Given resolution k ∈ Z++, build a staggered gridding.
I Let Sqk,i1,...,im range over the cells.
I Construct ψp,q so that for each k and q,∑n
p=1 ψp,q maps
each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.
I Holds for all k, simultaneously handles all resolutions.
I The functions ψp,q are fractals.
Constructing ψp,q
I Given resolution k ∈ Z++, build a staggered gridding.
I Let Sqk,i1,...,im range over the cells.
I Construct ψp,q so that for each k and q,∑n
p=1 ψp,q maps
each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.
I Holds for all k, simultaneously handles all resolutions.
I The functions ψp,q are fractals.
Constructing ψp,q
I Given resolution k ∈ Z++, build a staggered gridding.
I Let Sqk,i1,...,im range over the cells.
I Construct ψp,q so that for each k and q,∑n
p=1 ψp,q maps
each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.
I Holds for all k, simultaneously handles all resolutions.
I The functions ψp,q are fractals.
Constructing ψp,q
I Given resolution k ∈ Z++, build a staggered gridding.
q
I Let Sqk,i1,...,im range over the cells.
I Construct ψp,q so that for each k and q,∑n
p=1 ψp,q maps
each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.
I Holds for all k, simultaneously handles all resolutions.
I The functions ψp,q are fractals.
Constructing ψp,q
I Given resolution k ∈ Z++, build a staggered gridding.
q
I Let Sqk,i1,...,im range over the cells.
I Construct ψp,q so that for each k and q,∑n
p=1 ψp,q maps
each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.
I Holds for all k, simultaneously handles all resolutions.
I The functions ψp,q are fractals.
Constructing ψp,q
I Given resolution k ∈ Z++, build a staggered gridding.
q
I Let Sqk,i1,...,im range over the cells.
I Construct ψp,q so that for each k and q,∑n
p=1 ψp,q maps
each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.
I Holds for all k, simultaneously handles all resolutions.
I The functions ψp,q are fractals.
Constructing ψp,q
I Given resolution k ∈ Z++, build a staggered gridding.
q
I Let Sqk,i1,...,im range over the cells.
I Construct ψp,q so that for each k and q,∑n
p=1 ψp,q maps
each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.
I Holds for all k, simultaneously handles all resolutions.
I The functions ψp,q are fractals.
Constructing ψp,q
I Given resolution k ∈ Z++, build a staggered gridding.
q
I Let Sqk,i1,...,im range over the cells.
I Construct ψp,q so that for each k and q,∑n
p=1 ψp,q maps
each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.
I Holds for all k, simultaneously handles all resolutions.
I The functions ψp,q are fractals.
Constructing ψp,q
I Given resolution k ∈ Z++, build a staggered gridding.
q
I Let Sqk,i1,...,im range over the cells.
I Construct ψp,q so that for each k and q,∑n
p=1 ψp,q maps
each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.
I Holds for all k, simultaneously handles all resolutions.
I The functions ψp,q are fractals.
Constructing ψp,q
I Given resolution k ∈ Z++, build a staggered gridding.
q
I Let Sqk,i1,...,im range over the cells.
I Construct ψp,q so that for each k and q,∑n
p=1 ψp,q maps
each Sqk,i1,...,im to its own specific interval of R.
I Holds for all q: gracefully handles the fuzz regions.
I Holds for all k, simultaneously handles all resolutions.
I The functions ψp,q are fractals.
Constructing ψp,q
I Given resolution k ∈ Z++, build a staggered gridding.
q
I Let Sqk,i1,...,im range over the cells.
I Construct ψp,q so that for each k and q,∑n
p=1 ψp,q maps
each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.
I Holds for all k, simultaneously handles all resolutions.
I The functions ψp,q are fractals.
Constructing ψp,q
I Given resolution k ∈ Z++, build a staggered gridding.
q
I Let Sqk,i1,...,im range over the cells.
I Construct ψp,q so that for each k and q,∑n
p=1 ψp,q maps
each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.
I Holds for all k, simultaneously handles all resolutions.
I The functions ψp,q are fractals.
Constructing ψp,q
I Given resolution k ∈ Z++, build a staggered gridding.
q
I Let Sqk,i1,...,im range over the cells.
I Construct ψp,q so that for each k and q,∑n
p=1 ψp,q maps
each Sqk,i1,...,im to its own specific interval of R.I Holds for all q: gracefully handles the fuzz regions.
I Holds for all k, simultaneously handles all resolutions.
I The functions ψp,q are fractals.
Constructing χq
I Recall goal:
f(x) =
2n+1∑q=1
χq
n∑p=1
ψp,q(xp)
I Start χq0 := 0; eventually χq := limr→∞ χ
qr.
I To build χqr+1 from χqr:
I Choose kr+1 so gridding fluctuations sufficiently small.I Make χq some value of f in each cell, smooth interpolation
in fuzz.I Every x ∈ [0, 1]n hits at least n+ 1 cells Sqk,i1,...,in ; i.e., sinceq ∈ [2n+ 1], more than half the approximations are good.
I (I’m leaving a lot out =( )
Constructing χq
I Recall goal:
f(x) =
2n+1∑q=1
χq
n∑p=1
ψp,q(xp)
I Start χq0 := 0; eventually χ
q := limr→∞ χqr.
I To build χqr+1 from χqr:
I Choose kr+1 so gridding fluctuations sufficiently small.I Make χq some value of f in each cell, smooth interpolation
in fuzz.I Every x ∈ [0, 1]n hits at least n+ 1 cells Sqk,i1,...,in ; i.e., sinceq ∈ [2n+ 1], more than half the approximations are good.
I (I’m leaving a lot out =( )
Constructing χq
I Recall goal:
f(x) =
2n+1∑q=1
χq
n∑p=1
ψp,q(xp)
I Start χq0 := 0; eventually χ
q := limr→∞ χqr.
I To build χqr+1 from χqr:
I Choose kr+1 so gridding fluctuations sufficiently small.I Make χq some value of f in each cell, smooth interpolation
in fuzz.I Every x ∈ [0, 1]n hits at least n+ 1 cells Sqk,i1,...,in ; i.e., sinceq ∈ [2n+ 1], more than half the approximations are good.
I (I’m leaving a lot out =( )
Constructing χq
I Recall goal:
f(x) =
2n+1∑q=1
χq
n∑p=1
ψp,q(xp)
I Start χq0 := 0; eventually χ
q := limr→∞ χqr.
I To build χqr+1 from χqr:
I Choose kr+1 so gridding fluctuations sufficiently small.
I Make χq some value of f in each cell, smooth interpolationin fuzz.
I Every x ∈ [0, 1]n hits at least n+ 1 cells Sqk,i1,...,in ; i.e., sinceq ∈ [2n+ 1], more than half the approximations are good.
I (I’m leaving a lot out =( )
Constructing χq
I Recall goal:
f(x) =
2n+1∑q=1
χq
n∑p=1
ψp,q(xp)
I Start χq0 := 0; eventually χ
q := limr→∞ χqr.
I To build χqr+1 from χqr:
I Choose kr+1 so gridding fluctuations sufficiently small.I Make χq some value of f in each cell, smooth interpolation
in fuzz.
I Every x ∈ [0, 1]n hits at least n+ 1 cells Sqk,i1,...,in ; i.e., sinceq ∈ [2n+ 1], more than half the approximations are good.
I (I’m leaving a lot out =( )
Constructing χq
I Recall goal:
f(x) =
2n+1∑q=1
χq
n∑p=1
ψp,q(xp)
I Start χq0 := 0; eventually χ
q := limr→∞ χqr.
I To build χqr+1 from χqr:
I Choose kr+1 so gridding fluctuations sufficiently small.I Make χq some value of f in each cell, smooth interpolation
in fuzz.I Every x ∈ [0, 1]n hits at least n+ 1 cells Sqk,i1,...,in ; i.e., sinceq ∈ [2n+ 1], more than half the approximations are good.
I (I’m leaving a lot out =( )
Constructing χq
I Recall goal:
f(x) =
2n+1∑q=1
χq
n∑p=1
ψp,q(xp)
I Start χq0 := 0; eventually χ
q := limr→∞ χqr.
I To build χqr+1 from χqr:
I Choose kr+1 so gridding fluctuations sufficiently small.I Make χq some value of f in each cell, smooth interpolation
in fuzz.I Every x ∈ [0, 1]n hits at least n+ 1 cells Sqk,i1,...,in ; i.e., sinceq ∈ [2n+ 1], more than half the approximations are good.
I (I’m leaving a lot out =( )
Discussion 1
Discussion 2
I It needs different transfer functions..
I ..but these can be approximated by multiple layers!
I It shows the value of powerful nodes, whereas the standardapx results just suggest a very wide hidden layer.
Discussion 2
I It needs different transfer functions..
I ..but these can be approximated by multiple layers!
I It shows the value of powerful nodes, whereas the standardapx results just suggest a very wide hidden layer.
Discussion 2
I It needs different transfer functions..
I ..but these can be approximated by multiple layers!
I It shows the value of powerful nodes, whereas the standardapx results just suggest a very wide hidden layer.
Conclusion
I Some fancy mechanisms give 2-NNs f̂i → f ∈ C([0, 1]n).I Histogram constructions hint to the power of deeper
networks.
Thanks!
Andrew R. Barron. Universal approximation bounds forsuperpositions of a sigmoidal function. IEEE Transactions onInformation theory, 39(3):930–945, 1993.
George Cybenko. Approximation by superpositions of asigmoidal function. Mathematics of Control, Signals, andSystems, 2:303–314, 1989.
Andrei N. Kolmogorov. On the representation of continuousfunctions of several veriables as superpositions of continuousfunctions of one variable and addition. Dokl. Acad. NaukSSSR, 114(5):953–956, 1957. Translation to English: V. M.Volosov.
Tong Zhang. Sequential greedy approximation for certainconvex optimization problems. IEEE Transactions onInformation Theory, 49(3):682–691, 2003.
Recommended