Computing the Non-computable via Sparsity - On ...spars2017.lx.it.pt/index_files/PlenarySlides/Hansen...Computing the Non-computable via Sparsity-On Foundational Computational Barriers

Computing the Non-computable via Sparsity-

On Foundational Computational Barriers inl1 and Total Variation Regularisation

Anders C. Hansen (Cambridge)

Joint work with:

A. Bastounis (Cambridge)V. Vlacic (Cambridge)

SPARS 2017

1 / 60

l1 and TV regularisation

Key optimisation problems in regularisation:

z ∈ argminxJ (x) such that ‖Ax − y‖ ≤ δ, δ ≥ 0, (BP)

z ∈ argminx‖Ax − y‖2 such that J (x) ≤ τ, τ > 0, (CL)

z ∈ argminx‖Ax − y‖2

2 + λJ (x), λ > 0, (UL)

(1)where A ∈ Cm×N , y ∈ Cm and

J (x) = ‖x‖1 or J (x) = ‖x‖TV.

2 / 60

Foundations of Computational Mathematics

I Hilbert’s question on the existence of algorithms for decisionproblems, solved by Turing in

On Computable Numbers, with an Application to theEntscheidungsproblem. Proc. London Math. Soc. (1936)

I Smale’s question on the existence of purely iterative generallyconvergent algorithms for polynomial root-finding, solved byMcMullen and Doyle & McMullen in

Families of rational maps and iterative root-finding algorithmsAnnals of Math. (1987) and Solving the quintic by iteration,Acta Math. (1989).

3 / 60

Warm up question

Question 1Given any of the problems in (1), does there exist an algorithmsuch that for a given ε > 0 the algorithm produces and output thatis no further than ε away from a minimiser?

4 / 60

The basic question

Question 2 (Existence of algorithms)

Given any of the problems in (1), where the input may be givenwith some inaccuracy controlled by ε > 0, does there exists analgorithm that can compute an approximate solution, such that,for an arbitrary ε > 0, the output will be no further than ε awayfrom a true solution? The algorithm can choose ε to be as small asdesired (as a function of ε and the input) to produce the output.

5 / 60

What if the answer is: No...

I For fixed dimensions and any small accuracy parameter ε > 0,one can choose an arbitrary large time T , say T = 50 billionyears, and find an input such that the algorithm will still after50 billion years not have reached ε accuracy.

I Moreover, it is impossible to determine when the algorithmshould halt to achieve an ε accurate solution, and hence thealgorithm will never be able to produce an output where oneknows that the output is at least ε accurate.

I The largest ε for which this failure happens is called theBreakdown-epsilon, εB.

6 / 60

What if the answer is: No...

... then we have the following questions.

Question 3 (Existence of algorithms for subclasses)

(i) Given any of the problems in (1), which subclasses Ω of inputsA and y will provide a positive answer to Question 2?

(ii) Given an algorithm for solving any of the problems in (1), forwhich subclasses Ω of input will the algorithm be accurate inthe sense of Question 2?

(iii) Which subclasses Ω of inputs give negative answers toQuestion 2, and what is the Breakdown-epsilon?

7 / 60

Condition

I Condition of a matrix Cond(A) = ‖A‖‖A−1‖.I Condition of a mapping Ξ : Ω ⊂ Cn → Cm, linear or

non-linear, is often given by

Cond(Ξ) = supx∈Ω

limε→0+

supx+z∈Ω

0<‖z‖≤ε

dist(Ξ(x + z),Ξ(x))

‖z‖,

where we allow for multivalued functions by definingdist(Ξ(x),Ξ(z)) = minx∈Ξ(x),z∈Ξ(z) ‖x − z‖.

8 / 60

Condition

I If Ξ denotes the solution map to any of the problems in (1)with domain Ω, we define

ρ(A, y) = supδ | ‖A‖, ‖y‖ ≤ δ ⇒ (A + A, y + y) ∈ Ω are feasible,

and this yields the Feasibility Primal (FP) condition number

CFP(A, y) :=max(‖A‖, ‖y‖)

ρ(A, y).

9 / 60

The answer to Question 2

Theorem 1Fix α ≥ 2 and dimensions m,N ∈ N where N ≥ 4. Consider any ofthe problems listed in (1). Then there exists a domain Ω of inputs(A, y) ∈ Rm×N × Rm such that the condition numbers CFP(A, y),Cond(AA∗),Cond(Ξ) ≤ α and moreover ‖I‖∞ ≤ 1 for I ∈ Ω.However,

the answer to Question 2 is no.

Also, the Breakdown-epsilon εB ≥ 1/3.

This statement is valid for any model of computation.

10 / 60

Basis Pursuit and the Halting problem

Consider the Basis Pursuit problem of computing

z ∈ argminxJ (x) such that ‖Ax − y‖ ≤ δ, δ ≥ 0.

Suppose the answer to Question 2 was yes for BP, even if werestricted ourself to problems with a unique minimiser.

Then that would imply decidability of the famous undecidableHalting problem.

11 / 60

The Halting problem

Wikipedia:

In computability theory, the halting problem is the problem ofdetermining, from a description of an arbitrary computer programand an input, whether the program will finish running or continueto run forever.

12 / 60

Linear Programming (LP)

Consider the problem of computing

z ∈ argminx〈x , c〉 such that Ax = y , x ≥ 0,

where A ∈ Rm×N , c ∈ RN and y ∈ Rm.

Suppose the answer to Question 2 was yes for LP, even if werestricted ourself to problems with a unique minimiser.

Then that would imply decidability of the famous undecidableHalting problem.

13 / 60

Solving linear systems

Consider the problem of computing x where

Ax = y

where A ∈ CN×N is invertible.

Then the answer to Question 2 is yes!

14 / 60

Basic Questions

Given the paradoxes established above there are three mainquestions:

I How can one trust the output of an algorithm?

I Why do many modern algorithms work so well on manyproblems in applications?

I Do practitioners have to change their approach?

15 / 60

MATLAB tests (Basis Pursuit)

>> y = 1

for k = 1:10

x = 10^(-k);

A = [1-x,1];

[z,info] = spgl1( A, y, [],[], [], opts);

iterations(k) = info.iter;

error(k) = norm(z-true);

end

16 / 60


>> y = 1

for k = 1:10

x = 10^(-k);

A = [1-x,1];

[z,info] = spgl1( A, y, [],[], [], opts);

iterations(k) = info.iter;

error(k) = norm(z-true);

end

iterations =

4 4 1593 133005 7834764 3 4 4 6 4

error =

0 0 1.4e-09 4.3e-08 4.4e-05 0.7 0.7 0.7 0.7 0.7

17 / 60


>> A = [1,1]; y = 1

cvx_begin

variable x(n);

minimize(norm(x,1));

subject to

norm(A*x-y) <= 1/32

cvx_end

x =

0.608182913565274

0.360567086434726

norm(x - true) = 0

18 / 60


>> A = [1-10^(-15),1]; y = 1

cvx_begin

variable x2(n);

minimize(norm(x2,1));

subject to

norm(A*x2-y) <= 1/32

cvx_end

x2 =

0.157256056011886

0.811493943988114

norm(x - x2) = 0.637706877590281

norm(x2 - true2) = 0.222393647177312

19 / 60

MATLAB tests (Linear Programming)

>> y = 1, c = [1,1]

for k = 1:10

x = 10^(-k);

A = [1-x,1];

z = linprog(c,[],[],A,b,[0,0],[100,100])

error(k) = norm(z-true)

end

20 / 60

MATLAB tests (Linear Programming)

>> y = 1, c = [1,1]

for k = 1:10

x = 10^(-k);

A = [1-x,1];

z = linprog(c,[],[],A,b,[0,0],[100,100])

error(k) = norm(z-true)

end

error =

8.7e-10 1.1e-10 3.0e-06 8.7e-11 6.6e-5

7.0e-6 7.0e-6 0.2 0.6 0.7

21 / 60

MATLAB tests (Solving Linear Systems)

>> y = [1,1]’ ; A = [1,0;0,2*10^(-16)];

>> A\y

Warning: Matrix is close to singular or badly scaled.

Results may be inaccurate. RCOND = 2.000000e-16.

ans =

1.0e+15 *

0.000000000000001

5.000000000000000

22 / 60

Basic Questions



- In general one cannot, unless one has extra information aboutthe problem.



23 / 60

Why do many modern algorithms work so well?

Experiment from ”Undersampling improves fidelity of physical imaging and thebenefits grow with resolution”, B. Roman, R. Calderbank, B. Adcock D.Nietlispach, M. Bostock, I. Calvo-Almazn, M. Graves A. Hansen, PNAS (inrevision).

Figure : Standard 3D MRI headscan. Scanning time = 15 min (ourexperiment is done at Cambridge University Hospital).

24 / 60


Experiment from ”Undersampling improves fidelity of physical imaging and thebenefits grow with resolution”, B. Roman, R. Calderbank, B. Adcock D.Nietlispach, M. Bostock, I. Calvo-Almazn, M. Graves A. Hansen, PNAS (inrevision).

Figure : Left: Standard full sampling. Right: Resolution enhancing withcompressed sensing solving BP with SPGL1 and δ = 10−5. Scanningtime for both = 15 min

25 / 60

How do you halt

How do you set the halting parameters in your code?

26 / 60

Basic Questions



- In general one cannot, unless one has extra information aboutthe problem.


- Sparsity and the Breakdown-epsilon


- Yes, if one wants to make sure that the computation is reliable.

27 / 60

One vs several minimisers

Question: Do problems become easier if they have uniqueminimisers?

Answer: Well, .... it’s complicated.

28 / 60

Distance to solutions with several minimisers

Consider the distance to the solution with several minimisers bythe following formula

%(A, y) = supδ : ‖A‖, ‖y‖ ≤ δ⇒ (A + A, y + y) ∈ Ω, |Ξ(A + A, y + y)| = 1,

and this yields the RCC condition number

CRCC(A, y) :=max(‖A‖, ‖y‖)

%(A, y).

29 / 60

Condition

I Condition of a matrix Cond(A) = ‖A‖‖A−1‖.I Condition of a mapping Ξ : Ω ⊂ Cn → Cm, linear or

non-linear, is often given by

Cond(Ξ) = supx∈Ω

limε→0+

supx+z∈Ω

0<‖z‖≤ε

dist(Ξ(x + z),Ξ(x))

‖z‖.

I Distance to infeasibility

CFP(A, y) :=max(‖A‖, ‖y‖)

ρ(A, y).

I Distance to solutions with several minimisers

CRCC(A, y) :=max(‖A‖, ‖y‖)

%(A, y).

30 / 60

Condition

31 / 60

Sparsity

Definition 2 (Robust Nullspace Property)

A matrix U ∈ Cm×n satisfies the `2 robust nullspace property oforder s if there is a ρ ∈ (0, 1) and a τ > 0 such that

‖vS‖2 ≤ρ√s‖vSc‖1 + τ‖Uv‖2 (2)

for all s-sparse sets S and vectors v ∈ Cn.

32 / 60

Sparsity

Theorem 3 (Robust Nullspace Property)

Suppose that a matrix A ∈ Cm×N satisfies the robust null spaceproperty of order s with constants 0 < ρ < 1 and τ > 0. Then, forany x ∈ CN and y = Ax, a solution

x ∈ arg min ||z ||1 subject to ||Az − y ||2 ≤ δ

satisfies

‖x − x‖1 ≤2(1 + ρ)

(1− ρ)σs(x)1 +

4τ

1− ρδ.

33 / 60

The Paradoxes

Conditions on Ω 3 (A, y) andΞ

Problems Q 2 Breakd.-eps.

Cond(AA∗), CFP(A, y),Cond(Ξ) ≤ 2, ‖A‖, ‖y‖ ≤ 1

All in (1) l1

and TVno εB ≥ 1/3

y = Ax , x is s-sparse,A is Bernoulli or subsampledHadamard with RNP of order sand CRCC(A, y) =∞

BP(δ = 0)with l1

yes

CRCC(A, y), ‖A‖, ‖y‖ ≤ 1 BP with l1 no εB ≥ 1/3

y = Ax , x is s-sparse, A hasthe RNP of order s

BP(δ = 0)with l1

yes

34 / 60

The Paradoxes




BP(δ = 0)with l1

yes


BP(δ > 0),UL with l1

no∗ εB ∈ [a,Cδ]

y = Ax , x is s-sparse, A hasthe RNP of order s and ρ > 1

BP(δ ≥ 0)with l1

no εB > 0

∃ (A, y) ∈ Ω s.t. |Ξ(A, y)| > 1,ε > 0 s.t. ∀ semidefinite diag-onal D with ‖D‖ < ε, (y ,U +UD) ∈ Ω

BP, ULwith l1

no εB > 0

∃ Ω ⊂ Ω above, yet CL with l1 yes35 / 60

The Paradoxes




BP(δ = 0)with l1

yes


BP(δ > 0),UL with l1

no∗ εB ∈ [a,Cδ]

‖A‖ ≤ M, A is invertible Lin. sys-tem

yes

‖A‖, ‖y‖, ‖c‖ ≤ 1,|Ξ(A, y , c)| = 1

Lin. Prog. no εB ≥ 1/3

36 / 60

Why do modern algorithms work?

Sparsity and the Breakdown-epsilon

(1) Basis Pursuit (δ = 0): If one restricts to the set Ω of problemsof the form y = Ax , where x is s-sparse and A satisfies theRobust Nullspace Property (RNP) of order s, then the answerto Question 2 is yes.

(2) Basis Pursuit (δ > 0): With exactly the same assumptions asin (1) the answer to Question 2 is no.

However the Breakdown-epsilon εB satisfies

εB ≤ Cδ,

where C depends on ρ and τ in the RNP.

37 / 60

The Breakdown-epsilon

The assumptions on the algorithm to get the positive answers toQuestion 2 and the bounds on the Breakdown-epsilon:

We assume existence of an algorithm Γn such that for given ε > 0,A ∈ Cm×N and y ∈ Cm and xn = Γn(A, y , ε) then

lim supn→∞

‖Axn − y‖2 ≤ ε, limn→∞

‖xn‖1 = BPε(A, y),

where BPε(A, y) is the real value given by

min ‖x ′‖1 such that ‖Ax ′ − y‖ ≤ ε.

38 / 60


Figure : Left: Computable problem. Right: Non-computable problemwith Breakdown-epsilon 0 < εB ≤ 10−4.

39 / 60

Consequences and open problems

I There is a rich classification theory describing the boundariesof what computers can achieve in regularisation and in generalin modern mathematics of information. This classificationtheory is not understood.

I Which problems will give positive answers to Question 2, andwhich problems have reasonable Breakdown-epsilons?

I Practitioners must know these things to assure accurateoutput from their code.

I Are there other conditions than sparsity that are sufficient toget positive answers to Question 2.

40 / 60

Computational mathematics before the computer

Hilbert (1928): Can a computer decide things? For example,given some finite data input, can a computer alway verify if astatement (depending on the data) is true or false.

41 / 60

Computational mathematics before the computer

I Turing (1936): To answer this one needs a definition of whata computer is.

I Turing (1936): The answer to Hilbert’s question is: NO!

42 / 60

The Turing machine

43 / 60

Turing machines and modern computers

This computer

is not a Turing machine, it uses floating point arithmetic

44 / 60

The Smale program

Smale (1980s): If modern computers are not Turing machines,and the analysis of algorithms in scientific computing, for exampleon convergence of Newton’s method, is done without a Turingmachine, should we reconsider the model?

45 / 60

The Blum-Schub-Smale machine

Blum, Shub, Smale (late 80s): Blum, Shub and Smale createdthe BSS machine that can work with real numbers.

46 / 60

The SCI Hierarchy and Smale’s program

I The Solvability Complexity Index Hierarchy is a classificationhierarchy for all kinds of computational problems.

I The Solvability Complexity Index is the smallest number oflimits needed to compute a problem.

47 / 60

The Setup

(i) Ω is some set, called the primary set,

(ii) Λ is a set of complex valued functions on Ω, called theevaluation set,

(iii) (M, d) is a metric space,

(iv) Ξ : Ω→M, called the problem function.

48 / 60

Computational Problem

Definition 4 (Computational problem)

Given a primary set Ω, an evaluation set Λ, a metric space M anda problem function Ξ : Ω→M we call the collection Ξ,Ω,M,Λa computational problem.

49 / 60

General Algorithm

Definition 5 (General Algorithm)

Given a computational problem Ξ,Ω,M,Λ, a general algorithmis a mapping Γ : Ω→M such that for each A ∈ Ω

(i) there exists a finite subset of evaluations ΛΓ(A) ⊂ Λ ,

(ii) the action of Γ on A only depends on Af f ∈ΛΓ(A) whereAf := f (A),

(iii) for every B ∈ Ω such that Bf = Af for every f ∈ ΛΓ(A), itholds that ΛΓ(B) = ΛΓ(A).

50 / 60

Tower of Algorithms

Definition 6 (Tower of algorithms)Given a computational problem Ξ,Ω,M,Λ, a tower of algorithms of height kfor Ξ,Ω,M,Λ is a family of sequences of functions

Γnk : Ω → M,

Γnk ,nk−1 : Ω → M,

...

Γnk ,...,n1 : Ω → M,

where nk , . . . , n1 ∈ N and the functions Γnk ,...,n1 at the lowest level in the towerare general algorithms in the sense of Definiton 5. Moreover, for every A ∈ Ω,

Ξ(A) = limnk→∞

Γnk (A),

Γnk (A) = limnk−1→∞

Γnk ,nk−1 (A),

...

Γnk ,...,n2 (A) = limn1→∞

Γnk ,...,n1 (A),

(3)

where S = limn→∞ Sn means convergence Sn → S in the metric space M.51 / 60

Solvability Complexity Index

Definition 7 (Solvability complexity index)

I Given a computational problem Ξ,Ω,M,Λ, it is said tohave Solvability Complexity Index SCI(Ξ,Ω,M,Λ)α = k withrespect to a tower of algorithms of type α if k is the smallestinteger for which there exists a tower of algorithms of type αof height k.

I If no such tower exists then SCI(Ξ,Ω,M,Λ)α =∞.I If there exists a tower Γnn∈N of type α and height one such

that for each A ∈ Ω there is an n1 ∈ N such thatΞ(A) = Γn1(A), then we define SCI(Ξ,Ω,M,Λ)α = 0.

52 / 60

Arithmetic Tower

Definition 8 (Arithmetic towers)

Given a computational problem Ξ,Ω,M,Λ we define thefollowing:

(i) An Arithmetic tower of algorithms of height k forΞ,Ω,M,Λ is a tower of algorithms where the lowestfunctions Γ = Γnk ,...,n1 : Ω→M satisfy the following: Foreach A ∈ Ω the action of Γ on A consists of only performingfinitely many arithmetic operations on Af f ∈ΛΓ(A).

53 / 60

Computing spectra

Theorem 9Let Ω = B(`2(N)) and Ξ : A 7→ Sp(A). Then

SCI(Ξ)G = SCI(Ξ)A = 3.

54 / 60

The SCI Hierarchy

Definition 10 (The Solvability Complexity Index Hierarchy)

Consider a collection C of computational problems and let αindicate the type of algorithm allowed. Define

∆α0 := Ξ,Ω ∈ C | SCI(Ξ,Ω)α = 0

∆αm+1 := Ξ,Ω ∈ C | SCI(Ξ,Ω)α ≤ m, m ∈ N,

as well as

∆α1 := Ξ,Ω ∈ C | SCI(Ξ,Ω)α ≤ 1,

and the approximate solution can be computed with error control.

55 / 60

Where in the SCI Hierarchy are the problems?





2 + λJ (x), λ > 0, (UL)

where A ∈ Cm×N , y ∈ Cm and

J (x) = ‖x‖1 or J (x) = ‖x‖TV.

I Question 1: Are the above problems in ∆A1 ?

56 / 60

SCI Hierarchy with inexact input

Definition 11 (∆m-information)

For m ∈ N we say that Λ has ∆m+1-information if each fj ∈ Λ isnot available , however, there is a mapping fj ,nm,...,n1 : Ω→ Q suchthat

limnm→∞

. . . limn1→∞

fj ,nm,...,n1(A) = fj(A). (4)

Similarly, for m = 0 we have that

|fj ,n(A)− fj(A)| ≤ 2−n. (5)

57 / 60

SCI Hierarchy with inexact input

We will use the notation

Ξ,Ω,M,Λ∆m ∈ ∆αk

to denote that the computational problem is in ∆αk when Λ has

∆m-information.

58 / 60

Where in the SCI Hierarchy is the BP problem?





2 + λJ (x), λ > 0, (UL)

where A ∈ Cm×N , y ∈ Cm and

J (x) = ‖x‖1 or J (x) = ‖x‖TV.

I Question 1: Are the above problems in ∆A1 ?

I Question 2: IsΞ,Ω,M,Λ∆1 ∈ ∆A

1

59 / 60

The answer to Question 2

Theorem 12Fix α ≥ 2 and dimensions m,N ∈ N where N ≥ 4. Let Ξ,Ω beone of the computational problems defined above and listed in (1).Then there exists a domain Ω of inputs (A, y) ∈ Rm×N × Rm suchthat the condition numbers CFP(A, y), Cond(AA∗),Cond(Ξ) ≤ αand moreover ‖I‖∞ ≤ 1 for I ∈ Ω. However,

Ξ,Ω∆1 /∈ ∆G1 .

Also, the Breakdown-epsilon εB ≥ 1/3.

60 / 60

Documents

Computing the Non-computable via Sparsity - On ...spars2017.lx.it.pt/index_files/PlenarySlides/Hansen...Computing the Non-computable via Sparsity-On Foundational Computational Barriers