hl13nflasdgasdg

Embed Size (px)

Citation preview

  • 8/9/2019 hl13nflasdgasdg

    1/16

    No Free Lunch versus Occams Razor inSupervised Learning

    Tor Lattimore1 andMarcus Hutter1,2,3

    Research School of Computer Science1Australian National University and 2ETH Zurich and 3NICTA

    {tor.lattimore,marcus.hutter}@anu.edu.au

    15 November 2011

    Abstract

    The No Free Lunch theorems are often used to argue that domain specificknowledge is required to design successful algorithms. We use algorithmic in-formation theory to argue the case for a universal bias allowing an algorithmto succeed in all interesting problem domains. Additionally, we give a new al-gorithm for off-line classification, inspired by Solomonoff induction, with goodperformance on all structured (compressible) problems under reasonable as-

    sumptions. This includes a proof of the efficacy of the well-known heuristic ofrandomly selecting training data in the hope of reducing the misclassificationrate.

    Contents1 Introduction 22 Preliminaries 33 No Free Lunch Theorem 44 Complexity-based classification 85 Discussion 12A Technical proofs 14

    Keywords

    Supervised learning; Kolmogorov complexity; no free lunch; Occams razor.

    1

  • 8/9/2019 hl13nflasdgasdg

    2/16

    1 Introduction

    The No Free Lunch (NFL) theorems, stated and proven in various settings anddomains [Sch94, Wol01, WM97], show that no algorithm performs better than any

    other when their performance is averaged uniformly over all possible problems of aparticular type.1 These are often cited to argue that algorithms must be designedfor a particular domain or style of problem, and that there is no such thing as ageneral purpose algorithm.

    On the other hand, Solomonoff induction [Sol64a, Sol64b] and the more generalAIXI model [Hut04] appear to universally solve the sequence prediction and rein-forcement learning problems respectively. The key to the apparent contradiction isthat Solomonoff induction and AIXI do not assume that each problem is equallylikely. Instead they apply a bias towards more structured problems. This bias isuniversal in the sense that no class of structured problems is favored over another.This approach is philosophically well justified by Occams razor.

    The two classic domains for NFL theorems are optimisation and classification.In this paper we will examine classification and only remark that the case for opti-misation is more complex. This difference is due to the active nature of optimisationwhere actions affect future observations.

    Previously, some authors have argued that the NFL theorems do not disprovethe existence of universal algorithms for two reasons.

    1. That taking a uniform average is not philosophically the right thing to do, asargued informally in [GCP05].

    2. Carroll and Seppi in [CS07] note that the NFL theorem measures performanceas misclassification rate, where as in practise, the utility of a misclassificationin one direction may be more costly than another.

    We restrict our consideration to the task of minimising the misclassification ratewhile arguing more formally for a non-uniform prior inspired by Occams razor andformalised by Kolmogorov complexity. We also show that there exist algorithms(unfortunately only computable in the limit) with very good properties on all struc-tured classification problems.

    The paper is structured as follows. First, the required notation is introduced(Section 2). We then state the original NFL theorem, give a brief introduction to

    Kolmogorov complexity, and show that if a non-uniform prior inspired by Occamsrazor is used, then there exists a free lunch (Section 3). Finally, we give a newalgorithm inspired by Solomonoff induction with very attractive properties in theclassification problem (Section 4).

    1Such results have been less formally discussed long before by Watanabe in 1969 [WD69].

    2

  • 8/9/2019 hl13nflasdgasdg

    3/16

    2 Preliminaries

    Here we introduce the required notation and define the problem setup for the NoFree Lunch theorems.

    Strings. A finite string x over alphabet X is a finite sequence x1x2x3 xn1xnwithxiX. An infinite string x over alphabetXis an infinite sequencex1x2x3 .Alphabets are usually countable or finite, while in this paper they will almost alwaysbe binary. For finite strings we have a length function defined by (x) := n forx= x1x2 xn. The empty string of length 0 is denoted by . The set X

    n is the setof all strings of length n. The set X is the set of all finite strings. The set X isthe set of all infinite strings. Let x be a string (finite or infinite) then substrings aredenotedxs:t := xsxs+1 xt1xt where s t. A useful shorthand is x

  • 8/9/2019 hl13nflasdgasdg

    4/16

    Definition 3 (Loss function). The loss of algorithm A, when applied to classifica-tion problem f, with training data Xm is measured by counting the proportion ofmisclassifications in the testing data, Xu.

    LA(f, Xm) := 1

    |Xu|xXu

    [[A(fXm, x) =f(x)]]

    where [[]] is the indicator function defined by, [[expr]] = 1 if expr is true and 0otherwise.

    We are interested in the expected loss of an algorithm on the set of all problemswhere expectation is taken with respect to some distribution P.

    Definition 4 (Expected loss). Let M be the set of all functions from X to Y andPbe a probability distribution on M. IfXm is the training data then the expectedloss of algorithmAis

    LA(P, Xm) :=fM

    P(f)LA(f, Xm)

    3 No Free Lunch Theorem

    We now use the above notation to give a version of the No Free Lunch Theorem ofwhich Wolperts is a generalization.

    Theorem 5 (No Free Lunch). LetPbe the uniform distribution onM. Then the

    following holds for any algorithmA and training dataXmX.

    LA(P, Xm) =|Y 1|/|Y| (1)

    The key to the proof is the following observation. Let x Xu, then for ally Y, P(f(x) =y |f|Xm) =P(f(x) =y) = 1/|Y|. This means no information canbe inferred from the training data, which suggests no algorithm can be better thanrandom.

    Occams razor/Kolmogorov complexity. The theorem above is often used toargue that no general purpose algorithm exists and that focus should be placed onlearning in specific domains.

    The problem with the result is the underlying assumption that P is uniform,which implies that training data provides no evidence about the true class labelsof the test data. For example, if we have classified the sky as blue for the last1,000 years then a uniform assumption on the possible sky colours over time wouldindicate that it is just as likely to be green tomorrow as blue, a result that goesagainst all our intuition.

    How then, do we choose a more reasonable prior? Fortunately, this question hasalready been answered heuristically by experimental scientists who must endlessly

    4

  • 8/9/2019 hl13nflasdgasdg

    5/16

    choose between one of a number of competing hypotheses. Given any experiment, itis easy to construct a hypothesis that fits the data by using a lookup table. Howeversuch hypotheses tend to have poor predictive power compared to a simple alternativethat also matches the data. This is known as the principle of parsimony, or Occams

    razor, and suggests that simple hypotheses should be given a greater weight thanmore complex ones.

    Until recently, Occams razor was only an informal heuristic. This changedwhen Solomonoff, Kolmogorov and Chaitin independently developed the field ofalgorithmic information theory that allows for a formal definition of Occams razor.We give a brief overview here, while a more detailed introduction can be foundin [LV08]. An in depth study of the philosophy behind Occams razor and itsformalisation by Kolmogorov complexity can be found in [KLV97, RH11]. While webelieve Kolmogorov complexity is the most foundational formalisation of Occamsrazor, there have been other approaches such as MML [WB68] and MDL [Gru07].

    These other techniques have the advantage of being computable (given a computableprior) and so lend themselves to good practical applications.The idea of Kolmogorov complexity is to assign to each binary string an integer

    valuedcomplexitythat represents the length of its shortest description. Those stringswith short descriptions are considered simple, while strings with long descriptions arecomplex. For example, the string consisting of 1,000,000 1s can easily be describedas one million ones. On the other hand, to describe a string generated by tossinga coin 1,000,000 times would likely require a description about 1,000,000 bits long.The key to formalising this intuition is to choose a universal Turing machine as thelanguage of descriptions.

    Definition 6 (Kolmogorov Complexity). Let Ube a universal Turing machine andx B be a finite binary string. Then define the plain Kolmogorov complexity C(x)to be the length of the shortest program (description) p such that U(p) =x.

    C(x) := minpB

    {(p) :U(p) =x}

    It is easy to show that Cdepends on choice of universal Turing machineU onlyup to a constant independent of x and so it is standard to choose an arbitraryreferenceuniversal Turing machine.

    For technical reasons it is difficult to use Cas a prior, so Solomonoff introduced

    monotone machines to construct the Solomonoff prior, M. A monotone Turingmachine has one read-only input tape which may only be read from left to right andone write-only output tape that may only be written to from left to right. It hasany number of working tapes. Let Tbe such a machine and write T(p) =x to meanthat after reading p, x is on the output tape. The machines are called monotonebecause ifp is a prefix ofqthenT(p) is a prefix ofT(q). It is possible to show thereexists a universal monotone Turing machine Uand this is used to define monotonecomplexityKmand Solomonoffs prior, M.

    5

  • 8/9/2019 hl13nflasdgasdg

    6/16

    Definition 7 (Monotone Complexity). Let Ube the reference universal monotoneTuring machine then define Km, Mand KM as follows,

    Km(x) := min {(p) :U(p) =x}

    M(x) :=

    U(p)=x2(p)

    KM(x) := log M(x)

    whereU(p) =x means that when given inputp, U outputs x possibly followed bymore bits.

    Some facts/notes follow.

    1. For anyn,

    xBnM(x) 1.

    2. Km, Mand KMare incomputable.

    3. 0< KM(x) K m(x) C(x)< (x) +O(1)2

    To illustrate why M gives greater weight to simple x, suppose x is simple thenthere exists a relatively short monotone Turing machine p, computing it. ThereforeKm(x) is small and so 2Km(x) M(x) is relatively large.

    Since M is a semi-measure rather than a proper measure, it is not appropriateto use it in place ofPwhen computing expected loss. However it can be normalizedto a proper measure, Mnorm defined inductively by

    Mnorm() := 1 Mnorm(xb) :=Mnorm(x) M(xb)

    M(x0) + M(x1)

    Note that this normalisation is not unique, but is philosophically and technically themost attractive and was used and defended by Solomonoff. For a discussion of nor-malisation, see [LV08, p.303]. The normalised version satisfies

    xBnMnorm(x) = 1.

    We will also need to define M/KM with side information, M(y; x) := M(y)wherex is provided on a spare tape of the universal Turing machine. Now defineKM(y; x) := log M(y; x). This allows us to define the complexity of a function interms of its output relative to its input.

    Definition 8(Complexity of a function). LetX={x1, , xn} Bk andf :X

    Bthen define the complexity off, KM(f; X) by

    KM(f; X) :=KM(f(x1)f(x2) f(xn); x1, x2, , xn)

    An example is useful to illustrate why this is a good measure of the complexityoff.

    2The approximation C(x) Km(x) is only accurate to log (x), while KM Km is almostalways very close [Gac83, Gac08]. This is a little surprising since the sum in the definition ofMcontains 2Km . It shows that there are only comparitively few short programs for any x.

    6

  • 8/9/2019 hl13nflasdgasdg

    7/16

    Example 9. Let X Bn for some n, and Y = Band f : X Y be defined byf(x) = [[xn= 1]]. Now for a complex X, the string f(x1)f(x2) might be difficultto describe, but there is a very short program that can output f(x1)f(x2) whengivenx1x2 as input. This gives the expected result that KM(f; X) is very small.

    Free lunch using Solomonoff prior. We are now ready to use Mnormas a prior ona problem family. The following proposition shows that when problems are chosenaccording to the Solomonoff prior that there is a (possibly small) free lunch.

    Before the proposition, we remark on problems with maximal complexity,KM(f; X) = O(|X|). In this case f exhibits no structure allowing it to be com-pressed, which turns out to be equivalent to being random in every intuitive sense[ML66]. We do not believe such problems are any more interesting than trying topredict random coin flips. Further, the NFL theorems can be used to show that noalgorithm can learn the class of random problems by noting that almost all problems

    are random. Thus a bias towards random problems is not much of a bias (from uni-form) at all, and so at most leads to a decreasingly small free lunch as the numberof problems increases.

    Proposition 1 (Free lunch under Solomonoff prior). LetY = Band fix ak N.Now letX = Bn and Xm X such that |Xm| = 2

    n k. For sufficiently large nthere exists an algorithmA such that

    LA(Mnorm, Xm)< 1/2

    Before the proof of Proposition 1, we require an easy lemma.

    Lemma 10. LetN M then there exists an algorithmAN such thatfN

    P(f)LAN(f, Xm)1

    2

    fN

    P(f)

    Proof. Let Ai with i {0, 1} be the algorithm always choosing i. Note thatfN

    P(f)LA0(f, Xm) =fN

    P(f)(1 LA1(f, Xm))

    The result follows easily.

    Proof of Proposition 1. Now letM1 be the set of all f Mwithf(y) = 1y Xmand M0=M M1. Now construct an Aby

    A(fXm, x) =

    1 iff M1

    AM0(fXm, x) otherwise

    7

  • 8/9/2019 hl13nflasdgasdg

    8/16

    Let f1 M1 be the constant valued function such that f1(x) = 1xthen

    LA(Mnorm, Xm) =fM

    Mnorm(f)LA(f, Xm) (2)

    =fM0

    Mnorm(f)LA(f, Xm) +fM1

    Mnorm(f)LA(f, Xm) (3)

    1

    2

    fM0

    Mnorm(f) +fM1

    Mnorm(f)LA(f, Xm) (4)

    1

    2

    fM0

    Mnorm(f) +

    fM1f1

    Mnorm(f) (5)

    with independent ofn. The second becausemaxfM1{f1} Mnorm(f)

    n 0 and |M1|is independent ofn.

    The proposition is unfortunately extremely weak. It is more interesting to knowexactly what conditions are required to do much better than random. In the nextsection we present an algorithm with good performance on all well structured prob-lems when given good training data. Without good training data, even assuminga Solomonoff prior, we believe it is unlikely that the best algorithm will performwell.

    Note that while it appears intuitively likely that any non-uniform distributionsuch as Mnorm might offer a free lunch, this is in fact not true. It is shown in[SVW01] that there exist non-uniform distributions where the loss over a problemfamily is independent of algorithm. These distributions satisfy certain symmetryconditions not satisfied by Mnorm, which allows Proposition 1 to hold.

    4 Complexity-based classification

    Solomonoff induction is well known to solve the online prediction problem where thetrue value of each classification is known after each guess. In our setup, the trueclassification is only known for the training data, after which the algorithm no longerreceives feedback. While Solomonoff induction can be used to bound the number oftotal errors while predicting deterministic sequences, it gives no indication of whenthese errors may occur. For this reason we present a complexity-inspired algorithmwith better properties for the offline classification problem.

    8

  • 8/9/2019 hl13nflasdgasdg

    9/16

  • 8/9/2019 hl13nflasdgasdg

    10/16

    Note, the above is not precisely true since for small strings the dependence ofKMon the universal monotone Turing machine can be fairly large. However if weincrease the size of the example so that |X|> 1000 then these quirks disappear fornatural reference universal Turing machines.

    Definition 13 (Entropy). Let [0, 1]

    H() :=

    [ log + (1 )log(1 )] if = 0 and = 1

    0 otherwise

    Theorem 14. Let (0, 1) be the proportion of data to be given for training then:

    1. There exists a B (training set) such that for all n N, n c1 0 independent of, ,x and y. Finally use the techniques in

    the proof of Theorem 14 to complete the proof.

    5 Discussion

    Summary. Proposition 1 shows that if problems are distributed according to theircomplexity, as Occams razor suggests they should, then a (possibly small) free lunchexists. While the assumption of simplicity still represents a bias towards certainproblems, it is a universal one in the sense that no style of structured problem ismore favoured than another.

    In Section 4 we gave a complexity-based classification algorithm and proved thefollowing properties:

    1. It performs well on problems that exhibit some compressible structure,KM(f; X) =o(n).

    2. Increasing the amount of training data decreases the error.

    3. It performs better when given a good (broad/randomized) selection of trainingdata.

    12

  • 8/9/2019 hl13nflasdgasdg

    13/16

    Theorem 14 is reminiscent of the transductive learning bounds of Vapnik and others[DEyM04, Vap82, Vap00], but holds for al lMartin-Lof random training data, ratherthan with high probability. This is different to the predictive result in Solomonoffinduction where results hold with probability 1 rather than for all Martin-Lof ran-

    dom sequences [HM07]. If we assume the training set is sampled randomly, thenour bounds are comparable to those in [DEyM04].

    Unfortunately, the algorithm of Section 4 is incomputable. However Kolmogorovcomplexity can be approximated via standard compression algorithms, which mayallow for a computable approximation of the classifier of Section 4. Such approxi-mations have had some success in other areas of AI, including general reinforcementlearning [VNH+11] and unsupervised clustering [CV05].

    Occams razor is often thought of as the principle of choosing the simplest hy-pothesis matching your data. Our definition of simplest is the hypothesis that min-imises KM(f; X) (maximises M(f; X)). This is perhaps not entirely natural from

    the informal statement of Occams razor, sinceM(x) contains contributions from allprograms computingx, not just the shortest. We justify this by combining Occamsrazor with Epicurus principle of multiple explanations that argues for all consistenthypotheses to be considered. In some ways this is the most natural interpretationas no scientist would entirely rule out a hypothesis just because it is slightly morecomplex than the simplest. A more general discussion of this issue can be found in[Dow11, sec.4]. Additionally, we can argue mathematically that sinceKM Km,the simplest hypothesis is very close to the mixture.3 Therefore the debate is morephilosophical than practical in this setting.

    An alternative approach to formalising Occams razor has been considered inMML [WB68]. However, in the deterministic setting the probability of the data

    given the hypothesis satisfiesP(D|H) = 1. This means the two part code reduces tothe code-length of the prior, log(1/P(H)). This means the hypothesis with minimummessage length depends only on the choice of prior, not the complexity of codingthe data. The question then is how to choose the prior, on which MML gives nogeneral guidance. Some discussion of Occams razor from a Kolmogorov complexityviewpoint can be found in [Hut10, KLV97, RH11], while the relation between MMLand Kolmogorov complexity is explored in [WD99].

    Assumptions. We assumed finite X, Y, and deterministic f, which is the stan-dard transductive learning setting. Generalisations to countable spaces may stillbe possible using complexity approaches, but non-computable real numbers provemore difficult. One can either argue by the strong Church-Turing thesis that non-computable reals do not exist, or approximate them arbitrarily well. Stochastic fare interesting and we believe a complexity-based approach will still be effective,although the theorems and proofs may turn out to be somewhat different.

    Acknowledgements. We thank Wen Shao and reviewers for valuable feedback

    3The bounds of Section 4 would depend on the choice of complexity at most logarithmically in|X| with KMproviding the uniformly better bound.

    13

  • 8/9/2019 hl13nflasdgasdg

    14/16

    on earlier drafts and the Australian Research Council for support under grantDP0988049.

    A Technical proofsLemma 16 (Proof of Entropy inequality).

    0 (1 )log 1

    1 (10)

    H() [+ (1 )(1 )] H

    + (1 )(1 )

    (11)

    With equality only if {0, 1} or= 0

    Proof. First, (10) is trivial. To prove (11), note that for = 0 or {0, 1}, equalityis obvious. Now, fixing (0, 1) and computing.

    H() [+ (1 )(1 )] H

    + (1 )(1 )

    = (1 )log 1 (1 )

    (1 )(1 )

    (1 )log(1 )1

    Therefore integrating both sides over gives,

    (1 ) log(1 )1 H() [+ (1 )(1 )] H

    + (1 )(1 )

    as required.

    References

    [CS07] J. Carroll and K. Seppi. No-free-lunch and Bayesian optimality. In IJCNNWorkshop on Meta-Learning, 2007.

    [CV05] R. Cilibrasi and P. Vitanyi. Clustering by compression. Information Theory,IEEE Transactions on, 51(4):1523 1545, 2005.

    [DEyM04] P. Derbeko, R. El-yaniv, and R. Meir. Error bounds for transductive learningvia compression and clustering. NIPS, 16, 2004.

    [Dow11] D. Dowe. MML, hybrid Bayesian network graphical models, statistical con-sistency, invariance and uniqueness. In Handbook of Philosophy of Statistics,volume 7, pages 901982. Elsevier, 2011.

    14

  • 8/9/2019 hl13nflasdgasdg

    15/16

    [Gac83] P. Gacs. On the relation between descriptional complexity and algorithmicprobability. Theoretical Computer Science, 22(1-2):71 93, 1983.

    [Gac08] P. Gacs. Expanded and improved proof of the relation between descriptioncomplexity and algorithmic probability. Unpublished, 2008.

    [GCP05] C. Giraud-Carrier and F. Provost. Toward a justification of meta-learning: Isthe no free lunch theorem a show-stopper. InICML workshop on meta-learning,pages 916, 2005.

    [Gru07] P. Grunwald. The Minimum Description Length Principle, volume 1 ofMITPress Books. The MIT Press, 2007.

    [HM07] M. Hutter and A. Muchnik. On semimeasures predicting Martin-Lof randomsequences. Theoretical Computer Science, 382(3):247261, 2007.

    [Hut04] M. Hutter. Universal Artificial Intelligence: Sequential Decisions based on

    Algorithmic Probability. Springer, Berlin, 2004.

    [Hut10] M. Hutter. A complete theory of everything (will be subjective). Algorithms,3(4):329350, 2010.

    [KLV97] W. Kirchherr, M. Li, and P. Vitanyi. The miraculous universal distribution.The Mathematical Intelligencer, 19(4):715, 1997.

    [LV08] M. Li and P. Vitanyi. An Introduction to Kolmogorov Complexity and ItsApplications. Springer, Verlag, 3rd edition, 2008.

    [ML66] P. Martin-Lof. The definition of random sequences. Information and Control,

    9(6):602 619, 1966.[RH11] S. Rathmanner and M. Hutter. A philosophical treatise of universal induction.

    Entropy, 13(6):10761136, 2011.

    [Sch94] C. Schaffer. A conservation law for generalization performance. InProceedingsof the Eleventh International Conference on Machine Learning, pages 259265.Morgan Kaufmann, 1994.

    [Sol64a] R. Solomonoff. A formal theory of inductive inference, Part I. Informationand Control, 7(1):122, 1964.

    [Sol64b] R. Solomonoff. A formal theory of inductive inference, Part II. Information

    and Control, 7(2):224254, 1964.

    [SVW01] C. Schumacher, M. Vose, and L. Whitley. The no free lunch and problemdescription length. In Lee Spector and Eric D. Goodman, editors, GECCO2001: Proc. of the Genetic and Evolutionary Computation Conf., pages 565570, San Francisco, 2001. Morgan Kaufmann.

    [Vap82] V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, New York, 1982.

    15

  • 8/9/2019 hl13nflasdgasdg

    16/16

    [Vap00] V. Vapnik. The Nature of Statistical Learning Theory. Springer, Berlin, 2ndedition, 2000.

    [VNH+11] J. Veness, K. S. Ng, M. Hutter, W. Uther, and D. Silver. A Monte Carlo AIXIapproximation. Journal of Artificial Intelligence Research, 40:95142, 2011.

    [WB68] C. Wallace and D. Boulton. An information measure for classification. TheComputer Journal, 11(2):185194, 1968.

    [WD69] S. Watanabe and S. Donovan. Knowing and guessing; a quantitative study ofinference and information. Wiley New York,, 1969.

    [WD99] C. Wallace and D. Dowe. Minimum message length and Kolmogorov complex-ity. The Computer Journal, 42(4):270283, 1999.

    [WM97] D. Wolpert and W. Macready. No free lunch theorems for optimization. IEEETransactions on Evolutionary Computation, 1(1):6782, April 1997.

    [Wol01] D. Wolpert. The supervised learning no-free-lunch theorems. In In Proc. 6thOnline World Conference on Soft Computing in Industrial Applications, pages2542, 2001.

    16