53
SS2016 Modern Neural Computation Lecture 5: Neural Networks and Neuroscience Hirokazu Tanaka School of Information Science Japan Institute of Science and Technology

JAISTサマースクール2016「脳を知るための理論」講義04 Neural Networks and Neuroscience

Embed Size (px)

Citation preview

  • SS2016 Modern Neural Computation

    Lecture 5: Neural Networks and Neuroscience

    Hirokazu TanakaSchool of Information Science

    Japan Institute of Science and Technology

  • Supervised learning as functional approximation.

    In this lecture we will learn: Single-layer neural networks

    Perceptron and the perceptron theorem.Cerebellum as a perceptron.

    Multi-layer feedforward neural networksUniversal functional approximations, Back-propagation algorithms

    Recurrent neural networksBack-propagation-through-time (BPTT) algorithms

    TempotronSpike-based perceptron

  • Gradient-descent learning for optimization.

    Classification problem: to output discrete labels.For a binary classification (i.e., 0 or 1), a cross-entropy is often used.

    Regression problem: to output continuous values.Sum of squared errors is often used.

  • Cost function: classification and regression.

    Classification problem: to output discrete labels.For a binary classification (i.e., 0 or 1), a cross-entropy is often used.

    Regression problem: to output continuous values.Sum of squared errors is often used.

    : output of network, : desired outputi iy y

    ( ) ( ) ( )1: samples: samples

    log 1 log 1 log 1ii i i i i iyy

    iii

    y y y y y y = +

    ( ): sa p e

    2

    m l s

    ii

    iy y

  • Perceptron: single-layer neural network.

    Assume a single-layer neural network with an input layer composed of N units and an output layer composed of one unit.

    Input units are specified by

    and an output unit are determined by

    ( )1T

    Nx x=x

    ( )T01

    0

    n

    ii iy f w x fw w

    =

    = + = + w x

    ( )1 if 00 if 0

    uf

    uu

    =

  • Perceptron: single-layer neural network.

    feature 1

    feature 2

  • Perceptron: single-layer neural network.

    [Remark] Instead of using

    often, an augmented input vector

    are used. Then,

    ( )1T

    Nx x=x

    ( ) ( )T T0y f w f= + =w x w x

    ( )11T

    Nx x=x

    ( )10T

    Nw w w=w

  • Perceptron Learning Algorithm.

    ( ) ( ) ( ){ }21 1 2, , ,, , ,P Pd d dx x x Given a training set:

    Perceptron learning rule:

    ( )i i iyd =w x

    while err>1e-4 && count

  • Perceptron Learning Algorithm.

    Case 1: Linearly separable case

  • Perceptron Learning Algorithm.Case 2: Linearly non-separable case

  • Perceptrons capacity: Covers Counting Theorem.

    Question: Suppose that there are P vectors in N-dimensional Euclidean space.

    There are 2P possible patterns of two classes. How many of them are linearly separable?

    [Remark] They are assumed to be in general position.

    Answer: Covers Counting Theorem.

    { }1, ,, NP i x x x

    ( )1

    0

    1, 2

    N

    k

    PC P N

    k

    =

    =

  • Perceptrons capacity: Covers Counting Theorem.

    Covers Counting Theorem.

    Case :

    Case = 2:

    Case :

    ( )1

    0

    1, 2

    N

    k

    PC P N

    k

    =

    =

    ( ), 2PC P N =

    ( ) 1, 2PC P N =

    ( ), NC P N APCover (1965) IEEE Information; Sompolinsky (2013) MIT lecture note

  • Perceptrons capacity: Covers Counting Theorem.

    Case for large P:

    Orhan (2014) Covers Function Counting Theorem

    ( ) 1 21 e2

    rf,

    2 2PpC P N

    Np

    +

  • Cerebellum as a Perceptron.

    Llinas (1974) Scientific American

  • Cerebellum as a Perceptron.

    Cerebellar cortex has a feedforward structure: mossy fibers -> granule cells -> parallel fibers -> Purkinje cells

    Ito (1984) Cerebellum and Neural Control

  • Cerebellum as a Perceptron (or its extensions)

    Perceptron modelMarr (1969): Long-term potentiation (LTP) learning.Albus (1971): Long-term depression (LTD) learning.

    Adaptive filter theoryFujita (1982): Reverberation among granule and Golgi cells for generating temporal templates.

    Liquid-state machine modelYamazaki and Tanaka (2007):

  • Perceptron: a new perspective.

    Evaluation of memory capacity of a Purkinje cell using perceptron methods (the Gardner limit).

    Brunel, N., Hakim, V., Isope, P., Nadal, J. P., & Barbour, B. (2004). Optimal information storage and the distribution of synaptic weights: perceptron versus Purkinje cell. Neuron, 43(5), 745-757.

    Estimation of dimensions of neural representations during visual memory task in the prefrontal cortex using perceptron methods (Covers counting theorem).

    Rigotti, M., Barak, O., Warden, M. R., Wang, X. J., Daw, N. D., Miller, E. K., & Fusi, S. (2013). The importance of mixed selectivity in complex cognitive tasks. Nature, 497(7451), 585-590.

  • Limitation of Perceptron.

    Only linearly separable input-output sets can be learned.

    Non-linear sets, even a simple one like XOR, CANNOT be learned.

  • Multilayer neural network: feedforward design

    ( )nix

    ( )1njx

    Layer 1 Layer n-1 Layer n Layer N

    ( )1nijw

    Feedforward network: a unit in layer n receives inputs from layer n-1 and projects to layer n+1.

  • Multilayer neural network: feedforward design

    ( )nix

    ( )1njx

    Layer 1 Layer n-1 Layer n Layer N

    ( )1nijw

    Feedforward network: a unit in layer n receives inputs from layer n-1 and projects to layer n+1.

  • Multilayer neural network: forward propagation.

    ( ) ( )( ) ( ) ( )1 11

    n n n ni i ij j

    jx f u f w x

    =

    = =

    ( ) 11 u

    f ue

    =+

    ( )( )

    ( ) ( )( )21 11

    11

    11

    u

    u uuf e

    e eu

    ef u f u

    = = = + + +

    Layer n-1 Layer n

    ( )nix

    ( )1njx

    ( )1nijw

    ( ) ( ) ( )1 1

    1

    n n ni ij j

    ju w x

    =

    =

    In a feedforward multilayer neural network propagates its activities from one layer to another in one direction:

    Inputs to neurons in layer n are a summation of activities of neurons in layer n-1:

    The function f is called an activation function, and its derivative is easy to compute:

  • Multilayer neural network: error backpropagation

    Define an cost function as a squared sum of errors in output units:

    Gradients of cost function with respect to weights:

    ( )( ) ( )( )2 21 12 2N N

    i i ii i

    x z= =

    Layer n-1 Layer n

    ( ) ( ) ( ) ( )( ) ( )1 11n n n n ni j j j jij

    x x w =

    ( )1nj

    ( )ni

    The neurons in the output layer has explicit supervised errors (the difference between the network outputs and the desired outputs). How, then, to compute the supervising signals for neurons in intermediate layers?

  • Multilayer neural network: error backpropagation

    1. Compute activations of units in all layers.

    2. Compute errors in the output units, .

    3. Back-propagate the errors to lower layers using

    4. Update the weights

    ( ){ } ( ){ } ( ){ }1 ,, , ,n Ni i ix x x ( ){ }Ni

    ( ) ( ) ( ) ( )( ) ( )1 11n n n n ni j j j jij

    x x w =

    ( ) ( ) ( ) ( )( ) ( )1 1 11n n n n nij i i i jw x x x + + + =

  • Multilayer neural network as universal machine for functional approximation.

    A multilayer neural network is in principle able to approximate any functional relationship between inputs and outputs at any desired accuracy (Funahashi, 1988).

    Intuition: A sum or a difference of two sigmoid functions is a bump-like function. And, a sufficiently large number of bump functions can approximate any function.

  • NETtalk: A parallel network that learns to read aloud.

    Sejnowski & Rosenberg (1987) Complex Systems

    A feedforward three-layer neural network with delay lines.

  • NETtalk: A parallel network that learns to read aloud.

    Sejnowski & Rosenberg (1987) Complex Systems; https://www.youtube.com/watch?v=gakJlr3GecE

    A feedforward three-layer neural network with delay lines.

  • NETtalk: A parallel network that learns to read aloud.

    Sejnowski & Rosenberg (1987) Complex Systems

    Activations of hidden units for a same sound but different inputs

  • Hinton diagrams: characterizing and visualizing connection to and from hidden units.

    Hinton (1992) Sci Am

    Activations of hidden units for a same sound but different inputs

  • Autonomous driving learning by backpropagation.

    Pomerleau (1991) Neural Comput

    Activations of hidden units for a same sound but different inputs

  • Autonomous driving learning by backpropagation.

    Pomerleau (1991) Neural Comput; https://www.youtube.com/watch?v=ilP4aPDTBPE

  • Gradient vanishing problem: why is training a multi-layer neural network so difficult?

    Hochreiter et al. (1991)

    The back-propagation algorithm works only for neural networks of three or four layers.

    Training neural networks with many hidden layers called deep neural networks- is notoriously difficult.

    ( ) ( ) ( ) ( )( ) ( )1 11N N N N Nj i i i iji

    x x w =

    ( ) ( ) ( ) ( )( ) ( )

    ( ) ( ) ( )( ) ( ) ( ) ( )( ) ( )

    2 1 1 1 2

    1 1 1 2

    1

    1 1

    N N N N Nk j j j jk

    j

    N N N N N N Ni i i ij j j jk

    j i

    x x w

    x x w x x w

    =

    =

    ( ) ( ) ( ) ( ) ( )( 1) ( 1) ( 1) ( 1) ( ) ( )~ 1 1 1n Nn n N N N Nx x x x x x+ +

  • Multilayer neural network: recurrent connections

    A feedforward neural network can represent an instantaneous relationship between inputs and outputs- memoryless: it depends on current inputs but not on previous inputs.

    In order to describe a history, a neural network should have its own dynamics.

    One way to incorporate dynamics into a neural network is to introduce recurrent connections between units.

  • Working memory in the parietal cortex.

    A feedforward neural network can represent an instantaneous relationship between inputs and outputs- memoryless: it depends on current inputs x(t) but not on previous inputs x(t-1), x(t-2), ...

    In order to describe a history, a neural network should have its own dynamics.

    One way to incorporate dynamics into a neural network is to introduce recurrent connections between units.

  • Multilayer neural network: recurrent connections

    ( ) ( )( ) ( ) ( )( )( )1 1ii ix t f u t f t t+ = + = +Wx Ua

    ( ) ( )( )iz t g t= Vx

    Recurrent dynamics of neural network:

    Output readout:

    a x z

    U VW

  • Temporal unfolding: backpropagation through time (BPTT)

    1ta

    1txtztx

    { }10 2 1,, , ,, ,t T a a a aa

    { }1 2 3, , , ,, ,t Tzz z zz

    ,U W V

    Training set for a recurrent network:

    Input series:

    Output series:

    Optimize the weight matrices so as to approximate the training set:

  • Temporal unfolding: backpropagation through time (BPTT)

    0a 1z1x,U W V

    0a2z1

    x,U WV,U W

    1a 2x

    0a3z

    1x,U WV

    ,U W1a 3x2

    x,U W

    2a

    1ta

    1txtztx,U W V

  • Working-memory related activity in parietal cortex.

    Gnadt & Andersen (1988) Exp Brain Res

  • Temporal unfolding: backpropagation through time (BPTT)

    Zipser (1991) Neural Comput

  • Temporal unfolding: backpropagation through time (BPTT)

    Zipser (1991) Neural Comput

    Model

    Experiment

    Model

    Experiment

  • Spike pattern discrimination in humans.

    Johansson & Birznieks (2004); Johansson & Flanagan (2009)

  • Spike pattern discrimination in dendrites.

    Branco et al. (2009) Science

  • Tempotron: Spike-based perceptron.

    Consider five neurons and each emitting one spike but at different timings:

    Rate coding: Information is coded in numbers of spikes in a given period.

    ( ) ( )31 2 4 5, , , , 1,1,1,1,1r r r r r =

    Temporal coding: Information is coded in temporal patterns of spiking.

  • Tempotron: Spike-based perceptron.

    Consider five neurons and each emitting one spike but at different timings:

  • Tempotron: Spike-based perceptron.

    Basic idea: Expand the spike pattern into time:

    N

    T

    NT

    Now

  • Tempotron: Spike-based perceptron.

    31 1

    t tw e w e +

    2 22 tw e w +

    21 1

    tw e w +

    32 2

    t tw e w e +

    ( ) ( )21 23 2 1t t tw e e w e + + + > ( ) ( )21 22 31t t tw e w e e + + +

  • Tempotron: Spike-based perceptron.

    31 1

    t tw e w e +

    2 22 tw e w +

    21 1

    tw e w +

    32 2

    t tw e w e +

    ( ) ( )21 23 2 1t t tw e e w e + + + > ( ) ( )21 22 31t t tw e w e e + + +

  • Learning a tempotron: intuition.

    31 1

    t tw e w e +

    2 22 tw e w +

    21 1

    tw e w +

    32 2

    t tw e w e +

    ( ) ( )21 23 2 1t t tw e e w e + + + > ( ) ( )21 22 31t t tw e w e e + + >+

    What was wrong if the second pattern was misclassified?

    The last spike of neuron #1 (red one) is most responsible for the error, so the synaptic strength of this neuron should be reduced.

    1w =

  • Learning a tempotron: intuition.

    31 1

    t tw e w e +

    2 22 tw e w +

    21 1

    tw e w +

    32 2

    t tw e w e +

    ( ) ( )21 23 2 1t t tw e e w e + +

  • Exercise: Capacity of perceptron.

    Generate a set of random vectors.

    Write a code for the Perceptron learning algorithm.

    By randomly relabeling, count how many of them are linearly separable.

    Rigotti, M., Barak, O., Warden, M. R., Wang, X. J., Daw, N. D., Miller, E. K., & Fusi, S. (2013). The importance of mixed selectivity in complex cognitive tasks. Nature, 497(7451), 585-590.

  • Exercise: Training of recurrent neural networks.

    0 =

    IPT

    1 T1n n n n

    n nn n n

    + = +P r r PP P

    r P r

    Goal: Investigate the effects of chaos and feedback in a recurrent network.

    ( )1t n n n t+ = + + x x x Mr

    T tanhnn nz = w x

    tanhn n=r x

    1 nn n n ne+ = w w P r

    nn ne z f=

    Recurrent dynamics without feedback:

    Update of covariance matrix:

    Update of weight matrix:

    force_internal_all2all.m

  • Exercise: Training of recurrent neural networks.

    0 =

    IPT

    1 T1n n n n

    n nn n n

    + = +P r r PP P

    r P r

    Goal: Investigate the effects of chaos and feedback in a recurrent network.

    ( )1 ft n nn n n tz+ = ++ + x x Mr wx

    T tanhnn nz = w x

    tanhn n=r x

    1 nn n n ne+ = w w P r

    nn ne z f=

    Recurrent dynamics with feedback:

    Update of covariance matrix:

    Update of weight matrix:

    force_external_feedback_loop.m

  • Exercise: Training of recurrent neural networks.Goal: Investigate the effects of chaos and feedback in a recurrent network.

    Investigate the effect of output feedback. Are there any difference in the activities of recurrent units?

    Investigate the effect of gain parameter g. What happens if the gain parameter is smaller than 1?

    Try to approximate some other time series such as chaotic ones. Use the Lorentz model, for example.

  • References

    Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1988). Learning representations by back-propagating errors. Cognitive modeling, 5(3), 1.

    Sejnowski, T. J., & Rosenberg, C. R. (1987). Parallel networks that learn to pronounce English text. Complex systems, 1(1), 145-168.

    Funahashi, K. I. (1989). On the approximate realization of continuous mappings by neural networks. Neural networks, 2(3), 183-192.

    S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies

    Zipser, D. (1991). Recurrent network model of the neural mechanism of short-term active memory. Neural Computation, 3(2), 179-193.

    Johansson, R. S., & Birznieks, I. (2004). First spikes in ensembles of human tactile afferents code complex spatial fingertip events. Nature neuroscience, 7(2), 170-177.

    Branco, T., Clark, B. A., & Husser, M. (2010). Dendritic discrimination of temporal input sequences in cortical neurons. Science, 329(5999), 1671-1675.

    Gtig, R., & Sompolinsky, H. (2006). The tempotron: a neuron that learns spike timingbased decisions. Nature neuroscience, 9(3), 420-428.

    Sussillo, D., & Abbott, L. F. (2009). Generating coherent patterns of activity from chaotic neural networks. Neuron, 63(4), 544-557.

    SS2016 Modern Neural ComputationLecture 5: Neural Networks and NeuroscienceSupervised learning as functional approximation.Gradient-descent learning for optimization.Cost function: classification and regression.Perceptron: single-layer neural network.Perceptron: single-layer neural network.Perceptron: single-layer neural network.Perceptron Learning Algorithm.Perceptron Learning Algorithm.Perceptron Learning Algorithm.Perceptrons capacity: Covers Counting Theorem.Perceptrons capacity: Covers Counting Theorem.Perceptrons capacity: Covers Counting Theorem.Cerebellum as a Perceptron.Cerebellum as a Perceptron.Cerebellum as a Perceptron (or its extensions)Perceptron: a new perspective.Limitation of Perceptron.Multilayer neural network: feedforward designMultilayer neural network: feedforward designMultilayer neural network: forward propagation.Multilayer neural network: error backpropagationMultilayer neural network: error backpropagationMultilayer neural network as universal machine for functional approximation.NETtalk: A parallel network that learns to read aloud.NETtalk: A parallel network that learns to read aloud.NETtalk: A parallel network that learns to read aloud.Hinton diagrams: characterizing and visualizing connection to and from hidden units.Autonomous driving learning by backpropagation.Autonomous driving learning by backpropagation.Gradient vanishing problem: why is training a multi-layer neural network so difficult?Multilayer neural network: recurrent connectionsWorking memory in the parietal cortex.Multilayer neural network: recurrent connections 35 36Working-memory related activity in parietal cortex.Temporal unfolding: backpropagation through time (BPTT)Temporal unfolding: backpropagation through time (BPTT)Spike pattern discrimination in humans.Spike pattern discrimination in dendrites.Tempotron: Spike-based perceptron.Tempotron: Spike-based perceptron.Tempotron: Spike-based perceptron.Tempotron: Spike-based perceptron.Tempotron: Spike-based perceptron.Learning a tempotron: intuition.Learning a tempotron: intuition.Exercise: Capacity of perceptron.Exercise: Training of recurrent neural networks.Exercise: Training of recurrent neural networks.Exercise: Training of recurrent neural networks.References