BioinforMatics Paper presentation

BIOINFORMATICSPRESENTATION

By Pranav Bhat 11CO66

Pruthvi P 11CO69

From Birdsong to Human Speech Recognition:Bayesian Inference on a Hierarchy of NonlinearDynamical Systems

By : Izzet B. Yildiz , Katharina von Kriegstein , Stefan J. Kiebel

Main AIM of the paper.

To translate a BIRD SONG model to HUMAN SPEECH .

BIRD SONG MODELThe birdsong model performs

A Bayesian version of dynamical, predictive coding based on an internal generative model of how birdsong is produced .

The core of this generative model consists of a two-level hierarchy of nonlinear dynamical systems and is the proposed mechanistic basis of how songbirds extract online information from an ongoing song.

We translated this birdsong model to human sound recognition by replacing songbird related parts with human-specific parts.

This included processing the input with a human cochlea model, which maps sound waves to neuronal activity. The resulting model is able to learn and recognize any sequence of sounds such as speech or music ,even in the presence of adverse conditions of noise, and hence gives an insight to the development of the automated speech recognition model.

Overall Structure of the paper

First, inspired by songbird circuitry, it proposes a mechanistic hypothesis about how humans recognize speech using nonlinear dynamical systems.

Secondly, if the resulting speech recognition system shows good performance, even under adverse conditions, it may be used to optimize automatic speech recognition.

Thirdly, the neurobiological plausibility of the model would allow it to be used to derive predictions for neurobiological experiments.

BRIEF OVERVIEW OF THE PAPER

We translated these two levels to the human speech model in

the present study. The second, higher level encodes a

recurrent neural network producing a sequential activation of

neurons in a winner-less competition setting (stable hetero clinic

channels). These dynamic sequences control dynamics at a first, lower level ,

where we model amplitude variations in specific frequency bands.

In comparison to the birdsong model, the generative model here

does not explicitly model the vocal tract dynamics but rather the

dynamics at the cochlea which would be elicited by the stimulus.

Prerequisites Structure of the ear

Importance of Cochlea1. It is spiral shaped peripheral

organ in the inner ear.2. Important part of the

auditory system which converts acoustic sound waves to neural signals.

3. Sound coming from the ear canal beats the cochlea and thus gets converted to sensitive neural signals of different frequencies.

4. Frequency specificity comes from differential stiffness of the basilar membrane which extends from the cochlea.

5. Its base is thick and responds to higher frequencies while apex is thin and responds to lower frequency.

Prerequisites Cochleogram

Cochleogram representing the firing rate of auditory nerves at each time point(frequency time)

[LyonPassiveModel with 86 channels]

Model Conceptual overview

1. Bayesian approach builds a generative model which is then converted to recognition and learning model.

2. As compared to other models its hierarchically structured(2 levels for this model) and non linear and dynamic which can be tailored to once specific needs.

3. More flexible than other models such as Markov models, Deep Belief networks , Liquid state machines , TRACE and shortlist.

4. For this model, the firing patterns at the pre mortar area is considered.

Keyterms

Modules : It is a mechanism basedon Bayesian inference which canlearn and recognize a single word. Itis like a sophisticated templatematcher where the template islearned and stored in ahierarchically structured recurrentneural network and comparedagainst a stimulus in an onlinefashion.Each module contains thetwo level model described shortly.

Prediction message

Prediction error message

Agent : Is a group of individual models which together achieve a common classification task like word recognition task". Here we show how precision settings in agents are crucial to learn new stimuli or recognize sounds in noisy environment.

Mathematical details of the model

Level 2 : Sequential dynamics (winner less competition setting) This consists of a group of N equilibrium points (saddle points) ,each having

one unstable direction pointing to the next equilibrium point and all other directions stably pointing to a stable hetero clinic channel. Here the firing of signals are pictured to be like a game of musical chairs, with each neural signal being generated randomly.

These can be represented by the following mathematical equations

Mathematical model for the Second Layer

Significance of the terms S(x) = 1/(1+(e)-x) is a sigmoid function applied component-wise for hidden-

state vectors x,y used at the 2nd Level, describing the stable heteroclinicchannel respectively, and y acting as normalizing functions for x, to restrict the range to 0,1. Y uses exponential functions for fineness , to avoid overlap of signals and hence to render sensitivity of the neurons.

V is the set of casual states v(i) used to transmit output from Level 2 to level 1, and all of x,y and v have the Normally distributed Noise factors packed in to ensure reality.

The connectivity matrix represents strengths of inhibition from j to ifor pij choosing high inhibition from previously active neuron to currently active neuron and low inhibition from currently active neuron to next active neuron.

Each second level wave called ensemble sends a signal Ik to the first level and hence total signal to the first level is

Mathematical model for level 1 (Spectro-Temporal Dynamics )

Here firing rates are encoded by first level activity.

A specific input I,from the 2nd level attracts the acitivity of the module network to a global attractor encoding a specific spectral pattern in the Cochleogram and change continuously over time.

Hop field network dynamics modeled similar to associative memory.

The equations used here areHere phi as a sigmoid function is tanh function and K1 = 2.

The dimension value n = 6, since there are six samples.

Mathematical model for Learning and recognition For a given speech stimulus z and model m, the following terms are

valid.1. Model evidence /marginal likelihood of z (p(Z|m) is a conditional probability

of truthfulness of the z.

2. Posterior density P(u|z,m) describes the mean distribution of the variables v,x and Ik all together denoted by U= {x,v,Ik }

1. Where

3.

It can seen that maximizing the F(q,z) will minimize D(q||p) thus giving an approximation of Q(u) = P(u|z,m).Here Q(u) is assumed to be of laplace approximation which states that

Here we use a concept of precision such that low precision means greater influence and high precision means less deviation from expectations.

Above maximization of F can be written in terms of herierrachal setting as follows

A message passing scheme can be used to find the optimal mode and variance for the states where optimization problem turns into a gradient descend on precision weighted prediction errors governed by following equations.

High precision for a variable means amplified prediction error and hence toleration of only smaller errors and low precision involves more approximation and greater larger tolerance.

Implementation

The above Bayesian inference can be implemented neuro biologically using two types of neuronal ensembles. The modes of expected casual and hidden states can be represented by the

neural activity of state ensembles

Prediction errors encoded by activity of error ensembles,with one to one correspondence to state ensembles.

These messages can be passed via forward and lateral connections. Error units can be identified by superficial pyramidal cells. This message passing scheme efficiently minimizes prediction errors and

optimizes predictions at all levels and uses academic free ware as software backbone.

Results

Bayesian model for learning and online recognition of human speech Main objective of this model is

Learning : where the feedback parameters from 2nd

level to 1st level are allowed to change.(is slower) Recognition : parameters are fixed and model only

reconstructs the hidden dynamics.(online)

How learning and recognition is done

Starts with sensation

Speech sound wave Passes through cochlea model and becomes a dynamic input to the model

Speech signal preprocessed by cochlea model to z(t) Then it reaches the 1st level of the module. Each module infers the states od the 1st and 2nd level = recognition And learns the connection weights from second to first level.

Both the modules contain neuronal population which encodes the expectation (sensory input) about the cochlea.

These expectation predict the neuronal activity at the next level.

Error minimization

z(t) from cochlear model is compared to the prediction of 1st model

Then prediction error are computed and propagated to 2nd level.

But levels adjust their internal predictions accordingly to upto a agreed precision.

Similarly the 2ns level forms predictions which are sent to 1st

level.(only possible if backward connections are appropriate)

Learning

Compared to recognition learning is not online.

Doesnt happen over the course of the complete stimulus.

Rather , for learning the prediction errors are summed up for the whole stimulus duration and used after stimulus presentation to update the parameters.

Testing

Learning speech

The relatively high precision forces each module to closely match the external stimulus, i.e., minimize the prediction error about the sensory input, and allow for more prediction error on the internal dynamics.

To reduce these prediction errors, each module is forced to adapt the backward connections from the second level to the first level, which are free parameters in the model .

This automatic optimization process iterates until the prediction error can be no further reduced and is typically completed after five to six repetitions of a word.

Word recognition task

Samples : ten samples of ten words for digits (zero to nine) spoken by five female speakers, adding up to a total of 500 speech samples.

For classification, we used a winner-take-all process where the winner was the module with the lowest prediction error, i.e. the module which can best explain the sensory input using its internal model.

average Word Error Rate = 1.6

Robustness of the system against noise

SIGNAL TO NOISE RATIO WER

30dB 3.6

20dB 5

10dB 11.2

Variation in speech rate

A sample compressed by 25% was given to a module trained of 8.

Result : The sample was recognized.

Inference : Module is inheritably robust against speech rate. (Because of the reduction of prediction errors)

Recognition in a noisy environment

The target sentence : She argues with her sister. and

presented it to a module

without background speaker, with one background speaker, and with three background speakers.

Accent adaptation

By adaptation we mean that the learning of the parameters in a module proceeds from a previously learned parameter set (base accent) as opposed to learning from scratch in the Learning speech simulation. Therefore, adaptation can be understood as slight changes of the backward connections instead of learning a completely new word.

Documents

BioinforMatics Paper presentation