31
Kaldi&voice Your personal speech recogni4on server using open source code Xavier Anguera CTO & CSO, ELSA Corp. [email protected]

Kaldi-voice: Your personal speech recognition server using open source code

Embed Size (px)

Citation preview

Kaldi&voice+Your+personal+speech+recogni4on+server+using+open+source+code+

Xavier+Anguera+CTO+&+CSO,[email protected]+

Outline+•  Intro+•  What+is+speech+recogni4on+

–  Applica4ons+•  Approaches+to+ASR+

–  PaHern+matching+approaches+–  Sta4s4cal&based+approaches+

•  Available+speech+recogni4on+engines+–  “open”+source+–  Online+commercial+systems+

•  Building+your+own+online+system+–  Live+demo+

Automa4c+Speech+Recogni4on+

•  Automa'c)Speech)Recogni'on)(ASR))is+the+process+of+conver4ng+an+unknown+speech+waveform+into+the+corresponding+orthographic+transcrip4on.++

Image:+hHp://blogs.msdn.com/b/devschool/archive/2012/02/06/speech&recogni4on&using&visual&studio&determining&the&bna.aspx+

Content2

Personal22context2

Search+Summary+

Transcripts+Meaning+Age+

Gender+

Height+

Spoken+language+

Spoken+dialect+

Spoken+accent+

Literacy+level+

Speaker+ID+

Personality+traits+(OCEAN)+

Speech+likability+

Speech+intelligibility+

Sleepiness/4redness+

Intoxica4on+level+

Emo4on+

State+of+interest+

Image:+Telefonica+I+D+

Applica4ons+of+Speech+Recogni4on/Understanding+(ASR/ASU)+

!  Dicta4on+!  Telephone&based+Informa4on++

!  direc4ons,+air+travel,+banking,+etc+!  Polls,+online+shopping+!  Call+rou4ng+

!  Hands&free+!  in+car,+computer,+home(domo4cs),+controlling+tools+

!  Second+language+(accent+reduc4on)+!  Audio+archive+searching+!  Help+for+disabled+people+

How+do+humans+do+it?+

Ar4cula4on+system+of+one+person+produces+sound+waves+which+the+ear+of+another+person+conveys+to+the+brain+for+processing+

How+can+computers+do+it?+

•  Digi4za4on+•  Acous4c+analysis+of+the+speech+signal+

•  Linguis4c+interpreta4on+

Acous4c+waveform+ Acous4c+signal+

Speech+recogni4on+

Challenges+in+ASR+processing+!  Speaker+variability+

!  Inter&speaker:+Vocal+tract,+gender,+dialects+!  Intra&speaker:+:+stress,+age,+humor,+changes+of+ar4cula4on+due+to+environment+influence,+…+

!  Language+variability+!  From+isolated+words+to+con4nuous+speech+!  Out&of&vocabulary+words+

!  Vocabulary+size+and+domain+!  From+just+a+few+words+(e.g.+Isolated+numbers)+to+large+vocabulary+speech+

recogni4on+!  Domain+that+is+being+recognized+(medical,+social,+engineering,+…)+

!  Noise+!  Convolu4ve:+recording/transmission+condi4ons,+reverbera4on+!  Addi4ve:+recording+environment,+transmission+SNR+

Approaches+to+ASR+

!  PaHern&based+approaches+!  Sta4s4cs&based+approaches+

PaHern&based+speech+recogni4on+

" Feature measurement: Filter Bank, MFCC, LPC, DFT, ... " Pattern training: Creation of a reference pattern derived from an averaging technique " Pattern classification: Compare speech patterns with a local distance measure and a global time alignment procedure (DTW) " Decision logic: similarity scores are used to decide which is the best reference pattern.

Template+Matching+Mechanism+

TDP:++Speech+Recogni4on+

Alignment+Example+

Sta4s4cs&based+approaches+•  Can+be+seen+as+extension+of+template&based+approach,+using+more+powerful+mathema4cal+and+sta4s4cal+tools+

•  Some4mes+seen+as+�an4&linguis4c�+approach+–  Fred+Jelinek+(IBM,+1988):+�Every+4me+I+fire+a+linguist+my+system+improves�

•  Process:+1.  Collect+a+large2corpus+of+transcribed+speech+recordings+2.  Train+the+computer+to+learn+the+correspondences+

(�machine+learning�)+3.  At+run+4me,+apply+sta4s4cal+processes+to+search+through+

the+space+of+all+possible+solu4ons,+and+pick+the+sta4s4cally+most+likely+one+

Sta4s4cs&based+approaches+

•  Hidden+Markov+Models+(HMM)+•  Gaussian+Mixture+Models+(GMM)+•  Deep+Neural+Networks+(DNN)+

Markov+model+

Output2=2sequence2of2states2

Image:+hHp://madhukaudantha.blogspot.pt/2014/05/markov&models&and&hidden&markov&models.html+

Hidden+Markov+Models+(HMM)+

Output2=2observa:ons2linked2to2the2states2through2a2predefined2probability2distribu:on2!2modeled2using2GMM2or2DNN2models2

Image:+hHp://izanami.tl.fukuoka&u.ac.jp/SLPL/HMM/HTKBook/node5.html+

19/34+

HMMs+for+some+words+

Gaussian+Mixture+Models+(GMM)+

1D+GMM+2D+GMM+

Dep+neural+networks+

Image:+hHp://www.amax.com/blog/+

A2neuron2in2our2brain2

Image:+hHp://www.medicalsciencenavigator.com/how&to&study&for&anatomy&and&physiology/why&sleep&improves&memory+

Classical+representa4on+of+a+neuron++

Long+short&term+memory+cells++

DNN+evolu4on+

•  We+started+to+use+mul4layer+perceptrons+(MLP’s)+about+25+years+ago+[1]+– Neural+networks+with+1+or+few+hidden+layers+

•  Around+2010+G.+Hinton+and+S.+Bengio+(separately)+proposed+methods+to+effec4vely+train+many+hidden+layers+– Machines+have+become+much+more+powerful+– Lots+of+audio+data+with+transcrip4ons+areavailable++

[1]+“Merging+Mul4layer+perceptrons+and+Hidden+Markov+Models:+some+experiments+in+con4nuous+speech+recogni4on”,+Herve+Bourlard+and+Nelson+Morgan,+Technical+report+ICSI,+1989+

Image:+hHp://whatsnext.nuance.com/category/in&the&labs/+

Processing+power+evolu4on+

Image:+hHp://whatsnext.nuance.com/category/in&the&labs/+

ASR+performance+evolu4on+

Speech+recogni4on+engines+

•  HTK+(hHp://htk.eng.cam.ac.uk/),+non&commercial+license+

•  Sphinx+(hHp://cmusphinx.sourceforge.net/),+GPL+

•  Julius+(hHp://julius.osdn.jp/en_index.php),+open+

•  Kaldi+(hHp://www.kaldi&asr.org/),+Apache+license+

Online+ASR&STT+services+

•  Google+voice+(hHps://console.developers.google.com/project)+

•  ATT+voice+recogni4on+(hHp://developer.aH.com/apis/speech)+

•  Wit.ai+(hHps://wit.ai/)+

Building+an+ASR+with+open+source+tools+

•  We+need:+– Speech+recogni4on+engine+– Speech+databases+/+models+– Online+speech+server+– Frontend+interfaces+

Kōnele+app+

Dictate.js+

My+toolchain+

•  Kaldi+ASR+++++++++++++++++++++hHp://www.kaldi&asr.org/+

•  Kaldi+gstreamer+server+hHps://github.com/alumae/kaldi&gstreamer&server+

•  Dictate.js++hHp://kaljurand.github.io/dictate.js/+

•  Kōnele+app+hHps://kaljurand.github.io/K6nele/+

Demo+