Pfa 10.0 Beta (Ang)

8/2/2019 Pfa 10.0 Beta (Ang)

1/62

2

8/2/2019 Pfa 10.0 Beta (Ang)

2/62

Acknowledgment

We would like to thank our supervisors Mr. Kamel KHENISSI for his valuable support to

make this work without forgetting to thank Ms. Wiem FRADI for linguistic revision of

our report.

Ahmed BAHRI

Moemen MANSOURI

3

8/2/2019 Pfa 10.0 Beta (Ang)

3/62

Abstract

"The voice is the technology of tomorrow" is the affirmation of such specialists of the

giants Microsoft and IBM. In North America, where it is ahead of decades compared to

the rest of the world, speech technology is becoming the most natural mode of interactionwith the machine: Windows7, the flagship product of Microsoft, is an excellent example.

The maturity of the technology of synthesis and voice recognition has led researchers to

the realization of an old dream: "the understanding of spontaneous speech in the

machine".

The work developed during this project consists in the creation of an application to

manipulate vocally MySQL data base utility.

The project focused on finding vocabularies which allow the manipulation of MySQL

data base including the vocal manipulation of SKYPE an Mozilla Firefox. In automatic

speech recognition, we used a specific configuration of the framework used in this project

with the aim of having the best result.

Table of contents4

8/2/2019 Pfa 10.0 Beta (Ang)

4/62

List of figures5

8/2/2019 Pfa 10.0 Beta (Ang)

5/62

List of tables

6

8/2/2019 Pfa 10.0 Beta (Ang)

6/62

Glossary

7

8/2/2019 Pfa 10.0 Beta (Ang)

7/62

GENERAL INTRODUCTION

Speech recognition is a technology that allows computer software to interpret a

natural human language to control a well-defined system.

8

8/2/2019 Pfa 10.0 Beta (Ang)

8/62

Early research in automatic speech recognition began around 40 years in the U.S.

during the Cold War through the early attempts to create a machine capable of

understanding human speech in order to interpret Russian intercepted messages. Now the

development of speech recognition has continued to evolve, taking great importance since

it became widely used by:

Large firms for some of their internal applications or look in applications

based on speech recognition (Dragon Naturally Speaking, etc.) All these

applications generally use their own speech engine, as there are also companies

that specialize in creating and selling these engines; Voxalead example (still

experimental).

People with disabilities by allowing them greater autonomy.

Speech recognition can also be linked to many planes of science (natural

language processing, linguistics, formal language theory, information theory,

signal processing, neural networks, artificial intelligence, etc..).

In fact, we can see that this technology today represents a potential market in the

world of software selling because some speech recognition and PC formed an

indispensable means of intellectual and social development.

As part of our End of Year Project, we thought of creating a speech recognition

system for controlling the management system MySQL databases that will facilitate its

manipulation to develop other projects.

To design a system of automatic speech recognition (ASR) as correct as possible, it

should:

Firstly to understand how the speech signal is really complex, ie know the object or

observation input

Secondly to define properly the task of the system, ie the constraints and expected

performance.

We briefly present the project, then we will expose problems with the study of what

exists, then we will present the different needs and needs to improve the current system

9

8/2/2019 Pfa 10.0 Beta (Ang)

9/62

and define the various specifications (platform, tools,...). Finally we propose a system that

we deem appropriate.

CHAPTER I: PRESENTATION OF THE PROJECT

I- Context of the project

10

8/2/2019 Pfa 10.0 Beta (Ang)

10/62

Nowadays the voice technology is broadcasting across different operating systems

and each day the necessity of this technology is getting enlarged continually.

Within the context of our project, we applied this technology in MySQL data base

utility with the aim of knowing how vocal application work.

This project was created using local resources of the Private High School of

Engineering and Technology ESPRIT".

II- The choice of methodology

To better achieve the project, it is essential to establish a process aiming to help to

formalize the preliminary stages of developing a system to make this development morefaithful to the client's needs.

Given the number of available methods (2TUP, RUP, AGILE methods ...), the choice

becomes difficult; leading a project manager was asked during a startup project:

How will I organize the development teams?

What are the tasks assigned to whom?

How long would it take to deliver the product?

How do we involve the client development to capture the needs of it?

The following table shows us the advantages and disadvantages of each methodology

11

8/2/2019 Pfa 10.0 Beta (Ang)

11/62

Table1: Comparative table of design methodologies

Justification of our choice :

Given that our project is based on a well-defined development process that will determine the

functional needs expected of the system until the final design and coding 2TUP has appeared

the most appropriate to lead and plan the sequence of stages during this project. Two Tracks

Unified Process respond to the constraints imposed continual change information of the

company systems.

III Introduction to the 2TUP methodology:

Abbreviation of "Two Track Unified Process. It is a process that meets the needs

of the Unified Process. The process 2TUP responds to constraints imposed by continual

change information Systems Company. In this sense, it strengthens the control over the

evolving capacities and correction of such systems. "Track 2" literally means that the

process follows two paths or limbs. These are the "Functionnal ways"and "technical

architecture", which correspond to the two axes of change imposed on the system

information.

12

8/2/2019 Pfa 10.0 Beta (Ang)

12/62

Figure1: Two types of constraints imposed on the information system

1-The functional limb:

This part capitalizes the knowledge of the company's business. It generally

constitutes an investment in the medium and long term. The functions of the information

system are in fact independent of the technologies used.This part includes the following steps:

1 - The capture of requirement needs, producing a model focused on the needs of the

business users.

2 - Functional Analysis.

2-The technical limb:

It capitalizes the know-how. It is also an investment for the short and medium term.

Techniques developed for the system can be in effect independently of the the functions

to be performed. This part includes the following steps:

1- Capture of technical needs.

2 - The Generic Design

3-Branch of the middle

As a result of the developments in the functional model and technical architecture,

implementation of the system is to merge the results of the two limbs. This merger results

in the production process of a Y-shaped

This part includes the following steps:

1. Preliminary design.

2. Detailed design.

3. Coding.

4. Integration.

13

8/2/2019 Pfa 10.0 Beta (Ang)

13/62

Figure2: Development Process in Y

14

8/2/2019 Pfa 10.0 Beta (Ang)

14/62

CHAPTER II-THE FUNCTIONAL PART

I-Preliminary study

Figure3: Preliminary Study Schema

As the diagram above shows, the preliminary study is the first step 2TUP. It is to

perform an initial identification of the functional and operational needs, mainly using the

text.It prepares more formal activities to capture functional needs and capture techniques.

For our project this study was achieved through the development of a specification. It

examined the various systems already on the market, tried of identify the positive and

negative sides through the critical part to fix our main objectives, articulate the needs and

secure the modules that will maintain or improve by thereafter.

Last stage of this study is the modeling of a context diagram

15

8/2/2019 Pfa 10.0 Beta (Ang)

15/62

Figure 4: Functional Schema

Description of the schema

1. The speaker emits a sentence, once the sound; it is captured by a microphone.

2. The voice signal is then digitized using an analog to digital. The setting of the

signal provides a fingerprint.

3. The decoding is to describe the acoustic signal in terms of linguistic units. It aims

to segment the signal, identifying of the different segments is based of the phonetic and

linguistic constraints.

Once the analysis process is completed the recognition phase begins, in fact all the

words spoken are separated by silences of duration greater than a few tenths of a second

"recognition phase consists mainly of two phases:

1) The learning curve : The speaker pronounces the whole vocabulary often several

times to create a reference dictionary

2) The recognition phase : The speaker stated before a word. To recognize the

words emitted by the speaker there are three parts:

- First, the sensor: to apprehend the phoneme physical balance, we in our case it is

the microphone. A signal is transmitted to the microphone when the speaker speaks.

- Second, the parameterization of forms which gives us an impression that is to say,

the characteristic sound (Time / Frequency / Intensity).And finally, the identification of

16

8/2/2019 Pfa 10.0 Beta (Ang)

16/62

forms. A second schema is needed to better absorb all its different use cases that should

be treated.

Figure 5: Operating principle of speech recognition

This diagram shows over the operating principle of recognition: a speaker pronounces

a word vocabulary. Then word recognition is a typical problem of pattern recognition.

Any system of pattern recognition always involves the following three parts:

- A sensor for understanding the physical phenomenon under consideration (in our

case a microphone).

- A floor-parameterization of the shapes (eg a spectrum analyzer).

- A floor-loaded decision to classify an unknown form in one of of the possible

categories.

II- Capture of the Functional needs

Once the preliminary study is done we move to the next step, in which we will

determine the functional needs on the left branch and the parallel technical needs on the

right branch.

17

8/2/2019 Pfa 10.0 Beta (Ang)

17/62

1-Functional Needs:

This project consists in designing and implementing a tool for voice manipulation of the

database. First we will start by defining the actors who will interact with the system.

Considering the need for our application, it appears that the main actors are reduced to an

administrator and a user. The administrator responsible of database creation and maintenance of

the accounts of the individual users using the database, all these tasks must be completed only

by his voice. The user can access the database through his natural voice, after authentication we

can fulfill the request LDD.

Figure 6: Capture of the functional needs

This project consists in designing and implementing a tool for voice manipulation of the

database. First we will start by defining the actors who will interact with the system.

Considering the need for our application, it appears that the main actors are reduced to an

administrator and a user. The administrator responsible of database creation and maintenance of

the accounts of the individual users using the database, all these tasks must be completed only

by his voice. The user can access the database through his natural voice, after authentication we

can fulfill the request LDD.

1.1-The use case diagram:

The use case diagram reflects the principle of the overall functioning of our

application and the various actions of the actors. The study of the needs of the actors whointeract with our system requires the development of use cases as follows:

18

8/2/2019 Pfa 10.0 Beta (Ang)

18/62

Figure 7: Use case diagram

We consider that in our system two users are possible: The administrator and the normal user.

The administrator accesses to all the existing use cases including that of manipulating the

database, so that the user can only manipulate the database once created.

a) Voice Authentication:

The user pronounces his login and password in authenticate to access the main interface of

MySQL.

The following table details the process of authentication

Title Authentification

Intention Authentification des utilisateurs.

Actors Users.

Preconditions MySQL available Preconditions MySQL available

Start when Application is launched.

Definition of Transitions - pronounce the login and password.

Finish when The administrator or the user's choice validates the session

and connects

Exception(s) Invalid user name. Invalid password. MySQL not found.

19

8/2/2019 Pfa 10.0 Beta (Ang)

19/62

Postconditions MySQL Menu

Table2: nominal scenario of the use case "Voice Authentication"

b) Manipulate of vocally database:

Title: Manipulating the database

Summarizing: The user can create, modify or delete one or more databases and create and

execute queries LMD vocally.

Actors: the application user.

The table details the process of the vocal manipulating the base.

Title Vocal manipulation of the date base

Intention Creating and manipulating a database in MySQL

Actors User of the application

Preconditions Authentication succeeded

Start when the main window of MySQL opens.

Definition of

Transitions

CASE 1: The user wants to create a database:

-pronounced "create new schema".

-Say the name of the database to create.

- Confirm selection.CASE 2: The user wants to delete of database:

20

8/2/2019 Pfa 10.0 Beta (Ang)

20/62

-Say the name of of table to drop.

- pronounced "drop database".

- Confirm selection.

CASE 3: The user wants to create a new table in the database:

-Select of database

-Be 'create new table. "-Say the name of the table to create.


CASE 4: The user wants to create a query LMD:

- pronounce the name of of table.

-Deliver the application to create.


CASE 5: The user wants to execute a query LMD:

- Say the name of the table.

-Say the word "execute"

NB: for this case the user must write the LMD querry

Finish when The user confirms his choice.Exception(s) - the name of the database or table already exists.

-Syntax Error in SQL.Table3: nominal scenario of the use case "Vocal Manipulation"

c) Manage vocally users:

Title: Managing users

Summary: The Administrator can create, modify or delete a user account

Actors: Administrator application.

The table details the process of managing users.

Title Manage the User

Intention Creation, modification or deleting of the user account

Actors Administrator

Preconditions Authentification en tant que administrateur russi.

Start when The interface MySQL Administrator opens

Definition of

Transitions

CASE 1: The user wants to create a new user:

- Say "user administration ".

-Say "add new user".

- Say the name and password.

-Say "apply changes "CASE 2: The user wants to delete a user:

21

8/2/2019 Pfa 10.0 Beta (Ang)

21/62

-Say "user administration ".

- pronounce the name of the user.

- pronounced "drop user".

- pronounced "ok"

CASE 3: The user wants to create a clone user:

-Say "user administration ".- pronounce the name of the user.

- pronounced "clone user".

- Give the name and password of the user decision

-ok.

Finish when The administrator completes

The session disconnects

Exception(s) Invalid user name.

Invalid password.Table4: nominal scenario of the use case "Manage user"

1.2- The Activity diagram

1. User pronounce a sentence through a microphone.

2. The voice signal is then analyzed using the model to achieve an Acoustic signal.

3. The decoding is to describe the acoustic signal in terms of linguistic units. It aims to segment

the signal.

4. Segmented signal is compared with the database (dictionary) thanks to the search graph.

5. Projection of the action on screen.

22

8/2/2019 Pfa 10.0 Beta (Ang)

22/62

Figure 8: Activity Diagram

1. User pronounce a sentence speaker emits a sentence through a microphone.

2. The voice signal is then analyzed using the model to achieve an signal.

3. The decoding is to describe the acoustic signal in terms of linguistic units. It aims to

segment the signal.

4. Segmented signal is compared with the database (dictionary) thanks to the search

graph.

5. Projection of the action on screen.

2 - Non-Functional Needs:

Besides the functional needs developed above, we must consider the following constraints:

-The service quality of the application:

-Ergonomics of the application:

-The interfaces of our application must be clear

23

8/2/2019 Pfa 10.0 Beta (Ang)

23/62

-The response time of the application should be minimal.

III-Functional Analysis

Figure 9: The functional analysis

1 - Cutting into categories

It consists of

1) Divide the class into categories Candidate

2) Elaborate preliminary class diagrams by categories

3) Decide the dependencies between categories

24

8/2/2019 Pfa 10.0 Beta (Ang)

24/62

Figure 10: Cutting into categories

1.1- The packages Diagram:

The package diagram is a graphical representation of the relationships between the packages

from the speech recognition system

25

8/2/2019 Pfa 10.0 Beta (Ang)

25/62

Figure 11: Packaging Diagram

The package and more general overall is the "general media treatment" decomposed of

two packages which are: "Sound treatment " and "Speech treatment "

- The Sound is the wave which it's audible to ear for the humain.

- The Speech Is the process of stretching and relaxing Vocal cords to Produce sound.

The package that we will interests on is the "Speech treatment " which in turn is divided

into two sub packages that are "Speech Synthesis " and "Speech recognition ".

Our development is based on "Speech recognition".

26

8/2/2019 Pfa 10.0 Beta (Ang)

26/62

2 - Development of the static model

Figure 12: nominal scenario of the use case "Voice Authentication"

2.1-Diagram of Classes:

The different classes are

Caller ou Appelant

Instruction

Language Instruction

Recording ou Enregistrement

Speech Recognizer ou Reconnaissance Vocale

Feature ou Caractristique

Feature Extraction ou Extraction de Caractristique

27

8/2/2019 Pfa 10.0 Beta (Ang)

27/62

Feature Classification ou Classification de Caractristique

Feature Matching ou Correspondance de Caractristique

Code book ou Dictionnaire

Action.

The various relationships are :

Listen

Record

Send Speech Signal

Perform

Search And Match

Contain

Figure 13: Model participating Class Diagram

28

8/2/2019 Pfa 10.0 Beta (Ang)

28/62

2.2-Description of class diagram

Class "Caller" has a relationship "Listen" with another class "Instruction ". The caller can

listen to a type of instruction, which is "Language Instruction".

Then of class "Caller"is associated with a 'Record' with the class "Recording". This class

it's then associated with a 'Send Speech Signal "with class" Speech Recognizer ".

The class 'Speech Recognizer " is then associated with of class" Feature "in a

relationship" Perform ", which means that the class' Speech Recognizer" contact of class

"Feature" in feature extraction, classification of the features and feature matching.

However, the class "Feature Matching" is associated with the relationship "Search and

Match" with class " code book" to match the input speech, and its associated to the class

Action with the relation contain.

Note: To ensure clarity of the diagram we preferred not to put the attributes and class

methods.

We defined physical model needs to consist of 3 classes that facilitate of implementation

of our application.

3 - Development of dynamic model:

The development of the dynamic model is the third activity of the analysis stage. It is

situated on of left branch of the cycle Y. This is an iterative activity, strongly coupled

with the activity of static modeling, described above. The development of dynamic model

precedes the preliminary design.

29

8/2/2019 Pfa 10.0 Beta (Ang)

29/62

Figure 13: The dynamique model

3.1- Sequence diagram:

The sequence diagram is mainly used to show interactions between the 6 categories

listed / objects in of previous section. However, this interaction is in a sequential order

that interactions take place. The figure shows the sequence diagram system, including the

classes / objects, lifelines, processes and interactions. The interactions between the seven

classes / objects are numbered 1 through 11 sequentially, which indicates which process

should be done first to implement the following process.

30

8/2/2019 Pfa 10.0 Beta (Ang)

30/62

Figure 14: object sequence diagram of a speech recognition

Note:

The nominal scenario for voice recognition has been represented by details in the

sequence diagram above (see Figure 14), in the following sequence diagrams we chose to

group the class instances related to the recognition of a proceeding that is "Recognition

System" according to the diagram package described later (see diagram package Figure

11).

31

8/2/2019 Pfa 10.0 Beta (Ang)

31/62

Administrator

:Recognition system :Data base utility :MySQl data base

ref

alt

connection

Pronounciation of login and password

Verification of grammar which countain login and psw

altValid grammar of parameter connection

Insertion of the login and psw in the f ield of textVerification of login and psw

Invalid g rammar

demand of parameter of connection

valid parameter of connectionResponse

Displaying MySQl administrator interface

Response

Demand of parameter of connection

Invalid parametre

Figure 15: sequence diagram representing the connection

32

8/2/2019 Pfa 10.0 Beta (Ang)

32/62

Administrator


ref

alt

connection

ouncing of the control privilege u ser command

Verification of grammar

manipulation the mouse to insert privilege

addition of privilege

Valid grammar

Demand to pronounce again the command

Figure 16: sequence diagram representing the affection a privileged user

33

8/2/2019 Pfa 10.0 Beta (Ang)

33/62

Administrator


ref

alt

connection

pronouciation of the creation users command

Verification of grammar

manipulation the mouse to insert new user

addition of new user

Valid grammar


Opening the new user information interface

pronouciation of the login information

Verification grammar

alt Valid grammar

Insertion of the login information in the f ield of textattribute the login information


invalid grammar

Figure 17: sequence diagram addition a user

34

8/2/2019 Pfa 10.0 Beta (Ang)

34/62

3.2- Diagram of States transitions

Now that the scenarios were formalized, the knowledge of all interactions between

objects allows representing business rules system dynamics. However, it should focus on

class behavior richest precisely in order to develop some of these dynamic rules. It uses

this concept of finite state machine, which involves tracking the life cycle of a generic

object of a particular class over its interactions with the rest of the world, in all possible

cases. The local view of an object, describing how he reacts to events based on its current

state and moves into a new state, it's plotted as a state diagram.

Figure 18: sequence diagram of states transitions

35

8/2/2019 Pfa 10.0 Beta (Ang)

35/62

4-Confrontation between the static and the dynamic models:

The various relationships that exist between the main concepts of the static model (object,

class, association, attribute and operation) and the main dynamic concepts (message,

event, state and activity).

The matches are far from trivial, because it is indeed complementary points of view and

not redundant. Try of synthesize the most important, without being exhaustive:

a message can be an operation invocation on an object (the receiver) by another object

(the issuer);

An event or effect on a transition may correspond to the call of an operation;

An activity in one state may affect the performance of a complex transaction or a series

of operations;

A diagram of interactions involves objects (or roles);

An operation can be described by an interaction diagram or activity;

A guard condition and a change event attributes can view links or static;

An effect on a transition can handle attributes or static links;

The setting of a message can be an attribute or an entire object.

36

8/2/2019 Pfa 10.0 Beta (Ang)

36/62

Chapter III-The Technical part

I-Capture of the technical requirements:

The capture of requirements, which identifies all the constraints on the choice of

dimensioning system design. Tools and equipment selected, thus that taking into account

the constraints of integration with the existing (pre-requisite technical architecture).

Figure 19: Capture of the technical requirements

Part of the work consisted in the study the functioning of speech recognition systems, to

attach then to develop an acoustic model allowing the recognize words.

This is why, we propose a first step which we introduce the Hidden Markov Models,

mathematical concept that will allow to discuss the layout of the operating systems of

automatic speech recognition (ASR). And in a second step, we will apply this model in

our project.

1-The Hidden Markov Models:37

8/2/2019 Pfa 10.0 Beta (Ang)

37/62

Definition :

A Markov process is a discrete time system which is constantly in a state

taken from N distinct states. The transitions between states occurs between two

consecutive discrete instants, according to some probability law. The probability of each

state depends only on the condition that immediately precedes it.

A hidden Markov model (HMM) represents as the same way as a Markov chain, a

whole sequence of observations whose state of each observation is not observed, but

associated with a probability density function . It is therefore a stochastic process in

which observations are a random function of the state and whose state changes every

moment according to the probabilities of transition from the previous state.

Figure 20: The Markov model

More formally, a state machine hidden Markov is characterized by a quadruplet

assemblies described below:

-Ifthe state of i

- iis the propability ofIf be the initiale state

-aijis the propability of transitionIf If

-bi(k)probability of emitting the symbol kbeing in the stateIf.On condition that :

-The sum of the probabilities of initial states is equal to 1

i = 1

i

- The sum of probabilities of transitions from a state is equal to 1.

- The sum of probabilities of outputs from a state is equal to 1

38

8/2/2019 Pfa 10.0 Beta (Ang)

38/62

We can describe a hidden Markov model as the parameter set

= (, A, B)

With :

- the set of the initial probabilities

- A the set of transition probabilities between states.

- B the set of laws (or densities) of probabilities associated with a state.

2-The Voice recognition theory:

We use speech recognition to dial a phone number, browse through the windows on our

computer, entering data into a software or dictate a letter in a word processor, the basic problem

remains to the same: identify the meaning of a flow of words uttered often in a background

noise more or less important.

This task is made difficult not only by the deformations induced by the use of a

microphone but also by a number of factors inherent to human language:

- homonyms where the same sequence of sounds can correspond to several words (like

the sound "s-in" in "cent" "sans" San [Francisco "sang" means bloodI ).

- The local accents.

- Patterns of language (as some elisions that make it difficult to separate the words: ("

j'vais l'chercher") in ( je vais le cherchI)..

- The speed differences between the users.

- Imperfections of a microphone ...

For the human ears, these factors do not usually represent difficulties. The brain plays

with these deformations of speech by taking into consideration almost unconsciously, nonverbal

and contextual elements that allow to eliminate ambiguities.

Only by taking into account these elements surrounding the sound itself that the voice

recognition software can achieve high degrees of reliability.

Today software that give the best results are all based on a probabilistic approach.

The aim of speech recognition is to reconstruct a sequence of words M from a recorded

acoustic signal A.

39

8/2/2019 Pfa 10.0 Beta (Ang)

39/62

In the statistical approach, we will consider all sequences of words M which could match

the signal A.

In this set of possible consequences we will then choose one which is most likely that is

to say that maximizes the probability P (M / A) that M is the correct interpretation ofA what is

Note M=arg max P (M / A).

Figure 21: reconstruction of a sequence of words M from a recorded acoustic signal A.

Note that P (A / B) represents the probability of event A if the event B has occurred.

The axiom of Bayes calculates the probability of concurrence of two events A and B by the

following equalities

P (A and B) = P (A / B) P (B) = P (B / A) P (A)

Where P (A) is the probability that event A occurs.

Thus, the axiom Bayes can rewrite the expression:

And as P (A) is a constant search for the best M, we have finally the equation:

This last equation is the key to the probabilistic approach to speech recognition. In fact, the

first term P (A / M) represents the probability of observing the acoustic signal A if the

sequence of words was pronounced M: it is a purely acoustic problem.

The second term P (M) represents the probability that it is the sequence of words that M was

pronounced: it is a linguistic problem.

The above equation tells us so that we can divide the problem of speech recognition into two

independent parts: we will model separately the aspects acoustic and language problems.

40

8/2/2019 Pfa 10.0 Beta (Ang)

40/62

Thus, the transcript is divided into several modules

feature extraction produces A

using the acoustic model calculating P (A / M) and M, looking for the hypotheses that are

likely associated to A.

using the language model calculating P (M) to select one or more assumptions on M

depending on the language knowledge

The following schema illustrates the components of a transcription system.

Figure 22: The transcription systme

3-Features extraction:

The sound signal to be analyzed in the form of a wave whose intensity varies over time.

The first stage of the transcription process is to extract a series of numerical values

sufficiently informative on the acoustic level to decode the signal thereafter.

41

8/2/2019 Pfa 10.0 Beta (Ang)

41/62

The signal may contain areas of silence, noise or music. These areas are first removed in

order to have only portions of useful signal to the transcript, that is to say, those

corresponding to speech.

The sound signal is then segmented into what are described as breath groups, using as

delimiters of silent pauses long enough (about 0.3 s). The advantage of this segmentation is

to have a continuous tone of a reasonable size compared to the capabilities of model

calculations of the ASR system. Later in the transcription process, the analysis is done

separately for every breath.

To identify changes in the signal, which generally varies rapidly over time, the group is

blowing itself divided into windows of a few milliseconds of study (usually 20 or 30 ms).

In order to avoid losing important information on the top or end of windows, we made sure

that they overlap, which leads to extract features every 10 ms.

From the signal contained in each analysis window are calculated numerical values

characterizing the human voice. After this step, the signal becomes a sequence of vectors

called acoustic dimension often greater than or equal to 39.

4- The Acoustic Model:

The next step is to associate the acoustic vectors, which are, as we have seen, numeric

vectors, a set of assumptions of words (symbols). Referring to equation 1 of the statistical

modeling, this amounts to estimate P (A / M). The techniques for calculating this value

form what is called the acoustic model.

The most used tool for modeling the acoustic model is the Hidden Markov Model

presented above. The HMMs have indeed shown their effectiveness in practice to recognize

speech. Even if they have some limitations to model signal characteristics, such as the

duration or length of successive acoustic observations, the HMMs offer a well-defined

mathematical framework to calculate the probabilities P (A / M).

Acoustic models involve three levels of HMM shown in the figure below.

42

8/2/2019 Pfa 10.0 Beta (Ang)

42/62

Figure 23: The acoustic model

They look at first to recognize the types of sound, in other words to identify the phones (which

sounds are pronounced by speakers and defined by specific characteristics). To do this, they

model a phone by an HMM, usually 3 states representing the beginning, middle and the end.

The hidden variable is then sub-phone and acoustic observations are acoustics vectors.

To calculate the probabilities of observation in each state, two approaches are often

considered, one based on the representation of probability densities by Gaussian el'autre based

on neural networks. These different methods establish assumptions about the likelihood of

phones uttered. However, the aim of acoustic models is to determine a sequence of words.

Acoustic models for this purpose use a dictionary of pronunciations, making the

correspondence between a word and pronunciations. As a word may be pronounced in

different ways, according to his predecessor and his successor, or simply as the habits of the

speaker, there may be multiple entries in the lexicon for the same word. The indications are

given through the features of pronunciation phonemes.

43

8/2/2019 Pfa 10.0 Beta (Ang)

43/62

The second level of HMM models the words from the HMM representing phones and lexicon

of pronunciations. It comes in the form of a lexical tree initially containing all the words in the

vocabulary gradually pruned as and when the phones are accepted. Since HMMs modlisent

first level of phonemes, not phones, phonemes found in the dictionary pronunciations are

converted into phones to recognize words. Transformation rules depending on the context of

developing phoneme are then used.

The third level models finally the sequence of words M in a group of breath and can then

incorporate the knowledge gained from the language model on M. To establish the HMM

equivalent to a word graph, the HMM corresponding to the lexical tree is duplicated each time

the acoustic model makes the assumption that a new word has been recognized.

The functioning of the acoustic model just described is facing a major problem: the search

space of higher-level HMM is often considerable, especially if the vocabulary is important and

if the breath to be analyzed contains multiple words . Algorithms from dynamic programming

can effectively calculate the probabilities. These are mainly the Viterbi algorithm and the

decoding stack, also called decoding A *. In addition, use is made of very regular pruning to

keep only those assumptions that could be most interesting.

The role of the acoustic model is thus to align the sound signal with theories of words using

only acoustic indices of order. It includes in its last level modeling information about the

words introduced by the language model.

5-The model language :

The language model is intended to find sequences of words most likely, in other words

those that maximize the value P (M) of equation 1. If one refers to the highest level of

HMM acoustic model (see previous figure), the values P (M) are the probabilities of

successive words.

a) Functioning of a language model

By placing M1N = M = m1 ... mn, where m is the word of rank i of the sequence M, the

probability

P (M) is as follows:

44

8/2/2019 Pfa 10.0 Beta (Ang)

44/62

The evaluation of P (M) reduces then to calculate P values (mi) and P (mi | M1i - 1)

respectively which are obtained using the equalities

Where V is the vocabulary used by the ASR system, and C (mi) and C (M1i) represent the

respective numbers of occurrences of the word and half of the sequence of words in the

corpus M1i learning. Unfortunately, predicting the sequence of words M1i, the number of

parameter P (m) and P i (mi | M1i - 1) of the language model to estimate, increasesexponentially with n. In order to reduce this number, P (mi | M1i - 1) is modeled by a N-

gram, that is to say, a Markov chain of orderN-1 (with N> 1) using the equation:

P (mi | M1i - 1), P (mi | mi - N + 1i - 1)

This equation indicates that every word may be mid predicted from the N-1 preceding words.

ForN = 2, 3 or4 refers respectively bi gram model, trigram or Quad gram. ForN = 1, the

model is said united program and returns to estimate P (mi).

Generally, these are models bis grams, trigrams and quad grams that are used in language

models for ASR.

6- The choice of Sphinx API :

Sphinx 4 is a speech recognizer written entirely in Java. The goals are to have a speech

recognition highly flexible to equal the other commercial products and develop collaborative

research centers from various universities, laboratories of Sun and HP, but also from MIT.

While being highly configurable, recognition of Sphinx 4 supports including single words and

phrases (use of grammar). Its architecture is scalable to enable new research and test newalgorithms.

The recognition quality depends directly on the quality of voice data. The latter being the

information relating to their own voices. Examples are different phonemes, the individual

words (vocabulary), different ways of pronunciation. More information will only be important

and known by the system, the better his reaction and his choice to make.

As shown in the following figure which represents its architecture, Sphinx 4 is based on 3

modules.

45

8/2/2019 Pfa 10.0 Beta (Ang)

45/62

Figure 24: General architecture of the Sphinx-4

6.1- The Architecture of sphinx -4 :

Figure 25: Detailed Architecture of Sphinx-4

The main blocks are the frontend, decoder, and the linguist. The support blocks include the

configuration manager and the tool blocks.

The frontend takes one or more input signals and meterizes by a sequence of functions. The

linguist translates into any kind of model of standard language, as well as information on the

pronunciation of dictionary and structural information of one or more sets of acoustic modelsin a search graph. The research director in the decoder uses the featured frontend and search

46

8/2/2019 Pfa 10.0 Beta (Ang)

46/62

graph of the linguist to do the actual decoding, generating results. At any time before or during

the recognition process, the application can issue checks to each module, becoming a partner

in the recognition process.

a) The Frontend

Front-End cuts the recorded voice into different parts and prepare them for the decoder.

The aim of the Front End is to set an input signal (example, audio) into a sequence of outputs.

As illustrated in Figure 26, the frontend has one or more parallel chains of replaceable

communication signal processing modules called "dataprocessors.

Support multiple channels allows simultaneous calculation of different types of parameters of

the input signals are identical or different. This allows the creation of systems that cansimultaneously decode types derived from non-voice signals.

Figure 26: Parallel chains of communicating Data Process

b) The Linguist :

The linguist generates searchgraph which is used by the decoder during the search, at the same

time hiding the complexity of the generation of this graph. As the case along the Sphinx-4, the

linguist is a plug-in module allows people to dynamically configure the system with different

linguist implementations.

A typical implementation of constructs searchgraph using the structure of the language

represented by a given language model and the topological structure of AcousticModel (HMM

for basic sound units used by the system).

During generation of searchgraph, the linguist may also incorporate sub-word units in the

contexts of arbitrary length.

By allowing different implementations of the linguist to be connected to the execution,

Sphinx-4 allows individuals to provide different configurations for different systems and

recognition. For example, a simple numerical application recognition digits may use a single

linguist who keeps the search space entirely in memory. The linguist is based around threecomponents which are described in the following sections:

47

8/2/2019 Pfa 10.0 Beta (Ang)

47/62

The language model

The dictionnary

The acoustic model

b.1) The language model :Role :

Describes what can be said in a very special context.

Helps narrow the search space.

There are three kinds of language model: the simplest is used for isolated words, the second

for applications based on commands and controls and the last for the current language.

The model implementation language supports several types of grammars, we opted for The

Grammar JSGF that supports the Java TM Speech API Grammar Format (JSGF) [20], whichdefines a BNF style, platform independent representation Unicode and vendor-independent

grammars.

b.2) The Dictionary

The dictionary gives the pronunciation of words found in the languageModel. The

pronunciations of the words cut into sequences of sub-word units found in the AcousticModel.

Dictionary interface also supports the classification of words allows for one-year term to be in

several classes.

b.3) The AcousticModel

The module AcousticModel provide a correspondence between a unit of speech and an HMM

that can be scored against incoming characteristics provided by the Frontend.

b.4) the Search Graph

The SearchGraph is the main data structure used during the decoding process.

It is a directed graph where each node, called SearchState, represents either a state issue or not

transmitter. States transmitters can be scored against incoming noise characteristics while non-

issues are generally used to represent higher level language constructs such as words and

phonemes that are not directly scored against the elements involved. The arcs between states

represent possible state transitions, each with a probability representing the likelihood of

transition along the arc.

48

8/2/2019 Pfa 10.0 Beta (Ang)

48/62

How is built SearchGraph affects memory footprint, speed and accuracy of recognition. The

modular design of Sphinx-4, however, allows different strategies to be used

SearchGraphcompilation without changing other aspects of the systems.

The choice between static and dynamic construction of language HMMs depends mainly on

the size of vocabulary, complexity of the language model and the desired memory footprint of

the system, and can be performed by the application.

c) Decoder

The decoder is the heart of the Sphinx 4. It was he who processes the information received

from the

Front-End, analyzes and compares them with the knowledge base to give a result to the

application.

The main role of the Sphinx-4 decoder block is to use the features of the Front End in

collaboration with the linguist SearchGraph to generate hypotheses results. The block decoder

includes SearchManager ins and other supporting code that simplifies the decoding process of

an application. As such, the most interesting element of the block decoder is SearchManager.

The decoder simply tells the SearchManager recognize a frameset features. At each step of the

process, creates SearchManager results object that contains all the paths that have not reached

a final state transmitter.

7-Technical use case diagram

This results in the block diagram of the overall operation of the system and the various

actions of the actors.

The study of the needs of actors who interact with our system requires the development of

use case diagram as follows:

49

8/2/2019 Pfa 10.0 Beta (Ang)

49/62

Figure 27: Technical use case diagram

The different use cases are:

Listen for instructions: to capture the signal from the microphone

Save the speech signal: save the signals from the microphone

Analyze the speech signal: Segmenting the signal into phoneme

Match the speech signal: match the signal to the data base

Match the feature vector : match the characteristics of the signal analyzed in the data

base

Extract the feature vector:analyze the signal in the extractant carectristiques

significant

Classify Feature vector: entities classify the signals analyzed by category

Different actors are:

User

The dictionary (code book)

II- The Generic Design

The generic design, which then defines the components needed to build the technical

architecture. This design is completely independent of the functional aspects. It aims to

standardize and reuse the same mechanisms for all systems. The technical architecture

built the backbone of the system, its importance is such that it is advisable to make a

prototype.

50

8/2/2019 Pfa 10.0 Beta (Ang)

50/62

Figure 28: The generic conception

Software layers

Sphinx-4 has been compiled and tested on Solaris, Mac OS X, Linux and

Windows. The execution, compilation and testing of Sphinx-4 require additional

software. The following software must be installed on the machine:

- Java SDK 5.1. http://java.sun.com.

- The various libraries that make up the Sphinx-4

Exploitation and Configuration Software :

a) Implementation of the library with Eclipse

51

8/2/2019 Pfa 10.0 Beta (Ang)

51/62

The implementation of Sphinx-4 in an arbitrary application is relatively easy. The first

state is to create a new project (menu File - New - Project). The figure below shows how to

create a new project in Eclipse.

Figure 29: Creation of a new project

The second step is to insert bookstores Sphinx-4 in the draft. For this, we make a right

click on the project and we will in the project properties. It then chooses the menu "Java Build

Path". Finally we click on "Add External JARs" to add the various libraries provided by Sphinx.Libraries to add are the following:

Figure30 -insert libraries in the Sphinx-4 project

js.jar.

jsapi.jar (This must be created by launching the application jsapi.exe located in the lib

directory of the downloaded archive).

This library is used by Java among others to record sound.

52

8/2/2019 Pfa 10.0 Beta (Ang)

52/62

sphinx4.jar.

TIDIGITS_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar.

Only for recognition of numbers.

WSJ_8gau_13dCep_8kHz_31mel_200Hz_3500Hz.

WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.

b) Writing Grammar

To perform recognition, we must write a grammar, that is to say a file

describing the terms that must be recognized by the program. Grammars are used

by Sphinx JSGF format (Java Speech Grammar Format must then create a file with

an extension.''gram''. This file contains the grammar used by the application, that is

to say the words or phrases that are potentially pronounceable.

b.1) Example of grammar

Figure31 Gramar file

The file grammar above allows understanding all of the following sentences:

53

8/2/2019 Pfa 10.0 Beta (Ang)

53/62

Figure32 list of sentences which we can pronounce it

This figure shows the grammar above graphically.

Figure33 GraphicalGrammar structure

c) Writing the configuration file for Sphinx

After writing the grammar file, we must create the configuration file

Filename.config.xml. The easiest way is to use a configuration file to one of several

demonstrations provided in the downloaded archive. This file specifies among other

things used the dictionary and grammar used.

54

8/2/2019 Pfa 10.0 Beta (Ang)

54/62

Figure34 XML configuration File

55

8/2/2019 Pfa 10.0 Beta (Ang)

55/62

Chapter VI THE MIDDLE PART

I- the design part:

Fig35 The design part

The model design system organizes the system in components, delivering technical

services and functional. This model combines the information from the right branch and left

branch. It can be considered as the transformation of the analysis model by projecting the

analysis classes on the software layers.

The preliminary design is a delicate step because it integrates the functional analysis

model in the technical architecture in order to draw the mapping of system components to be

developed.

Detailed design, which then examines how to make each component.

The encoding step, which produces components and tests as and when the code units

completed.

56

8/2/2019 Pfa 10.0 Beta (Ang)

56/62

The recipe step, which is finally to validate the functionality of the developed system.

1- Detailed Design:

57

8/2/2019 Pfa 10.0 Beta (Ang)

57/62

58

8/2/2019 Pfa 10.0 Beta (Ang)

58/62

Fig36 The Class diagram detailed

The class diagram detailed outcome of the general class diagram (described in Part "2.1-

the class diagram).

NB: It is noted that some classes will be transformed in the following forms:

Class Codebook: will become our dictionary (database).

Class Instruction: will become as grammar file59

8/2/2019 Pfa 10.0 Beta (Ang)

59/62

Class LanguageInstruction: will become as grammar file

II- Realization part

1-Description of the applications interfaces :

In this part of the project we will goanna to show you the first exemple of our

application .

This interface present the home interface of all users.

Fig36 The home interface

This interface show the process of the addition of a new application

60

8/2/2019 Pfa 10.0 Beta (Ang)

60/62

Fig37 The home interface

This interface show how to edit existed application.

61

8/2/2019 Pfa 10.0 Beta (Ang)

61/62

Conclusion62

8/2/2019 Pfa 10.0 Beta (Ang)

62/62

This project has leads to the creation of an application for manipulating vocally some

other application to see its MySQL, SKYPE.

Thus, a search job on the internet and a careful study on the working tools were made tochoose the most appropriate architecture for the system.

Throughout this project we have done our best to improve our application but we faced on

a major problem: the development of an acoustic model customized to each user of our

application.

Concretely, the difference between the applications present on the market as (Dragon

Naturally Speaking, Speak Q, etc. ..) is the degree of perfection of the acoustic model;

What may be considered as the most important task as it requests additional time beyond

the deadline of our project.

Documents

Pfa 10.0 Beta (Ang)