Pfa 10.0 Beta (Ang)

Embed Size (px)

Citation preview

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    1/62

    2

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    2/62

    Acknowledgment

    We would like to thank our supervisors Mr. Kamel KHENISSI for his valuable support to

    make this work without forgetting to thank Ms. Wiem FRADI for linguistic revision of

    our report.

    Ahmed BAHRI

    Moemen MANSOURI

    3

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    3/62

    Abstract

    "The voice is the technology of tomorrow" is the affirmation of such specialists of the

    giants Microsoft and IBM. In North America, where it is ahead of decades compared to

    the rest of the world, speech technology is becoming the most natural mode of interactionwith the machine: Windows7, the flagship product of Microsoft, is an excellent example.

    The maturity of the technology of synthesis and voice recognition has led researchers to

    the realization of an old dream: "the understanding of spontaneous speech in the

    machine".

    The work developed during this project consists in the creation of an application to

    manipulate vocally MySQL data base utility.

    The project focused on finding vocabularies which allow the manipulation of MySQL

    data base including the vocal manipulation of SKYPE an Mozilla Firefox. In automatic

    speech recognition, we used a specific configuration of the framework used in this project

    with the aim of having the best result.

    Table of contents4

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    4/62

    List of figures5

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    5/62

    List of tables

    6

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    6/62

    Glossary

    7

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    7/62

    GENERAL INTRODUCTION

    Speech recognition is a technology that allows computer software to interpret a

    natural human language to control a well-defined system.

    8

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    8/62

    Early research in automatic speech recognition began around 40 years in the U.S.

    during the Cold War through the early attempts to create a machine capable of

    understanding human speech in order to interpret Russian intercepted messages. Now the

    development of speech recognition has continued to evolve, taking great importance since

    it became widely used by:

    Large firms for some of their internal applications or look in applications

    based on speech recognition (Dragon Naturally Speaking, etc.) All these

    applications generally use their own speech engine, as there are also companies

    that specialize in creating and selling these engines; Voxalead example (still

    experimental).

    People with disabilities by allowing them greater autonomy.

    Speech recognition can also be linked to many planes of science (natural

    language processing, linguistics, formal language theory, information theory,

    signal processing, neural networks, artificial intelligence, etc..).

    In fact, we can see that this technology today represents a potential market in the

    world of software selling because some speech recognition and PC formed an

    indispensable means of intellectual and social development.

    As part of our End of Year Project, we thought of creating a speech recognition

    system for controlling the management system MySQL databases that will facilitate its

    manipulation to develop other projects.

    To design a system of automatic speech recognition (ASR) as correct as possible, it

    should:

    Firstly to understand how the speech signal is really complex, ie know the object or

    observation input

    Secondly to define properly the task of the system, ie the constraints and expected

    performance.

    We briefly present the project, then we will expose problems with the study of what

    exists, then we will present the different needs and needs to improve the current system

    9

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    9/62

    and define the various specifications (platform, tools,...). Finally we propose a system that

    we deem appropriate.

    CHAPTER I: PRESENTATION OF THE PROJECT

    I- Context of the project

    10

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    10/62

    Nowadays the voice technology is broadcasting across different operating systems

    and each day the necessity of this technology is getting enlarged continually.

    Within the context of our project, we applied this technology in MySQL data base

    utility with the aim of knowing how vocal application work.

    This project was created using local resources of the Private High School of

    Engineering and Technology ESPRIT".

    II- The choice of methodology

    To better achieve the project, it is essential to establish a process aiming to help to

    formalize the preliminary stages of developing a system to make this development morefaithful to the client's needs.

    Given the number of available methods (2TUP, RUP, AGILE methods ...), the choice

    becomes difficult; leading a project manager was asked during a startup project:

    How will I organize the development teams?

    What are the tasks assigned to whom?

    How long would it take to deliver the product?

    How do we involve the client development to capture the needs of it?

    The following table shows us the advantages and disadvantages of each methodology

    11

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    11/62

    Table1: Comparative table of design methodologies

    Justification of our choice :

    Given that our project is based on a well-defined development process that will determine the

    functional needs expected of the system until the final design and coding 2TUP has appeared

    the most appropriate to lead and plan the sequence of stages during this project. Two Tracks

    Unified Process respond to the constraints imposed continual change information of the

    company systems.

    III Introduction to the 2TUP methodology:

    Abbreviation of "Two Track Unified Process. It is a process that meets the needs

    of the Unified Process. The process 2TUP responds to constraints imposed by continual

    change information Systems Company. In this sense, it strengthens the control over the

    evolving capacities and correction of such systems. "Track 2" literally means that the

    process follows two paths or limbs. These are the "Functionnal ways"and "technical

    architecture", which correspond to the two axes of change imposed on the system

    information.

    12

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    12/62

    Figure1: Two types of constraints imposed on the information system

    1-The functional limb:

    This part capitalizes the knowledge of the company's business. It generally

    constitutes an investment in the medium and long term. The functions of the information

    system are in fact independent of the technologies used.This part includes the following steps:

    1 - The capture of requirement needs, producing a model focused on the needs of the

    business users.

    2 - Functional Analysis.

    2-The technical limb:

    It capitalizes the know-how. It is also an investment for the short and medium term.

    Techniques developed for the system can be in effect independently of the the functions

    to be performed. This part includes the following steps:

    1- Capture of technical needs.

    2 - The Generic Design

    3-Branch of the middle

    As a result of the developments in the functional model and technical architecture,

    implementation of the system is to merge the results of the two limbs. This merger results

    in the production process of a Y-shaped

    This part includes the following steps:

    1. Preliminary design.

    2. Detailed design.

    3. Coding.

    4. Integration.

    13

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    13/62

    Figure2: Development Process in Y

    14

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    14/62

    CHAPTER II-THE FUNCTIONAL PART

    I-Preliminary study

    Figure3: Preliminary Study Schema

    As the diagram above shows, the preliminary study is the first step 2TUP. It is to

    perform an initial identification of the functional and operational needs, mainly using the

    text.It prepares more formal activities to capture functional needs and capture techniques.

    For our project this study was achieved through the development of a specification. It

    examined the various systems already on the market, tried of identify the positive and

    negative sides through the critical part to fix our main objectives, articulate the needs and

    secure the modules that will maintain or improve by thereafter.

    Last stage of this study is the modeling of a context diagram

    15

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    15/62

    Figure 4: Functional Schema

    Description of the schema

    1. The speaker emits a sentence, once the sound; it is captured by a microphone.

    2. The voice signal is then digitized using an analog to digital. The setting of the

    signal provides a fingerprint.

    3. The decoding is to describe the acoustic signal in terms of linguistic units. It aims

    to segment the signal, identifying of the different segments is based of the phonetic and

    linguistic constraints.

    Once the analysis process is completed the recognition phase begins, in fact all the

    words spoken are separated by silences of duration greater than a few tenths of a second

    "recognition phase consists mainly of two phases:

    1) The learning curve : The speaker pronounces the whole vocabulary often several

    times to create a reference dictionary

    2) The recognition phase : The speaker stated before a word. To recognize the

    words emitted by the speaker there are three parts:

    - First, the sensor: to apprehend the phoneme physical balance, we in our case it is

    the microphone. A signal is transmitted to the microphone when the speaker speaks.

    - Second, the parameterization of forms which gives us an impression that is to say,

    the characteristic sound (Time / Frequency / Intensity).And finally, the identification of

    16

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    16/62

    forms. A second schema is needed to better absorb all its different use cases that should

    be treated.

    Figure 5: Operating principle of speech recognition

    This diagram shows over the operating principle of recognition: a speaker pronounces

    a word vocabulary. Then word recognition is a typical problem of pattern recognition.

    Any system of pattern recognition always involves the following three parts:

    - A sensor for understanding the physical phenomenon under consideration (in our

    case a microphone).

    - A floor-parameterization of the shapes (eg a spectrum analyzer).

    - A floor-loaded decision to classify an unknown form in one of of the possible

    categories.

    II- Capture of the Functional needs

    Once the preliminary study is done we move to the next step, in which we will

    determine the functional needs on the left branch and the parallel technical needs on the

    right branch.

    17

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    17/62

    1-Functional Needs:

    This project consists in designing and implementing a tool for voice manipulation of the

    database. First we will start by defining the actors who will interact with the system.

    Considering the need for our application, it appears that the main actors are reduced to an

    administrator and a user. The administrator responsible of database creation and maintenance of

    the accounts of the individual users using the database, all these tasks must be completed only

    by his voice. The user can access the database through his natural voice, after authentication we

    can fulfill the request LDD.

    Figure 6: Capture of the functional needs

    This project consists in designing and implementing a tool for voice manipulation of the

    database. First we will start by defining the actors who will interact with the system.

    Considering the need for our application, it appears that the main actors are reduced to an

    administrator and a user. The administrator responsible of database creation and maintenance of

    the accounts of the individual users using the database, all these tasks must be completed only

    by his voice. The user can access the database through his natural voice, after authentication we

    can fulfill the request LDD.

    1.1-The use case diagram:

    The use case diagram reflects the principle of the overall functioning of our

    application and the various actions of the actors. The study of the needs of the actors whointeract with our system requires the development of use cases as follows:

    18

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    18/62

    Figure 7: Use case diagram

    We consider that in our system two users are possible: The administrator and the normal user.

    The administrator accesses to all the existing use cases including that of manipulating the

    database, so that the user can only manipulate the database once created.

    a) Voice Authentication:

    The user pronounces his login and password in authenticate to access the main interface of

    MySQL.

    The following table details the process of authentication

    Title Authentification

    Intention Authentification des utilisateurs.

    Actors Users.

    Preconditions MySQL available Preconditions MySQL available

    Start when Application is launched.

    Definition of Transitions - pronounce the login and password.

    Finish when The administrator or the user's choice validates the session

    and connects

    Exception(s) Invalid user name. Invalid password. MySQL not found.

    19

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    19/62

    Postconditions MySQL Menu

    Table2: nominal scenario of the use case "Voice Authentication"

    b) Manipulate of vocally database:

    Title: Manipulating the database

    Summarizing: The user can create, modify or delete one or more databases and create and

    execute queries LMD vocally.

    Actors: the application user.

    The table details the process of the vocal manipulating the base.

    Title Vocal manipulation of the date base

    Intention Creating and manipulating a database in MySQL

    Actors User of the application

    Preconditions Authentication succeeded

    Start when the main window of MySQL opens.

    Definition of

    Transitions

    CASE 1: The user wants to create a database:

    -pronounced "create new schema".

    -Say the name of the database to create.

    - Confirm selection.CASE 2: The user wants to delete of database:

    20

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    20/62

    -Say the name of of table to drop.

    - pronounced "drop database".

    - Confirm selection.

    CASE 3: The user wants to create a new table in the database:

    -Select of database

    -Be 'create new table. "-Say the name of the table to create.

    - Confirm selection.

    CASE 4: The user wants to create a query LMD:

    - pronounce the name of of table.

    -Deliver the application to create.

    - Confirm selection.

    CASE 5: The user wants to execute a query LMD:

    - Say the name of the table.

    -Say the word "execute"

    NB: for this case the user must write the LMD querry

    Finish when The user confirms his choice.Exception(s) - the name of the database or table already exists.

    -Syntax Error in SQL.Table3: nominal scenario of the use case "Vocal Manipulation"

    c) Manage vocally users:

    Title: Managing users

    Summary: The Administrator can create, modify or delete a user account

    Actors: Administrator application.

    The table details the process of managing users.

    Title Manage the User

    Intention Creation, modification or deleting of the user account

    Actors Administrator

    Preconditions Authentification en tant que administrateur russi.

    Start when The interface MySQL Administrator opens

    Definition of

    Transitions

    CASE 1: The user wants to create a new user:

    - Say "user administration ".

    -Say "add new user".

    - Say the name and password.

    -Say "apply changes "CASE 2: The user wants to delete a user:

    21

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    21/62

    -Say "user administration ".

    - pronounce the name of the user.

    - pronounced "drop user".

    - pronounced "ok"

    CASE 3: The user wants to create a clone user:

    -Say "user administration ".- pronounce the name of the user.

    - pronounced "clone user".

    - Give the name and password of the user decision

    -ok.

    Finish when The administrator completes

    The session disconnects

    Exception(s) Invalid user name.

    Invalid password.Table4: nominal scenario of the use case "Manage user"

    1.2- The Activity diagram

    1. User pronounce a sentence through a microphone.

    2. The voice signal is then analyzed using the model to achieve an Acoustic signal.

    3. The decoding is to describe the acoustic signal in terms of linguistic units. It aims to segment

    the signal.

    4. Segmented signal is compared with the database (dictionary) thanks to the search graph.

    5. Projection of the action on screen.

    22

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    22/62

    Figure 8: Activity Diagram

    1. User pronounce a sentence speaker emits a sentence through a microphone.

    2. The voice signal is then analyzed using the model to achieve an signal.

    3. The decoding is to describe the acoustic signal in terms of linguistic units. It aims to

    segment the signal.

    4. Segmented signal is compared with the database (dictionary) thanks to the search

    graph.

    5. Projection of the action on screen.

    2 - Non-Functional Needs:

    Besides the functional needs developed above, we must consider the following constraints:

    -The service quality of the application:

    -Ergonomics of the application:

    -The interfaces of our application must be clear

    23

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    23/62

    -The response time of the application should be minimal.

    III-Functional Analysis

    Figure 9: The functional analysis

    1 - Cutting into categories

    It consists of

    1) Divide the class into categories Candidate

    2) Elaborate preliminary class diagrams by categories

    3) Decide the dependencies between categories

    24

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    24/62

    Figure 10: Cutting into categories

    1.1- The packages Diagram:

    The package diagram is a graphical representation of the relationships between the packages

    from the speech recognition system

    25

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    25/62

    Figure 11: Packaging Diagram

    The package and more general overall is the "general media treatment" decomposed of

    two packages which are: "Sound treatment " and "Speech treatment "

    - The Sound is the wave which it's audible to ear for the humain.

    - The Speech Is the process of stretching and relaxing Vocal cords to Produce sound.

    The package that we will interests on is the "Speech treatment " which in turn is divided

    into two sub packages that are "Speech Synthesis " and "Speech recognition ".

    Our development is based on "Speech recognition".

    26

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    26/62

    2 - Development of the static model

    Figure 12: nominal scenario of the use case "Voice Authentication"

    2.1-Diagram of Classes:

    The different classes are

    Caller ou Appelant

    Instruction

    Language Instruction

    Recording ou Enregistrement

    Speech Recognizer ou Reconnaissance Vocale

    Feature ou Caractristique

    Feature Extraction ou Extraction de Caractristique

    27

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    27/62

    Feature Classification ou Classification de Caractristique

    Feature Matching ou Correspondance de Caractristique

    Code book ou Dictionnaire

    Action.

    The various relationships are :

    Listen

    Record

    Send Speech Signal

    Perform

    Search And Match

    Contain

    Figure 13: Model participating Class Diagram

    28

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    28/62

    2.2-Description of class diagram

    Class "Caller" has a relationship "Listen" with another class "Instruction ". The caller can

    listen to a type of instruction, which is "Language Instruction".

    Then of class "Caller"is associated with a 'Record' with the class "Recording". This class

    it's then associated with a 'Send Speech Signal "with class" Speech Recognizer ".

    The class 'Speech Recognizer " is then associated with of class" Feature "in a

    relationship" Perform ", which means that the class' Speech Recognizer" contact of class

    "Feature" in feature extraction, classification of the features and feature matching.

    However, the class "Feature Matching" is associated with the relationship "Search and

    Match" with class " code book" to match the input speech, and its associated to the class

    Action with the relation contain.

    Note: To ensure clarity of the diagram we preferred not to put the attributes and class

    methods.

    We defined physical model needs to consist of 3 classes that facilitate of implementation

    of our application.

    3 - Development of dynamic model:

    The development of the dynamic model is the third activity of the analysis stage. It is

    situated on of left branch of the cycle Y. This is an iterative activity, strongly coupled

    with the activity of static modeling, described above. The development of dynamic model

    precedes the preliminary design.

    29

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    29/62

    Figure 13: The dynamique model

    3.1- Sequence diagram:

    The sequence diagram is mainly used to show interactions between the 6 categories

    listed / objects in of previous section. However, this interaction is in a sequential order

    that interactions take place. The figure shows the sequence diagram system, including the

    classes / objects, lifelines, processes and interactions. The interactions between the seven

    classes / objects are numbered 1 through 11 sequentially, which indicates which process

    should be done first to implement the following process.

    30

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    30/62

    Figure 14: object sequence diagram of a speech recognition

    Note:

    The nominal scenario for voice recognition has been represented by details in the

    sequence diagram above (see Figure 14), in the following sequence diagrams we chose to

    group the class instances related to the recognition of a proceeding that is "Recognition

    System" according to the diagram package described later (see diagram package Figure

    11).

    31

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    31/62

    Administrator

    :Recognition system :Data base utility :MySQl data base

    ref

    alt

    connection

    Pronounciation of login and password

    Verification of grammar which countain login and psw

    altValid grammar of parameter connection

    Insertion of the login and psw in the f ield of textVerification of login and psw

    Invalid g rammar

    demand of parameter of connection

    valid parameter of connectionResponse

    Displaying MySQl administrator interface

    Response

    Demand of parameter of connection

    Invalid parametre

    Figure 15: sequence diagram representing the connection

    32

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    32/62

    Administrator

    :Recognition system :Data base utility :MySQl data base

    ref

    alt

    connection

    ouncing of the control privilege u ser command

    Verification of grammar

    manipulation the mouse to insert privilege

    addition of privilege

    Valid grammar

    Demand to pronounce again the command

    Figure 16: sequence diagram representing the affection a privileged user

    33

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    33/62

    Administrator

    :Recognition system :Data base utility :MySQl data base

    ref

    alt

    connection

    pronouciation of the creation users command

    Verification of grammar

    manipulation the mouse to insert new user

    addition of new user

    Valid grammar

    Demand to pronounce again the command

    Opening the new user information interface

    pronouciation of the login information

    Verification grammar

    alt Valid grammar

    Insertion of the login information in the f ield of textattribute the login information

    Demand to pronounce again the command

    invalid grammar

    Figure 17: sequence diagram addition a user

    34

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    34/62

    3.2- Diagram of States transitions

    Now that the scenarios were formalized, the knowledge of all interactions between

    objects allows representing business rules system dynamics. However, it should focus on

    class behavior richest precisely in order to develop some of these dynamic rules. It uses

    this concept of finite state machine, which involves tracking the life cycle of a generic

    object of a particular class over its interactions with the rest of the world, in all possible

    cases. The local view of an object, describing how he reacts to events based on its current

    state and moves into a new state, it's plotted as a state diagram.

    Figure 18: sequence diagram of states transitions

    35

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    35/62

    4-Confrontation between the static and the dynamic models:

    The various relationships that exist between the main concepts of the static model (object,

    class, association, attribute and operation) and the main dynamic concepts (message,

    event, state and activity).

    The matches are far from trivial, because it is indeed complementary points of view and

    not redundant. Try of synthesize the most important, without being exhaustive:

    a message can be an operation invocation on an object (the receiver) by another object

    (the issuer);

    An event or effect on a transition may correspond to the call of an operation;

    An activity in one state may affect the performance of a complex transaction or a series

    of operations;

    A diagram of interactions involves objects (or roles);

    An operation can be described by an interaction diagram or activity;

    A guard condition and a change event attributes can view links or static;

    An effect on a transition can handle attributes or static links;

    The setting of a message can be an attribute or an entire object.

    36

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    36/62

    Chapter III-The Technical part

    I-Capture of the technical requirements:

    The capture of requirements, which identifies all the constraints on the choice of

    dimensioning system design. Tools and equipment selected, thus that taking into account

    the constraints of integration with the existing (pre-requisite technical architecture).

    Figure 19: Capture of the technical requirements

    Part of the work consisted in the study the functioning of speech recognition systems, to

    attach then to develop an acoustic model allowing the recognize words.

    This is why, we propose a first step which we introduce the Hidden Markov Models,

    mathematical concept that will allow to discuss the layout of the operating systems of

    automatic speech recognition (ASR). And in a second step, we will apply this model in

    our project.

    1-The Hidden Markov Models:37

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    37/62

    Definition :

    A Markov process is a discrete time system which is constantly in a state

    taken from N distinct states. The transitions between states occurs between two

    consecutive discrete instants, according to some probability law. The probability of each

    state depends only on the condition that immediately precedes it.

    A hidden Markov model (HMM) represents as the same way as a Markov chain, a

    whole sequence of observations whose state of each observation is not observed, but

    associated with a probability density function . It is therefore a stochastic process in

    which observations are a random function of the state and whose state changes every

    moment according to the probabilities of transition from the previous state.

    Figure 20: The Markov model

    More formally, a state machine hidden Markov is characterized by a quadruplet

    assemblies described below:

    -Ifthe state of i

    - iis the propability ofIf be the initiale state

    -aijis the propability of transitionIf If

    -bi(k)probability of emitting the symbol kbeing in the stateIf.On condition that :

    -The sum of the probabilities of initial states is equal to 1

    i = 1

    i

    - The sum of probabilities of transitions from a state is equal to 1.

    - The sum of probabilities of outputs from a state is equal to 1

    38

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    38/62

    We can describe a hidden Markov model as the parameter set

    = (, A, B)

    With :

    - the set of the initial probabilities

    - A the set of transition probabilities between states.

    - B the set of laws (or densities) of probabilities associated with a state.

    2-The Voice recognition theory:

    We use speech recognition to dial a phone number, browse through the windows on our

    computer, entering data into a software or dictate a letter in a word processor, the basic problem

    remains to the same: identify the meaning of a flow of words uttered often in a background

    noise more or less important.

    This task is made difficult not only by the deformations induced by the use of a

    microphone but also by a number of factors inherent to human language:

    - homonyms where the same sequence of sounds can correspond to several words (like

    the sound "s-in" in "cent" "sans" San [Francisco "sang" means bloodI ).

    - The local accents.

    - Patterns of language (as some elisions that make it difficult to separate the words: ("

    j'vais l'chercher") in ( je vais le cherchI)..

    - The speed differences between the users.

    - Imperfections of a microphone ...

    For the human ears, these factors do not usually represent difficulties. The brain plays

    with these deformations of speech by taking into consideration almost unconsciously, nonverbal

    and contextual elements that allow to eliminate ambiguities.

    Only by taking into account these elements surrounding the sound itself that the voice

    recognition software can achieve high degrees of reliability.

    Today software that give the best results are all based on a probabilistic approach.

    The aim of speech recognition is to reconstruct a sequence of words M from a recorded

    acoustic signal A.

    39

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    39/62

    In the statistical approach, we will consider all sequences of words M which could match

    the signal A.

    In this set of possible consequences we will then choose one which is most likely that is

    to say that maximizes the probability P (M / A) that M is the correct interpretation ofA what is

    Note M=arg max P (M / A).

    Figure 21: reconstruction of a sequence of words M from a recorded acoustic signal A.

    Note that P (A / B) represents the probability of event A if the event B has occurred.

    The axiom of Bayes calculates the probability of concurrence of two events A and B by the

    following equalities

    P (A and B) = P (A / B) P (B) = P (B / A) P (A)

    Where P (A) is the probability that event A occurs.

    Thus, the axiom Bayes can rewrite the expression:

    And as P (A) is a constant search for the best M, we have finally the equation:

    This last equation is the key to the probabilistic approach to speech recognition. In fact, the

    first term P (A / M) represents the probability of observing the acoustic signal A if the

    sequence of words was pronounced M: it is a purely acoustic problem.

    The second term P (M) represents the probability that it is the sequence of words that M was

    pronounced: it is a linguistic problem.

    The above equation tells us so that we can divide the problem of speech recognition into two

    independent parts: we will model separately the aspects acoustic and language problems.

    40

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    40/62

    Thus, the transcript is divided into several modules

    feature extraction produces A

    using the acoustic model calculating P (A / M) and M, looking for the hypotheses that are

    likely associated to A.

    using the language model calculating P (M) to select one or more assumptions on M

    depending on the language knowledge

    The following schema illustrates the components of a transcription system.

    Figure 22: The transcription systme

    3-Features extraction:

    The sound signal to be analyzed in the form of a wave whose intensity varies over time.

    The first stage of the transcription process is to extract a series of numerical values

    sufficiently informative on the acoustic level to decode the signal thereafter.

    41

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    41/62

    The signal may contain areas of silence, noise or music. These areas are first removed in

    order to have only portions of useful signal to the transcript, that is to say, those

    corresponding to speech.

    The sound signal is then segmented into what are described as breath groups, using as

    delimiters of silent pauses long enough (about 0.3 s). The advantage of this segmentation is

    to have a continuous tone of a reasonable size compared to the capabilities of model

    calculations of the ASR system. Later in the transcription process, the analysis is done

    separately for every breath.

    To identify changes in the signal, which generally varies rapidly over time, the group is

    blowing itself divided into windows of a few milliseconds of study (usually 20 or 30 ms).

    In order to avoid losing important information on the top or end of windows, we made sure

    that they overlap, which leads to extract features every 10 ms.

    From the signal contained in each analysis window are calculated numerical values

    characterizing the human voice. After this step, the signal becomes a sequence of vectors

    called acoustic dimension often greater than or equal to 39.

    4- The Acoustic Model:

    The next step is to associate the acoustic vectors, which are, as we have seen, numeric

    vectors, a set of assumptions of words (symbols). Referring to equation 1 of the statistical

    modeling, this amounts to estimate P (A / M). The techniques for calculating this value

    form what is called the acoustic model.

    The most used tool for modeling the acoustic model is the Hidden Markov Model

    presented above. The HMMs have indeed shown their effectiveness in practice to recognize

    speech. Even if they have some limitations to model signal characteristics, such as the

    duration or length of successive acoustic observations, the HMMs offer a well-defined

    mathematical framework to calculate the probabilities P (A / M).

    Acoustic models involve three levels of HMM shown in the figure below.

    42

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    42/62

    Figure 23: The acoustic model

    They look at first to recognize the types of sound, in other words to identify the phones (which

    sounds are pronounced by speakers and defined by specific characteristics). To do this, they

    model a phone by an HMM, usually 3 states representing the beginning, middle and the end.

    The hidden variable is then sub-phone and acoustic observations are acoustics vectors.

    To calculate the probabilities of observation in each state, two approaches are often

    considered, one based on the representation of probability densities by Gaussian el'autre based

    on neural networks. These different methods establish assumptions about the likelihood of

    phones uttered. However, the aim of acoustic models is to determine a sequence of words.

    Acoustic models for this purpose use a dictionary of pronunciations, making the

    correspondence between a word and pronunciations. As a word may be pronounced in

    different ways, according to his predecessor and his successor, or simply as the habits of the

    speaker, there may be multiple entries in the lexicon for the same word. The indications are

    given through the features of pronunciation phonemes.

    43

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    43/62

    The second level of HMM models the words from the HMM representing phones and lexicon

    of pronunciations. It comes in the form of a lexical tree initially containing all the words in the

    vocabulary gradually pruned as and when the phones are accepted. Since HMMs modlisent

    first level of phonemes, not phones, phonemes found in the dictionary pronunciations are

    converted into phones to recognize words. Transformation rules depending on the context of

    developing phoneme are then used.

    The third level models finally the sequence of words M in a group of breath and can then

    incorporate the knowledge gained from the language model on M. To establish the HMM

    equivalent to a word graph, the HMM corresponding to the lexical tree is duplicated each time

    the acoustic model makes the assumption that a new word has been recognized.

    The functioning of the acoustic model just described is facing a major problem: the search

    space of higher-level HMM is often considerable, especially if the vocabulary is important and

    if the breath to be analyzed contains multiple words . Algorithms from dynamic programming

    can effectively calculate the probabilities. These are mainly the Viterbi algorithm and the

    decoding stack, also called decoding A *. In addition, use is made of very regular pruning to

    keep only those assumptions that could be most interesting.

    The role of the acoustic model is thus to align the sound signal with theories of words using

    only acoustic indices of order. It includes in its last level modeling information about the

    words introduced by the language model.

    5-The model language :

    The language model is intended to find sequences of words most likely, in other words

    those that maximize the value P (M) of equation 1. If one refers to the highest level of

    HMM acoustic model (see previous figure), the values P (M) are the probabilities of

    successive words.

    a) Functioning of a language model

    By placing M1N = M = m1 ... mn, where m is the word of rank i of the sequence M, the

    probability

    P (M) is as follows:

    44

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    44/62

    The evaluation of P (M) reduces then to calculate P values (mi) and P (mi | M1i - 1)

    respectively which are obtained using the equalities

    Where V is the vocabulary used by the ASR system, and C (mi) and C (M1i) represent the

    respective numbers of occurrences of the word and half of the sequence of words in the

    corpus M1i learning. Unfortunately, predicting the sequence of words M1i, the number of

    parameter P (m) and P i (mi | M1i - 1) of the language model to estimate, increasesexponentially with n. In order to reduce this number, P (mi | M1i - 1) is modeled by a N-

    gram, that is to say, a Markov chain of orderN-1 (with N> 1) using the equation:

    P (mi | M1i - 1), P (mi | mi - N + 1i - 1)

    This equation indicates that every word may be mid predicted from the N-1 preceding words.

    ForN = 2, 3 or4 refers respectively bi gram model, trigram or Quad gram. ForN = 1, the

    model is said united program and returns to estimate P (mi).

    Generally, these are models bis grams, trigrams and quad grams that are used in language

    models for ASR.

    6- The choice of Sphinx API :

    Sphinx 4 is a speech recognizer written entirely in Java. The goals are to have a speech

    recognition highly flexible to equal the other commercial products and develop collaborative

    research centers from various universities, laboratories of Sun and HP, but also from MIT.

    While being highly configurable, recognition of Sphinx 4 supports including single words and

    phrases (use of grammar). Its architecture is scalable to enable new research and test newalgorithms.

    The recognition quality depends directly on the quality of voice data. The latter being the

    information relating to their own voices. Examples are different phonemes, the individual

    words (vocabulary), different ways of pronunciation. More information will only be important

    and known by the system, the better his reaction and his choice to make.

    As shown in the following figure which represents its architecture, Sphinx 4 is based on 3

    modules.

    45

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    45/62

    Figure 24: General architecture of the Sphinx-4

    6.1- The Architecture of sphinx -4 :

    Figure 25: Detailed Architecture of Sphinx-4

    The main blocks are the frontend, decoder, and the linguist. The support blocks include the

    configuration manager and the tool blocks.

    The frontend takes one or more input signals and meterizes by a sequence of functions. The

    linguist translates into any kind of model of standard language, as well as information on the

    pronunciation of dictionary and structural information of one or more sets of acoustic modelsin a search graph. The research director in the decoder uses the featured frontend and search

    46

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    46/62

    graph of the linguist to do the actual decoding, generating results. At any time before or during

    the recognition process, the application can issue checks to each module, becoming a partner

    in the recognition process.

    a) The Frontend

    Front-End cuts the recorded voice into different parts and prepare them for the decoder.

    The aim of the Front End is to set an input signal (example, audio) into a sequence of outputs.

    As illustrated in Figure 26, the frontend has one or more parallel chains of replaceable

    communication signal processing modules called "dataprocessors.

    Support multiple channels allows simultaneous calculation of different types of parameters of

    the input signals are identical or different. This allows the creation of systems that cansimultaneously decode types derived from non-voice signals.

    Figure 26: Parallel chains of communicating Data Process

    b) The Linguist :

    The linguist generates searchgraph which is used by the decoder during the search, at the same

    time hiding the complexity of the generation of this graph. As the case along the Sphinx-4, the

    linguist is a plug-in module allows people to dynamically configure the system with different

    linguist implementations.

    A typical implementation of constructs searchgraph using the structure of the language

    represented by a given language model and the topological structure of AcousticModel (HMM

    for basic sound units used by the system).

    During generation of searchgraph, the linguist may also incorporate sub-word units in the

    contexts of arbitrary length.

    By allowing different implementations of the linguist to be connected to the execution,

    Sphinx-4 allows individuals to provide different configurations for different systems and

    recognition. For example, a simple numerical application recognition digits may use a single

    linguist who keeps the search space entirely in memory. The linguist is based around threecomponents which are described in the following sections:

    47

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    47/62

    The language model

    The dictionnary

    The acoustic model

    b.1) The language model :Role :

    Describes what can be said in a very special context.

    Helps narrow the search space.

    There are three kinds of language model: the simplest is used for isolated words, the second

    for applications based on commands and controls and the last for the current language.

    The model implementation language supports several types of grammars, we opted for The

    Grammar JSGF that supports the Java TM Speech API Grammar Format (JSGF) [20], whichdefines a BNF style, platform independent representation Unicode and vendor-independent

    grammars.

    b.2) The Dictionary

    The dictionary gives the pronunciation of words found in the languageModel. The

    pronunciations of the words cut into sequences of sub-word units found in the AcousticModel.

    Dictionary interface also supports the classification of words allows for one-year term to be in

    several classes.

    b.3) The AcousticModel

    The module AcousticModel provide a correspondence between a unit of speech and an HMM

    that can be scored against incoming characteristics provided by the Frontend.

    b.4) the Search Graph

    The SearchGraph is the main data structure used during the decoding process.

    It is a directed graph where each node, called SearchState, represents either a state issue or not

    transmitter. States transmitters can be scored against incoming noise characteristics while non-

    issues are generally used to represent higher level language constructs such as words and

    phonemes that are not directly scored against the elements involved. The arcs between states

    represent possible state transitions, each with a probability representing the likelihood of

    transition along the arc.

    48

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    48/62

    How is built SearchGraph affects memory footprint, speed and accuracy of recognition. The

    modular design of Sphinx-4, however, allows different strategies to be used

    SearchGraphcompilation without changing other aspects of the systems.

    The choice between static and dynamic construction of language HMMs depends mainly on

    the size of vocabulary, complexity of the language model and the desired memory footprint of

    the system, and can be performed by the application.

    c) Decoder

    The decoder is the heart of the Sphinx 4. It was he who processes the information received

    from the

    Front-End, analyzes and compares them with the knowledge base to give a result to the

    application.

    The main role of the Sphinx-4 decoder block is to use the features of the Front End in

    collaboration with the linguist SearchGraph to generate hypotheses results. The block decoder

    includes SearchManager ins and other supporting code that simplifies the decoding process of

    an application. As such, the most interesting element of the block decoder is SearchManager.

    The decoder simply tells the SearchManager recognize a frameset features. At each step of the

    process, creates SearchManager results object that contains all the paths that have not reached

    a final state transmitter.

    7-Technical use case diagram

    This results in the block diagram of the overall operation of the system and the various

    actions of the actors.

    The study of the needs of actors who interact with our system requires the development of

    use case diagram as follows:

    49

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    49/62

    Figure 27: Technical use case diagram

    The different use cases are:

    Listen for instructions: to capture the signal from the microphone

    Save the speech signal: save the signals from the microphone

    Analyze the speech signal: Segmenting the signal into phoneme

    Match the speech signal: match the signal to the data base

    Match the feature vector : match the characteristics of the signal analyzed in the data

    base

    Extract the feature vector:analyze the signal in the extractant carectristiques

    significant

    Classify Feature vector: entities classify the signals analyzed by category

    Different actors are:

    User

    The dictionary (code book)

    II- The Generic Design

    The generic design, which then defines the components needed to build the technical

    architecture. This design is completely independent of the functional aspects. It aims to

    standardize and reuse the same mechanisms for all systems. The technical architecture

    built the backbone of the system, its importance is such that it is advisable to make a

    prototype.

    50

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    50/62

    Figure 28: The generic conception

    Software layers

    Sphinx-4 has been compiled and tested on Solaris, Mac OS X, Linux and

    Windows. The execution, compilation and testing of Sphinx-4 require additional

    software. The following software must be installed on the machine:

    - Java SDK 5.1. http://java.sun.com.

    - The various libraries that make up the Sphinx-4

    Exploitation and Configuration Software :

    a) Implementation of the library with Eclipse

    51

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    51/62

    The implementation of Sphinx-4 in an arbitrary application is relatively easy. The first

    state is to create a new project (menu File - New - Project). The figure below shows how to

    create a new project in Eclipse.

    Figure 29: Creation of a new project

    The second step is to insert bookstores Sphinx-4 in the draft. For this, we make a right

    click on the project and we will in the project properties. It then chooses the menu "Java Build

    Path". Finally we click on "Add External JARs" to add the various libraries provided by Sphinx.Libraries to add are the following:

    Figure30 -insert libraries in the Sphinx-4 project

    js.jar.

    jsapi.jar (This must be created by launching the application jsapi.exe located in the lib

    directory of the downloaded archive).

    This library is used by Java among others to record sound.

    52

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    52/62

    sphinx4.jar.

    TIDIGITS_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar.

    Only for recognition of numbers.

    WSJ_8gau_13dCep_8kHz_31mel_200Hz_3500Hz.

    WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.

    b) Writing Grammar

    To perform recognition, we must write a grammar, that is to say a file

    describing the terms that must be recognized by the program. Grammars are used

    by Sphinx JSGF format (Java Speech Grammar Format must then create a file with

    an extension.''gram''. This file contains the grammar used by the application, that is

    to say the words or phrases that are potentially pronounceable.

    b.1) Example of grammar

    Figure31 Gramar file

    The file grammar above allows understanding all of the following sentences:

    53

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    53/62

    Figure32 list of sentences which we can pronounce it

    This figure shows the grammar above graphically.

    Figure33 GraphicalGrammar structure

    c) Writing the configuration file for Sphinx

    After writing the grammar file, we must create the configuration file

    Filename.config.xml. The easiest way is to use a configuration file to one of several

    demonstrations provided in the downloaded archive. This file specifies among other

    things used the dictionary and grammar used.

    54

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    54/62

    Figure34 XML configuration File

    55

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    55/62

    Chapter VI THE MIDDLE PART

    I- the design part:

    Fig35 The design part

    The model design system organizes the system in components, delivering technical

    services and functional. This model combines the information from the right branch and left

    branch. It can be considered as the transformation of the analysis model by projecting the

    analysis classes on the software layers.

    The preliminary design is a delicate step because it integrates the functional analysis

    model in the technical architecture in order to draw the mapping of system components to be

    developed.

    Detailed design, which then examines how to make each component.

    The encoding step, which produces components and tests as and when the code units

    completed.

    56

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    56/62

    The recipe step, which is finally to validate the functionality of the developed system.

    1- Detailed Design:

    57

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    57/62

    58

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    58/62

    Fig36 The Class diagram detailed

    The class diagram detailed outcome of the general class diagram (described in Part "2.1-

    the class diagram).

    NB: It is noted that some classes will be transformed in the following forms:

    Class Codebook: will become our dictionary (database).

    Class Instruction: will become as grammar file59

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    59/62

    Class LanguageInstruction: will become as grammar file

    II- Realization part

    1-Description of the applications interfaces :

    In this part of the project we will goanna to show you the first exemple of our

    application .

    This interface present the home interface of all users.

    Fig36 The home interface

    This interface show the process of the addition of a new application

    60

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    60/62

    Fig37 The home interface

    This interface show how to edit existed application.

    61

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    61/62

    Conclusion62

  • 8/2/2019 Pfa 10.0 Beta (Ang)

    62/62

    This project has leads to the creation of an application for manipulating vocally some

    other application to see its MySQL, SKYPE.

    Thus, a search job on the internet and a careful study on the working tools were made tochoose the most appropriate architecture for the system.

    Throughout this project we have done our best to improve our application but we faced on

    a major problem: the development of an acoustic model customized to each user of our

    application.

    Concretely, the difference between the applications present on the market as (Dragon

    Naturally Speaking, Speak Q, etc. ..) is the degree of perfection of the acoustic model;

    What may be considered as the most important task as it requests additional time beyond

    the deadline of our project.