Click here to load reader

Indian Language to Indian Language Machine Translation ilmt/download/ilmt_srs-1.03.pdf · PDF file3.5.3 Sandhi Splitter 3.5.4 ... Indian Language to Indian Language Machine Translation

  • View
    214

  • Download
    1

Embed Size (px)

Text of Indian Language to Indian Language Machine Translation ilmt/download/ilmt_srs-1.03.pdf ·...

  • IndianLanguagetoIndianLanguageMachineTranslation

    System

    (ILMT)System

    SoftwareRequirementSpecifications

    version1.03

    October2008

    IIITHyderabad

  • SRSV1.03IssueDate:06/10/2008

    TableofContents

    1Introduction1.1Purpose1.2Scope1.3Definitions,acronyms,andabbreviations1.4References1.5Overview

    2Overalldescription2.1ProductPerspective2.2Productfunctions2.3UserCharacteristics2.4Constraints:SystemStructure/Architecture2.5Assumptionsanddependencies2.6SoftwareEngineeringApproaches(Reference:s/wenggprocessforILMT)2.7ILMTSystem:Accuracy/Usersatisfaction

    2.7.1ILMTSystem:Accuracy/User2.8Methodology

    3SpecificRequirements3.1SystemStructure

    3.1.1CommonRepresentationSSF3.2ILMTSystemArchitecture3.3 InformationFlow3.4 ProcessDescription3.5 SpecificationofIndividualModules

    3.5.1Preprocessor3.5.2Tokenizer3.5.3SandhiSplitter3.5.4MorphAnalyzer3.5.5POSTagger3.5.6Chunker

    3.5.6.1Chunking3.5.6.2Pruning

    Confidential ILMT,1.03 Pageiof2

  • SRSV1.03IssueDate:06/10/2008

    3.5.6.2.1MorphPruning3.5.6.2.2GuessMorph

    3.5.6.2.3PickoneMorph 3.5.6.3Headcomputation

    3.5.6.4Inheritheadfeatures 3.5.7LocalWordGrouper/Splitter

    3.5.7.1LocalWordGrouper/VibhaktiComputation3.5.8NamedEntityRecognizer(NER)3.5.9SimpleParser3.5.10LexicalSenseDisambiguation.3.5.11SLtoTLTransfer

    3.5.11.1TransferEngineModule3.5.11.2LexicaltransferEngine3.5.11.3Transliteration

    3.5.12PuttargetlanguagefeaturesAgreementFeature3.5.13InterchunkAgreement3.5.14IntrachunkAgreement3.5.15TAMVibhaktiSplitter3.5.16AgreementDistributioninsplitvibhakti3.5.17AssignDefaultFeatures3.5.18WordGeneration

    3.6Evaluation

    4SystemIntegrationandTesting4.1 Dashboard4.2GraphicalUserInterface(GUI)

    References

    AppendixAppendixA:XMLFileStructureforCorporaCreation(.cmlfiles)AppendixB:ShaktiStandardFormat:BNFwithBriefDescription

    RevisionHistory

    Confidential ILMT,1.03 Pageiiof2

  • SRSV1.03IssueDate:06/10/2008

    1Introduction

    1.1 Purpose

    IndianLanguagetoIndianLanguageMachineTranslationSystem(henceforthreferredasILMTSystem)willbeabidirectionalmachinetranslationsystem,tobedevelopedfornineIndianlanguagepairs,andhence,therewouldbe9ILMTSystems,oneforeachlanguagepair.Todistinguishthesenineproducts,wewillusethreecharactercodeforeachlanguageassuffixestonametheproduct,i.e.,

    ILMT_TamHin forTamilHindiILMT_TelHin forTeluguHindiILMT_MarHin forMarathiHindiILMT_BenHin forBangalaHindiILMT_TamTel forTamilTeluguILMT_UrdHin forUrduHindiILMT_PanHin forPunjabiHindiILMT_MalTam forMalayalamTamilILMTKanHin forKannadaHindi

    ILMTSystemisbeingdevelopedbyaconsortiumofacademic&researchinstitutionsworkinginthefieldofNaturalLanguageProcessing(NLP)technology.Alltheseinstitutionsareactiveinthefield of technologydevelopment for NLP in general &Indian LanguageMachine Translation inparticular,formanyyears.Theyhavedevelopedandaccumulatedalmostallcomponents/modulesthatcanbeadapted,modified,orenhanced(astheneedmaybe)forILMTSystembeingdeveloped.

    ILMT System being conceived will be a very large and complex system, and it will beinappropriatetodesignitafresh.Theconsortiumhasdecidedthattheavailableversionsofallmoduleswouldbetakenastheinitialversions,andtheneachmodulewouldbeengineeredstepwise,tomakeitacomponentofamaintainableILMTsoftwareproduct.

    ThisSRSisbeingwritteninthelightoftheconstraintsdescribedintheearlierpara.SotheprimarypurposeofwritingthisSRSare

    Itwillprovideabaselinefordesign&developmentofILMTSystemw.r.t.functionality Itwillprovideabaselineforvalidation&verificationoftheILMTSystem Itwillhelpinthefielddeployment&maintenanceoftheILMTSystem

    Confidential ILMT,1.03 Page1of61

  • SRSV1.03IssueDate:06/10/2008

    Itwillprovideabasisforthetransferoftechnology&enhancementofILMTSystem(astheneedarisesinfuture).

    1.2 Scope

    TheILMTsystemwillprovideawebinterfacefortranslation.Itwillworkonwebpagesortextmaterialfrombooks,magazines,newspaperetc,writteninstandardlanguage.ILMTSystemistobedevelopedfortwodistinctuserdomains,i.e., tourism&health.So eachofthenineproductsmentionedabovewouldhavethreedistinctpackage,viz.,

    ILMTSystemforGeneralPurpose ILMTSystemforTourism ILMTSystemforHealth

    1.3 Definitions,acronyms,andabbreviations

    NLPNaturalLanguageProcessingILMTSystemIndianLanguagetoIndianLanguageMachineTranslationSystemCMLCoporaMarkupLanguageSSFShaktiStandardFormatDSFDictionaryStandardFormatAPIApplicationProgramInterfacenounp(CAT_)ItreturntrueifthevalueofCAT_isalexicalcatogryoftypenoune.{other_}sMorethanonefeaturestructure,otherabbreviation

    themajorlistofabbreviationwhichisusedinSSFtodefinetheformatisavailableinAppendixB. theothermajorlistofabbreviationwhichisusedtodefinetheattributeanditsvalueforPOSand

    MorphisavailableinreferencesnotesforstandardsforPOSandMorph

    1.4 References IEEE8301998 SSF,Dr.RajeevSangal,IIITHyderabad Dashboard,Dr.RajeevSangal,IIITHyderabad NotesofSRS,ExpertSoftwareConsultantsLtd.,NewDelhi

    Confidential ILMT,1.03 Page2of61

  • SRSV1.03IssueDate:06/10/2008

    1.5 Overview

    This SRS is organized into 3 main sections. They are Introduction, Overall description, and Specificrequirements.TheSRSendswithalistofAppendixestomakeitcompleteinitself.SpecificRequirementssectionsdescribesthefunctionalaswellasnonfunctionalrequirements.

    Confidential ILMT,1.03 Page3of61

  • SRSV1.03IssueDate:06/10/2008

    2

    OverallDescriptions

    2.1 Productperspective

    AsaproductILMTSystemistobedevelopedforthreedistinctusagescenarios,i.e.,generalpurpose,tourismdomain,andhealthdomain.Theproductwillbeusedbytheusersonthewebusingabrowser.Sothesystemmustbeabletohandlethewebcontentappropriately.

    2.2 ProductFunctions

    TheILMTSystemwillbebasedontheanalyzetransfergenerateparadigm.Theinputtextisfirstpreprocessed(collected,cleaned,andformatted).Then,theanalysisofthesourcelanguagetextiscarriedout.Aftersourcelanguageanalysis,transferofvocabularyandanalyzedstructureiscarriedout.Andfinallythetargetlanguageoutputisgenerated.Themajorproductfunctions(orsubfunctions)ofILMTSystemwillbe

    Preprocessor Collector Cleaner,and Formatter

    SourceLanguageAnalyzer Tokenizer Morphanalyzer Sandhisplitter(optional) POStagger Chunker Pruning HeadComputation VibhaktiComputation NamedEntityRecognizer Simpleparser

    SourcetoTargetlanguageTransfer TransferGrammar Lexicalsubstitution Transliteration

    Targetlanguagegenerator

    Confidential ILMT,1.03 Page4of61

  • SRSV1.03IssueDate:06/10/2008

    AgreementFeature IntrachunkAgreement LocalWord(/Vibhakti)Splitter AgreementDistribution DefaultFeatures Wordgenerator

    2.3 UserCharacteristics

    AsaproductILMTSystemistobedevelopedforthreedistinctusagescenarios,i.e.generalpurpose,tourismdomain,andhealthdomain.TheaimoftheILMTSystemistodevelopatranslationsystemwherethefollowingholdsgoodwithrespecttotheusers,

    theuserdoesnotknowthesourcelanguage theuserisanativespeakeroftargetlanguage, translated output is comprehendable i.e., the human user can make a meaning out of the

    translatedoutput thereisnomajorambiguity,andlastly itisausablesystem

    Moreover,ILMTSystempresumesthattheusercanreadtargetlanguagescript.

    2.4 Constraints:SystemStructure/Architecture

    TheproposedILMTSystemisbeingdevelopedinaconsortiummode(11participatingorganizations).Theresearchershavedevelopedovera periodof time lot ofNatural LanguageProcessing(NLP)modules.Thesemodulesareall functional ina specificenvironment andina limitedscope.SincetheseNLPmodulesarecomplextheycannotberewritten/modifiedovernight.Theyaredevelopedusing various programming languages (like, C, Perl, lex, Java, C++, Python, etc.) and differentparadigmor formalism. Due to thecomplexity of anyNLPSystem,and theheterogeneity of theavailablemodules,itisdecidedthatILMTSystemwillbedevelopedusingBlackboardArchitecturetoprovideinterinteroperabilitybetweenheterogeneousmodules.HenceallthemoduleswilloperateonacommondatarepresentationcalledShaktiStandardFormat(SSF)eitherinmemoryorstream.Allthemodulesthatarebeingdeveloped(orreengineered)willneedtocomply(oradapt)tothespecificationsoftheblackboard.Withtheblackboardarchitectureitwillbeeasytoconfigure&setupILMTSystemwiththeheterogeneousmodules.In viewof the aboveconstraints, a separate application (calledDashboard) is beingdeveloped inparallel at IIITHyderabad, whichwill providea frameworkfor settingupandconfiguringILMT

    Confidential ILMT,1.03 Page5of61

  • SRSV1.03IssueDate:06/10/2008

    Systemusingblackboardarchitecture.ILMTSystemvalidationcanbedoneonlyagainstTestSuites,andhence,foreachofthenineproductseveryone of the three versions, there have to have test suits developed in advance. Further, each product isbidirectional,andhence,thereshouldbeadistincttestsuiteforeachdirectionofMachineTranslationTesting.Asthereareonly9languagesinall,westillneedTestSuitesforeachoftheninelanguages,andforeachoftheirthreeversions,i.e.,Weneedtodevelop54TestSuitsinallforthisproject.Followingthesameschemeasproposedfortheproductstestsuitswillbenamedas

    TS_HIN_TEL TS_HIN_TAM TS_HIN_PAN TS_HIN_BEN TS_HIN_MAR TS_HIN_KAN TS_TAM_MAL TS_TAM_TEL TS_HIN_URD

    Eachoftheaboveninesuitswillhavesixdistinctversionsforthreeusagescenarioandeachdirection.

    2.5 Assumptionsanddependencies

    TheproposedILMTSystemwillbebasedonanalyzetransfergenerateparadigm.First,analysisofthesourcelanguagetextwouldbedone,thenatransferofvocabularyandanalyzedstructuretotargetlanguagewouldbecarriedout,andfinallythetargetlanguagewouldbegenerated.BecauseIndianlanguagesaresimilarandsharegrammaticalstructures, shallowparsingwouldbedone.Thetransfergrammarcomponentwouldbekeptsimplerequiringonlyasimpleparser.Domainspecificaspectswouldbehandledbybuildingsuitablenamedentityrecognizers,suita

Search related