32
Linguistics 101: The Conceptual Base of Natural Language Processing (NLP) Zina Saadi Computational Linguist Middle Eastern Languages Specialist

Linguistics 101

Embed Size (px)

Citation preview

Page 1: Linguistics 101

Linguistics 101: The Conceptual Base of Natural Language Processing (NLP)

Zina Saadi

Computational Linguist

Middle Eastern Languages Specialist

Page 2: Linguistics 101

Linguistics is the scientific study of human languages.

The goal of linguists is to explore the complexity of languages by creating language models that describe their minimal units and their connections.

Linguists often search for universal rules of a language.

What is Linguistics?

2

Page 3: Linguistics 101

Overview of Linguistics:

Phonology

Morphology

Semantics

Syntax

Summary: A taste of Basis Technology Linguistic Solutions

[mostly Modern Standard Arabic (MSA),

the linguistic analysis at the word level]

Presentation Outline

3

Page 4: Linguistics 101

Articulation

Phones

Phonemes

Prosody

Linguistic Signs

Morphemes

Affixes

Stems

Words Meaning

Roots

Lemmas

Homonyms

Words Order

Part of Speech

Linguistics: The Big Picture

4

Morphology SyntaxPhonology

Arbitrary Structure Lexical Structure Grammatical Structure

Semantics

Page 5: Linguistics 101

Why is Phonology Important?

5

For a better communicationEach human language is composed of a series of sounds with functional or

distinctive characteristics that can sometimes affect the meaning of the

sentence.

اب قلب محمدتThe heart of

Mohamed repented

اب قلب محمدطThe heart of

Mohamed became kind

ال الماءسThe water leaked

ال الماءزThe water

disappeared

Emphatic

/t/ sound

vs. /t/

Voiced

sound

/s/ vs. /z/

Page 6: Linguistics 101

What is Phonology?The study of the sound system of a given language in addition to the analysis and

classification of its phones and phonemes.

6

Page 7: Linguistics 101

• What is a phone? sound that can be individually produced and recognized by speakers of a language.

Phonetics and Phonology

7

fb

b

bilabial

voiced

stop

oral f

labiodental

voiceless

fricative

oral

Page 8: Linguistics 101

Phones: What about Arabic?

Phonetics and Phonology

8

b

bilabial

Voiced + Voiceless

stop

oralf

labiodental

Voiceless +voiced

fricative

oral

Page 9: Linguistics 101

• What is a phoneme? may correspond to specific phones, or a single phoneme may correspond to any one of a set of allophones

Example: Emphatic letters distinction

NLP Uses: Text-To-Speech (TTS) applications

Phonetics and Phonology

9

ب تا taabarepented

ار س saara moved

ل ذ dhallahumiliated

Group 1

اب ط Taaba was good

ار ص Saarabecame

ل ظ Zallaremained

Group 2

Page 10: Linguistics 101

What are allophones? sounds variations that do not alter the meaning of words

Phonetics and Phonology

10

[dj] as in jet

• Modern Standard Arabic

• Most Arabic countries

[g] as in game

• Egypt

• Some regions in Morocco

Page 11: Linguistics 101

Prosody

Duration of sound

كتب (kataba) => to write

كاتب (kaataba) => to correspond with

Stress (emphasis of a syllable)

Present => gift or opposite of absent

Present => to offer

NLP Uses: to correct pronunciation in TTS

Phonetics and Phonology

11

Page 12: Linguistics 101

Pronunciation Rules

Phonetics and Phonology

12

S (Finnish)

Lisa

[li:za]

Lisa

[li:sa]

S (French)

souvent

/s/

maison

/z/

/s/ → [voiced] / V_V

/s/ and /z/ are different phonemes/s/ → [voiced] unpredictably

/s/ and /z/ are in free variation

Page 13: Linguistics 101

What is Morphology?Morphology is the systematic change that

occur at the word level to create a relationship between a given word and

other words.

13

Page 14: Linguistics 101

عصفور

bird

oiseau

Vogel

The Linguistic Sign: Form and Meaning

14

• Invariant form (pronunciation)

• Invariant meaning

Page 15: Linguistics 101

Why is the Linguistic Sign Arbitrary?

15

The linguistic sign is arbitrary as there is no direct connection between the

invariant form and meaning.

there are no transformational rules that connects phonological level of

words to its morphological level for a specific meaning, while there are

syntactic rules that bind words in sentences.

ل مM L

لم

lamdidn’t

لم limawhy

مل mallabored

لم lammacollect

لن

lanWon’t

Page 16: Linguistics 101

Morphology: Linguistic Signs

16

Example: He writes various books

6 linguistic signs [he, write, -s, various, book, -s]

The first –s signifies the present tense

The second –s signifies plurality

The linguistic sign is the minimal meaningful unit that contributes to the meaning of a word

Page 17: Linguistics 101

A Word can have:

One morpheme: ( فوق, above)

Many linear morphemes ( انكتاب ال , the two books)

Many non-linear morphemes: ( ريوط , birds), ( تبامك , desks)

Are morphemes always attached to words?

Morphemes and Words

17

Page 18: Linguistics 101

Morphemes are not always linear:

MSA: تخرج ال

Algerian Dialect: شتخرج ما

French: Ne sors pas

English: do not leave

Discontinued morpheme

Morphology: Detecting Morphemes

18

Page 19: Linguistics 101

Allomorph example:

Morphology: Various Morpheme Forms

19

مدد madada (extended)

ت مددأنا (I)

تما مددأنتما (you masc. dual)

تممددأنتم (you masc. plural)

مد madda (extended)

مد هو (he masc.)

ا مد هما (they masc. dual)

وامد هم (they masc. plural)

Page 20: Linguistics 101

Analytic Relationship (each word is a morpheme)

Example: Chinese & Japanese

Agglutinative Relationship (words with linear morphemes)

Example: Turkish and Korean

Morphological Relationship (words with non-linear morphemes)

Example: Arabic and Hebrew

Morphology and Languages

20

Page 21: Linguistics 101

Agglutinative vs. Morphological languages

Morphology and Languages

21

French

maison

ma maison

maisons

mes maisons

English

house

my house

houses

my houses

Turkish

ev

evi

evier

evieri

Arabic

منزل

يمنزل

زلامن

يزلامن

Page 22: Linguistics 101

د ر س

درسlesson

ةدرس م school

رساد student

سم درteacher

سيدر ت teaching

Why Arabic is Morphological?

22

Arabic words are derived from a root adhering to a list of patterns

Page 23: Linguistics 101

Arabic words with the same pattern and different root share something in meanings (Semantics => word classes)

Why Arabic is Morphological?

23

مفعلةmifaalah

مطرقة mitraqahhammer

مكنسةmiknasah

broom

مسطرةmistarah

ruler

ملعقة milaaqah

spoon

مسبحةmisbahah rosary

All these

words refer to

instruments

(إسم آلة)

Page 24: Linguistics 101

خرج • to exit

• to leave

ج خر • to remove

أخرج

• to take out

• to emit

• to extricate

إستخرج

• to extract

• to derive

• to conclude

ج تخر • to graduate

Lexical Stucture in Arabic

24

Root

Lemmas as verbs

خرجم (an exit)رجاخ م (exits)

ج تخر (graduation)

Lemmas as nouns

يج خر (graduate)يجــون (graduates masc. pl)خريجـات خر (graduates fem. pl)

Stems

Page 25: Linguistics 101

The study of meaning

Word-to-words relationship

What is Semantics?

25

Word Relationships

Pronunciation Spelling Meaning Examples NLP Uses

Homonyms same same different bow /boʊ/ -> a musical instrument of the violinbow /boʊ/ -> a tied ribbon

Translation

Heteronyms different same different bow -> /baʊ/ to bendbow -> /boʊ/ a tied ribbon

TTS

Homophones same different different (lessen, lesson), (maid, made)

SpellCorrection

Synonyms different different same (to begin, to start) InformationRetrieval

Page 26: Linguistics 101

The study of the principles and rules for constructing sentences in natural languages.

Word order

English: SVO (subject-verb-object)

Persian: SOV (subject-object-verb)

Arabic (most often): VSO (verb-subject-object)

NLP Uses: Spell Checking

What is Syntax?

26

Page 27: Linguistics 101

SummaryA Taste of Basis Technology Linguistic

Solutions

27

Page 28: Linguistics 101

Articulation

Rosette Name

Translation (RNT)

Rosette Name

Indexer (RNI)

Linguistic Signs

Rosette Base

Linguistics (RBL)

Rosette Entity

Extractor (REX)

Words Meaning

Rosette Base

Linguistics (RBL)

Rosette Entity

Extractor (REX)

Words Order

Rosette Base

Linguistics (RBL)

Rosette Entity

Extractor (REX)

Linguistics: The Big Picture

28

Morphology SyntaxPhonology

Arbitrary Structure Lexical Structure Grammatical Structure

Semantics

Page 29: Linguistics 101

AyadAllawi

إياد عالوي

EyadAllawi

اياد عالوى

Defines arbitrariness

Basis Technology

29

Page 30: Linguistics 101

Enriches lexicology

Basis Technology

30

.الحفالتتخرج جامعة هارفرد في العام الحالي من أهم حفلةكانت

يجـونحيث كان عدد يجاتو الخر آالف يتجاوز ال الخريج .خر

The graduation ceremony at Harvard University this year was among the most important

ceremonies, where the number of female and male graduates exceeded 10 thousands graduate.

Affixes

(Morphology)

ةحفل

تحفالال

يجـال ونخر

يجوال اتخر

Pronunciation & Prosody (Phonology)

The final letter in ةحفل is silent (instead of /t/), medial alef in يجاو ت الخر has longer

duration than the initial alef in the same word

Page 31: Linguistics 101

Triage grammatical complexity

Basis Technology

31

طه حسين قصص مؤثرة طبعت على هيئة كتب.كتب

Taha Hussein wroteimpressive stories published

in the form of books.

Homonyms (Semantic):

same spelling and

pronunciation but different

meaning

Word Order

(Syntax)

Arabic VSO

Part-of-Speech

(Syntax)

Verb <- كتب

كتب -> Noun

Manner of articulation (Phonology)

kataba كتب is different from قطب qaTaba, even though many English speakers

can't differentiate between them

Page 32: Linguistics 101

Thank [email protected]