37
コーパスを用いた言語分析と統計 Collostructional Analysis とは何か? TwiFULL 関西 @神戸大学人文学研究科 20130327 発表者:Yuzo Morishita (@pathos95606)

TwiFULL 20130327

Embed Size (px)

Citation preview

Page 1: TwiFULL 20130327

コーパスを用いた言語分析と統計Collostructional Analysis とは何か?

TwiFULL 関西@神戸大学人文学研究科

20130327 発表者:Yuzo Morishita (@pathos95606)

Page 2: TwiFULL 20130327

発表の構成1. はじめに

2. 言語分析と量的研究

3. フィッシャーの正確確率検定

4. Collostructional Analysis

5. Bybee (2010) による批判

6. Gries (2012) による Bybee (2010) への反論

7. 大規模コーパスを用いた具体的分析

1

Page 3: TwiFULL 20130327

自己紹介英語の文法をコーパス基盤で量的・質的に研究

統計も少しだけ

以下のような構文について研究してます (1) a. James came running up the stairs. (BNC-FRS)

b. We went shopping in Brighton. (BNC-FB9)

c. He sent the clerk hurrying into the back room to get a dark grey suit. (BNC-CDN)

d. She ran after him, calling his name. (BNC-A0N)

e. Dan walked from the room, his head reeling. (BNC-FAB) 

2

Page 4: TwiFULL 20130327

はじめに

Page 5: TwiFULL 20130327

はじめに認知言語学では...

 実験やコーパス頻度に基づく研究が増加 統計解析が必要な研究も増加

 用法基盤モデル (e.g., Langacker 1990, 2000) の影響

近年、日本でも目立つようになってきた統計手法Collostructional Analysis (Stefanowitsch and Gries 2003)

手法の妥当性に関してほぼ無批判 (cf. Bybee 2010)

3

Page 6: TwiFULL 20130327

言語分析と量的研究

Page 7: TwiFULL 20130327

言語分析と量的研究コーパス研究の発展

 コーパス研究の多様化  Sinclair (1991) の「コーパス駆動型研究」   Kennedy (1998) の「コーパス検証型研究」

 WaC (Web as Corpus) の時代

 

4

Page 8: TwiFULL 20130327

言語分析と量的研究言語分析と量的研究の歴史

 本格的な量的言語研究の萌芽 (e.g., Chao 1950: (Zipf 1935) の Review

Hockett 1953: (Shannon and Warren 1949) の Review)

 類型論的研究(e.g., Haspelmath 2008)

 認知・機能系の研究(e.g., Bybee 1985, Baayen 1993, Bybee and Hopper 2001)

  問題点指摘タイプ (e.g., Johnson 1999, Kilgaliff 2005)

5

Page 9: TwiFULL 20130327

言語分析と量的研究2000年代:言語分析のとコーパス研究の融合

 構文文法 (Construction Grammar)と量的研究

  結果構文におけるフレームと量的研究 (Boas 2001)

  二重目的語構文での語と構文の統計解析 (Stefanowitsch and Gries 2003)

  スペイン語の構文研究 (Bybee and Eddington 2006)

6

Page 10: TwiFULL 20130327

言語分析と量的研究構文文法 (Construction Grammar)

(e.g., Fillmore 1988, Fillmore et al. 1988, Goldberg 1995)

 語と構文の間に明確な性質の違いを求めない (cf. Grimshaw 1990: Argument Structure,

Levin and Rappaport 1995: Lexical Conceptual Structure) 

7

Page 11: TwiFULL 20130327

フィッシャーの正確確率検定(Fisher's exact test)

Page 12: TwiFULL 20130327

フィッシャーの正確確率検定

不正解 正解    Σ Group A 21 ( 28.4) 433 ( 425.6) 454Group B 44 ( 36.6) 540 ( 549.4) 584Σ    65 975 1,038

1,03865×454 ≒ 28.4

454 ×1,038975

≒ 425.6

584 ×1,038

65≒ 36.6

584 × 1,038975 ≒ 549.4

χ = 3.68; df=1; p = 0.055

期待値の計算

2

≒≒≒≒

χ 値の計算

χ =期待値

Σ(実測値 - 期待値)2

2

2

カイ2乗検定Table 1: Chi-square test

8

Page 13: TwiFULL 20130327

フィッシャーの正確確率検定

不正解 正解    Σ Group A 21 ( 28.4) 433 ( 425.6) 454Group B 44 ( 36.6) 540 ( 549.4) 584Σ    65 975 1,038

χ = 3.68; df=1; p = 0.0552

≒≒≒≒

カイ2乗検定Table 1: Chi-square test

80 2 4 6 8 10

0.0

0.5

1.0

1.5

2.0

2.5

chi-squared(1) distribution

chi-sq

density

Page 14: TwiFULL 20130327

統計の基礎フィッシャーの正確確率検定 (Fisher's exact test)

Sir Ronald Aylmer Fisher (1980-1962)

0 2 4 6 8 10

0.0

0.5

1.0

1.5

2.0

2.5

chi-squared(1) distribution

chi-sq

density

理由:特定の分布を前提とする不正確性

10

Page 15: TwiFULL 20130327

統計の基礎フィッシャーの正確確率検定 (Fisher's exact test)

Sir Ronald Aylmer Fisher (1980-1962)

ありとあらゆる組み合せの可能性を考える

全ての組み合せのうち現在の結果に当てはまる組み合せが生じる確率を計算

組み合せ(Combination)を計算するので計算量が膨大に...

6 C 2 6! 2!(6−2)!=e.g.

11

Page 16: TwiFULL 20130327

統計の基礎フィッシャーの正確確率検定 (Fisher's exact test)

> fisher.test(matrix(c(21, 433, 44, 540), nrow = 2))

Fisher's Exact Test for Count Data

data: matrix(c(21, 433, 44, 540), nrow = 2) p-value < 0.06992alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.330999 1.041104 sample estimates:odds ratio 0.5955003

> fisher.test(matrix(c(21, 433, 44, 540), nrow = 2))$p.value[1] 0.06992

http://www.r-project.org/The R Project for Statistical Computing

12cf.

Page 17: TwiFULL 20130327

Collostructional Approach

Page 18: TwiFULL 20130327

Collostructional Approach

Table 2: Collostructional Analysis

       Construction c Other constructions Row totals

Verb v w x w+x Other verbs y z y+z Column totals a+c b+d a+b+c+d

Table3: Observed frequencies of give and the ditransitive in the ICE-GB   (Gries 2012: 480)       ditransitive construction ¬ditransitive construction Row totals

give 461 (9) 699 (1,151) 1,160 ¬give 574 (1,026) 136,930 (136,478) 137,504 Column totals 1,035 137,629 138,664

構文と語の結合度を調べる統計的手法 (Stefanowitsch and Gries 2003)

collostruction < collocation + construction

13

Page 19: TwiFULL 20130327

Collostructional Approach

二重目的語構文における分析結果とその意義Table 4: Collexemes most strongly attracted to the ditransitive construction (Stefanowitsch and Gries 2003: 229)Collexeme (n) Collostruction strength Collexeme (n) Collostruction strengthgive (461) 0 allocate (4) 2.91E-06tell (128) 1.6E-127 wish (9) 3.11E-06send (64) 7.26E-68 accord (3) 8.15E-06offer (43) 3.31E-49 pay (13) 2.34E-05show (49) 2.23E-33 hand (5) 3.01E-05 cost (20) 1.12E-22 guarantee (4) 4.72E-05teach (15) 4.32E-16 buy (9) 6.35E-05award (18) 1.36E-11 assign (3) 2.61E-04allow (18) 1.12E-10 charge (4) 3.02E-04 lend (7) 2.85E-09 cause (8) 5.56E-04 deny (8) 4.5E-09 ask (12) 6.28E-04owe (6) 2.67E-08 afford (4) 1.08E-03promise (7) 3.23E-08 cook (3) 3.34E-03earn (7) 2.13E-07 spare (2) 3.5E-03grant (5) 1.33E-06 drop (3) 2.16E-02 14

Page 20: TwiFULL 20130327

Collostructional Approach

Goldberg (1995: 38) A: Central Sense: Agent successfully causes recipient to receive patient e.g. "give", "pass", "throw", "toss", "bring", "take"... (2) John gave Mary a book.

B: Conditions of Satisfaction imply that agent causes recipient to receive patient e.g. "guarantee", "promise", "owe"... (3) Chris promised Pat a car.

C: Agent causes recipient not to receive patient e.g. "refuse", "deny"... (4) Mary denied her sister a cake.

D: ...

E: ...

F: ...

15

Page 21: TwiFULL 20130327

Collostructional Approach

二重目的語構文における分析結果とその意義Table 4: Collexemes most strongly attracted to the ditransitive construction (Stefanowitsch and Gries 2003: 229)Collexeme (n) Collostruction strength Collexeme (n) Collostruction strengthgive (461) 0 allocate (4) 2.91E-06tell (128) 1.6E-127 wish (9) 3.11E-06send (64) 7.26E-68 accord (3) 8.15E-06offer (43) 3.31E-49 pay (13) 2.34E-05show (49) 2.23E-33 hand (5) 3.01E-05 cost (20) 1.12E-22 guarantee (4) 4.72E-05teach (15) 4.32E-16 buy (9) 6.35E-05award (18) 1.36E-11 assign (3) 2.61E-04allow (18) 1.12E-10 charge (4) 3.02E-04 lend (7) 2.85E-09 cause (8) 5.56E-04 deny (8) 4.5E-09 ask (12) 6.28E-04owe (6) 2.67E-08 afford (4) 1.08E-03promise (7) 3.23E-08 cook (3) 3.34E-03earn (7) 2.13E-07 spare (2) 3.5E-03grant (5) 1.33E-06 drop (3) 2.16E-02 16

Page 22: TwiFULL 20130327

Bybee (2010) による批判

Page 23: TwiFULL 20130327

Bybee (2010) による批判批判の論点

 - 粗頻度 (raw frequencies) で十分では?

 - クロス表右下の数字は必要?どうやって数値化?

  ditransitive construction ¬ditransitive construction Row totals

give 461 (9) 699 (1,151) 1,160 ¬give 574 (1,026) 136,930 (136,478) 137,504 Column totals 1,035 137,629 138,664

Table 3: Observed frequencies of give and the ditransitive in the ICE-GB

17

Page 24: TwiFULL 20130327

Bybee (2010) による批判Bybee and Eddington (2006) を基に Collostruction Strength を計算

内省と Collostructional Approach の結果を比較

High Collostructional Frequency in Corpus Acceptability Strength Construction Frequency High Frequency

dormido 'asleep' 42 79.34 28 161 sorpendido 'surprised' 42 17.57 7 92quieto 'still/calm' 39 85.76 29 129

Low Frequency Related perplejo 'paralyzed' 40 2.62 1 20paralizado 'paralyzed' 35 2.49 1 1pasmado 'amazed' 30 2.72 1 16

Low Frequency Unrelateddesnutrido 'undernourished' 17 3.23 1 5 orgullosismo 'proud' 6 3.92 1 1

Table 5: Adjective with quedarse 'become' (Bybee 2010: 100)

18

Page 25: TwiFULL 20130327

Gries (2012) による Bybee (2010) への反論

Page 26: TwiFULL 20130327

Gries (2012) による Bybee (2010) への反論

Figure 1: The comparison of a frequency- vs and AM-based approach (Gries 2012: 502)

19

Page 27: TwiFULL 20130327

大規模コーパスでの分析による検証

Page 28: TwiFULL 20130327

大規模コーパスでの分析による検証Gries (2012) や Bybee (2010) の研究は小規模コーパス Stefanowitsch and Gries (2003):ICE-GB (1 million words) 

 Bybee (2010):A spoken (1.1 million) and written (1 million) corpora

大規模コーパス利用への展望 (Gries and Stefanowitsch 2004: 235)

大規模コーパスを利用した研究の問題点 機械的に収集可能な現象だけに研究が制限される可能性  統計的分析への影響

20

Page 29: TwiFULL 20130327

大規模コーパスでの分析による検証British National Corpus (BNC) を利用 約1億語:ICE-GB の約100倍、Bybee (2010) の約50倍   註)BYU-BNC (http://corpus.byu.edu/bnc) は似て非なるコーパス

扱うのは以下の構文 (5) a. He sent the clerk hurrying into the back room to get a dark grey suit. (BNC-CDN)

b. A series of small explosions one morning brought Alec running out of the top of the steps. (BNC-B1X)

c. A day's yacht charter took us threading through the islands (BNC-BPJ)

d. My boyfriend Fisher Stevens will have to drag me kicking and screaming out of the house. (BNC-CH2)21

Page 30: TwiFULL 20130327

大規模コーパスでの分析による検証分析対象と方法定形の動詞と非定形 (-ing形) の動詞の collostruction strength  

Verbs of Sending and Carrying (Levin 1993: 132-137) e.g. "airmail", "FedEX", "pass", "send", "roll", "bring", "take", "drive"

NP V NP V-ing P(P) という形式を持つもの全て

結果:303 例

動詞の頻度検索:語彙素タグ e.g. hw="send" pos="VERB" (sent, sends, sending, etc.)

全構文数:センテンス・タグ e.g. <s n="777">

22

Page 31: TwiFULL 20130327

大規模コーパスでの分析による検証

       The Construction Other constructions Row totals

crashing 38 2,139 2,177 Other verbs 265 6,023,842 6,024,107 Column totals 303 6,025,981 6,026,284 

Table 6: Collostructional Approach to the construction

p=3.469382e-83 log103.46938e-83

= - 82.460

23

Page 32: TwiFULL 20130327

Collostructional Approach

Table 7: Collexemes strongly and weakly attracted to the constructionCollexeme (n) Collostruction strength Collexeme (n) Collostruction strengthcrashing (38) 82.460 banging (1) 1.195scurrying (19) 52.117 raining (1) 1.186sprawling (13) 33.354 smashing (1) 1.153flying (21) 27.672 scattering (1) 1.131tumbling (10) 20.423 marching (1) 1.075 rushing (12) 18.552 floating (1) 1.046kicking and screaming (5) 15.748 sailing (1) 0.908arcing (3) 15.182 sliding (1) 0.878reeling (6) 13.043 sinking (1) 0.850 hurrying (8) 12.184 swinging (1) 0.828 screaming (8) 11.684 pouring (1) 0.797hurtling (5) 11.593 backing (1) 0.695scuttling (5) 11.576 escaping (1) 0.634spinning (5) 7.429 shooting (1) 0.506billowing (5) 6.959 falling (1) 0.000

24

Page 33: TwiFULL 20130327

大規模コーパスでの分析による検証

Figure 2: Correlation between Collostruction Strength and Raw Frequency

25

0 10 20 30

020

4060

80

Raw_Frequency

Collostruction_Strength

Page 34: TwiFULL 20130327

REFERENCES

Baayen, Harald. 1993. On frequency, transparency, and productivity. In G. E. Booij and J. van Marle (eds.) Yearbook of morphology. 181-208. Dordrecht: Kluwer Academic.

Boas, Hans C. 2001. A constructional approach to resultatives. Stanford, Calif.: CSLI Publications. Bybee, Joan L. 1985. Morphology: A study of the relation between meaning and form. Amsterdam ; Philadelphia: John Benjamins. ---. 2010. Language, usage and cognition. Cambridge; New York: Cambridge University Press.--- and David Eddington. 2006. A usage-based approach to Spanish verbs of 'becoming'. Language 82:2. 323-355. --- and Paul Hopper. (eds.) 2001. Frequency and the emergence of linguistic structure. Amsterdam: John Benjamins Publishing Company. Chao, Yuen Ren. 1950. LanguageReview of Human behavior and the principle of least effort: An introduction to human ecology by George Kingsley Zipf. Language 26:3. 394-401.

Page 35: TwiFULL 20130327

REFERENCES

Fillmore, Charles J. 1988. The mechanisms of "Construction Grammar". Proceedings of the Fourteenth Annual Meeting of the Berkeley Linguistics Society. 35-55.

---, Paul Kay, Mary Catherine O'Connor. 1988. Regularity and idiomaticity in Grammatical Constructions: The case of let alone. Language 64:3. 501-538.Goldberg, Adele E. 1995. Constructions: A construction grammar approach to argument structure. Chicago: University of Chicago Press. Gries. 2012. Frequencies, probabilities, and association measures in usage-/exemplar- based linguistics: Some necessary clarifications. Studies in Language 11:3. 477-510.Gries Stefan Th., Beate Hampe, and Doris Schönefeld. 2005. Converging evidence: Bringing together experimental and corpus data on the association of verbs and constructions. Cognitive Linguistics 16:4. 635-676. ---. 2010. Converging evidence II: More on the association of verbs and constructions. In Sally Rice and John Newman (eds.) Empirical and Experimental Methods in Cognitive/Functional Research. Stanford, Calif.: CSLI Publications, Center for the Study of Language and Information.

Page 36: TwiFULL 20130327

REFERENCES

Grimshaw, Jane. 1990. Argument structure. Cambridge, Mass.: MIT Press.Haspelmath, Martin. 2008. Frequency vs. iconicity in explaining grammatical asymmetries. Cognitive Linguistics 19:1. 1-33.Hockett, Charles F. 1953. Review of the mathematical theory of communication by Claude L. Shannon and warren Weaver. Language 29:1. 69-93. Johnson, Douglas H. 1999. The insignificance of statistical significance testing. The Journal of Wildlife Management 63:3. 763-772.Kennedy, Graeme. 1998. An introduction to corpus linguistics. London; New York: Longman. Kilgaliff. 2005. Language is never ever ever random. Corpus Linguistics and Linguistic Theory 3:2. 263-276. Langacker, Ronald. 1990. A usage-based model. In Ronald Langacker. Concept, image and symbol: The cognitive basis of grammar. Berlin: Mouton de Gruyter. 261-288.---. 2000. A dynamic usage-based model. In Michael Barlow and Suzanne Kemmer (eds.) Usage-based models of language. Stanford, Calif.: CSLI Publications, Center for the Study of Language and Information. 1-63.

Page 37: TwiFULL 20130327

REFERENCES

Levin Beth and Malka Rappaport Hovav. 1995. Unaccusativity: At the syntax-lexical semantics interface. Cambridge, Mass.: MIT Press.Shannon, Claude L. and Warren Weaver. 1949. The mathematical theory of communication. Urbana: University of Illinois Press.Sinclair, John. 1991. Corpus, concordance, collocation. Oxford; Tokyo: Oxford University Press.Stefanowitsch and Gries. 2003. Collostructions: Investigating the interaction of words and constructions. International Journal of Corpus Linguistics 8:2. 209-243.

Zipf, George Kingsley. 1935[1949]. Human behavior and the principle of least effort: An introduction to human ecology. Cambridge, Mass.: Addison-Wesley Press.