30
Globalisation & Computer systems Week 7 Text processes and globalisation part 1: Sorting strings: collation Searching strings and regular expressions Practical: regular expressions in UNIX

Globalisation & Computer systems Week 7 Text processes and globalisation part 1: Sorting strings: collation Searching strings and regular expressions Practical:

Embed Size (px)

Citation preview

Globalisation & Computer systems

Week 7 Text processes and globalisation

part 1: Sorting strings: collation Searching strings and regular

expressions Practical: regular expressions in UNIX

Text processes

Character encoding design:“must provide the set of code values that

allows programmers to design applications capable of implementing a variety of text processes in the desired language”

Text processes operate over text elements

Text processes

Text elements The objects of a text Depends on perspective Different text processes operate over

different objects

Sorting

Sorting (collation)“The process of ordering units of textual information. Collation is usually specific to a particular language” (Unicode version 3: glossary)

Sorting

Language specific sort order phonetically based sort graphically based sort sort element

Sorting

Levels of comparison Level 1 (primary difference) Levels 2 and 3 (similar) Level 4 (exact match)

Sorting

Levels of comparison Level 4: exact match

match in code value character equivalence resumes : resumes

Sorting

Levels of comparison Level 1 (primary difference:

alphabetic)

Sorting

Levels of comparison Level 1 (primary difference)

resume < resumes

Sorting

Levels of comparison Level 1 (primary difference)

resume < resumes Level 2 (similar: no accent < accent)

resume < résumé resumes < résumés

Level 3 (similar: lower case < upper case)

résumé < Résumé

Sorting

Forward and backward sequence sort

Forward sequence Start comparison from beginning of

string Backward sequence

Start comparison from end of string

Sorting

Implementation Sort keys

assign set of weights to each character in the string

compare substrings according to weighting

switch weightings on / off

Searching

Text elements The objects of a text Depends on perspective Different text processes operate over

different objects

Regular Expressions Basis of all web-based and word-

processor-based searches Definition 1. An algebraic notation

for describing a string Definition 2. A set of rules that you

can use to specify one or more items, such as words in a file, by using a single character string (Sarwar et al.)

Regular Expressions regular expression, text corpus regular expression algebra has

variants: Perl, Unix tools Unix tools: egrep, sed, awk

Regular Expressions Find occurrences of /Nokia/ in the

text egrep -n ‘Nokia’ nokia_corpus.txt

Regular Expressionsegrep -n ‘Nokia’ nokia_corpus.txt

1:.Nokia shares slide after warning 4:HELSINKI (Reuters) - Nokia has cut its sales growth forecast for 7:markets sharply down.Nokia warned group sales would grow only 13:better than expected first-quarter profits from Nokia, 15:Finland's Nokia and rivals have been hit by debt-laden telecoms 19:Nokia said in a statement. "The speed of this transition has been 20:slower than was anticipated earlier this year." Nokia saw its market 26:"The problem with Nokia is that it looks like its going ex-growth," 29:with a raft of new functions, was hurting. "Nokia had been perceived 36:Nokia cast another shadow over the sector by slashing its forecast for 41:be sold this year. "Nokia now believes that general weakness in all key 43:Nokia said. The market was caught by surprise, especially as Nokia had 46:said Nokia had been "a bit optimistic overall" in its forecasts. "We 49:adjust to weaker demand, Nokia followed the path of rivals in announcing 51:thousands of jobs in the group last year. Despite the bleak outlook, Nokia 57:Nokia also warned second quarter sales would grow only between two and 61:operating efficiencies, strong brand and leading product portfolio," Nokia 62:said. Nokia said it expected pro forma earnings per share (EPS) of 0.18-0.20 67:protecting the margins -- but Nokia has to be a top-line growth story as well, 69:analyst Susan Anthony.But Nokia, known for its strength in forecasting the 79:Nokia's own forecast. Nokia's January-March net sales came in worse than the

Regular Expressions set operatoregrep -n ‘[Nn]okia’

nokia_corpus.txt

Regular Expressions optional operatoregrep -n ‘shares?’

nokia_corpus.txt

Regular Expressions

egrep -n ‘shares?’ nokia_corpus.txt

1:.Nokia shares slide after warning 6:weak demand, sending its shares 12 percent lower and European 62:said. Nokia said it expected pro forma earnings per share (EPS) of 0.18-0.20 85:lion share of the company's sales and earnings, saw sales fall seven percent

Regular Expressions Kleene operators:

/string*/ “zero or more occurrences of previous character”

/string+/ “1 or more occurrences of previous character”

Regular Expressions Wildcard operator:

/string./ “any character after the previous character”

Regular Expressions Wildcard operator:

/string./ “any character after the previous character”

Combine wildcard and kleene: /string.*/ “zero or more instances of any

character after the previous character” /string.+/ “one or more instances of any

character after the previous character”

Regular Expressions

egrep –n ‘profit.*’ nokia_corpus.txt

13:better than expected first-quarter profits from Nokia, 52:remains the only profitable handset maker among the "big three" suppliers 60:company's profitability outlook remains strong, driven by increasing 81:Pre-tax profit was 1.31 billion euros.The company's struggling networks unit

Regular Expressions Anchors

Beginning of line operator: ^egrep ‘^said’ nokia_corpus.txt End of line operator: $egrep ‘$said’ nokia_corpus.txt

Regular Expressions Disjunction:

set operator/[Ss]tring/ “a string which begins with either S

or s” Range/[A-Z]tring/ “a string beginning with a capital

letter” pipe |/string1|string2/ “either string 1 or string 2”

Regular Expressions Disjunction

egrep –n ‘weak|warning|drop’ nokia_corpus.txt

egrep –n ‘weak.*|warn.*|drop.*’ nokia_corpus.txt

Regular Expressions

Negation: /[^a-z]tring“ any strings that does not begin

with a small letter”

Regular Expressions Precedence

1. Parantheses2. Kleene and optional operators * . ?3. Anchors and sequences4. Disjunction operator |

(a) /supply | iers/

Regular Expressions Precedence

1. Parantheses2. Kleene and optional operators * . ?3. Anchors and sequences4. Disjunction operator |

(a) /supply | iers/ /supply/ /iers/(b) /suppl(y|iers)/ /supply/ suppliers/