Upload
oscar-carson
View
222
Download
0
Embed Size (px)
Citation preview
Globalisation & Computer systems
Week 7 Text processes and globalisation
part 1: Sorting strings: collation Searching strings and regular
expressions Practical: regular expressions in UNIX
Text processes
Character encoding design:“must provide the set of code values that
allows programmers to design applications capable of implementing a variety of text processes in the desired language”
Text processes operate over text elements
Text processes
Text elements The objects of a text Depends on perspective Different text processes operate over
different objects
Sorting
Sorting (collation)“The process of ordering units of textual information. Collation is usually specific to a particular language” (Unicode version 3: glossary)
Sorting
Levels of comparison Level 1 (primary difference) Levels 2 and 3 (similar) Level 4 (exact match)
Sorting
Levels of comparison Level 4: exact match
match in code value character equivalence resumes : resumes
Sorting
Levels of comparison Level 1 (primary difference)
resume < resumes Level 2 (similar: no accent < accent)
resume < résumé resumes < résumés
Level 3 (similar: lower case < upper case)
résumé < Résumé
Sorting
Forward and backward sequence sort
Forward sequence Start comparison from beginning of
string Backward sequence
Start comparison from end of string
Sorting
Implementation Sort keys
assign set of weights to each character in the string
compare substrings according to weighting
switch weightings on / off
Searching
Text elements The objects of a text Depends on perspective Different text processes operate over
different objects
Regular Expressions Basis of all web-based and word-
processor-based searches Definition 1. An algebraic notation
for describing a string Definition 2. A set of rules that you
can use to specify one or more items, such as words in a file, by using a single character string (Sarwar et al.)
Regular Expressions regular expression, text corpus regular expression algebra has
variants: Perl, Unix tools Unix tools: egrep, sed, awk
Regular Expressionsegrep -n ‘Nokia’ nokia_corpus.txt
1:.Nokia shares slide after warning 4:HELSINKI (Reuters) - Nokia has cut its sales growth forecast for 7:markets sharply down.Nokia warned group sales would grow only 13:better than expected first-quarter profits from Nokia, 15:Finland's Nokia and rivals have been hit by debt-laden telecoms 19:Nokia said in a statement. "The speed of this transition has been 20:slower than was anticipated earlier this year." Nokia saw its market 26:"The problem with Nokia is that it looks like its going ex-growth," 29:with a raft of new functions, was hurting. "Nokia had been perceived 36:Nokia cast another shadow over the sector by slashing its forecast for 41:be sold this year. "Nokia now believes that general weakness in all key 43:Nokia said. The market was caught by surprise, especially as Nokia had 46:said Nokia had been "a bit optimistic overall" in its forecasts. "We 49:adjust to weaker demand, Nokia followed the path of rivals in announcing 51:thousands of jobs in the group last year. Despite the bleak outlook, Nokia 57:Nokia also warned second quarter sales would grow only between two and 61:operating efficiencies, strong brand and leading product portfolio," Nokia 62:said. Nokia said it expected pro forma earnings per share (EPS) of 0.18-0.20 67:protecting the margins -- but Nokia has to be a top-line growth story as well, 69:analyst Susan Anthony.But Nokia, known for its strength in forecasting the 79:Nokia's own forecast. Nokia's January-March net sales came in worse than the
Regular Expressions
egrep -n ‘shares?’ nokia_corpus.txt
1:.Nokia shares slide after warning 6:weak demand, sending its shares 12 percent lower and European 62:said. Nokia said it expected pro forma earnings per share (EPS) of 0.18-0.20 85:lion share of the company's sales and earnings, saw sales fall seven percent
Regular Expressions Kleene operators:
/string*/ “zero or more occurrences of previous character”
/string+/ “1 or more occurrences of previous character”
Regular Expressions Wildcard operator:
/string./ “any character after the previous character”
Combine wildcard and kleene: /string.*/ “zero or more instances of any
character after the previous character” /string.+/ “one or more instances of any
character after the previous character”
Regular Expressions
egrep –n ‘profit.*’ nokia_corpus.txt
13:better than expected first-quarter profits from Nokia, 52:remains the only profitable handset maker among the "big three" suppliers 60:company's profitability outlook remains strong, driven by increasing 81:Pre-tax profit was 1.31 billion euros.The company's struggling networks unit
Regular Expressions Anchors
Beginning of line operator: ^egrep ‘^said’ nokia_corpus.txt End of line operator: $egrep ‘$said’ nokia_corpus.txt
Regular Expressions Disjunction:
set operator/[Ss]tring/ “a string which begins with either S
or s” Range/[A-Z]tring/ “a string beginning with a capital
letter” pipe |/string1|string2/ “either string 1 or string 2”
Regular Expressions Disjunction
egrep –n ‘weak|warning|drop’ nokia_corpus.txt
egrep –n ‘weak.*|warn.*|drop.*’ nokia_corpus.txt
Regular Expressions Precedence
1. Parantheses2. Kleene and optional operators * . ?3. Anchors and sequences4. Disjunction operator |
(a) /supply | iers/