38

[email protected]. char 不可再分 (在当前讨论范围内) 比如: A,b “,”, “@” “ ” 汉字 大多可以视觉识别 也包括 whitespace ,如空格,换行,

Embed Size (px)

Citation preview

char不可再分

(在当前讨论范围内)比如:

A,b“,”, “@”“ ”汉字

大多可以视觉识别也包括 whitespace ,如空格,换行, Tab

Set<char>有穷的字符集成为字母表( alphabet )

一般非空比如:英文字母,数字,中文和标点

String把字符按顺序连起来,称为 string.

一般是有限长度可以是 1 个,两个,…,或者 0 个( ε, “” ,空字符

串)比如:

“abcdaaab”

在 ES 中,字符串用单引号或双引号括起来。

Algebra of string+ 字符串连接

空字符串 + 其它字符串 = 其它字符串 + 空串 = 其它字符串

Alphabet vs string集合 列表

无序不重复势

有序可重复长度

Set<string>字符串的集合

一般为无穷 长度不受限制

也可以是有穷的 比如:空集 , {“a”,”ab”}

Algebra of Set<string>|

相当于集合的并集,结果仍是 Set<string>Ф=The Union of Zero Set<string>

{“a”,”bc”,”e”} | {“a”,”1”}={“a”,”bc”,”e”,”1”}

• Many use ∪, +, or ∨ for alternation

类似于 Cartesian Product 笛卡尔乘积 , 结果仍是Set<string>比如:{“a”,”bc”,”e”} {“a”,”1”}=

{“aa”,”a1”,”bca”,”bc1”,”ea”,”e1”}

乘方{“a”,”1”} 2 ={“a”,”1”} {“a”,”1”}

={“aa”,”a1”,”1a”,”11”}{“a”,”1”}3={“a”,”1”} {“a”,”1”} {“a”,”1”}

={“aaa”,”aa1”,”a1a”,”a11”,”1aa”,”1a1”,”11a”,”111”}

定义{”a”,”1”}1={“a”,”1”}

这样 {“a”,”1”} 2= {”a”,”1”}1+1= {”a”,”1”} {”a”,”1”}{”a”,”1”}0={ε}

这样, {”a”,”1”}1= {”a”,”1”}0+1={ε} {”a”,”1”}

乘和或 复合S{m,n} =Sm | Sm+1 | Sm+2… | Sn

S{m,} =Sm | Sm+1 | Sm+2… S ? =S0 |SS+ = S1 | S2 | S3…

S* =S0 | S1 | S2 | S3…* is called Kleene Star

Priority of ops* highestConcatenationalternation. parentheses may be omitted. For example,

(ab)c can be written as abc, and a|(b(c*)) can be written as a|bc*.

Regular ExpressionSome set<string> is called regular

expression, or RE, RegExp, RegexThe following are RegExp

{“a”} is regular expression, for any char in alphabet

RS is RegExp, if R and S are both RegexR* is RegExp, if R is Regex

So {ε} is RegExpIf a Set<string> cannot be represented by

above process, it’s not RegExp

Note Ф is often included in RegExp

See Standard

Empty Empty allowed[]()|

Assertion^$\b

World boundary Not _, [0-9], [A-z]

\BNot \b

(?=expression)(?!expression)

quantifier?+*{m,}{m,n}

The following will be lazy if appended by another ?

capture()(?: expression)

Atom Escape

\c followed by lower or upper letter \a =a

For a is not designated special meaningsSo are some other letters

\u002F \0

Character Class[][abd]={“a”,”b”,”c”}[a-c] = {“a”,”b”,”c”}

[-ca] where – is literal[ac-] where – is literal

[^a-c] = alphabet / [a-c]

Escape Class[\b]={“backspace”}[\]] = {“]”}[\B] error[\1] error

\1 will be the captured group

. Any char except newline\d digit\D not digit\w word char\W not word char\s whitespace\S not whitespace

Back reference\1\0

<NUL>\1000000000

Error if no such many matches.

/ … /gimWhere g for globali for case insensitivem for multiline

Note:// will be taken as comments

Use /(?:)/

RegExp.RegExp is a function

Can construct Regular Expressions RegExp(pattern, flags) new RegExp(pattern, flags)

RegExp.prototype

RegExp.prototype.constructorexec

Return matches, an array Ordered by the appearance of ( There is one implicit () around the whole pattern

testReturn bool

toStringReturn string

Members of RegExp instancesourceglobalignoreCasemultilinelastIndex integer

{ [[Writable]]: true, [[Enumerable]]: false, [[Configurable]]: false }.

Thank You!

The End

<ZWNJ> and <ZWJ> are format-control characters that are used to make necessary distinctions when forming words or phrases in certain languages.

The Unicode format-control characters (i.e., the characters in category ―Cf‖ in the Unicode Character Database such as LEFT-TO-RIGHT MARK or RIGHT-TO-LEFT MARK) are control codes used to control the formatting of a range of text in the absence of higher-level protocols for this (such as mark-up languages).

All format control characters may be used within comments, and within string literals and regular expression literals.

In ECMAScript source text, <ZWNJ> and <ZWJ> may also be used in an identifier after the first character.

<BOM> is a format-control character used primarily at the start of a text to mark it as Unicode and to allow detection of the text's encoding and byte order. <BOM> characters intended for this purpose can sometimes also appear after the start of a text, for example as a result of concatenating files. <BOM> characters are treated as white space characters (see 7.2).

The special treatment of certain format-control characters outside of comments, string literals, and regular expression literals is summarised in Table 1.

Table 1 — Format-Control Character Usage Code

Unit Value Name Formal Name Usage

\u200C Zero width non-joiner <ZWNJ> IdentifierPart

\u200D Zero width joiner <ZWJ> IdentifierPart

\uFEFF Byte Order Mark <BOM> Whitespace