Upload
jodie-parks
View
241
Download
2
Embed Size (px)
Citation preview
Text processingTeam 3
김승곤 박지웅 엄태건 최지헌 임유빈
Outline
1. Introduction
2. Text processing
3. Index techniques in database
4. Index techniques in wireless network
5. Apache Lucene
6. Apache Solr
7. Demo
5/18
5/20
What’s Text Processing?● Mechanism for the manipulation of text
o Language processing
o Data structure
o Visualization
o Human factor
● Converting text to indexing term
Necessity of Index
데이터베이스의 성능 측정 기준
1) 테이블에 접근하는 SQL 의 수
2) 업무 카테고리
3) 중복된 Access Pattern 을 제거한 실제 Access pattern (RAP, Real Access Pattern)
4) 테이블 유형에 대한 분류
5) SQL 의 성능정보
6) 테이블당 인덱스 수
인덱스 활용 예시
Text processing steps
1. Lexical Analysis
2. Elimination of stopwords
3. Stemming
4. Selection of index terms
5. Building a thesaurus
Lexical Analysis● Converting byte stream to tokens
o Numbers and digits
o Hyphens Index as phrase, allow partial match Proximity information
o Punctuation
o Lexer By hand - Easy, fast but not flexible DFA generator (Deterministic Finite Automata) - Use state machine
“Deaths from car accidents in 1989”{Deaths, car, accidents, from, 1989}
“Work-out”
“BS”, “B.S.”, “M.S.”, URLs
Elimination of Stopwords● High frequency, but not useless
o For example,
o Removing stopwords
o Statistical approach & Lookup table
the, of, and, a, in, to, is, for, with, are
“to be or not to be” -> {be}
Stemming● Reduce variant word forms to a single “stem”
o Words
o Four approaches Table lookup - use a dictionary
Successor variety - fancy suffix removal
Affix removal - cut prefixes and suffixes
Character N-grams
-’s, -ing, -ed, -s, in-, ad-, pre-, sub-
Porter’s algorithm● Removes suffixes in five stages
o Each depends on a suffix and the stem measure m
o Porter Errors organization/organ doing/doe past/paste european/europe resolve/resolution
Rule Result
SSES -> SS caresses -> caress
IES -> I ponies -> poni, ties -> ti
SS -> SS caress -> caress
S ->∅ cats -> cat
EED->EE feed -> feed
(*v*) ED plastered -> plaster
(*v*) ING motoring -> motor
Indexing Implementation in Text Database
Index Data Structure
● Different types of index data structures, for querying large text collectiono Signature Fileo Inverted File
Index Data Structure: Signature File
● F bits of signature
● Make term descriptors with m bits = 1, rest = 0
● Superimpose term descriptors of document to obtain document descriptor
Index Data Structure: Signature File
● Probabilistic indexing method
● Queryingo Form query descriptor for termo Fetch superset of query descriptor by comparisono Possibility of wrong result (false drop)
Index Data Structure: Inverted File
Index Data Structure: Inverted File
Index Data Structure: Inverted File
Index Data Structure: Inverted File
● Store mapping from content to its location
● Structureo a directory of terms
o posting lists of document IDs
Index Data Structure: Inverted File
Index Data Structure: Inverted File
Query: not
Index Data Structure: Inverted File
Query: not
Index Data Structure: Inverted File
Query: not
String comparison slow! Solution: Inverted index
Index Data Structure: Inverted File
Query: not Inverted index
0
1
Query: not Inverted index
Index Data Structure: Inverted File
0
1
Index Data Structure: Inverted File
Query: not Inverted index
0
1
Index Data Structure: Inverted File
Query: be Inverted index
0
1
Index Data Structure: Inverted File
Query: thing Inverted index
0
1
Drawbacks of Signature File
● Elimination of false match
● Require more disk access
● Difficult to construct and maintain
● Larger than inverted file
Drawbacks of Inverted File
● Performance challenges caused byo huge amount of documents
o increasingly large number of users
● Space cost of the associated inverted list even ranges gigabytes to terabytes
● Researches to improve performance of indexo more efficient index structure for low space and fast
query processing
Compression of Inverted File
● Time cost of indexo Seek and retrieve inverted list from disk into memory
o Transfer lists from memory into CPU cache
● Increase number of lists that can be cached
● Reduce number of disk accesses
Compression of Inverted File:d-gap
4 9 20 28 45 59 81 102 130 157 178 210 237 258
Compression of Inverted File:d-gap
4 9 20 28 45 59 81 102 130 157 178 210 237 258
5 11 8 17 14 22 21 28 27 21 32 27 21
Compression of Inverted File:d-gap
4 9 20 28 45 59 81 102 130 157 178 210 237 258
5 11 8 17 14 22 21 28 27 21 32 27 21
4 5 11 8 17 14 22 21 28 27 21 32 27 21
Compression of Inverted File:Simple-9
● Combination of bit alignment and word alignment
● Pack as many integers as possible into one 32-bit word
● Compression format
selector(4-bit)
data bit (28-bit)
Compression of Inverted File:Simple-9
● Compression method
● example
● all values are less than ‘8’ → 3 bits
● Selector c
Compression of Inverted File:Simple-9
● example
● all values are less than ‘32’ → 5 bits
● Selector e
Compression of Inverted File:Simple-9
Compression of Inverted File: Bitlist
● Simple and very efficient encoding schemeo Use encoded number to represent a set of
document IDso Only use 0/1 to indicate whether a document
contains a specific term
o Low space requirement
● naive inverted list
● 0/1 matrix
Bitlist structure
Bitlist structure
● base number = 4
Bitlist structure
● base number = 4
Bitlist structure
Bitlist structure
● bitlist, base = 4 ● bitlist, base = 12
Bitlist: DocID reassignment
● before
● after
Reference
[1] Rahevar, Mrugendrasinh L., and Mehul C. Parikh. "Optimized index construction for large text collections using blocked sort-based indexing." Advanced Communication Control and Computing Technologies (ICACCCT), 2014 International Conference on. IEEE, 2014.[2] Rao, Weixiong, et al. "Bitlist: New full-text index for low space cost and efficient keyword search." Proceedings of the VLDB Endowment 6.13 (2013): 1522-1533.[3] Zhang, Jiangong, Xiaohui Long, and Torsten Suel. "Performance of compressed inverted list caching in search engines." Proceedings of the 17th international conference on World Wide Web. ACM, 2008.[4] Zobel, Justin, and Alistair Moffat. "Inverted files for text search engines." ACM computing surveys (CSUR) 38.2 (2006): 6.[5] William, B. Frakes, and Ricardo Baeza-Yates. "Information retrieval: data structures and algorithms." ISBN-10 134638379 (1992).
Indexing Techniques For Full-Text Search
In wireless broadcast environment
Wireless mobile computing
● Broadcasting o Effective technique to disseminate
information to massive number of clients through public broadcast channels
o Why? bandwidth efficiency energy efficiency scalability
Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010
Full-text search in Wireless mobile computing
● Full-text search is used in various information systems
● Previous works have been developed for disk storage, not “wireless channels”
● In disk-based storage, documents are stored in physical space, so clients can “jump’ among different storage slots
● In on-air storage, documents are stored “sequentially” along the time line
Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010
public channel
Contents provider
Full-text search in Wireless mobile computing
broadcast* : bucket articles (breaking news, weather reports … )
…….
I want to find articles containing “Database System”
Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010
Full-text search in Wireless mobile computing
Full Scan!
Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010
Problem?
● Energy consumption o Mobile device has limited battery power
need to reduce energy consumption!
● Active mode <-> doze mode o Active mode: computes operation o Doze mode: do nothing
Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010
Metrics
● Traditional o Number of disk accesses
● In wireless network o latency ( access time )
duration from the time of query submission to the time when the download of the target information is complete
o energy ( tuning time ) duration which the mobile device remains
in active mode. Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010
Metrics
Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010
● Broadcast index buckets and data buckets
● Data access protocol of client
o [Initial Probe]: Receive the current bucket broadcasted on the air, and check if the current bucket is the first bucket of the index
o [IndexWait] : If the current bucket is not the first bucket of the index, wait until the first bucket of the next index arrives on the air
o [DataWait]: Find the target data addresses by using the index, and wait until the target data bucket arrives on the air.
Basic scheme
Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010
Naive: Inverted list method
Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010
Naive: Inverted list method● Problem
o Large IndexWait time AccessTime increases
● Solutiono replication/ distribution
Improved: Inverted list + Index tree method
Improved: Inverted list + Index tree method● Distribution
Evaluation
Reference● Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and
Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010
Hash Indexing Scheme
● Tree-based indexing vs hash based indexing ○ hash-based is more flexible and space efficient for
full-text search in wireless data broadcast
Yang, Kai, et al. "A novel hash-based streaming scheme for energy efficient full-text search in wireless data broadcast." Database Systems for
Advanced Applications. Springer Berlin Heidelberg, 2011.
Basic-Hash Indexing Scheme
Reference
Yang, Kai, et al. "A novel hash-based streaming scheme for
energy efficient full-text search in wireless data broadcast."
Database Systems for Advanced Applications. Springer
Berlin Heidelberg, 2011.