Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network

Text processingTeam 3

김승곤 박지웅 엄태건 최지헌 임유빈

Outline

1. Introduction

2. Text processing

3. Index techniques in database

4. Index techniques in wireless network

5. Apache Lucene

6. Apache Solr

7. Demo

5/18

5/20

What’s Text Processing?● Mechanism for the manipulation of text

o Language processing

o Data structure

o Visualization

o Human factor

● Converting text to indexing term

Necessity of Index

데이터베이스의 성능 측정 기준

1) 테이블에 접근하는 SQL 의 수

2) 업무 카테고리

3) 중복된 Access Pattern 을 제거한 실제 Access pattern (RAP, Real Access Pattern)

4) 테이블 유형에 대한 분류

5) SQL 의 성능정보

6) 테이블당 인덱스 수

인덱스 활용 예시

Text processing steps

1. Lexical Analysis

2. Elimination of stopwords

3. Stemming

4. Selection of index terms

5. Building a thesaurus

Lexical Analysis● Converting byte stream to tokens

o Numbers and digits

o Hyphens Index as phrase, allow partial match Proximity information

o Punctuation

o Lexer By hand - Easy, fast but not flexible DFA generator (Deterministic Finite Automata) - Use state machine

“Deaths from car accidents in 1989”{Deaths, car, accidents, from, 1989}

“Work-out”

“BS”, “B.S.”, “M.S.”, URLs

Elimination of Stopwords● High frequency, but not useless

o For example,

o Removing stopwords

o Statistical approach & Lookup table

the, of, and, a, in, to, is, for, with, are

“to be or not to be” -> {be}

Stemming● Reduce variant word forms to a single “stem”

o Words

o Four approaches Table lookup - use a dictionary

Successor variety - fancy suffix removal

Affix removal - cut prefixes and suffixes

Character N-grams

-’s, -ing, -ed, -s, in-, ad-, pre-, sub-

Porter’s algorithm● Removes suffixes in five stages

o Each depends on a suffix and the stem measure m

o Porter Errors organization/organ doing/doe past/paste european/europe resolve/resolution

Rule Result

SSES -> SS caresses -> caress

IES -> I ponies -> poni, ties -> ti

SS -> SS caress -> caress

S ->∅ cats -> cat

EED->EE feed -> feed

(*v*) ED plastered -> plaster

(*v*) ING motoring -> motor

Indexing Implementation in Text Database

Index Data Structure

● Different types of index data structures, for querying large text collectiono Signature Fileo Inverted File

Index Data Structure: Signature File

● F bits of signature

● Make term descriptors with m bits = 1, rest = 0

● Superimpose term descriptors of document to obtain document descriptor

Index Data Structure: Signature File

● Probabilistic indexing method

● Queryingo Form query descriptor for termo Fetch superset of query descriptor by comparisono Possibility of wrong result (false drop)

Index Data Structure: Inverted File




● Store mapping from content to its location

● Structureo a directory of terms

o posting lists of document IDs



Query: not


Query: not


Query: not

String comparison slow! Solution: Inverted index


Query: not Inverted index

0

1



0

1



0

1


Query: be Inverted index

0

1


Query: thing Inverted index

0

1

Drawbacks of Signature File

● Elimination of false match

● Require more disk access

● Difficult to construct and maintain

● Larger than inverted file

Drawbacks of Inverted File

● Performance challenges caused byo huge amount of documents

o increasingly large number of users

● Space cost of the associated inverted list even ranges gigabytes to terabytes

● Researches to improve performance of indexo more efficient index structure for low space and fast

query processing

Compression of Inverted File

● Time cost of indexo Seek and retrieve inverted list from disk into memory

o Transfer lists from memory into CPU cache

● Increase number of lists that can be cached

● Reduce number of disk accesses

Compression of Inverted File:d-gap

4 9 20 28 45 59 81 102 130 157 178 210 237 258


4 9 20 28 45 59 81 102 130 157 178 210 237 258

5 11 8 17 14 22 21 28 27 21 32 27 21


4 9 20 28 45 59 81 102 130 157 178 210 237 258

5 11 8 17 14 22 21 28 27 21 32 27 21

4 5 11 8 17 14 22 21 28 27 21 32 27 21

Compression of Inverted File:Simple-9

● Combination of bit alignment and word alignment

● Pack as many integers as possible into one 32-bit word

● Compression format

selector(4-bit)

data bit (28-bit)


● Compression method

● example

● all values are less than ‘8’ → 3 bits

● Selector c


● example

● all values are less than ‘32’ → 5 bits

● Selector e


Compression of Inverted File: Bitlist

● Simple and very efficient encoding schemeo Use encoded number to represent a set of

document IDso Only use 0/1 to indicate whether a document

contains a specific term

o Low space requirement

● naive inverted list

● 0/1 matrix

Bitlist structure

Bitlist structure

● base number = 4

Bitlist structure

● base number = 4

Bitlist structure

Bitlist structure

● bitlist, base = 4 ● bitlist, base = 12

Bitlist: DocID reassignment

● before

● after

Reference

[1] Rahevar, Mrugendrasinh L., and Mehul C. Parikh. "Optimized index construction for large text collections using blocked sort-based indexing." Advanced Communication Control and Computing Technologies (ICACCCT), 2014 International Conference on. IEEE, 2014.[2] Rao, Weixiong, et al. "Bitlist: New full-text index for low space cost and efficient keyword search." Proceedings of the VLDB Endowment 6.13 (2013): 1522-1533.[3] Zhang, Jiangong, Xiaohui Long, and Torsten Suel. "Performance of compressed inverted list caching in search engines." Proceedings of the 17th international conference on World Wide Web. ACM, 2008.[4] Zobel, Justin, and Alistair Moffat. "Inverted files for text search engines." ACM computing surveys (CSUR) 38.2 (2006): 6.[5] William, B. Frakes, and Ricardo Baeza-Yates. "Information retrieval: data structures and algorithms." ISBN-10 134638379 (1992).

Indexing Techniques For Full-Text Search

In wireless broadcast environment

Wireless mobile computing

● Broadcasting o Effective technique to disseminate

information to massive number of clients through public broadcast channels

o Why? bandwidth efficiency energy efficiency scalability

Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010

Full-text search in Wireless mobile computing

● Full-text search is used in various information systems

● Previous works have been developed for disk storage, not “wireless channels”

● In disk-based storage, documents are stored in physical space, so clients can “jump’ among different storage slots

● In on-air storage, documents are stored “sequentially” along the time line


public channel

Contents provider


broadcast* : bucket articles (breaking news, weather reports … )

…….

I want to find articles containing “Database System”



Full Scan!


Problem?

● Energy consumption o Mobile device has limited battery power

need to reduce energy consumption!

● Active mode <-> doze mode o Active mode: computes operation o Doze mode: do nothing


Metrics

● Traditional o Number of disk accesses

● In wireless network o latency ( access time )

duration from the time of query submission to the time when the download of the target information is complete

o energy ( tuning time ) duration which the mobile device remains

in active mode. Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010

Metrics


● Broadcast index buckets and data buckets

● Data access protocol of client

o [Initial Probe]: Receive the current bucket broadcasted on the air, and check if the current bucket is the first bucket of the index

o [IndexWait] : If the current bucket is not the first bucket of the index, wait until the first bucket of the next index arrives on the air

o [DataWait]: Find the target data addresses by using the index, and wait until the target data bucket arrives on the air.

Basic scheme


Naive: Inverted list method


Naive: Inverted list method● Problem

o Large IndexWait time AccessTime increases

● Solutiono replication/ distribution

Improved: Inverted list + Index tree method

Improved: Inverted list + Index tree method● Distribution

Evaluation

Reference● Yon Dohn Chung, Member, IEEE, Sanghyun Yoo, and

Myoung Ho Kim , “Energy- and Latency-Efficient Processing of Full-Text Searches on a Wireless Broadcast Stream”, IEEE TKDE, 2010

Hash Indexing Scheme

● Tree-based indexing vs hash based indexing ○ hash-based is more flexible and space efficient for

full-text search in wireless data broadcast

Yang, Kai, et al. "A novel hash-based streaming scheme for energy efficient full-text search in wireless data broadcast." Database Systems for

Advanced Applications. Springer Berlin Heidelberg, 2011.

Basic-Hash Indexing Scheme

Reference

Yang, Kai, et al. "A novel hash-based streaming scheme for

energy efficient full-text search in wireless data broadcast."

Database Systems for Advanced Applications. Springer

Berlin Heidelberg, 2011.

Documents

Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text processing 3.Index techniques in database 4.Index techniques in wireless network