1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk,...

Finding (Recently) Frequent Items in Distributed Data Streams

Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, Christopher Olston

CMU-CS-05

Speaker ：董原賓 Advisor ：柯佳伶

Introduction

Mining distributed data streams Challenge ： Transfer among nodes is costly Goal ： Minimize communication requirements

Application Monitor usage in large-scale distributed systems

Content Delivery Network Detect malicious activities in networked systems

Detect worms

Definition Data streams S1, S2 ,…,Sm, m ≧ 1 R : root node ε: error tolerance, 0 ≦ε≦ s s : minimum support c(u) : frequency of item u in S, u ∈ universe U of items N = ∑u∈U c(u) (u) : estimate count of c(u),ĉ max{0, c(u) – ε* N} ≦ (u) ĉ ≦ c(u)

Definition T : a period of time units, an epoch α: decay rate, 0 ＜ α ≦ 1 (ε,α)-synopsis : consist of S: (u) and S:nĉ l ≧2 : number of levels in the hierarchy ε≧ ε1≧ ε2≧…≧εl-1 d≧2 : fanout (degree) of all non-leaf nodes i

n the hierarchy

Main Approach

1. Every monitor node Mi uses a single-stream approximate frequency counting algorithm

2. m monitor nodes M1,M2,…,Mm relay data every T time units to central root node R

3. Every T time, each monitor node sends its (εl-1,1)-synopsis to its parent node

Main Approach

4. The parent node combines the d (εl-1,1)-synopses from its d children into a single (εl-2,1)-synopsis using

Algorithm 1a : based on lossy counting or Algorithm 1b : based on majority counting

5. The root node combines the d (ε1,1)-synopses using

Algorithm 2a : based on lossy counting or Algorithm 2b : based on majority counting

Algorithm 1a and 1b

Example

use algorithm 1a 27distinct items (I1~I27), partition them

into categories: A : contains I1

B : contains I2~I14

C : contains I15~I27

ε1≈ ε= 0.05, ε2 = 0.03

Example

S1 = S3 = { I1:9, I2~I14:6, I15~I27:1 }S2 = S4 = { I1:9, I2~I14:1, I15~I27:6 }

S1:n = S2:n = S3:n = S4:n = 100

Because of the lossy counting algorithm leads to undercounting of eachitem’s frequency by ε2 ·100 = 0.03 ·100 = 3

S1 = S3 = { I1:6, I2~I14:3, I15~I27:0 }S2 = S4 = { I1:6, I2~I14:0, I15~I27:3 }

S1 = S3 = { I1:6, I2~I14:3 }

S2 = S4 = { I1:6, I15~I27:3 }

Item’s counts fall below zero are eliminated

Link load M1l1 = 14

14 14 14 14

l1 = { I1:8}

l2 = { I1:8}

l1:n = S1:n + S2:n = 100 + 100 = 200l1: (Iĉ 1) = S1: (Iĉ 1) + S2: (Iĉ 1) = 6 + 6 = 12l1: (Iĉ 1) = l1: (Iĉ 1) - (ε1- ε2)·l1:n = 12 - (0.05-0.03)·200 = 12 - 4 = 8

l1:n = S1:n + S2:n = 100 + 100 = 200l1: (Iĉ 2) = S1: (Iĉ 2) + S2: (Iĉ 2) = 3 + 0 = 3l1: (Iĉ 1) = l1: (Iĉ 1) - (ε1- ε2)·l1:n = 3 - (0.05-0.03)·200 = 3 - 4 = -1 delete

Link load l1 R = 1

Example

Minimizing Total Load on the Root Node

Using Algorithm 1a at all applicable nodes and setting εi = 0 for all 2 ≦ i ≦l−1, term this strategy MinRootLoad

Minimizing Worst Case Maximum Load on Any Link

Definition : 2 ≦ i ≦ l − 2, Δi = εi − εi+1 and Δl−1 = εl-1

I : the contents of all input streams S1, . . . , Sm

I : denote the set of all possible instances of I Communication hierarchy T defined by degree d a

nd number of levels l w : maximum load on any link : Worst-case load W :

Iwc ： denotes the set of worst-case inputs, for all instances I ∈I – Iwc , I ’∈Iwc ,for any

Worst case property : For any two input streams Si and Sj , there is no item occu

rrence common to both Si and Sj

For any input stream Si, all items occurring in Si occur with equal frequency

For any two input streams Si and Sj , both the number of item occurrences, and the number of distinct items, in Si and Sj are equal

Level X : all counts are dropped at the level X, the most heavily loaded link(s) are the ones leading to level X

After solving the equation, we obtain Δi = ε1 · , 2 ≦ i ≦ l-1 and

Δl-1 = ε1 ·

the maximum possible load on any link is Lwc = we term this strategy MinMaxLoad_WC

Good Precision Gradients for Non-Worst-Case Inputs

Real data is unlikely to exhibit worst-case characteristics

Two extreme opposite cases : 1. Items on each input stream are disjoint

Solution : use strategy MinMaxLoad_WC 2. All input streams contain identical distributions of item

s Solution : there is no benefit to delaying pruning, set ε1 = ε2 =

…= εl-1 = ε (strategy SS2)

Most real-world data falls somewhere between these two extremes

Li : the number of local frequent items in Si

Gi : the number of items that are local frequent in Si and global frequent in S, S = S1 U S2 U…U Sm

Commonality parameter γ∈ [0, 1],

A natural hybrid strategy is to use a linear combination of MinMaxLoad_WC and SS2

Set εi = (1 - γ) ·

εl-1 = (1 - γ) ·

We term this hybrid strategy MinMaxLoad_NWC

Experiment

Data set : 1. traffic logs from Internet, and identify hosts rece

iving large numbers of packets recently Data were collected for one full day

2. Java Servlet versions of two publicly available dynamic Web application benchmarks :

RUBiS is modeled after eBay, an online auction site and RUBBoS is modeled after slashdot, an online bulletin-boa

rd Ran each benchmark for 40 hours

Experiment

Simulated environment : 216 monitoring nodes (m = 216) Communication hierarchy of fanout six (d = 6) consisted o

f four levels (l = 4) set s = 0.01, ε = 0.1·s, and ε1 = 0.9· ε An epoch T for data1 is 5 mins ,for data2 is 15 mins

Data Characteristics : Data1(internet), γ= 0.675 Data2, Auction γ= 0.839 and Bulletin Board γ= 0.571

Experiment

1 Finding (Recently) Frequent Items in Distributed Data Streams Amit Manjhi, Vladislav Shkapenyuk,...

Documents

· parelkar kedar anil jemini he-kare sudarshan bharat obc open open sc open open nt3 mgt mgt mgt mgt ... solanki dhwanika mahendrasinh categ. open quota nri

TECHNICAL REPORT€¦ · KEDAR AKHADA 1 CHANUKADA ginger Treatment KEDAR AKHADA 4 CHANUKADA ginger Treatment LAXMI NAGAR 1 JORAYAL french bean Treatment NIRAULI 3 JORAYAL ginger Treatment

dbcpeth.edu.in...Omkar Vilas Bhavsar R. D. Darekar Sneha Kulkarni Sanjay Tupe Borase Sudhakar Jagannath, Shelar Suresh Kautik S. R. Thorwat Vishal Kamble Kedar Nath Meena Pawar Page

MEMPERKASAKAN ORANG MELAYU DI KUBANG PASU ......MEMPERKASAKAN ORANG MELAYU DI KUBANG PASU, KEDAR MELALUI PROGRAM INDUSTRI KECIL DAN SEDERHANA (IKS) Nursubana Binh Hasbim Sarjana Muda

polytechnic.somaiya.edu · kesarkar kedar arun khandu somnath shankar krishnakumar p r more shubham dhondiram nadar selva kemilt nagrani rashmi naresh patekar ravindra sudhakar rajoli

WAVANG KULIT GEDEK SERI ASliN DI JITRA, KEDAR DALAM … kulit Gedek Seri Asun di Jitra... · WAVANG KULIT GEDEK SERI ASliN DI JITRA, ... (HURUF BESAR) mengaku membenarkan ... Universiti

Kedar, S., Andrade, J., Banerdt, B., Delage, P., Golombek, M., Grott, M… · 1 Analysis of regolith properties using seismic signals generated by InSight’s HP3 penetrator Sharon

Bab II - Latar Belakang Informasi BARU - oppb.orgoppb.org/wp-content/uploads/2015/03/Bab-II-Latar-Belakang-Informas... · Indonesia adalah Bangsa Kedar terbesar dunia yang mendiami

Die verfälschte Wachtturm-Bibel · Anfrage von mir hin betonte Professor Kedar, sein Urteil beträfe nur die hebräischen Schriften und nicht die griechischen Schriften. Und die

AWI labreport-kedar

Provence Alpes Côte dAzur (PACA) Wazana Dominique Kedar Yael

· pramod tiwari ram brikasha prasad shamanna udai prakash srivastava swamy ram shankar srivastava ... kavindra kumar kavita kavitha m kedar pahadasingh kethavath kishor kiran prasad

PDF8B - Vidya Pratishthan's Polytechnic College Science _ D6445.pdf · patule dattatrya rajendra jamdar kedar kumar dhadas pradip balasaheb madane kishor bharat pore kishor bhimrao

Robust Foreground Detection in Video Using Pixel Layers Kedar A. Patwardhan, Student Member, IEEE, Guillermo Sapiro, Senior Member, IEEE, and Vassilios

OBC - Siddharth Nagarsidharthnagar.nic.in/pdf/obc.pdf · dipak kumar jaiswal ajay kumar brijesh kumar ... ram dawar yadav kedar nath chaudhary ... annu kandu sudha devi nee-lu sharma

Kedar Brochure Online 1PDF

reappointmentportal.tsuprogram.com...SHRI GAURI NANDAN VARSHNEYt kishori lal sharma KEDAR NATH GUPTA DR. SANGEETA SINGH LALLU LAL Late Shiv Das Prasad RAJ KUMAR SINGH Shyam Kishore

UCHAPAN YAB DATO MENTBRI BESAR KEDAR DALAM MAJLIS …eprints.usm.my/35197/1/SYED_AHMAD_SHAHBUDDIN.pdf · dengan kata2 sahaja, bahkan Muhibbah mempunyai pengertian yang chukop luas

hbtu.ac.in · shubham kumar mishra vaibhav mishra aditya kumar km mansi vikash pal akash sonkar father name raj narayan ashok kushwaha chetan swaroop ram kedar mishra jai shankar

annual exam result2017.pdfrashmika bharti (w) raj kumar singh rajesh r vant kumar /0 rambilas manjhi page 16 of 30 maulana azad medical college 388 443 (distinction in 2) 357 366 423