Upload
thomasine-patrick
View
213
Download
0
Embed Size (px)
Citation preview
Discovering Web Access Patterns and Trends by Applying OLAP
and Data Mining Technology on Web logs
Data Engineering Lab
성 유 진
Abstract
Web server log files analysis • server performance improvement• system performance improvement• customer targeting in electronic commerce
problem and difficulty• large raw log data processing is not easy• data reduce
• size and time
• current weglogminer • slow, inflexible, difficult to maintain
• only frequency count not enough WebLogMiner
• Virtual University/data mining WeblogMiner• OLAP and data mining technique• multi-dimensional data cube• scalability, interactivity, variety, flexibility
Design of a Web log Miner
Web log server log file information• domain name of the request / user name / date and time of
the request / the method of the request(GET, POST) / the name of the file requested / the result of the request(success, failure, error, etc) / size of the data sent back / the URL of the referring page / identification of the client agent
• Example210.114.3.64 - - [01/Jul/1998:17:34:05 0900]"GET/~yjsung/sign.htmlHTTP/1.1" 200 740
210.114.3.64 -- [01/Jul/1998:17:38:44-0900]"POST/cgi-bin/yjsung/signHTTP/1.1" 200 352
POST : 브라우저가 채워진 양식을 서버에 전달 할 때 GET : 서버로부터의 데이터 요청 시
• Cache information • frequent backtracking and reload : deficient design
– client site log
• Access count• not always the measure of interestingness
– 특정 document 를 access 하기 위해 반드시 거쳐야하는 사이트
• Time and Date • evaluate user interest by time spent
• Domain name • Sequence of requests can predict next request
improve traffic
.Filtering the data, creating relational DB
2. Data cube construction
3. OLAP is used
4. Data mining technique are used
WebLogMiner 4 Stages
1.DATABASE CONSTRUCTION FROM SERVER LOG FILES
Data Cleansing and Transformation• filter out page graphics(sound and video) but 보존• two types
• without knowledge about site– (time day, month, year 등으로의 transformation 은 서버 정보
없이 가능 )
• with knowledge about site : – associating server request to intended action needs site structure
• relation database• cleaned data and new implicit data is added
2.MULTI-DIMENSIONAL WEB LOG DATA
CUBE CONSTRUCTION AND MANIPULATION Data Cube
• group by operator in SQL is used to compute aggregates on a set of attributes
sum of sales by P, C: for each product, give a breakdown on how much of it was sold to each customer
• CUBE is the n-dimensional generalization of group-by• gives remarkable flexibility to manipulate and view the
data• allow OLAP operation such as drill-down, roll-up,
slice and dice
•Attributes - URL - domain name
- size of resource,
- time
. . .
•Attributes - URL - domain name
- size of resource,
- time
. . .
3.DATA MINING ON WEB LOG DATA CUBE
AND WEB LOG DATABASE Data Characterization
• find rule that summarize user defined data set☞ the traffic on a web server for a given type of media
in a particular time of day Class comparison
• discover discriminant rules ☞ compare requests from two different web browsers
Association • discover the patterns that access to different
resources consistently occurring together Prediction
☞ access to a new resource on a given day can be prediected based on accesses to similar old resources on similar days
Classification • can be used to develop a better understanding of
each class in the web log database, and perhaps restructure a web sit or customize answers to requests based on classes of requests
Time-series analysis - • to analyze data along time sequences to discover
time-related interesting patterns …☞ disclose the patterns and trends of the
improvement of services of the web server
Focus will be on time-series analysis because web log records are highly time-related
Experiments with the web log miner Virtual-U:six different major component: Goal - understand the usage and user
behavior patterns
Data Cleaning and transformations• all entries were mapped one on one into
relational database• field site, user action are added.• Problem
– extraneous information => define those entries and eliminate them
– multiple server requests by same user action– same server request by multiple user actions– local activities are not recorded
Multi-dimensional data cube construction manipulation• summarization(group-bys on different
dimensions)• request/domain /event/session/bandwidth/error/referring organization /browser summary
ExamplesFigure2) OLAP analysis of Web log
Fig3) Typical event sequence and user behavior pattern analysis
Fig4) Web traffic analysis of Web log
•Fig6) Event trees of month one to four
Discussion and Conclusion
WebLogMiner• OLAP and data mining technique• multi-dimensional data cube• major strength
• scalability, interactivity, variety, flexibility
Current log file 의 문제점• web server should collect more information• new structure is needed ==> would
simplify pre-processing