30
WWW servers and search engines 2004, 劉劉劉

WWW servers and search engines

  • Upload
    dolf

  • View
    18

  • Download
    0

Embed Size (px)

DESCRIPTION

WWW servers and search engines. 2004, 劉震昌. Web browser and server. tools to read HTML document. client. server. Web browser. Web server (ex. 跑 IIS). send request. click a link. find document. display. return HTML document. Where is the web server?. Probing the Internet (cont.). - PowerPoint PPT Presentation

Citation preview

Page 1: WWW servers and search engines

WWW servers and search engines

2004, 劉震昌

Page 2: WWW servers and search engines

Web browser and server tools to read HTML document

Web browser Web server (ex. 跑 IIS)

client server

click a link send requestfind document

return HTML documentdisplay

Where is the web server?

Page 3: WWW servers and search engines

Probing the Internet (cont.)

tracert, ping

router

source destination

www.yahoo.com.tw

packet

封包 ( 網路上資料傳輸單位 )

Page 4: WWW servers and search engines

Probing the Internet (How do you know you are on Internet?)

ping www.yahoo.com.tw

Pinging rc.tpe.yahoo.com [202.1.237.23] with 32 bytes of data:

Reply from 202.1.237.23: bytes=32 time=4ms TTL=246Reply from 202.1.237.23: bytes=32 time=5ms TTL=246Reply from 202.1.237.23: bytes=32 time=4ms TTL=246Reply from 202.1.237.23: bytes=32 time=4ms TTL=246

Ping statistics for 202.1.237.23: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),Approximate round trip times in milli-seconds: Minimum = 4ms, Maximum = 5ms, Average = 4ms

Page 5: WWW servers and search engines

The route from source to destination

tracert www.yahoo.com.twTracing route to rc.tpe.yahoo.com [202.1.237.23]

over a maximum of 30 hops:

1 <1 ms <1 ms <1 ms gateway.lan20.csie.ncnu.edu.tw [163.22.20.254] 2 <1 ms <1 ms <1 ms ip253.puli01.ncnu.edu.tw [163.22.1.253] 3 <1 ms <1 ms <1 ms ip090.puli255-64-203.ncnu.edu.tw [203.64.255.90] 4 1 ms 1 ms 1 ms 140.128.251.38 5 17 ms 74 ms 2 ms tc-tanet-gw01.router.hinet.net [211.22.189.186] 6 2 ms 1 ms 1 ms 211.22.189.190 7 1 ms 1 ms 1 ms tc-c12r2.router.hinet.net [211.22.189.74] 8 4 ms 4 ms 4 ms tp-s2-c12r2.router.hinet.net [210.65.200.30] 9 4 ms 4 ms 4 ms tp-s2-c6r8.router.hinet.net [211.22.35.181] 10 9 ms 5 ms 6 ms 211.22.41.89 11 5 ms 5 ms 5 ms rc.tpe.yahoo.com [202.1.237.23]

Trace complete.

Page 6: WWW servers and search engines

Lab#5 Try ping and tracert to access www.googl

e.com.tw Record your results in a text file Email to me with subject: Lab5 學號

Page 7: WWW servers and search engines

動態 IP 如何架站 (WWW,ftp,…) ? DHCP (Dynamic Host Configuration Protoc

ol) DHCP 說明

IP:163.22.123.111

IP:163.22.123.123

.

.

.

If we want to communicate with hime, What’s the IP or domain name?

1. 自己架 DNS (domain name server)2. 動態註冊 IP 與 domain name

Page 8: WWW servers and search engines

www.no-ip.com

IP:163.22.123.111動態 www.no-ip.com

DNS server

Kamiry.no-ip.com註冊 IP 與 domain name的對應

參考: No-IP 使用文件

Page 9: WWW servers and search engines

安裝 IIS (internet information server)

在 Windows CD 片 安裝說明 IIS 設定 Microsoft IIS 太普遍,並且有很多安全漏洞,

可以使用非微軟的 WWW server Ex. Apache, analogx, … 參考文件

Page 10: WWW servers and search engines

HW#3 在自己的電腦上架設 WWW server 將 server 的 domain name email 給我 將自己的個人網頁放到自己的電腦上 助教指定開機時間 server 必須開啟

Page 11: WWW servers and search engines

Searching the Web

Ref: Chapter 13 in “Modern Information Retrieval”

Ricardo Baeza-Yates and Berthier Ribeiro-Neto

Page 12: WWW servers and search engines

Outline Measuring the Web Methods for searching the Web

Search engines Web directories

Page 13: WWW servers and search engines

Searching the Web WWW starts in 1989 Just the textual data is estimated to be

in the order of one terabyte Goal: how to efficiently manage,

retrieve and filter information from the Web?

Page 14: WWW servers and search engines

Challenges Distributed data

Data spans over many computers interconnected without predefined topology

High percentage of volatile data 易變資料 40% of the Web changes every month

Large volume Unstructured and redundant data 重複資料

30% of Web pages are (near) duplicates Heterogeneous data

Different languages

Page 15: WWW servers and search engines

Measuring the Web

Internet

URLsWWW

Webserver

*1998, 3M servers

No. of servers =1/10 no. of computers on Internet

3 百萬

Page 16: WWW servers and search engines

Measuring the Web (cont.) 1998 5Kb per Web page on average 300M Web pages (3億… ) 300M * 5Kb = 1.5 Terabytes Grow at a rate of 20M pages per month

Page 17: WWW servers and search engines

Growth of the Web

1996 1997 1998

100

200

300

Webpages Web

sites

Million

year

Page 18: WWW servers and search engines

Methods for searching the Web

Search engines 搜尋引擎 Index the Web documents as a full-text d

atabase Alta Vista, Google, …

Web directories 入門網站目錄 Classify selected Web documents by subj

ect Yahoo!

Page 19: WWW servers and search engines

Search engines concept 搜尋引擎

Model the Web as a database All queries must be answered without

accessing the Web pages

Userqueries database

Page 20: WWW servers and search engines

Search engines (cont.) AltaVista (www.altavista.com)

20 multi-processor machines 130 Gb of RAM each Over 500 Gb of disk space each 75% resources on the query engine

Page 21: WWW servers and search engines

The top search engines Foreign

Google ( www.google.com ) www.yahoo.com www.altavista.com Inktomi ( www.inktomi.com ) Statistics on search engines

www.searchenginewatch.com http://imt.net/~notess/search

Taiwan Yahoo!/Kimo uses google Openfind ( www.openfind.com.tw )(中正大學吳昇教授 ) Yam ( www.yam.com.tw )

Page 22: WWW servers and search engines

Search engines (cont.) Centralized crawler-indexer

architecture

UserInterface

QueryEngine

Indexdatabase

users

Indexer

Crawler

Web

Page 23: WWW servers and search engines

User Interface Query interface

Keywords Boolean operator

Answer interface Rank the searched pages

Statistics about the term occurrence within the document

Popularity Hyperlink information

Page 24: WWW servers and search engines

UserInterface

QueryEngine

Indexdatabase

users

Indexer

Crawler

Web

Page 25: WWW servers and search engines

Crawler Robots, spiders (蜘蛛 ), wanderers, wal

kers, and knowbots In spite of their name, the crawler runs

on a local system and sends requests to remote Web servers

Method: start with a set of URLs, and from there extract other URLs

Page 26: WWW servers and search engines

Crawler (cont.)

How the Web is traversed, the index of a search engine can be thought as analogous to the stars in a sky Invalid links in search engines vary from

2% to 9% The current fastest crawlers are able

to traverse up to 10M Web pages per day (’98) 300M/10M = 30 days

Page 27: WWW servers and search engines

Web directories 網站目錄 Classify the Web pages by categories Directories are hierarchical taxonomies

that classify human knowledge Yahoo! has close to 1M pages classified How to classify pages?

Pages has to submitted to the Web directories

Manually done by few people Automatic classification is not yet mature Not every page is classified

Page 28: WWW servers and search engines

Some Web directories

Web directories URL Web sites(K) Categories

Yahoo! www.yahoo.com 750LookSmart www.looksmart.com 300 24Lycos Subjects a2z.lycos.com 50eBLAST www.eblast.com 125NewHoo www.newhoo.com 100 23Magellan www.mckinley.com 60Netscape www.netscape.com Snap www.snap.com

Page 29: WWW servers and search engines

Lab about search engine Today 1:00~3:00

Page 30: WWW servers and search engines

Final typing test 10/20 沒達到標準學期總分扣 10 分