Transcript
Page 1: Parsing HTML Topic 3, Chapter 7 Network Programming Kansas State University at Salina

Parsing HTMLTopic 3, Chapter 7

Network Programming

Kansas State University at Salina

Page 2: Parsing HTML Topic 3, Chapter 7 Network Programming Kansas State University at Salina

Picking information from an HTML page

A difficult problem HTML defines page layout, not content –

advantage XML Very useful because of volume of data

available If the format of the page changes, your

program is broken.

Page 3: Parsing HTML Topic 3, Chapter 7 Network Programming Kansas State University at Salina

HTML Definition: Token – one piece of information

in an HTML formatted page HTML tag – usually only relates to formatting URL or image reference Textual information

Must look at several tokens to determine context of the data

Start-tag, End-tag structure leads parsing code to use finite state machines and stacks.

( <TABLE> … </TABLE> )

Page 4: Parsing HTML Topic 3, Chapter 7 Network Programming Kansas State University at Salina

Tokens

{'data': [], 'type': 'StartTag', 'name': u'html'}{'data': [], 'type': 'StartTag', 'name': u'head'}{'data': u'\n ', 'type': 'SpaceCharacters'}{'data': [], 'type': 'StartTag', 'name': u'title'}{'data': u' ', 'type': 'SpaceCharacters'}{'data': u'Tim Bower', 'type': 'Characters'}{'data': u' ', 'type': 'SpaceCharacters'}{'data': [], 'type': 'EndTag', 'name': u'title'}{'data': u'\n', 'type': 'SpaceCharacters'}{'data': [], 'type': 'EndTag', 'name': u'head'}{'data': u'\n\n', 'type': 'SpaceCharacters'}{'data': [(u'bgcolor', u'lightyellow')],

'type': 'StartTag', 'name': u'body'}{'data': u' \n\n', 'type': 'SpaceCharacters'}{'data': [], 'type': 'StartTag', 'name': u'table'}{'data': u' ', 'type': 'SpaceCharacters'}{'data': [], 'type': 'StartTag', 'name': u'tbody'}{'data': [], 'type': 'StartTag', 'name': u'tr'}{'data': u'\n', 'type': 'SpaceCharacters'}{'data': [], 'type': 'StartTag', 'name': u'td'}{'data': u'\n', 'type': 'SpaceCharacters'}{'data': [], 'type': 'StartTag', 'name': u'h1'}{'data': u'Tim Bower', 'type': 'Characters'}{'data': [], 'type': 'EndTag', 'name': u'h1'}

<HTML>

<HEAD>

<TITLE> Tim Bower </TITLE>

</HEAD>

<BODY BGCOLOR="lightyellow">

<TABLE> <TR>

<TD>

<H1>Tim Bower</H1>

Page 5: Parsing HTML Topic 3, Chapter 7 Network Programming Kansas State University at Salina

Two main programming strategies The call-back approach (HTMLParser shown

in text book) Define your own class that extends the

HTMLParser class Nice use of inheritance and polymorphism Pass the HTML page to the parser and it calls

functions from your class as needed to process the start-tags, data elements, end-tags and a few other miscellaneous tags.

The document tree approach Parser builds a tree (data structure object) based

on the page contents You iterate through the tree or a list of tokens

taken from the tree looking for desired data.

Page 6: Parsing HTML Topic 3, Chapter 7 Network Programming Kansas State University at Salina

HTMLParserimport HTMLParser

class TitleParser(HTMLParser): def __init__(self): self.title = '' self.readingtitle = 0 HTMLParser.__init__(self)

def handle_starttag(self, tag, \ attrs):

if tag == 'title': self.readingtitle = 1

def handle_data(self, data): if self.readingtitle: self.title += data

def handle_endtag(self, tag): if tag == 'title':

print “*** %s ***” % \ self.title self.readingtitle = 0

fd = open(sys.argv[1])tp = TitleParser()tp.feed(fd.read())

Page 7: Parsing HTML Topic 3, Chapter 7 Network Programming Kansas State University at Salina

Argh!, HTMLParser is fragile and hard to debug.

Traceback (most recent call last): File "C:\Users\tim\Documents\Classes\Net_Programming\Source_code\Topic 3 - Web\weatherParser.py", line 258, in <module> parser.feed(data) File "C:\Python25\lib\HTMLParser.py", line 108, in feed self.goahead(0) File "C:\Python25\lib\HTMLParser.py", line 148, in goahead k = self.parse_starttag(i) File "C:\Python25\lib\HTMLParser.py", line 226, in parse_starttag endpos = self.check_for_whole_start_tag(i) File "C:\Python25\lib\HTMLParser.py", line 301, in check_for_whole_start_tag self.error("malformed start tag") File "C:\Python25\lib\HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos())HTMLParseError: malformed start tag, at line 120, column 477

Page 8: Parsing HTML Topic 3, Chapter 7 Network Programming Kansas State University at Salina

html5lib

Found on Python package index Install setuptools then use Python to install

html5lib (see the README file). Both are on K-State Online.

Advantages: Robust, standards based parser Filtering data after the page is parsed is easier to

follow and debug than the call-back approach Disadvantage:

Documentation of API for traversing the tree

Page 9: Parsing HTML Topic 3, Chapter 7 Network Programming Kansas State University at Salina

html5lib Usage Build the tree:

Loop through tokens:

p = html5lib.HTMLParser( \tree=treebuilders.getTreeBuilder("dom"))

f = open( "weather.html", "r" )dom_tree = p.parse(f)f.close()

walker = treewalkers.getTreeWalker("dom")stream = walker(dom_tree)passtags = [ u'a', u'h1', u'h2', u'h3', u'h4',u'em', \

u'strong', u'br', u'img', \u'dl', u'dt', u'dd' ]

for token in stream: # Don't show non interesting stuff if token.has_key('name'): if token['name'] in passtags: continue print token

Page 10: Parsing HTML Topic 3, Chapter 7 Network Programming Kansas State University at Salina

The DOM tree alternative

The DOM tree may be used directly. Not documented with html5lib, but xml.dom

package is standard with Python. DOM trees are normally used with XML, but

html5lib can make a DOM tree from HTML. Walk through the tree by examining children

nodes of each node. With knowledge of the page structure, you may be able to go almost directly to the desired information.

See chapter 8 and DOMtry.py posted file.

Page 11: Parsing HTML Topic 3, Chapter 7 Network Programming Kansas State University at Salina

html5lib tokens

Stream of tokens is a list Each token is a dictionary

token[ ‘data’ ] String (unicode encoding) Empty list List of tuples for formatting attributes

token[ ‘type’ ] – (StartTag, EndTag, Characters, SpaceCharacters)

token[ ‘name’ ] – description of start and end tags. (table, tr, td, h1, br, ul, li, … )

See example of tokens on previous slide

Page 12: Parsing HTML Topic 3, Chapter 7 Network Programming Kansas State University at Salina

html5lib token parsingdoingTitle = Falsefor token in stream: if token.has_key('name'): if token['name'] in passtags: continue else: tName = token['name'] tType = token['type'] if tType == 'StartTag': if tName == u'title': title = '' doingTitle = True if tType == 'EndTag': if tName == u'title': print "*** %s ***\n" % title doingTitle = False

if tType == 'Characters': if doingTitle: title += token['data']