Parsing HTML Topic 3, Chapter 7 Network Programming Kansas State University at Salina

Embed Size (px)

Text of Parsing HTML Topic 3, Chapter 7 Network Programming Kansas State University at Salina

  • Parsing HTMLTopic 3, Chapter 7Network ProgrammingKansas State University at Salina

  • Picking information from an HTML pageA difficult problemHTML defines page layout, not content advantage XMLVery useful because of volume of data availableIf the format of the page changes, your program is broken.

  • HTMLDefinition: Token one piece of information in an HTML formatted pageHTML tag usually only relates to formattingURL or image referenceTextual information

    Must look at several tokens to determine context of the data

    Start-tag, End-tag structure leads parsing code to use finite state machines and stacks. ( )

  • Tokens{'data': [], 'type': 'StartTag', 'name': u'html'}{'data': [], 'type': 'StartTag', 'name': u'head'}{'data': u'\n ', 'type': 'SpaceCharacters'}{'data': [], 'type': 'StartTag', 'name': u'title'}{'data': u' ', 'type': 'SpaceCharacters'}{'data': u'Tim Bower', 'type': 'Characters'}{'data': u' ', 'type': 'SpaceCharacters'}{'data': [], 'type': 'EndTag', 'name': u'title'}{'data': u'\n', 'type': 'SpaceCharacters'}{'data': [], 'type': 'EndTag', 'name': u'head'}{'data': u'\n\n', 'type': 'SpaceCharacters'}{'data': [(u'bgcolor', u'lightyellow')], 'type': 'StartTag', 'name': u'body'}{'data': u' \n\n', 'type': 'SpaceCharacters'}{'data': [], 'type': 'StartTag', 'name': u'table'}{'data': u' ', 'type': 'SpaceCharacters'}{'data': [], 'type': 'StartTag', 'name': u'tbody'}{'data': [], 'type': 'StartTag', 'name': u'tr'}{'data': u'\n', 'type': 'SpaceCharacters'}{'data': [], 'type': 'StartTag', 'name': u'td'}{'data': u'\n', 'type': 'SpaceCharacters'}{'data': [], 'type': 'StartTag', 'name': u'h1'}{'data': u'Tim Bower', 'type': 'Characters'}{'data': [], 'type': 'EndTag', 'name': u'h1'}

    Tim Bower

    Tim Bower

  • Two main programming strategiesThe call-back approach (HTMLParser shown in text book)Define your own class that extends the HTMLParser classNice use of inheritance and polymorphismPass the HTML page to the parser and it calls functions from your class as needed to process the start-tags, data elements, end-tags and a few other miscellaneous tags.The document tree approachParser builds a tree (data structure object) based on the page contentsYou iterate through the tree or a list of tokens taken from the tree looking for desired data.

  • HTMLParserimport HTMLParser

    class TitleParser(HTMLParser): def __init__(self): self.title = '' self.readingtitle = 0 HTMLParser.__init__(self)

    def handle_starttag(self, tag, \ attrs): if tag == 'title': self.readingtitle = 1

    def handle_data(self, data): if self.readingtitle: self.title += data

    def handle_endtag(self, tag): if tag == 'title': print *** %s *** % \ self.title self.readingtitle = 0fd = open(sys.argv[1])tp = TitleParser()tp.feed(

  • Argh!, HTMLParser is fragile and hard to debug.Traceback (most recent call last): File "C:\Users\tim\Documents\Classes\Net_Programming\Source_code\Topic 3 - Web\", line 258, in parser.feed(data) File "C:\Python25\lib\", line 108, in feed self.goahead(0) File "C:\Python25\lib\", line 148, in goahead k = self.parse_starttag(i) File "C:\Python25\lib\", line 226, in parse_starttag endpos = self.check_for_whole_start_tag(i) File "C:\Python25\lib\", line 301, in check_for_whole_start_tag self.error("malformed start tag") File "C:\Python25\lib\", line 115, in error raise HTMLParseError(message, self.getpos())HTMLParseError: malformed start tag, at line 120, column 477

  • html5libFound on Python package indexInstall setuptools then use Python to install html5lib (see the README file). Both are on K-State Online.Advantages:Robust, standards based parserFiltering data after the page is parsed is easier to follow and debug than the call-back approachDisadvantage:Documentation of API for traversing the tree

  • html5lib UsageBuild the tree:

    Loop through tokens:p = html5lib.HTMLParser( \tree=treebuilders.getTreeBuilder("dom"))f = open( "weather.html", "r" )dom_tree = p.parse(f)f.close()walker = treewalkers.getTreeWalker("dom")stream = walker(dom_tree)passtags = [ u'a', u'h1', u'h2', u'h3', u'h4',u'em', \u'strong', u'br', u'img', \u'dl', u'dt', u'dd' ] for token in stream: # Don't show non interesting stuff if token.has_key('name'): if token['name'] in passtags: continue print token

  • The DOM tree alternativeThe DOM tree may be used directly.Not documented with html5lib, but xml.dom package is standard with Python.DOM trees are normally used with XML, but html5lib can make a DOM tree from HTML.Walk through the tree by examining children nodes of each node. With knowledge of the page structure, you may be able to go almost directly to the desired information.See chapter 8 and posted file.

  • html5lib tokensStream of tokens is a listEach token is a dictionarytoken[ data ] String (unicode encoding)Empty listList of tuples for formatting attributestoken[ type ] (StartTag, EndTag, Characters, SpaceCharacters)token[ name ] description of start and end tags. (table, tr, td, h1, br, ul, li, ) See example of tokens on previous slide

  • html5lib token parsingdoingTitle = Falsefor token in stream: if token.has_key('name'): if token['name'] in passtags: continue else: tName = token['name'] tType = token['type'] if tType == 'StartTag': if tName == u'title': title = '' doingTitle = True if tType == 'EndTag': if tName == u'title': print "*** %s ***\n" % title doingTitle = False

    if tType == 'Characters': if doingTitle: title += token['data']