Annotation of MPIWGWeb/BeautifulSoup.py, revision 1.1.2.1

1.1.2.1 ! dwinter     1: """Beautiful Soup
        !             2: Elixir and Tonic
        !             3: "The Screen-Scraper's Friend"
        !             4: http://www.crummy.com/software/BeautifulSoup/
        !             5: 
        !             6: Beautiful Soup parses a (possibly invalid) XML or HTML document into a
        !             7: tree representation. It provides methods and Pythonic idioms that make
        !             8: it easy to navigate, search, and modify the tree.
        !             9: 
        !            10: A well-formed XML/HTML document yields a well-formed data
        !            11: structure. An ill-formed XML/HTML document yields a correspondingly
        !            12: ill-formed data structure. If your document is only locally
        !            13: well-formed, you can use this library to find and process the
        !            14: well-formed part of it.
        !            15: 
        !            16: Beautiful Soup works with Python 2.2 and up. It has no external
        !            17: dependencies, but you'll have more success at converting data to UTF-8
        !            18: if you also install these three packages:
        !            19: 
        !            20: * chardet, for auto-detecting character encodings
        !            21:   http://chardet.feedparser.org/
        !            22: * cjkcodecs and iconv_codec, which add more encodings to the ones supported
        !            23:   by stock Python.
        !            24:   http://cjkpython.i18n.org/
        !            25: 
        !            26: Beautiful Soup defines classes for two main parsing strategies:
        !            27: 
        !            28:  * BeautifulStoneSoup, for parsing XML, SGML, or your domain-specific
        !            29:    language that kind of looks like XML.
        !            30: 
        !            31:  * BeautifulSoup, for parsing run-of-the-mill HTML code, be it valid
        !            32:    or invalid. This class has web browser-like heuristics for
        !            33:    obtaining a sensible parse tree in the face of common HTML errors.
        !            34: 
        !            35: Beautiful Soup also defines a class (UnicodeDammit) for autodetecting
        !            36: the encoding of an HTML or XML document, and converting it to
        !            37: Unicode. Much of this code is taken from Mark Pilgrim's Universal Feed Parser.
        !            38: 
        !            39: For more than you ever wanted to know about Beautiful Soup, see the
        !            40: documentation:
        !            41: http://www.crummy.com/software/BeautifulSoup/documentation.html
        !            42: 
        !            43: Here, have some legalese:
        !            44: 
        !            45: Copyright (c) 2004-2010, Leonard Richardson
        !            46: 
        !            47: All rights reserved.
        !            48: 
        !            49: Redistribution and use in source and binary forms, with or without
        !            50: modification, are permitted provided that the following conditions are
        !            51: met:
        !            52: 
        !            53:   * Redistributions of source code must retain the above copyright
        !            54:     notice, this list of conditions and the following disclaimer.
        !            55: 
        !            56:   * Redistributions in binary form must reproduce the above
        !            57:     copyright notice, this list of conditions and the following
        !            58:     disclaimer in the documentation and/or other materials provided
        !            59:     with the distribution.
        !            60: 
        !            61:   * Neither the name of the the Beautiful Soup Consortium and All
        !            62:     Night Kosher Bakery nor the names of its contributors may be
        !            63:     used to endorse or promote products derived from this software
        !            64:     without specific prior written permission.
        !            65: 
        !            66: THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
        !            67: "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
        !            68: LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
        !            69: A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
        !            70: CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
        !            71: EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
        !            72: PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
        !            73: PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
        !            74: LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
        !            75: NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
        !            76: SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE, DAMMIT.
        !            77: 
        !            78: """
        !            79: from __future__ import generators
        !            80: 
        !            81: __author__ = "Leonard Richardson (leonardr@segfault.org)"
        !            82: __version__ = "3.2.0"
        !            83: __copyright__ = "Copyright (c) 2004-2010 Leonard Richardson"
        !            84: __license__ = "New-style BSD"
        !            85: 
        !            86: from sgmllib import SGMLParser, SGMLParseError
        !            87: import codecs
        !            88: import markupbase
        !            89: import types
        !            90: import re
        !            91: import sgmllib
        !            92: try:
        !            93:   from htmlentitydefs import name2codepoint
        !            94: except ImportError:
        !            95:   name2codepoint = {}
        !            96: try:
        !            97:     set
        !            98: except NameError:
        !            99:     from sets import Set as set
        !           100: 
        !           101: #These hacks make Beautiful Soup able to parse XML with namespaces
        !           102: sgmllib.tagfind = re.compile('[a-zA-Z][-_.:a-zA-Z0-9]*')
        !           103: markupbase._declname_match = re.compile(r'[a-zA-Z][-_.:a-zA-Z0-9]*\s*').match
        !           104: 
        !           105: DEFAULT_OUTPUT_ENCODING = "utf-8"
        !           106: 
        !           107: def _match_css_class(str):
        !           108:     """Build a RE to match the given CSS class."""
        !           109:     return re.compile(r"(^|.*\s)%s($|\s)" % str)
        !           110: 
        !           111: # First, the classes that represent markup elements.
        !           112: 
        !           113: class PageElement(object):
        !           114:     """Contains the navigational information for some part of the page
        !           115:     (either a tag or a piece of text)"""
        !           116: 
        !           117:     def setup(self, parent=None, previous=None):
        !           118:         """Sets up the initial relations between this element and
        !           119:         other elements."""
        !           120:         self.parent = parent
        !           121:         self.previous = previous
        !           122:         self.next = None
        !           123:         self.previousSibling = None
        !           124:         self.nextSibling = None
        !           125:         if self.parent and self.parent.contents:
        !           126:             self.previousSibling = self.parent.contents[-1]
        !           127:             self.previousSibling.nextSibling = self
        !           128: 
        !           129:     def replaceWith(self, replaceWith):
        !           130:         oldParent = self.parent
        !           131:         myIndex = self.parent.index(self)
        !           132:         if hasattr(replaceWith, "parent")\
        !           133:                   and replaceWith.parent is self.parent:
        !           134:             # We're replacing this element with one of its siblings.
        !           135:             index = replaceWith.parent.index(replaceWith)
        !           136:             if index and index < myIndex:
        !           137:                 # Furthermore, it comes before this element. That
        !           138:                 # means that when we extract it, the index of this
        !           139:                 # element will change.
        !           140:                 myIndex = myIndex - 1
        !           141:         self.extract()
        !           142:         oldParent.insert(myIndex, replaceWith)
        !           143: 
        !           144:     def replaceWithChildren(self):
        !           145:         myParent = self.parent
        !           146:         myIndex = self.parent.index(self)
        !           147:         self.extract()
        !           148:         reversedChildren = list(self.contents)
        !           149:         reversedChildren.reverse()
        !           150:         for child in reversedChildren:
        !           151:             myParent.insert(myIndex, child)
        !           152: 
        !           153:     def extract(self):
        !           154:         """Destructively rips this element out of the tree."""
        !           155:         if self.parent:
        !           156:             try:
        !           157:                 del self.parent.contents[self.parent.index(self)]
        !           158:             except ValueError:
        !           159:                 pass
        !           160: 
        !           161:         #Find the two elements that would be next to each other if
        !           162:         #this element (and any children) hadn't been parsed. Connect
        !           163:         #the two.
        !           164:         lastChild = self._lastRecursiveChild()
        !           165:         nextElement = lastChild.next
        !           166: 
        !           167:         if self.previous:
        !           168:             self.previous.next = nextElement
        !           169:         if nextElement:
        !           170:             nextElement.previous = self.previous
        !           171:         self.previous = None
        !           172:         lastChild.next = None
        !           173: 
        !           174:         self.parent = None
        !           175:         if self.previousSibling:
        !           176:             self.previousSibling.nextSibling = self.nextSibling
        !           177:         if self.nextSibling:
        !           178:             self.nextSibling.previousSibling = self.previousSibling
        !           179:         self.previousSibling = self.nextSibling = None
        !           180:         return self
        !           181: 
        !           182:     def _lastRecursiveChild(self):
        !           183:         "Finds the last element beneath this object to be parsed."
        !           184:         lastChild = self
        !           185:         while hasattr(lastChild, 'contents') and lastChild.contents:
        !           186:             lastChild = lastChild.contents[-1]
        !           187:         return lastChild
        !           188: 
        !           189:     def insert(self, position, newChild):
        !           190:         if isinstance(newChild, basestring) \
        !           191:             and not isinstance(newChild, NavigableString):
        !           192:             newChild = NavigableString(newChild)
        !           193: 
        !           194:         position =  min(position, len(self.contents))
        !           195:         if hasattr(newChild, 'parent') and newChild.parent is not None:
        !           196:             # We're 'inserting' an element that's already one
        !           197:             # of this object's children.
        !           198:             if newChild.parent is self:
        !           199:                 index = self.index(newChild)
        !           200:                 if index > position:
        !           201:                     # Furthermore we're moving it further down the
        !           202:                     # list of this object's children. That means that
        !           203:                     # when we extract this element, our target index
        !           204:                     # will jump down one.
        !           205:                     position = position - 1
        !           206:             newChild.extract()
        !           207: 
        !           208:         newChild.parent = self
        !           209:         previousChild = None
        !           210:         if position == 0:
        !           211:             newChild.previousSibling = None
        !           212:             newChild.previous = self
        !           213:         else:
        !           214:             previousChild = self.contents[position-1]
        !           215:             newChild.previousSibling = previousChild
        !           216:             newChild.previousSibling.nextSibling = newChild
        !           217:             newChild.previous = previousChild._lastRecursiveChild()
        !           218:         if newChild.previous:
        !           219:             newChild.previous.next = newChild
        !           220: 
        !           221:         newChildsLastElement = newChild._lastRecursiveChild()
        !           222: 
        !           223:         if position >= len(self.contents):
        !           224:             newChild.nextSibling = None
        !           225: 
        !           226:             parent = self
        !           227:             parentsNextSibling = None
        !           228:             while not parentsNextSibling:
        !           229:                 parentsNextSibling = parent.nextSibling
        !           230:                 parent = parent.parent
        !           231:                 if not parent: # This is the last element in the document.
        !           232:                     break
        !           233:             if parentsNextSibling:
        !           234:                 newChildsLastElement.next = parentsNextSibling
        !           235:             else:
        !           236:                 newChildsLastElement.next = None
        !           237:         else:
        !           238:             nextChild = self.contents[position]
        !           239:             newChild.nextSibling = nextChild
        !           240:             if newChild.nextSibling:
        !           241:                 newChild.nextSibling.previousSibling = newChild
        !           242:             newChildsLastElement.next = nextChild
        !           243: 
        !           244:         if newChildsLastElement.next:
        !           245:             newChildsLastElement.next.previous = newChildsLastElement
        !           246:         self.contents.insert(position, newChild)
        !           247: 
        !           248:     def append(self, tag):
        !           249:         """Appends the given tag to the contents of this tag."""
        !           250:         self.insert(len(self.contents), tag)
        !           251: 
        !           252:     def findNext(self, name=None, attrs={}, text=None, **kwargs):
        !           253:         """Returns the first item that matches the given criteria and
        !           254:         appears after this Tag in the document."""
        !           255:         return self._findOne(self.findAllNext, name, attrs, text, **kwargs)
        !           256: 
        !           257:     def findAllNext(self, name=None, attrs={}, text=None, limit=None,
        !           258:                     **kwargs):
        !           259:         """Returns all items that match the given criteria and appear
        !           260:         after this Tag in the document."""
        !           261:         return self._findAll(name, attrs, text, limit, self.nextGenerator,
        !           262:                              **kwargs)
        !           263: 
        !           264:     def findNextSibling(self, name=None, attrs={}, text=None, **kwargs):
        !           265:         """Returns the closest sibling to this Tag that matches the
        !           266:         given criteria and appears after this Tag in the document."""
        !           267:         return self._findOne(self.findNextSiblings, name, attrs, text,
        !           268:                              **kwargs)
        !           269: 
        !           270:     def findNextSiblings(self, name=None, attrs={}, text=None, limit=None,
        !           271:                          **kwargs):
        !           272:         """Returns the siblings of this Tag that match the given
        !           273:         criteria and appear after this Tag in the document."""
        !           274:         return self._findAll(name, attrs, text, limit,
        !           275:                              self.nextSiblingGenerator, **kwargs)
        !           276:     fetchNextSiblings = findNextSiblings # Compatibility with pre-3.x
        !           277: 
        !           278:     def findPrevious(self, name=None, attrs={}, text=None, **kwargs):
        !           279:         """Returns the first item that matches the given criteria and
        !           280:         appears before this Tag in the document."""
        !           281:         return self._findOne(self.findAllPrevious, name, attrs, text, **kwargs)
        !           282: 
        !           283:     def findAllPrevious(self, name=None, attrs={}, text=None, limit=None,
        !           284:                         **kwargs):
        !           285:         """Returns all items that match the given criteria and appear
        !           286:         before this Tag in the document."""
        !           287:         return self._findAll(name, attrs, text, limit, self.previousGenerator,
        !           288:                            **kwargs)
        !           289:     fetchPrevious = findAllPrevious # Compatibility with pre-3.x
        !           290: 
        !           291:     def findPreviousSibling(self, name=None, attrs={}, text=None, **kwargs):
        !           292:         """Returns the closest sibling to this Tag that matches the
        !           293:         given criteria and appears before this Tag in the document."""
        !           294:         return self._findOne(self.findPreviousSiblings, name, attrs, text,
        !           295:                              **kwargs)
        !           296: 
        !           297:     def findPreviousSiblings(self, name=None, attrs={}, text=None,
        !           298:                              limit=None, **kwargs):
        !           299:         """Returns the siblings of this Tag that match the given
        !           300:         criteria and appear before this Tag in the document."""
        !           301:         return self._findAll(name, attrs, text, limit,
        !           302:                              self.previousSiblingGenerator, **kwargs)
        !           303:     fetchPreviousSiblings = findPreviousSiblings # Compatibility with pre-3.x
        !           304: 
        !           305:     def findParent(self, name=None, attrs={}, **kwargs):
        !           306:         """Returns the closest parent of this Tag that matches the given
        !           307:         criteria."""
        !           308:         # NOTE: We can't use _findOne because findParents takes a different
        !           309:         # set of arguments.
        !           310:         r = None
        !           311:         l = self.findParents(name, attrs, 1)
        !           312:         if l:
        !           313:             r = l[0]
        !           314:         return r
        !           315: 
        !           316:     def findParents(self, name=None, attrs={}, limit=None, **kwargs):
        !           317:         """Returns the parents of this Tag that match the given
        !           318:         criteria."""
        !           319: 
        !           320:         return self._findAll(name, attrs, None, limit, self.parentGenerator,
        !           321:                              **kwargs)
        !           322:     fetchParents = findParents # Compatibility with pre-3.x
        !           323: 
        !           324:     #These methods do the real heavy lifting.
        !           325: 
        !           326:     def _findOne(self, method, name, attrs, text, **kwargs):
        !           327:         r = None
        !           328:         l = method(name, attrs, text, 1, **kwargs)
        !           329:         if l:
        !           330:             r = l[0]
        !           331:         return r
        !           332: 
        !           333:     def _findAll(self, name, attrs, text, limit, generator, **kwargs):
        !           334:         "Iterates over a generator looking for things that match."
        !           335: 
        !           336:         if isinstance(name, SoupStrainer):
        !           337:             strainer = name
        !           338:         # (Possibly) special case some findAll*(...) searches
        !           339:         elif text is None and not limit and not attrs and not kwargs:
        !           340:             # findAll*(True)
        !           341:             if name is True:
        !           342:                 return [element for element in generator()
        !           343:                         if isinstance(element, Tag)]
        !           344:             # findAll*('tag-name')
        !           345:             elif isinstance(name, basestring):
        !           346:                 return [element for element in generator()
        !           347:                         if isinstance(element, Tag) and
        !           348:                         element.name == name]
        !           349:             else:
        !           350:                 strainer = SoupStrainer(name, attrs, text, **kwargs)
        !           351:         # Build a SoupStrainer
        !           352:         else:
        !           353:             strainer = SoupStrainer(name, attrs, text, **kwargs)
        !           354:         results = ResultSet(strainer)
        !           355:         g = generator()
        !           356:         while True:
        !           357:             try:
        !           358:                 i = g.next()
        !           359:             except StopIteration:
        !           360:                 break
        !           361:             if i:
        !           362:                 found = strainer.search(i)
        !           363:                 if found:
        !           364:                     results.append(found)
        !           365:                     if limit and len(results) >= limit:
        !           366:                         break
        !           367:         return results
        !           368: 
        !           369:     #These Generators can be used to navigate starting from both
        !           370:     #NavigableStrings and Tags.
        !           371:     def nextGenerator(self):
        !           372:         i = self
        !           373:         while i is not None:
        !           374:             i = i.next
        !           375:             yield i
        !           376: 
        !           377:     def nextSiblingGenerator(self):
        !           378:         i = self
        !           379:         while i is not None:
        !           380:             i = i.nextSibling
        !           381:             yield i
        !           382: 
        !           383:     def previousGenerator(self):
        !           384:         i = self
        !           385:         while i is not None:
        !           386:             i = i.previous
        !           387:             yield i
        !           388: 
        !           389:     def previousSiblingGenerator(self):
        !           390:         i = self
        !           391:         while i is not None:
        !           392:             i = i.previousSibling
        !           393:             yield i
        !           394: 
        !           395:     def parentGenerator(self):
        !           396:         i = self
        !           397:         while i is not None:
        !           398:             i = i.parent
        !           399:             yield i
        !           400: 
        !           401:     # Utility methods
        !           402:     def substituteEncoding(self, str, encoding=None):
        !           403:         encoding = encoding or "utf-8"
        !           404:         return str.replace("%SOUP-ENCODING%", encoding)
        !           405: 
        !           406:     def toEncoding(self, s, encoding=None):
        !           407:         """Encodes an object to a string in some encoding, or to Unicode.
        !           408:         ."""
        !           409:         if isinstance(s, unicode):
        !           410:             if encoding:
        !           411:                 s = s.encode(encoding)
        !           412:         elif isinstance(s, str):
        !           413:             if encoding:
        !           414:                 s = s.encode(encoding)
        !           415:             else:
        !           416:                 s = unicode(s)
        !           417:         else:
        !           418:             if encoding:
        !           419:                 s  = self.toEncoding(str(s), encoding)
        !           420:             else:
        !           421:                 s = unicode(s)
        !           422:         return s
        !           423: 
        !           424: class NavigableString(unicode, PageElement):
        !           425: 
        !           426:     def __new__(cls, value):
        !           427:         """Create a new NavigableString.
        !           428: 
        !           429:         When unpickling a NavigableString, this method is called with
        !           430:         the string in DEFAULT_OUTPUT_ENCODING. That encoding needs to be
        !           431:         passed in to the superclass's __new__ or the superclass won't know
        !           432:         how to handle non-ASCII characters.
        !           433:         """
        !           434:         if isinstance(value, unicode):
        !           435:             return unicode.__new__(cls, value)
        !           436:         return unicode.__new__(cls, value, DEFAULT_OUTPUT_ENCODING)
        !           437: 
        !           438:     def __getnewargs__(self):
        !           439:         return (NavigableString.__str__(self),)
        !           440: 
        !           441:     def __getattr__(self, attr):
        !           442:         """text.string gives you text. This is for backwards
        !           443:         compatibility for Navigable*String, but for CData* it lets you
        !           444:         get the string without the CData wrapper."""
        !           445:         if attr == 'string':
        !           446:             return self
        !           447:         else:
        !           448:             raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr)
        !           449: 
        !           450:     def __unicode__(self):
        !           451:         return str(self).decode(DEFAULT_OUTPUT_ENCODING)
        !           452: 
        !           453:     def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
        !           454:         if encoding:
        !           455:             return self.encode(encoding)
        !           456:         else:
        !           457:             return self
        !           458: 
        !           459: class CData(NavigableString):
        !           460: 
        !           461:     def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
        !           462:         return "<![CDATA[%s]]>" % NavigableString.__str__(self, encoding)
        !           463: 
        !           464: class ProcessingInstruction(NavigableString):
        !           465:     def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
        !           466:         output = self
        !           467:         if "%SOUP-ENCODING%" in output:
        !           468:             output = self.substituteEncoding(output, encoding)
        !           469:         return "<?%s?>" % self.toEncoding(output, encoding)
        !           470: 
        !           471: class Comment(NavigableString):
        !           472:     def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
        !           473:         return "<!--%s-->" % NavigableString.__str__(self, encoding)
        !           474: 
        !           475: class Declaration(NavigableString):
        !           476:     def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
        !           477:         return "<!%s>" % NavigableString.__str__(self, encoding)
        !           478: 
        !           479: class Tag(PageElement):
        !           480: 
        !           481:     """Represents a found HTML tag with its attributes and contents."""
        !           482: 
        !           483:     def _invert(h):
        !           484:         "Cheap function to invert a hash."
        !           485:         i = {}
        !           486:         for k,v in h.items():
        !           487:             i[v] = k
        !           488:         return i
        !           489: 
        !           490:     XML_ENTITIES_TO_SPECIAL_CHARS = { "apos" : "'",
        !           491:                                       "quot" : '"',
        !           492:                                       "amp" : "&",
        !           493:                                       "lt" : "<",
        !           494:                                       "gt" : ">" }
        !           495: 
        !           496:     XML_SPECIAL_CHARS_TO_ENTITIES = _invert(XML_ENTITIES_TO_SPECIAL_CHARS)
        !           497: 
        !           498:     def _convertEntities(self, match):
        !           499:         """Used in a call to re.sub to replace HTML, XML, and numeric
        !           500:         entities with the appropriate Unicode characters. If HTML
        !           501:         entities are being converted, any unrecognized entities are
        !           502:         escaped."""
        !           503:         x = match.group(1)
        !           504:         if self.convertHTMLEntities and x in name2codepoint:
        !           505:             return unichr(name2codepoint[x])
        !           506:         elif x in self.XML_ENTITIES_TO_SPECIAL_CHARS:
        !           507:             if self.convertXMLEntities:
        !           508:                 return self.XML_ENTITIES_TO_SPECIAL_CHARS[x]
        !           509:             else:
        !           510:                 return u'&%s;' % x
        !           511:         elif len(x) > 0 and x[0] == '#':
        !           512:             # Handle numeric entities
        !           513:             if len(x) > 1 and x[1] == 'x':
        !           514:                 return unichr(int(x[2:], 16))
        !           515:             else:
        !           516:                 return unichr(int(x[1:]))
        !           517: 
        !           518:         elif self.escapeUnrecognizedEntities:
        !           519:             return u'&amp;%s;' % x
        !           520:         else:
        !           521:             return u'&%s;' % x
        !           522: 
        !           523:     def __init__(self, parser, name, attrs=None, parent=None,
        !           524:                  previous=None):
        !           525:         "Basic constructor."
        !           526: 
        !           527:         # We don't actually store the parser object: that lets extracted
        !           528:         # chunks be garbage-collected
        !           529:         self.parserClass = parser.__class__
        !           530:         self.isSelfClosing = parser.isSelfClosingTag(name)
        !           531:         self.name = name
        !           532:         if attrs is None:
        !           533:             attrs = []
        !           534:         elif isinstance(attrs, dict):
        !           535:             attrs = attrs.items()
        !           536:         self.attrs = attrs
        !           537:         self.contents = []
        !           538:         self.setup(parent, previous)
        !           539:         self.hidden = False
        !           540:         self.containsSubstitutions = False
        !           541:         self.convertHTMLEntities = parser.convertHTMLEntities
        !           542:         self.convertXMLEntities = parser.convertXMLEntities
        !           543:         self.escapeUnrecognizedEntities = parser.escapeUnrecognizedEntities
        !           544: 
        !           545:         # Convert any HTML, XML, or numeric entities in the attribute values.
        !           546:         convert = lambda(k, val): (k,
        !           547:                                    re.sub("&(#\d+|#x[0-9a-fA-F]+|\w+);",
        !           548:                                           self._convertEntities,
        !           549:                                           val))
        !           550:         self.attrs = map(convert, self.attrs)
        !           551: 
        !           552:     def getString(self):
        !           553:         if (len(self.contents) == 1
        !           554:             and isinstance(self.contents[0], NavigableString)):
        !           555:             return self.contents[0]
        !           556: 
        !           557:     def setString(self, string):
        !           558:         """Replace the contents of the tag with a string"""
        !           559:         self.clear()
        !           560:         self.append(string)
        !           561: 
        !           562:     string = property(getString, setString)
        !           563: 
        !           564:     def getText(self, separator=u""):
        !           565:         if not len(self.contents):
        !           566:             return u""
        !           567:         stopNode = self._lastRecursiveChild().next
        !           568:         strings = []
        !           569:         current = self.contents[0]
        !           570:         while current is not stopNode:
        !           571:             if isinstance(current, NavigableString):
        !           572:                 strings.append(current.strip())
        !           573:             current = current.next
        !           574:         return separator.join(strings)
        !           575: 
        !           576:     text = property(getText)
        !           577: 
        !           578:     def get(self, key, default=None):
        !           579:         """Returns the value of the 'key' attribute for the tag, or
        !           580:         the value given for 'default' if it doesn't have that
        !           581:         attribute."""
        !           582:         return self._getAttrMap().get(key, default)
        !           583: 
        !           584:     def clear(self):
        !           585:         """Extract all children."""
        !           586:         for child in self.contents[:]:
        !           587:             child.extract()
        !           588: 
        !           589:     def index(self, element):
        !           590:         for i, child in enumerate(self.contents):
        !           591:             if child is element:
        !           592:                 return i
        !           593:         raise ValueError("Tag.index: element not in tag")
        !           594: 
        !           595:     def has_key(self, key):
        !           596:         return self._getAttrMap().has_key(key)
        !           597: 
        !           598:     def __getitem__(self, key):
        !           599:         """tag[key] returns the value of the 'key' attribute for the tag,
        !           600:         and throws an exception if it's not there."""
        !           601:         return self._getAttrMap()[key]
        !           602: 
        !           603:     def __iter__(self):
        !           604:         "Iterating over a tag iterates over its contents."
        !           605:         return iter(self.contents)
        !           606: 
        !           607:     def __len__(self):
        !           608:         "The length of a tag is the length of its list of contents."
        !           609:         return len(self.contents)
        !           610: 
        !           611:     def __contains__(self, x):
        !           612:         return x in self.contents
        !           613: 
        !           614:     def __nonzero__(self):
        !           615:         "A tag is non-None even if it has no contents."
        !           616:         return True
        !           617: 
        !           618:     def __setitem__(self, key, value):
        !           619:         """Setting tag[key] sets the value of the 'key' attribute for the
        !           620:         tag."""
        !           621:         self._getAttrMap()
        !           622:         self.attrMap[key] = value
        !           623:         found = False
        !           624:         for i in range(0, len(self.attrs)):
        !           625:             if self.attrs[i][0] == key:
        !           626:                 self.attrs[i] = (key, value)
        !           627:                 found = True
        !           628:         if not found:
        !           629:             self.attrs.append((key, value))
        !           630:         self._getAttrMap()[key] = value
        !           631: 
        !           632:     def __delitem__(self, key):
        !           633:         "Deleting tag[key] deletes all 'key' attributes for the tag."
        !           634:         for item in self.attrs:
        !           635:             if item[0] == key:
        !           636:                 self.attrs.remove(item)
        !           637:                 #We don't break because bad HTML can define the same
        !           638:                 #attribute multiple times.
        !           639:             self._getAttrMap()
        !           640:             if self.attrMap.has_key(key):
        !           641:                 del self.attrMap[key]
        !           642: 
        !           643:     def __call__(self, *args, **kwargs):
        !           644:         """Calling a tag like a function is the same as calling its
        !           645:         findAll() method. Eg. tag('a') returns a list of all the A tags
        !           646:         found within this tag."""
        !           647:         return apply(self.findAll, args, kwargs)
        !           648: 
        !           649:     def __getattr__(self, tag):
        !           650:         #print "Getattr %s.%s" % (self.__class__, tag)
        !           651:         if len(tag) > 3 and tag.rfind('Tag') == len(tag)-3:
        !           652:             return self.find(tag[:-3])
        !           653:         elif tag.find('__') != 0:
        !           654:             return self.find(tag)
        !           655:         raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__, tag)
        !           656: 
        !           657:     def __eq__(self, other):
        !           658:         """Returns true iff this tag has the same name, the same attributes,
        !           659:         and the same contents (recursively) as the given tag.
        !           660: 
        !           661:         NOTE: right now this will return false if two tags have the
        !           662:         same attributes in a different order. Should this be fixed?"""
        !           663:         if other is self:
        !           664:             return True
        !           665:         if not hasattr(other, 'name') or not hasattr(other, 'attrs') or not hasattr(other, 'contents') or self.name != other.name or self.attrs != other.attrs or len(self) != len(other):
        !           666:             return False
        !           667:         for i in range(0, len(self.contents)):
        !           668:             if self.contents[i] != other.contents[i]:
        !           669:                 return False
        !           670:         return True
        !           671: 
        !           672:     def __ne__(self, other):
        !           673:         """Returns true iff this tag is not identical to the other tag,
        !           674:         as defined in __eq__."""
        !           675:         return not self == other
        !           676: 
        !           677:     def __repr__(self, encoding=DEFAULT_OUTPUT_ENCODING):
        !           678:         """Renders this tag as a string."""
        !           679:         return self.__str__(encoding)
        !           680: 
        !           681:     def __unicode__(self):
        !           682:         return self.__str__(None)
        !           683: 
        !           684:     BARE_AMPERSAND_OR_BRACKET = re.compile("([<>]|"
        !           685:                                            + "&(?!#\d+;|#x[0-9a-fA-F]+;|\w+;)"
        !           686:                                            + ")")
        !           687: 
        !           688:     def _sub_entity(self, x):
        !           689:         """Used with a regular expression to substitute the
        !           690:         appropriate XML entity for an XML special character."""
        !           691:         return "&" + self.XML_SPECIAL_CHARS_TO_ENTITIES[x.group(0)[0]] + ";"
        !           692: 
        !           693:     def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING,
        !           694:                 prettyPrint=False, indentLevel=0):
        !           695:         """Returns a string or Unicode representation of this tag and
        !           696:         its contents. To get Unicode, pass None for encoding.
        !           697: 
        !           698:         NOTE: since Python's HTML parser consumes whitespace, this
        !           699:         method is not certain to reproduce the whitespace present in
        !           700:         the original string."""
        !           701: 
        !           702:         encodedName = self.toEncoding(self.name, encoding)
        !           703: 
        !           704:         attrs = []
        !           705:         if self.attrs:
        !           706:             for key, val in self.attrs:
        !           707:                 fmt = '%s="%s"'
        !           708:                 if isinstance(val, basestring):
        !           709:                     if self.containsSubstitutions and '%SOUP-ENCODING%' in val:
        !           710:                         val = self.substituteEncoding(val, encoding)
        !           711: 
        !           712:                     # The attribute value either:
        !           713:                     #
        !           714:                     # * Contains no embedded double quotes or single quotes.
        !           715:                     #   No problem: we enclose it in double quotes.
        !           716:                     # * Contains embedded single quotes. No problem:
        !           717:                     #   double quotes work here too.
        !           718:                     # * Contains embedded double quotes. No problem:
        !           719:                     #   we enclose it in single quotes.
        !           720:                     # * Embeds both single _and_ double quotes. This
        !           721:                     #   can't happen naturally, but it can happen if
        !           722:                     #   you modify an attribute value after parsing
        !           723:                     #   the document. Now we have a bit of a
        !           724:                     #   problem. We solve it by enclosing the
        !           725:                     #   attribute in single quotes, and escaping any
        !           726:                     #   embedded single quotes to XML entities.
        !           727:                     if '"' in val:
        !           728:                         fmt = "%s='%s'"
        !           729:                         if "'" in val:
        !           730:                             # TODO: replace with apos when
        !           731:                             # appropriate.
        !           732:                             val = val.replace("'", "&squot;")
        !           733: 
        !           734:                     # Now we're okay w/r/t quotes. But the attribute
        !           735:                     # value might also contain angle brackets, or
        !           736:                     # ampersands that aren't part of entities. We need
        !           737:                     # to escape those to XML entities too.
        !           738:                     val = self.BARE_AMPERSAND_OR_BRACKET.sub(self._sub_entity, val)
        !           739: 
        !           740:                 attrs.append(fmt % (self.toEncoding(key, encoding),
        !           741:                                     self.toEncoding(val, encoding)))
        !           742:         close = ''
        !           743:         closeTag = ''
        !           744:         if self.isSelfClosing:
        !           745:             close = ' /'
        !           746:         else:
        !           747:             closeTag = '</%s>' % encodedName
        !           748: 
        !           749:         indentTag, indentContents = 0, 0
        !           750:         if prettyPrint:
        !           751:             indentTag = indentLevel
        !           752:             space = (' ' * (indentTag-1))
        !           753:             indentContents = indentTag + 1
        !           754:         contents = self.renderContents(encoding, prettyPrint, indentContents)
        !           755:         if self.hidden:
        !           756:             s = contents
        !           757:         else:
        !           758:             s = []
        !           759:             attributeString = ''
        !           760:             if attrs:
        !           761:                 attributeString = ' ' + ' '.join(attrs)
        !           762:             if prettyPrint:
        !           763:                 s.append(space)
        !           764:             s.append('<%s%s%s>' % (encodedName, attributeString, close))
        !           765:             if prettyPrint:
        !           766:                 s.append("\n")
        !           767:             s.append(contents)
        !           768:             if prettyPrint and contents and contents[-1] != "\n":
        !           769:                 s.append("\n")
        !           770:             if prettyPrint and closeTag:
        !           771:                 s.append(space)
        !           772:             s.append(closeTag)
        !           773:             if prettyPrint and closeTag and self.nextSibling:
        !           774:                 s.append("\n")
        !           775:             s = ''.join(s)
        !           776:         return s
        !           777: 
        !           778:     def decompose(self):
        !           779:         """Recursively destroys the contents of this tree."""
        !           780:         self.extract()
        !           781:         if len(self.contents) == 0:
        !           782:             return
        !           783:         current = self.contents[0]
        !           784:         while current is not None:
        !           785:             next = current.next
        !           786:             if isinstance(current, Tag):
        !           787:                 del current.contents[:]
        !           788:             current.parent = None
        !           789:             current.previous = None
        !           790:             current.previousSibling = None
        !           791:             current.next = None
        !           792:             current.nextSibling = None
        !           793:             current = next
        !           794: 
        !           795:     def prettify(self, encoding=DEFAULT_OUTPUT_ENCODING):
        !           796:         return self.__str__(encoding, True)
        !           797: 
        !           798:     def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
        !           799:                        prettyPrint=False, indentLevel=0):
        !           800:         """Renders the contents of this tag as a string in the given
        !           801:         encoding. If encoding is None, returns a Unicode string.."""
        !           802:         s=[]
        !           803:         for c in self:
        !           804:             text = None
        !           805:             if isinstance(c, NavigableString):
        !           806:                 text = c.__str__(encoding)
        !           807:             elif isinstance(c, Tag):
        !           808:                 s.append(c.__str__(encoding, prettyPrint, indentLevel))
        !           809:             if text and prettyPrint:
        !           810:                 text = text.strip()
        !           811:             if text:
        !           812:                 if prettyPrint:
        !           813:                     s.append(" " * (indentLevel-1))
        !           814:                 s.append(text)
        !           815:                 if prettyPrint:
        !           816:                     s.append("\n")
        !           817:         return ''.join(s)
        !           818: 
        !           819:     #Soup methods
        !           820: 
        !           821:     def find(self, name=None, attrs={}, recursive=True, text=None,
        !           822:              **kwargs):
        !           823:         """Return only the first child of this Tag matching the given
        !           824:         criteria."""
        !           825:         r = None
        !           826:         l = self.findAll(name, attrs, recursive, text, 1, **kwargs)
        !           827:         if l:
        !           828:             r = l[0]
        !           829:         return r
        !           830:     findChild = find
        !           831: 
        !           832:     def findAll(self, name=None, attrs={}, recursive=True, text=None,
        !           833:                 limit=None, **kwargs):
        !           834:         """Extracts a list of Tag objects that match the given
        !           835:         criteria.  You can specify the name of the Tag and any
        !           836:         attributes you want the Tag to have.
        !           837: 
        !           838:         The value of a key-value pair in the 'attrs' map can be a
        !           839:         string, a list of strings, a regular expression object, or a
        !           840:         callable that takes a string and returns whether or not the
        !           841:         string matches for some custom definition of 'matches'. The
        !           842:         same is true of the tag name."""
        !           843:         generator = self.recursiveChildGenerator
        !           844:         if not recursive:
        !           845:             generator = self.childGenerator
        !           846:         return self._findAll(name, attrs, text, limit, generator, **kwargs)
        !           847:     findChildren = findAll
        !           848: 
        !           849:     # Pre-3.x compatibility methods
        !           850:     first = find
        !           851:     fetch = findAll
        !           852: 
        !           853:     def fetchText(self, text=None, recursive=True, limit=None):
        !           854:         return self.findAll(text=text, recursive=recursive, limit=limit)
        !           855: 
        !           856:     def firstText(self, text=None, recursive=True):
        !           857:         return self.find(text=text, recursive=recursive)
        !           858: 
        !           859:     #Private methods
        !           860: 
        !           861:     def _getAttrMap(self):
        !           862:         """Initializes a map representation of this tag's attributes,
        !           863:         if not already initialized."""
        !           864:         if not getattr(self, 'attrMap'):
        !           865:             self.attrMap = {}
        !           866:             for (key, value) in self.attrs:
        !           867:                 self.attrMap[key] = value
        !           868:         return self.attrMap
        !           869: 
        !           870:     #Generator methods
        !           871:     def childGenerator(self):
        !           872:         # Just use the iterator from the contents
        !           873:         return iter(self.contents)
        !           874: 
        !           875:     def recursiveChildGenerator(self):
        !           876:         if not len(self.contents):
        !           877:             raise StopIteration
        !           878:         stopNode = self._lastRecursiveChild().next
        !           879:         current = self.contents[0]
        !           880:         while current is not stopNode:
        !           881:             yield current
        !           882:             current = current.next
        !           883: 
        !           884: 
        !           885: # Next, a couple classes to represent queries and their results.
        !           886: class SoupStrainer:
        !           887:     """Encapsulates a number of ways of matching a markup element (tag or
        !           888:     text)."""
        !           889: 
        !           890:     def __init__(self, name=None, attrs={}, text=None, **kwargs):
        !           891:         self.name = name
        !           892:         if isinstance(attrs, basestring):
        !           893:             kwargs['class'] = _match_css_class(attrs)
        !           894:             attrs = None
        !           895:         if kwargs:
        !           896:             if attrs:
        !           897:                 attrs = attrs.copy()
        !           898:                 attrs.update(kwargs)
        !           899:             else:
        !           900:                 attrs = kwargs
        !           901:         self.attrs = attrs
        !           902:         self.text = text
        !           903: 
        !           904:     def __str__(self):
        !           905:         if self.text:
        !           906:             return self.text
        !           907:         else:
        !           908:             return "%s|%s" % (self.name, self.attrs)
        !           909: 
        !           910:     def searchTag(self, markupName=None, markupAttrs={}):
        !           911:         found = None
        !           912:         markup = None
        !           913:         if isinstance(markupName, Tag):
        !           914:             markup = markupName
        !           915:             markupAttrs = markup
        !           916:         callFunctionWithTagData = callable(self.name) \
        !           917:                                 and not isinstance(markupName, Tag)
        !           918: 
        !           919:         if (not self.name) \
        !           920:                or callFunctionWithTagData \
        !           921:                or (markup and self._matches(markup, self.name)) \
        !           922:                or (not markup and self._matches(markupName, self.name)):
        !           923:             if callFunctionWithTagData:
        !           924:                 match = self.name(markupName, markupAttrs)
        !           925:             else:
        !           926:                 match = True
        !           927:                 markupAttrMap = None
        !           928:                 for attr, matchAgainst in self.attrs.items():
        !           929:                     if not markupAttrMap:
        !           930:                          if hasattr(markupAttrs, 'get'):
        !           931:                             markupAttrMap = markupAttrs
        !           932:                          else:
        !           933:                             markupAttrMap = {}
        !           934:                             for k,v in markupAttrs:
        !           935:                                 markupAttrMap[k] = v
        !           936:                     attrValue = markupAttrMap.get(attr)
        !           937:                     if not self._matches(attrValue, matchAgainst):
        !           938:                         match = False
        !           939:                         break
        !           940:             if match:
        !           941:                 if markup:
        !           942:                     found = markup
        !           943:                 else:
        !           944:                     found = markupName
        !           945:         return found
        !           946: 
        !           947:     def search(self, markup):
        !           948:         #print 'looking for %s in %s' % (self, markup)
        !           949:         found = None
        !           950:         # If given a list of items, scan it for a text element that
        !           951:         # matches.
        !           952:         if hasattr(markup, "__iter__") \
        !           953:                 and not isinstance(markup, Tag):
        !           954:             for element in markup:
        !           955:                 if isinstance(element, NavigableString) \
        !           956:                        and self.search(element):
        !           957:                     found = element
        !           958:                     break
        !           959:         # If it's a Tag, make sure its name or attributes match.
        !           960:         # Don't bother with Tags if we're searching for text.
        !           961:         elif isinstance(markup, Tag):
        !           962:             if not self.text:
        !           963:                 found = self.searchTag(markup)
        !           964:         # If it's text, make sure the text matches.
        !           965:         elif isinstance(markup, NavigableString) or \
        !           966:                  isinstance(markup, basestring):
        !           967:             if self._matches(markup, self.text):
        !           968:                 found = markup
        !           969:         else:
        !           970:             raise Exception, "I don't know how to match against a %s" \
        !           971:                   % markup.__class__
        !           972:         return found
        !           973: 
        !           974:     def _matches(self, markup, matchAgainst):
        !           975:         #print "Matching %s against %s" % (markup, matchAgainst)
        !           976:         result = False
        !           977:         if matchAgainst is True:
        !           978:             result = markup is not None
        !           979:         elif callable(matchAgainst):
        !           980:             result = matchAgainst(markup)
        !           981:         else:
        !           982:             #Custom match methods take the tag as an argument, but all
        !           983:             #other ways of matching match the tag name as a string.
        !           984:             if isinstance(markup, Tag):
        !           985:                 markup = markup.name
        !           986:             if markup and not isinstance(markup, basestring):
        !           987:                 markup = unicode(markup)
        !           988:             #Now we know that chunk is either a string, or None.
        !           989:             if hasattr(matchAgainst, 'match'):
        !           990:                 # It's a regexp object.
        !           991:                 result = markup and matchAgainst.search(markup)
        !           992:             elif hasattr(matchAgainst, '__iter__'): # list-like
        !           993:                 result = markup in matchAgainst
        !           994:             elif hasattr(matchAgainst, 'items'):
        !           995:                 result = markup.has_key(matchAgainst)
        !           996:             elif matchAgainst and isinstance(markup, basestring):
        !           997:                 if isinstance(markup, unicode):
        !           998:                     matchAgainst = unicode(matchAgainst)
        !           999:                 else:
        !          1000:                     matchAgainst = str(matchAgainst)
        !          1001: 
        !          1002:             if not result:
        !          1003:                 result = matchAgainst == markup
        !          1004:         return result
        !          1005: 
        !          1006: class ResultSet(list):
        !          1007:     """A ResultSet is just a list that keeps track of the SoupStrainer
        !          1008:     that created it."""
        !          1009:     def __init__(self, source):
        !          1010:         list.__init__([])
        !          1011:         self.source = source
        !          1012: 
        !          1013: # Now, some helper functions.
        !          1014: 
        !          1015: def buildTagMap(default, *args):
        !          1016:     """Turns a list of maps, lists, or scalars into a single map.
        !          1017:     Used to build the SELF_CLOSING_TAGS, NESTABLE_TAGS, and
        !          1018:     NESTING_RESET_TAGS maps out of lists and partial maps."""
        !          1019:     built = {}
        !          1020:     for portion in args:
        !          1021:         if hasattr(portion, 'items'):
        !          1022:             #It's a map. Merge it.
        !          1023:             for k,v in portion.items():
        !          1024:                 built[k] = v
        !          1025:         elif hasattr(portion, '__iter__'): # is a list
        !          1026:             #It's a list. Map each item to the default.
        !          1027:             for k in portion:
        !          1028:                 built[k] = default
        !          1029:         else:
        !          1030:             #It's a scalar. Map it to the default.
        !          1031:             built[portion] = default
        !          1032:     return built
        !          1033: 
        !          1034: # Now, the parser classes.
        !          1035: 
        !          1036: class BeautifulStoneSoup(Tag, SGMLParser):
        !          1037: 
        !          1038:     """This class contains the basic parser and search code. It defines
        !          1039:     a parser that knows nothing about tag behavior except for the
        !          1040:     following:
        !          1041: 
        !          1042:       You can't close a tag without closing all the tags it encloses.
        !          1043:       That is, "<foo><bar></foo>" actually means
        !          1044:       "<foo><bar></bar></foo>".
        !          1045: 
        !          1046:     [Another possible explanation is "<foo><bar /></foo>", but since
        !          1047:     this class defines no SELF_CLOSING_TAGS, it will never use that
        !          1048:     explanation.]
        !          1049: 
        !          1050:     This class is useful for parsing XML or made-up markup languages,
        !          1051:     or when BeautifulSoup makes an assumption counter to what you were
        !          1052:     expecting."""
        !          1053: 
        !          1054:     SELF_CLOSING_TAGS = {}
        !          1055:     NESTABLE_TAGS = {}
        !          1056:     RESET_NESTING_TAGS = {}
        !          1057:     QUOTE_TAGS = {}
        !          1058:     PRESERVE_WHITESPACE_TAGS = []
        !          1059: 
        !          1060:     MARKUP_MASSAGE = [(re.compile('(<[^<>]*)/>'),
        !          1061:                        lambda x: x.group(1) + ' />'),
        !          1062:                       (re.compile('<!\s+([^<>]*)>'),
        !          1063:                        lambda x: '<!' + x.group(1) + '>')
        !          1064:                       ]
        !          1065: 
        !          1066:     ROOT_TAG_NAME = u'[document]'
        !          1067: 
        !          1068:     HTML_ENTITIES = "html"
        !          1069:     XML_ENTITIES = "xml"
        !          1070:     XHTML_ENTITIES = "xhtml"
        !          1071:     # TODO: This only exists for backwards-compatibility
        !          1072:     ALL_ENTITIES = XHTML_ENTITIES
        !          1073: 
        !          1074:     # Used when determining whether a text node is all whitespace and
        !          1075:     # can be replaced with a single space. A text node that contains
        !          1076:     # fancy Unicode spaces (usually non-breaking) should be left
        !          1077:     # alone.
        !          1078:     STRIP_ASCII_SPACES = { 9: None, 10: None, 12: None, 13: None, 32: None, }
        !          1079: 
        !          1080:     def __init__(self, markup="", parseOnlyThese=None, fromEncoding=None,
        !          1081:                  markupMassage=True, smartQuotesTo=XML_ENTITIES,
        !          1082:                  convertEntities=None, selfClosingTags=None, isHTML=False):
        !          1083:         """The Soup object is initialized as the 'root tag', and the
        !          1084:         provided markup (which can be a string or a file-like object)
        !          1085:         is fed into the underlying parser.
        !          1086: 
        !          1087:         sgmllib will process most bad HTML, and the BeautifulSoup
        !          1088:         class has some tricks for dealing with some HTML that kills
        !          1089:         sgmllib, but Beautiful Soup can nonetheless choke or lose data
        !          1090:         if your data uses self-closing tags or declarations
        !          1091:         incorrectly.
        !          1092: 
        !          1093:         By default, Beautiful Soup uses regexes to sanitize input,
        !          1094:         avoiding the vast majority of these problems. If the problems
        !          1095:         don't apply to you, pass in False for markupMassage, and
        !          1096:         you'll get better performance.
        !          1097: 
        !          1098:         The default parser massage techniques fix the two most common
        !          1099:         instances of invalid HTML that choke sgmllib:
        !          1100: 
        !          1101:          <br/> (No space between name of closing tag and tag close)
        !          1102:          <! --Comment--> (Extraneous whitespace in declaration)
        !          1103: 
        !          1104:         You can pass in a custom list of (RE object, replace method)
        !          1105:         tuples to get Beautiful Soup to scrub your input the way you
        !          1106:         want."""
        !          1107: 
        !          1108:         self.parseOnlyThese = parseOnlyThese
        !          1109:         self.fromEncoding = fromEncoding
        !          1110:         self.smartQuotesTo = smartQuotesTo
        !          1111:         self.convertEntities = convertEntities
        !          1112:         # Set the rules for how we'll deal with the entities we
        !          1113:         # encounter
        !          1114:         if self.convertEntities:
        !          1115:             # It doesn't make sense to convert encoded characters to
        !          1116:             # entities even while you're converting entities to Unicode.
        !          1117:             # Just convert it all to Unicode.
        !          1118:             self.smartQuotesTo = None
        !          1119:             if convertEntities == self.HTML_ENTITIES:
        !          1120:                 self.convertXMLEntities = False
        !          1121:                 self.convertHTMLEntities = True
        !          1122:                 self.escapeUnrecognizedEntities = True
        !          1123:             elif convertEntities == self.XHTML_ENTITIES:
        !          1124:                 self.convertXMLEntities = True
        !          1125:                 self.convertHTMLEntities = True
        !          1126:                 self.escapeUnrecognizedEntities = False
        !          1127:             elif convertEntities == self.XML_ENTITIES:
        !          1128:                 self.convertXMLEntities = True
        !          1129:                 self.convertHTMLEntities = False
        !          1130:                 self.escapeUnrecognizedEntities = False
        !          1131:         else:
        !          1132:             self.convertXMLEntities = False
        !          1133:             self.convertHTMLEntities = False
        !          1134:             self.escapeUnrecognizedEntities = False
        !          1135: 
        !          1136:         self.instanceSelfClosingTags = buildTagMap(None, selfClosingTags)
        !          1137:         SGMLParser.__init__(self)
        !          1138: 
        !          1139:         if hasattr(markup, 'read'):        # It's a file-type object.
        !          1140:             markup = markup.read()
        !          1141:         self.markup = markup
        !          1142:         self.markupMassage = markupMassage
        !          1143:         try:
        !          1144:             self._feed(isHTML=isHTML)
        !          1145:         except StopParsing:
        !          1146:             pass
        !          1147:         self.markup = None                 # The markup can now be GCed
        !          1148: 
        !          1149:     def convert_charref(self, name):
        !          1150:         """This method fixes a bug in Python's SGMLParser."""
        !          1151:         try:
        !          1152:             n = int(name)
        !          1153:         except ValueError:
        !          1154:             return
        !          1155:         if not 0 <= n <= 127 : # ASCII ends at 127, not 255
        !          1156:             return
        !          1157:         return self.convert_codepoint(n)
        !          1158: 
        !          1159:     def _feed(self, inDocumentEncoding=None, isHTML=False):
        !          1160:         # Convert the document to Unicode.
        !          1161:         markup = self.markup
        !          1162:         if isinstance(markup, unicode):
        !          1163:             if not hasattr(self, 'originalEncoding'):
        !          1164:                 self.originalEncoding = None
        !          1165:         else:
        !          1166:             dammit = UnicodeDammit\
        !          1167:                      (markup, [self.fromEncoding, inDocumentEncoding],
        !          1168:                       smartQuotesTo=self.smartQuotesTo, isHTML=isHTML)
        !          1169:             markup = dammit.unicode
        !          1170:             self.originalEncoding = dammit.originalEncoding
        !          1171:             self.declaredHTMLEncoding = dammit.declaredHTMLEncoding
        !          1172:         if markup:
        !          1173:             if self.markupMassage:
        !          1174:                 if not hasattr(self.markupMassage, "__iter__"):
        !          1175:                     self.markupMassage = self.MARKUP_MASSAGE
        !          1176:                 for fix, m in self.markupMassage:
        !          1177:                     markup = fix.sub(m, markup)
        !          1178:                 # TODO: We get rid of markupMassage so that the
        !          1179:                 # soup object can be deepcopied later on. Some
        !          1180:                 # Python installations can't copy regexes. If anyone
        !          1181:                 # was relying on the existence of markupMassage, this
        !          1182:                 # might cause problems.
        !          1183:                 del(self.markupMassage)
        !          1184:         self.reset()
        !          1185: 
        !          1186:         SGMLParser.feed(self, markup)
        !          1187:         # Close out any unfinished strings and close all the open tags.
        !          1188:         self.endData()
        !          1189:         while self.currentTag.name != self.ROOT_TAG_NAME:
        !          1190:             self.popTag()
        !          1191: 
        !          1192:     def __getattr__(self, methodName):
        !          1193:         """This method routes method call requests to either the SGMLParser
        !          1194:         superclass or the Tag superclass, depending on the method name."""
        !          1195:         #print "__getattr__ called on %s.%s" % (self.__class__, methodName)
        !          1196: 
        !          1197:         if methodName.startswith('start_') or methodName.startswith('end_') \
        !          1198:                or methodName.startswith('do_'):
        !          1199:             return SGMLParser.__getattr__(self, methodName)
        !          1200:         elif not methodName.startswith('__'):
        !          1201:             return Tag.__getattr__(self, methodName)
        !          1202:         else:
        !          1203:             raise AttributeError
        !          1204: 
        !          1205:     def isSelfClosingTag(self, name):
        !          1206:         """Returns true iff the given string is the name of a
        !          1207:         self-closing tag according to this parser."""
        !          1208:         return self.SELF_CLOSING_TAGS.has_key(name) \
        !          1209:                or self.instanceSelfClosingTags.has_key(name)
        !          1210: 
        !          1211:     def reset(self):
        !          1212:         Tag.__init__(self, self, self.ROOT_TAG_NAME)
        !          1213:         self.hidden = 1
        !          1214:         SGMLParser.reset(self)
        !          1215:         self.currentData = []
        !          1216:         self.currentTag = None
        !          1217:         self.tagStack = []
        !          1218:         self.quoteStack = []
        !          1219:         self.pushTag(self)
        !          1220: 
        !          1221:     def popTag(self):
        !          1222:         tag = self.tagStack.pop()
        !          1223: 
        !          1224:         #print "Pop", tag.name
        !          1225:         if self.tagStack:
        !          1226:             self.currentTag = self.tagStack[-1]
        !          1227:         return self.currentTag
        !          1228: 
        !          1229:     def pushTag(self, tag):
        !          1230:         #print "Push", tag.name
        !          1231:         if self.currentTag:
        !          1232:             self.currentTag.contents.append(tag)
        !          1233:         self.tagStack.append(tag)
        !          1234:         self.currentTag = self.tagStack[-1]
        !          1235: 
        !          1236:     def endData(self, containerClass=NavigableString):
        !          1237:         if self.currentData:
        !          1238:             currentData = u''.join(self.currentData)
        !          1239:             if (currentData.translate(self.STRIP_ASCII_SPACES) == '' and
        !          1240:                 not set([tag.name for tag in self.tagStack]).intersection(
        !          1241:                     self.PRESERVE_WHITESPACE_TAGS)):
        !          1242:                 if '\n' in currentData:
        !          1243:                     currentData = '\n'
        !          1244:                 else:
        !          1245:                     currentData = ' '
        !          1246:             self.currentData = []
        !          1247:             if self.parseOnlyThese and len(self.tagStack) <= 1 and \
        !          1248:                    (not self.parseOnlyThese.text or \
        !          1249:                     not self.parseOnlyThese.search(currentData)):
        !          1250:                 return
        !          1251:             o = containerClass(currentData)
        !          1252:             o.setup(self.currentTag, self.previous)
        !          1253:             if self.previous:
        !          1254:                 self.previous.next = o
        !          1255:             self.previous = o
        !          1256:             self.currentTag.contents.append(o)
        !          1257: 
        !          1258: 
        !          1259:     def _popToTag(self, name, inclusivePop=True):
        !          1260:         """Pops the tag stack up to and including the most recent
        !          1261:         instance of the given tag. If inclusivePop is false, pops the tag
        !          1262:         stack up to but *not* including the most recent instqance of
        !          1263:         the given tag."""
        !          1264:         #print "Popping to %s" % name
        !          1265:         if name == self.ROOT_TAG_NAME:
        !          1266:             return
        !          1267: 
        !          1268:         numPops = 0
        !          1269:         mostRecentTag = None
        !          1270:         for i in range(len(self.tagStack)-1, 0, -1):
        !          1271:             if name == self.tagStack[i].name:
        !          1272:                 numPops = len(self.tagStack)-i
        !          1273:                 break
        !          1274:         if not inclusivePop:
        !          1275:             numPops = numPops - 1
        !          1276: 
        !          1277:         for i in range(0, numPops):
        !          1278:             mostRecentTag = self.popTag()
        !          1279:         return mostRecentTag
        !          1280: 
        !          1281:     def _smartPop(self, name):
        !          1282: 
        !          1283:         """We need to pop up to the previous tag of this type, unless
        !          1284:         one of this tag's nesting reset triggers comes between this
        !          1285:         tag and the previous tag of this type, OR unless this tag is a
        !          1286:         generic nesting trigger and another generic nesting trigger
        !          1287:         comes between this tag and the previous tag of this type.
        !          1288: 
        !          1289:         Examples:
        !          1290:          <p>Foo<b>Bar *<p>* should pop to 'p', not 'b'.
        !          1291:          <p>Foo<table>Bar *<p>* should pop to 'table', not 'p'.
        !          1292:          <p>Foo<table><tr>Bar *<p>* should pop to 'tr', not 'p'.
        !          1293: 
        !          1294:          <li><ul><li> *<li>* should pop to 'ul', not the first 'li'.
        !          1295:          <tr><table><tr> *<tr>* should pop to 'table', not the first 'tr'
        !          1296:          <td><tr><td> *<td>* should pop to 'tr', not the first 'td'
        !          1297:         """
        !          1298: 
        !          1299:         nestingResetTriggers = self.NESTABLE_TAGS.get(name)
        !          1300:         isNestable = nestingResetTriggers != None
        !          1301:         isResetNesting = self.RESET_NESTING_TAGS.has_key(name)
        !          1302:         popTo = None
        !          1303:         inclusive = True
        !          1304:         for i in range(len(self.tagStack)-1, 0, -1):
        !          1305:             p = self.tagStack[i]
        !          1306:             if (not p or p.name == name) and not isNestable:
        !          1307:                 #Non-nestable tags get popped to the top or to their
        !          1308:                 #last occurance.
        !          1309:                 popTo = name
        !          1310:                 break
        !          1311:             if (nestingResetTriggers is not None
        !          1312:                 and p.name in nestingResetTriggers) \
        !          1313:                 or (nestingResetTriggers is None and isResetNesting
        !          1314:                     and self.RESET_NESTING_TAGS.has_key(p.name)):
        !          1315: 
        !          1316:                 #If we encounter one of the nesting reset triggers
        !          1317:                 #peculiar to this tag, or we encounter another tag
        !          1318:                 #that causes nesting to reset, pop up to but not
        !          1319:                 #including that tag.
        !          1320:                 popTo = p.name
        !          1321:                 inclusive = False
        !          1322:                 break
        !          1323:             p = p.parent
        !          1324:         if popTo:
        !          1325:             self._popToTag(popTo, inclusive)
        !          1326: 
        !          1327:     def unknown_starttag(self, name, attrs, selfClosing=0):
        !          1328:         #print "Start tag %s: %s" % (name, attrs)
        !          1329:         if self.quoteStack:
        !          1330:             #This is not a real tag.
        !          1331:             #print "<%s> is not real!" % name
        !          1332:             attrs = ''.join([' %s="%s"' % (x, y) for x, y in attrs])
        !          1333:             self.handle_data('<%s%s>' % (name, attrs))
        !          1334:             return
        !          1335:         self.endData()
        !          1336: 
        !          1337:         if not self.isSelfClosingTag(name) and not selfClosing:
        !          1338:             self._smartPop(name)
        !          1339: 
        !          1340:         if self.parseOnlyThese and len(self.tagStack) <= 1 \
        !          1341:                and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)):
        !          1342:             return
        !          1343: 
        !          1344:         tag = Tag(self, name, attrs, self.currentTag, self.previous)
        !          1345:         if self.previous:
        !          1346:             self.previous.next = tag
        !          1347:         self.previous = tag
        !          1348:         self.pushTag(tag)
        !          1349:         if selfClosing or self.isSelfClosingTag(name):
        !          1350:             self.popTag()
        !          1351:         if name in self.QUOTE_TAGS:
        !          1352:             #print "Beginning quote (%s)" % name
        !          1353:             self.quoteStack.append(name)
        !          1354:             self.literal = 1
        !          1355:         return tag
        !          1356: 
        !          1357:     def unknown_endtag(self, name):
        !          1358:         #print "End tag %s" % name
        !          1359:         if self.quoteStack and self.quoteStack[-1] != name:
        !          1360:             #This is not a real end tag.
        !          1361:             #print "</%s> is not real!" % name
        !          1362:             self.handle_data('</%s>' % name)
        !          1363:             return
        !          1364:         self.endData()
        !          1365:         self._popToTag(name)
        !          1366:         if self.quoteStack and self.quoteStack[-1] == name:
        !          1367:             self.quoteStack.pop()
        !          1368:             self.literal = (len(self.quoteStack) > 0)
        !          1369: 
        !          1370:     def handle_data(self, data):
        !          1371:         self.currentData.append(data)
        !          1372: 
        !          1373:     def _toStringSubclass(self, text, subclass):
        !          1374:         """Adds a certain piece of text to the tree as a NavigableString
        !          1375:         subclass."""
        !          1376:         self.endData()
        !          1377:         self.handle_data(text)
        !          1378:         self.endData(subclass)
        !          1379: 
        !          1380:     def handle_pi(self, text):
        !          1381:         """Handle a processing instruction as a ProcessingInstruction
        !          1382:         object, possibly one with a %SOUP-ENCODING% slot into which an
        !          1383:         encoding will be plugged later."""
        !          1384:         if text[:3] == "xml":
        !          1385:             text = u"xml version='1.0' encoding='%SOUP-ENCODING%'"
        !          1386:         self._toStringSubclass(text, ProcessingInstruction)
        !          1387: 
        !          1388:     def handle_comment(self, text):
        !          1389:         "Handle comments as Comment objects."
        !          1390:         self._toStringSubclass(text, Comment)
        !          1391: 
        !          1392:     def handle_charref(self, ref):
        !          1393:         "Handle character references as data."
        !          1394:         if self.convertEntities:
        !          1395:             data = unichr(int(ref))
        !          1396:         else:
        !          1397:             data = '&#%s;' % ref
        !          1398:         self.handle_data(data)
        !          1399: 
        !          1400:     def handle_entityref(self, ref):
        !          1401:         """Handle entity references as data, possibly converting known
        !          1402:         HTML and/or XML entity references to the corresponding Unicode
        !          1403:         characters."""
        !          1404:         data = None
        !          1405:         if self.convertHTMLEntities:
        !          1406:             try:
        !          1407:                 data = unichr(name2codepoint[ref])
        !          1408:             except KeyError:
        !          1409:                 pass
        !          1410: 
        !          1411:         if not data and self.convertXMLEntities:
        !          1412:                 data = self.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref)
        !          1413: 
        !          1414:         if not data and self.convertHTMLEntities and \
        !          1415:             not self.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref):
        !          1416:                 # TODO: We've got a problem here. We're told this is
        !          1417:                 # an entity reference, but it's not an XML entity
        !          1418:                 # reference or an HTML entity reference. Nonetheless,
        !          1419:                 # the logical thing to do is to pass it through as an
        !          1420:                 # unrecognized entity reference.
        !          1421:                 #
        !          1422:                 # Except: when the input is "&carol;" this function
        !          1423:                 # will be called with input "carol". When the input is
        !          1424:                 # "AT&T", this function will be called with input
        !          1425:                 # "T". We have no way of knowing whether a semicolon
        !          1426:                 # was present originally, so we don't know whether
        !          1427:                 # this is an unknown entity or just a misplaced
        !          1428:                 # ampersand.
        !          1429:                 #
        !          1430:                 # The more common case is a misplaced ampersand, so I
        !          1431:                 # escape the ampersand and omit the trailing semicolon.
        !          1432:                 data = "&amp;%s" % ref
        !          1433:         if not data:
        !          1434:             # This case is different from the one above, because we
        !          1435:             # haven't already gone through a supposedly comprehensive
        !          1436:             # mapping of entities to Unicode characters. We might not
        !          1437:             # have gone through any mapping at all. So the chances are
        !          1438:             # very high that this is a real entity, and not a
        !          1439:             # misplaced ampersand.
        !          1440:             data = "&%s;" % ref
        !          1441:         self.handle_data(data)
        !          1442: 
        !          1443:     def handle_decl(self, data):
        !          1444:         "Handle DOCTYPEs and the like as Declaration objects."
        !          1445:         self._toStringSubclass(data, Declaration)
        !          1446: 
        !          1447:     def parse_declaration(self, i):
        !          1448:         """Treat a bogus SGML declaration as raw data. Treat a CDATA
        !          1449:         declaration as a CData object."""
        !          1450:         j = None
        !          1451:         if self.rawdata[i:i+9] == '<![CDATA[':
        !          1452:              k = self.rawdata.find(']]>', i)
        !          1453:              if k == -1:
        !          1454:                  k = len(self.rawdata)
        !          1455:              data = self.rawdata[i+9:k]
        !          1456:              j = k+3
        !          1457:              self._toStringSubclass(data, CData)
        !          1458:         else:
        !          1459:             try:
        !          1460:                 j = SGMLParser.parse_declaration(self, i)
        !          1461:             except SGMLParseError:
        !          1462:                 toHandle = self.rawdata[i:]
        !          1463:                 self.handle_data(toHandle)
        !          1464:                 j = i + len(toHandle)
        !          1465:         return j
        !          1466: 
        !          1467: class BeautifulSoup(BeautifulStoneSoup):
        !          1468: 
        !          1469:     """This parser knows the following facts about HTML:
        !          1470: 
        !          1471:     * Some tags have no closing tag and should be interpreted as being
        !          1472:       closed as soon as they are encountered.
        !          1473: 
        !          1474:     * The text inside some tags (ie. 'script') may contain tags which
        !          1475:       are not really part of the document and which should be parsed
        !          1476:       as text, not tags. If you want to parse the text as tags, you can
        !          1477:       always fetch it and parse it explicitly.
        !          1478: 
        !          1479:     * Tag nesting rules:
        !          1480: 
        !          1481:       Most tags can't be nested at all. For instance, the occurance of
        !          1482:       a <p> tag should implicitly close the previous <p> tag.
        !          1483: 
        !          1484:        <p>Para1<p>Para2
        !          1485:         should be transformed into:
        !          1486:        <p>Para1</p><p>Para2
        !          1487: 
        !          1488:       Some tags can be nested arbitrarily. For instance, the occurance
        !          1489:       of a <blockquote> tag should _not_ implicitly close the previous
        !          1490:       <blockquote> tag.
        !          1491: 
        !          1492:        Alice said: <blockquote>Bob said: <blockquote>Blah
        !          1493:         should NOT be transformed into:
        !          1494:        Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah
        !          1495: 
        !          1496:       Some tags can be nested, but the nesting is reset by the
        !          1497:       interposition of other tags. For instance, a <tr> tag should
        !          1498:       implicitly close the previous <tr> tag within the same <table>,
        !          1499:       but not close a <tr> tag in another table.
        !          1500: 
        !          1501:        <table><tr>Blah<tr>Blah
        !          1502:         should be transformed into:
        !          1503:        <table><tr>Blah</tr><tr>Blah
        !          1504:         but,
        !          1505:        <tr>Blah<table><tr>Blah
        !          1506:         should NOT be transformed into
        !          1507:        <tr>Blah<table></tr><tr>Blah
        !          1508: 
        !          1509:     Differing assumptions about tag nesting rules are a major source
        !          1510:     of problems with the BeautifulSoup class. If BeautifulSoup is not
        !          1511:     treating as nestable a tag your page author treats as nestable,
        !          1512:     try ICantBelieveItsBeautifulSoup, MinimalSoup, or
        !          1513:     BeautifulStoneSoup before writing your own subclass."""
        !          1514: 
        !          1515:     def __init__(self, *args, **kwargs):
        !          1516:         if not kwargs.has_key('smartQuotesTo'):
        !          1517:             kwargs['smartQuotesTo'] = self.HTML_ENTITIES
        !          1518:         kwargs['isHTML'] = True
        !          1519:         BeautifulStoneSoup.__init__(self, *args, **kwargs)
        !          1520: 
        !          1521:     SELF_CLOSING_TAGS = buildTagMap(None,
        !          1522:                                     ('br' , 'hr', 'input', 'img', 'meta',
        !          1523:                                     'spacer', 'link', 'frame', 'base', 'col'))
        !          1524: 
        !          1525:     PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea'])
        !          1526: 
        !          1527:     QUOTE_TAGS = {'script' : None, 'textarea' : None}
        !          1528: 
        !          1529:     #According to the HTML standard, each of these inline tags can
        !          1530:     #contain another tag of the same type. Furthermore, it's common
        !          1531:     #to actually use these tags this way.
        !          1532:     NESTABLE_INLINE_TAGS = ('span', 'font', 'q', 'object', 'bdo', 'sub', 'sup',
        !          1533:                             'center')
        !          1534: 
        !          1535:     #According to the HTML standard, these block tags can contain
        !          1536:     #another tag of the same type. Furthermore, it's common
        !          1537:     #to actually use these tags this way.
        !          1538:     NESTABLE_BLOCK_TAGS = ('blockquote', 'div', 'fieldset', 'ins', 'del')
        !          1539: 
        !          1540:     #Lists can contain other lists, but there are restrictions.
        !          1541:     NESTABLE_LIST_TAGS = { 'ol' : [],
        !          1542:                            'ul' : [],
        !          1543:                            'li' : ['ul', 'ol'],
        !          1544:                            'dl' : [],
        !          1545:                            'dd' : ['dl'],
        !          1546:                            'dt' : ['dl'] }
        !          1547: 
        !          1548:     #Tables can contain other tables, but there are restrictions.
        !          1549:     NESTABLE_TABLE_TAGS = {'table' : [],
        !          1550:                            'tr' : ['table', 'tbody', 'tfoot', 'thead'],
        !          1551:                            'td' : ['tr'],
        !          1552:                            'th' : ['tr'],
        !          1553:                            'thead' : ['table'],
        !          1554:                            'tbody' : ['table'],
        !          1555:                            'tfoot' : ['table'],
        !          1556:                            }
        !          1557: 
        !          1558:     NON_NESTABLE_BLOCK_TAGS = ('address', 'form', 'p', 'pre')
        !          1559: 
        !          1560:     #If one of these tags is encountered, all tags up to the next tag of
        !          1561:     #this type are popped.
        !          1562:     RESET_NESTING_TAGS = buildTagMap(None, NESTABLE_BLOCK_TAGS, 'noscript',
        !          1563:                                      NON_NESTABLE_BLOCK_TAGS,
        !          1564:                                      NESTABLE_LIST_TAGS,
        !          1565:                                      NESTABLE_TABLE_TAGS)
        !          1566: 
        !          1567:     NESTABLE_TAGS = buildTagMap([], NESTABLE_INLINE_TAGS, NESTABLE_BLOCK_TAGS,
        !          1568:                                 NESTABLE_LIST_TAGS, NESTABLE_TABLE_TAGS)
        !          1569: 
        !          1570:     # Used to detect the charset in a META tag; see start_meta
        !          1571:     CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M)
        !          1572: 
        !          1573:     def start_meta(self, attrs):
        !          1574:         """Beautiful Soup can detect a charset included in a META tag,
        !          1575:         try to convert the document to that charset, and re-parse the
        !          1576:         document from the beginning."""
        !          1577:         httpEquiv = None
        !          1578:         contentType = None
        !          1579:         contentTypeIndex = None
        !          1580:         tagNeedsEncodingSubstitution = False
        !          1581: 
        !          1582:         for i in range(0, len(attrs)):
        !          1583:             key, value = attrs[i]
        !          1584:             key = key.lower()
        !          1585:             if key == 'http-equiv':
        !          1586:                 httpEquiv = value
        !          1587:             elif key == 'content':
        !          1588:                 contentType = value
        !          1589:                 contentTypeIndex = i
        !          1590: 
        !          1591:         if httpEquiv and contentType: # It's an interesting meta tag.
        !          1592:             match = self.CHARSET_RE.search(contentType)
        !          1593:             if match:
        !          1594:                 if (self.declaredHTMLEncoding is not None or
        !          1595:                     self.originalEncoding == self.fromEncoding):
        !          1596:                     # An HTML encoding was sniffed while converting
        !          1597:                     # the document to Unicode, or an HTML encoding was
        !          1598:                     # sniffed during a previous pass through the
        !          1599:                     # document, or an encoding was specified
        !          1600:                     # explicitly and it worked. Rewrite the meta tag.
        !          1601:                     def rewrite(match):
        !          1602:                         return match.group(1) + "%SOUP-ENCODING%"
        !          1603:                     newAttr = self.CHARSET_RE.sub(rewrite, contentType)
        !          1604:                     attrs[contentTypeIndex] = (attrs[contentTypeIndex][0],
        !          1605:                                                newAttr)
        !          1606:                     tagNeedsEncodingSubstitution = True
        !          1607:                 else:
        !          1608:                     # This is our first pass through the document.
        !          1609:                     # Go through it again with the encoding information.
        !          1610:                     newCharset = match.group(3)
        !          1611:                     if newCharset and newCharset != self.originalEncoding:
        !          1612:                         self.declaredHTMLEncoding = newCharset
        !          1613:                         self._feed(self.declaredHTMLEncoding)
        !          1614:                         raise StopParsing
        !          1615:                     pass
        !          1616:         tag = self.unknown_starttag("meta", attrs)
        !          1617:         if tag and tagNeedsEncodingSubstitution:
        !          1618:             tag.containsSubstitutions = True
        !          1619: 
        !          1620: class StopParsing(Exception):
        !          1621:     pass
        !          1622: 
        !          1623: class ICantBelieveItsBeautifulSoup(BeautifulSoup):
        !          1624: 
        !          1625:     """The BeautifulSoup class is oriented towards skipping over
        !          1626:     common HTML errors like unclosed tags. However, sometimes it makes
        !          1627:     errors of its own. For instance, consider this fragment:
        !          1628: 
        !          1629:      <b>Foo<b>Bar</b></b>
        !          1630: 
        !          1631:     This is perfectly valid (if bizarre) HTML. However, the
        !          1632:     BeautifulSoup class will implicitly close the first b tag when it
        !          1633:     encounters the second 'b'. It will think the author wrote
        !          1634:     "<b>Foo<b>Bar", and didn't close the first 'b' tag, because
        !          1635:     there's no real-world reason to bold something that's already
        !          1636:     bold. When it encounters '</b></b>' it will close two more 'b'
        !          1637:     tags, for a grand total of three tags closed instead of two. This
        !          1638:     can throw off the rest of your document structure. The same is
        !          1639:     true of a number of other tags, listed below.
        !          1640: 
        !          1641:     It's much more common for someone to forget to close a 'b' tag
        !          1642:     than to actually use nested 'b' tags, and the BeautifulSoup class
        !          1643:     handles the common case. This class handles the not-co-common
        !          1644:     case: where you can't believe someone wrote what they did, but
        !          1645:     it's valid HTML and BeautifulSoup screwed up by assuming it
        !          1646:     wouldn't be."""
        !          1647: 
        !          1648:     I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS = \
        !          1649:      ('em', 'big', 'i', 'small', 'tt', 'abbr', 'acronym', 'strong',
        !          1650:       'cite', 'code', 'dfn', 'kbd', 'samp', 'strong', 'var', 'b',
        !          1651:       'big')
        !          1652: 
        !          1653:     I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS = ('noscript',)
        !          1654: 
        !          1655:     NESTABLE_TAGS = buildTagMap([], BeautifulSoup.NESTABLE_TAGS,
        !          1656:                                 I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS,
        !          1657:                                 I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS)
        !          1658: 
        !          1659: class MinimalSoup(BeautifulSoup):
        !          1660:     """The MinimalSoup class is for parsing HTML that contains
        !          1661:     pathologically bad markup. It makes no assumptions about tag
        !          1662:     nesting, but it does know which tags are self-closing, that
        !          1663:     <script> tags contain Javascript and should not be parsed, that
        !          1664:     META tags may contain encoding information, and so on.
        !          1665: 
        !          1666:     This also makes it better for subclassing than BeautifulStoneSoup
        !          1667:     or BeautifulSoup."""
        !          1668: 
        !          1669:     RESET_NESTING_TAGS = buildTagMap('noscript')
        !          1670:     NESTABLE_TAGS = {}
        !          1671: 
        !          1672: class BeautifulSOAP(BeautifulStoneSoup):
        !          1673:     """This class will push a tag with only a single string child into
        !          1674:     the tag's parent as an attribute. The attribute's name is the tag
        !          1675:     name, and the value is the string child. An example should give
        !          1676:     the flavor of the change:
        !          1677: 
        !          1678:     <foo><bar>baz</bar></foo>
        !          1679:      =>
        !          1680:     <foo bar="baz"><bar>baz</bar></foo>
        !          1681: 
        !          1682:     You can then access fooTag['bar'] instead of fooTag.barTag.string.
        !          1683: 
        !          1684:     This is, of course, useful for scraping structures that tend to
        !          1685:     use subelements instead of attributes, such as SOAP messages. Note
        !          1686:     that it modifies its input, so don't print the modified version
        !          1687:     out.
        !          1688: 
        !          1689:     I'm not sure how many people really want to use this class; let me
        !          1690:     know if you do. Mainly I like the name."""
        !          1691: 
        !          1692:     def popTag(self):
        !          1693:         if len(self.tagStack) > 1:
        !          1694:             tag = self.tagStack[-1]
        !          1695:             parent = self.tagStack[-2]
        !          1696:             parent._getAttrMap()
        !          1697:             if (isinstance(tag, Tag) and len(tag.contents) == 1 and
        !          1698:                 isinstance(tag.contents[0], NavigableString) and
        !          1699:                 not parent.attrMap.has_key(tag.name)):
        !          1700:                 parent[tag.name] = tag.contents[0]
        !          1701:         BeautifulStoneSoup.popTag(self)
        !          1702: 
        !          1703: #Enterprise class names! It has come to our attention that some people
        !          1704: #think the names of the Beautiful Soup parser classes are too silly
        !          1705: #and "unprofessional" for use in enterprise screen-scraping. We feel
        !          1706: #your pain! For such-minded folk, the Beautiful Soup Consortium And
        !          1707: #All-Night Kosher Bakery recommends renaming this file to
        !          1708: #"RobustParser.py" (or, in cases of extreme enterprisiness,
        !          1709: #"RobustParserBeanInterface.class") and using the following
        !          1710: #enterprise-friendly class aliases:
        !          1711: class RobustXMLParser(BeautifulStoneSoup):
        !          1712:     pass
        !          1713: class RobustHTMLParser(BeautifulSoup):
        !          1714:     pass
        !          1715: class RobustWackAssHTMLParser(ICantBelieveItsBeautifulSoup):
        !          1716:     pass
        !          1717: class RobustInsanelyWackAssHTMLParser(MinimalSoup):
        !          1718:     pass
        !          1719: class SimplifyingSOAPParser(BeautifulSOAP):
        !          1720:     pass
        !          1721: 
        !          1722: ######################################################
        !          1723: #
        !          1724: # Bonus library: Unicode, Dammit
        !          1725: #
        !          1726: # This class forces XML data into a standard format (usually to UTF-8
        !          1727: # or Unicode).  It is heavily based on code from Mark Pilgrim's
        !          1728: # Universal Feed Parser. It does not rewrite the XML or HTML to
        !          1729: # reflect a new encoding: that happens in BeautifulStoneSoup.handle_pi
        !          1730: # (XML) and BeautifulSoup.start_meta (HTML).
        !          1731: 
        !          1732: # Autodetects character encodings.
        !          1733: # Download from http://chardet.feedparser.org/
        !          1734: try:
        !          1735:     import chardet
        !          1736: #    import chardet.constants
        !          1737: #    chardet.constants._debug = 1
        !          1738: except ImportError:
        !          1739:     chardet = None
        !          1740: 
        !          1741: # cjkcodecs and iconv_codec make Python know about more character encodings.
        !          1742: # Both are available from http://cjkpython.i18n.org/
        !          1743: # They're built in if you use Python 2.4.
        !          1744: try:
        !          1745:     import cjkcodecs.aliases
        !          1746: except ImportError:
        !          1747:     pass
        !          1748: try:
        !          1749:     import iconv_codec
        !          1750: except ImportError:
        !          1751:     pass
        !          1752: 
        !          1753: class UnicodeDammit:
        !          1754:     """A class for detecting the encoding of a *ML document and
        !          1755:     converting it to a Unicode string. If the source encoding is
        !          1756:     windows-1252, can replace MS smart quotes with their HTML or XML
        !          1757:     equivalents."""
        !          1758: 
        !          1759:     # This dictionary maps commonly seen values for "charset" in HTML
        !          1760:     # meta tags to the corresponding Python codec names. It only covers
        !          1761:     # values that aren't in Python's aliases and can't be determined
        !          1762:     # by the heuristics in find_codec.
        !          1763:     CHARSET_ALIASES = { "macintosh" : "mac-roman",
        !          1764:                         "x-sjis" : "shift-jis" }
        !          1765: 
        !          1766:     def __init__(self, markup, overrideEncodings=[],
        !          1767:                  smartQuotesTo='xml', isHTML=False):
        !          1768:         self.declaredHTMLEncoding = None
        !          1769:         self.markup, documentEncoding, sniffedEncoding = \
        !          1770:                      self._detectEncoding(markup, isHTML)
        !          1771:         self.smartQuotesTo = smartQuotesTo
        !          1772:         self.triedEncodings = []
        !          1773:         if markup == '' or isinstance(markup, unicode):
        !          1774:             self.originalEncoding = None
        !          1775:             self.unicode = unicode(markup)
        !          1776:             return
        !          1777: 
        !          1778:         u = None
        !          1779:         for proposedEncoding in overrideEncodings:
        !          1780:             u = self._convertFrom(proposedEncoding)
        !          1781:             if u: break
        !          1782:         if not u:
        !          1783:             for proposedEncoding in (documentEncoding, sniffedEncoding):
        !          1784:                 u = self._convertFrom(proposedEncoding)
        !          1785:                 if u: break
        !          1786: 
        !          1787:         # If no luck and we have auto-detection library, try that:
        !          1788:         if not u and chardet and not isinstance(self.markup, unicode):
        !          1789:             u = self._convertFrom(chardet.detect(self.markup)['encoding'])
        !          1790: 
        !          1791:         # As a last resort, try utf-8 and windows-1252:
        !          1792:         if not u:
        !          1793:             for proposed_encoding in ("utf-8", "windows-1252"):
        !          1794:                 u = self._convertFrom(proposed_encoding)
        !          1795:                 if u: break
        !          1796: 
        !          1797:         self.unicode = u
        !          1798:         if not u: self.originalEncoding = None
        !          1799: 
        !          1800:     def _subMSChar(self, orig):
        !          1801:         """Changes a MS smart quote character to an XML or HTML
        !          1802:         entity."""
        !          1803:         sub = self.MS_CHARS.get(orig)
        !          1804:         if isinstance(sub, tuple):
        !          1805:             if self.smartQuotesTo == 'xml':
        !          1806:                 sub = '&#x%s;' % sub[1]
        !          1807:             else:
        !          1808:                 sub = '&%s;' % sub[0]
        !          1809:         return sub
        !          1810: 
        !          1811:     def _convertFrom(self, proposed):
        !          1812:         proposed = self.find_codec(proposed)
        !          1813:         if not proposed or proposed in self.triedEncodings:
        !          1814:             return None
        !          1815:         self.triedEncodings.append(proposed)
        !          1816:         markup = self.markup
        !          1817: 
        !          1818:         # Convert smart quotes to HTML if coming from an encoding
        !          1819:         # that might have them.
        !          1820:         if self.smartQuotesTo and proposed.lower() in("windows-1252",
        !          1821:                                                       "iso-8859-1",
        !          1822:                                                       "iso-8859-2"):
        !          1823:             markup = re.compile("([\x80-\x9f])").sub \
        !          1824:                      (lambda(x): self._subMSChar(x.group(1)),
        !          1825:                       markup)
        !          1826: 
        !          1827:         try:
        !          1828:             # print "Trying to convert document to %s" % proposed
        !          1829:             u = self._toUnicode(markup, proposed)
        !          1830:             self.markup = u
        !          1831:             self.originalEncoding = proposed
        !          1832:         except Exception, e:
        !          1833:             # print "That didn't work!"
        !          1834:             # print e
        !          1835:             return None
        !          1836:         #print "Correct encoding: %s" % proposed
        !          1837:         return self.markup
        !          1838: 
        !          1839:     def _toUnicode(self, data, encoding):
        !          1840:         '''Given a string and its encoding, decodes the string into Unicode.
        !          1841:         %encoding is a string recognized by encodings.aliases'''
        !          1842: 
        !          1843:         # strip Byte Order Mark (if present)
        !          1844:         if (len(data) >= 4) and (data[:2] == '\xfe\xff') \
        !          1845:                and (data[2:4] != '\x00\x00'):
        !          1846:             encoding = 'utf-16be'
        !          1847:             data = data[2:]
        !          1848:         elif (len(data) >= 4) and (data[:2] == '\xff\xfe') \
        !          1849:                  and (data[2:4] != '\x00\x00'):
        !          1850:             encoding = 'utf-16le'
        !          1851:             data = data[2:]
        !          1852:         elif data[:3] == '\xef\xbb\xbf':
        !          1853:             encoding = 'utf-8'
        !          1854:             data = data[3:]
        !          1855:         elif data[:4] == '\x00\x00\xfe\xff':
        !          1856:             encoding = 'utf-32be'
        !          1857:             data = data[4:]
        !          1858:         elif data[:4] == '\xff\xfe\x00\x00':
        !          1859:             encoding = 'utf-32le'
        !          1860:             data = data[4:]
        !          1861:         newdata = unicode(data, encoding)
        !          1862:         return newdata
        !          1863: 
        !          1864:     def _detectEncoding(self, xml_data, isHTML=False):
        !          1865:         """Given a document, tries to detect its XML encoding."""
        !          1866:         xml_encoding = sniffed_xml_encoding = None
        !          1867:         try:
        !          1868:             if xml_data[:4] == '\x4c\x6f\xa7\x94':
        !          1869:                 # EBCDIC
        !          1870:                 xml_data = self._ebcdic_to_ascii(xml_data)
        !          1871:             elif xml_data[:4] == '\x00\x3c\x00\x3f':
        !          1872:                 # UTF-16BE
        !          1873:                 sniffed_xml_encoding = 'utf-16be'
        !          1874:                 xml_data = unicode(xml_data, 'utf-16be').encode('utf-8')
        !          1875:             elif (len(xml_data) >= 4) and (xml_data[:2] == '\xfe\xff') \
        !          1876:                      and (xml_data[2:4] != '\x00\x00'):
        !          1877:                 # UTF-16BE with BOM
        !          1878:                 sniffed_xml_encoding = 'utf-16be'
        !          1879:                 xml_data = unicode(xml_data[2:], 'utf-16be').encode('utf-8')
        !          1880:             elif xml_data[:4] == '\x3c\x00\x3f\x00':
        !          1881:                 # UTF-16LE
        !          1882:                 sniffed_xml_encoding = 'utf-16le'
        !          1883:                 xml_data = unicode(xml_data, 'utf-16le').encode('utf-8')
        !          1884:             elif (len(xml_data) >= 4) and (xml_data[:2] == '\xff\xfe') and \
        !          1885:                      (xml_data[2:4] != '\x00\x00'):
        !          1886:                 # UTF-16LE with BOM
        !          1887:                 sniffed_xml_encoding = 'utf-16le'
        !          1888:                 xml_data = unicode(xml_data[2:], 'utf-16le').encode('utf-8')
        !          1889:             elif xml_data[:4] == '\x00\x00\x00\x3c':
        !          1890:                 # UTF-32BE
        !          1891:                 sniffed_xml_encoding = 'utf-32be'
        !          1892:                 xml_data = unicode(xml_data, 'utf-32be').encode('utf-8')
        !          1893:             elif xml_data[:4] == '\x3c\x00\x00\x00':
        !          1894:                 # UTF-32LE
        !          1895:                 sniffed_xml_encoding = 'utf-32le'
        !          1896:                 xml_data = unicode(xml_data, 'utf-32le').encode('utf-8')
        !          1897:             elif xml_data[:4] == '\x00\x00\xfe\xff':
        !          1898:                 # UTF-32BE with BOM
        !          1899:                 sniffed_xml_encoding = 'utf-32be'
        !          1900:                 xml_data = unicode(xml_data[4:], 'utf-32be').encode('utf-8')
        !          1901:             elif xml_data[:4] == '\xff\xfe\x00\x00':
        !          1902:                 # UTF-32LE with BOM
        !          1903:                 sniffed_xml_encoding = 'utf-32le'
        !          1904:                 xml_data = unicode(xml_data[4:], 'utf-32le').encode('utf-8')
        !          1905:             elif xml_data[:3] == '\xef\xbb\xbf':
        !          1906:                 # UTF-8 with BOM
        !          1907:                 sniffed_xml_encoding = 'utf-8'
        !          1908:                 xml_data = unicode(xml_data[3:], 'utf-8').encode('utf-8')
        !          1909:             else:
        !          1910:                 sniffed_xml_encoding = 'ascii'
        !          1911:                 pass
        !          1912:         except:
        !          1913:             xml_encoding_match = None
        !          1914:         xml_encoding_match = re.compile(
        !          1915:             '^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data)
        !          1916:         if not xml_encoding_match and isHTML:
        !          1917:             regexp = re.compile('<\s*meta[^>]+charset=([^>]*?)[;\'">]', re.I)
        !          1918:             xml_encoding_match = regexp.search(xml_data)
        !          1919:         if xml_encoding_match is not None:
        !          1920:             xml_encoding = xml_encoding_match.groups()[0].lower()
        !          1921:             if isHTML:
        !          1922:                 self.declaredHTMLEncoding = xml_encoding
        !          1923:             if sniffed_xml_encoding and \
        !          1924:                (xml_encoding in ('iso-10646-ucs-2', 'ucs-2', 'csunicode',
        !          1925:                                  'iso-10646-ucs-4', 'ucs-4', 'csucs4',
        !          1926:                                  'utf-16', 'utf-32', 'utf_16', 'utf_32',
        !          1927:                                  'utf16', 'u16')):
        !          1928:                 xml_encoding = sniffed_xml_encoding
        !          1929:         return xml_data, xml_encoding, sniffed_xml_encoding
        !          1930: 
        !          1931: 
        !          1932:     def find_codec(self, charset):
        !          1933:         return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \
        !          1934:                or (charset and self._codec(charset.replace("-", ""))) \
        !          1935:                or (charset and self._codec(charset.replace("-", "_"))) \
        !          1936:                or charset
        !          1937: 
        !          1938:     def _codec(self, charset):
        !          1939:         if not charset: return charset
        !          1940:         codec = None
        !          1941:         try:
        !          1942:             codecs.lookup(charset)
        !          1943:             codec = charset
        !          1944:         except (LookupError, ValueError):
        !          1945:             pass
        !          1946:         return codec
        !          1947: 
        !          1948:     EBCDIC_TO_ASCII_MAP = None
        !          1949:     def _ebcdic_to_ascii(self, s):
        !          1950:         c = self.__class__
        !          1951:         if not c.EBCDIC_TO_ASCII_MAP:
        !          1952:             emap = (0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15,
        !          1953:                     16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31,
        !          1954:                     128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7,
        !          1955:                     144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26,
        !          1956:                     32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33,
        !          1957:                     38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94,
        !          1958:                     45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63,
        !          1959:                     186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34,
        !          1960:                     195,97,98,99,100,101,102,103,104,105,196,197,198,199,200,
        !          1961:                     201,202,106,107,108,109,110,111,112,113,114,203,204,205,
        !          1962:                     206,207,208,209,126,115,116,117,118,119,120,121,122,210,
        !          1963:                     211,212,213,214,215,216,217,218,219,220,221,222,223,224,
        !          1964:                     225,226,227,228,229,230,231,123,65,66,67,68,69,70,71,72,
        !          1965:                     73,232,233,234,235,236,237,125,74,75,76,77,78,79,80,81,
        !          1966:                     82,238,239,240,241,242,243,92,159,83,84,85,86,87,88,89,
        !          1967:                     90,244,245,246,247,248,249,48,49,50,51,52,53,54,55,56,57,
        !          1968:                     250,251,252,253,254,255)
        !          1969:             import string
        !          1970:             c.EBCDIC_TO_ASCII_MAP = string.maketrans( \
        !          1971:             ''.join(map(chr, range(256))), ''.join(map(chr, emap)))
        !          1972:         return s.translate(c.EBCDIC_TO_ASCII_MAP)
        !          1973: 
        !          1974:     MS_CHARS = { '\x80' : ('euro', '20AC'),
        !          1975:                  '\x81' : ' ',
        !          1976:                  '\x82' : ('sbquo', '201A'),
        !          1977:                  '\x83' : ('fnof', '192'),
        !          1978:                  '\x84' : ('bdquo', '201E'),
        !          1979:                  '\x85' : ('hellip', '2026'),
        !          1980:                  '\x86' : ('dagger', '2020'),
        !          1981:                  '\x87' : ('Dagger', '2021'),
        !          1982:                  '\x88' : ('circ', '2C6'),
        !          1983:                  '\x89' : ('permil', '2030'),
        !          1984:                  '\x8A' : ('Scaron', '160'),
        !          1985:                  '\x8B' : ('lsaquo', '2039'),
        !          1986:                  '\x8C' : ('OElig', '152'),
        !          1987:                  '\x8D' : '?',
        !          1988:                  '\x8E' : ('#x17D', '17D'),
        !          1989:                  '\x8F' : '?',
        !          1990:                  '\x90' : '?',
        !          1991:                  '\x91' : ('lsquo', '2018'),
        !          1992:                  '\x92' : ('rsquo', '2019'),
        !          1993:                  '\x93' : ('ldquo', '201C'),
        !          1994:                  '\x94' : ('rdquo', '201D'),
        !          1995:                  '\x95' : ('bull', '2022'),
        !          1996:                  '\x96' : ('ndash', '2013'),
        !          1997:                  '\x97' : ('mdash', '2014'),
        !          1998:                  '\x98' : ('tilde', '2DC'),
        !          1999:                  '\x99' : ('trade', '2122'),
        !          2000:                  '\x9a' : ('scaron', '161'),
        !          2001:                  '\x9b' : ('rsaquo', '203A'),
        !          2002:                  '\x9c' : ('oelig', '153'),
        !          2003:                  '\x9d' : '?',
        !          2004:                  '\x9e' : ('#x17E', '17E'),
        !          2005:                  '\x9f' : ('Yuml', ''),}
        !          2006: 
        !          2007: #######################################################################
        !          2008: 
        !          2009: 
        !          2010: #By default, act as an HTML pretty-printer.
        !          2011: if __name__ == '__main__':
        !          2012:     import sys
        !          2013:     soup = BeautifulSoup(sys.stdin)
        !          2014:     print soup.prettify()

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>