annotate BeautifulSoup.py @ 146:fc2480e1f30f

more style
author casties
date Fri, 31 May 2013 21:15:32 +0200
parents bca61e893fcc
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1 """Beautiful Soup
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
2 Elixir and Tonic
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
3 "The Screen-Scraper's Friend"
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
4 http://www.crummy.com/software/BeautifulSoup/
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
5
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
6 Beautiful Soup parses a (possibly invalid) XML or HTML document into a
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
7 tree representation. It provides methods and Pythonic idioms that make
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
8 it easy to navigate, search, and modify the tree.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
9
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
10 A well-formed XML/HTML document yields a well-formed data
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
11 structure. An ill-formed XML/HTML document yields a correspondingly
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
12 ill-formed data structure. If your document is only locally
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
13 well-formed, you can use this library to find and process the
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
14 well-formed part of it.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
15
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
16 Beautiful Soup works with Python 2.2 and up. It has no external
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
17 dependencies, but you'll have more success at converting data to UTF-8
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
18 if you also install these three packages:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
19
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
20 * chardet, for auto-detecting character encodings
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
21 http://chardet.feedparser.org/
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
22 * cjkcodecs and iconv_codec, which add more encodings to the ones supported
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
23 by stock Python.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
24 http://cjkpython.i18n.org/
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
25
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
26 Beautiful Soup defines classes for two main parsing strategies:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
27
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
28 * BeautifulStoneSoup, for parsing XML, SGML, or your domain-specific
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
29 language that kind of looks like XML.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
30
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
31 * BeautifulSoup, for parsing run-of-the-mill HTML code, be it valid
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
32 or invalid. This class has web browser-like heuristics for
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
33 obtaining a sensible parse tree in the face of common HTML errors.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
34
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
35 Beautiful Soup also defines a class (UnicodeDammit) for autodetecting
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
36 the encoding of an HTML or XML document, and converting it to
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
37 Unicode. Much of this code is taken from Mark Pilgrim's Universal Feed Parser.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
38
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
39 For more than you ever wanted to know about Beautiful Soup, see the
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
40 documentation:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
41 http://www.crummy.com/software/BeautifulSoup/documentation.html
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
42
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
43 Here, have some legalese:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
44
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
45 Copyright (c) 2004-2010, Leonard Richardson
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
46
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
47 All rights reserved.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
48
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
49 Redistribution and use in source and binary forms, with or without
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
50 modification, are permitted provided that the following conditions are
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
51 met:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
52
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
53 * Redistributions of source code must retain the above copyright
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
54 notice, this list of conditions and the following disclaimer.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
55
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
56 * Redistributions in binary form must reproduce the above
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
57 copyright notice, this list of conditions and the following
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
58 disclaimer in the documentation and/or other materials provided
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
59 with the distribution.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
60
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
61 * Neither the name of the the Beautiful Soup Consortium and All
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
62 Night Kosher Bakery nor the names of its contributors may be
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
63 used to endorse or promote products derived from this software
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
64 without specific prior written permission.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
65
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
66 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
67 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
68 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
69 A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
70 CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
71 EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
72 PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
73 PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
74 LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
75 NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
76 SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE, DAMMIT.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
77
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
78 """
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
79 from __future__ import generators
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
80
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
81 __author__ = "Leonard Richardson (leonardr@segfault.org)"
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
82 __version__ = "3.2.0"
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
83 __copyright__ = "Copyright (c) 2004-2010 Leonard Richardson"
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
84 __license__ = "New-style BSD"
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
85
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
86 from sgmllib import SGMLParser, SGMLParseError
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
87 import codecs
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
88 import markupbase
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
89 import types
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
90 import re
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
91 import sgmllib
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
92 try:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
93 from htmlentitydefs import name2codepoint
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
94 except ImportError:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
95 name2codepoint = {}
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
96 try:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
97 set
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
98 except NameError:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
99 from sets import Set as set
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
100
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
101 #These hacks make Beautiful Soup able to parse XML with namespaces
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
102 sgmllib.tagfind = re.compile('[a-zA-Z][-_.:a-zA-Z0-9]*')
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
103 markupbase._declname_match = re.compile(r'[a-zA-Z][-_.:a-zA-Z0-9]*\s*').match
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
104
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
105 DEFAULT_OUTPUT_ENCODING = "utf-8"
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
106
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
107 def _match_css_class(str):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
108 """Build a RE to match the given CSS class."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
109 return re.compile(r"(^|.*\s)%s($|\s)" % str)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
110
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
111 # First, the classes that represent markup elements.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
112
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
113 class PageElement(object):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
114 """Contains the navigational information for some part of the page
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
115 (either a tag or a piece of text)"""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
116
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
117 def setup(self, parent=None, previous=None):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
118 """Sets up the initial relations between this element and
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
119 other elements."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
120 self.parent = parent
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
121 self.previous = previous
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
122 self.next = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
123 self.previousSibling = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
124 self.nextSibling = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
125 if self.parent and self.parent.contents:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
126 self.previousSibling = self.parent.contents[-1]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
127 self.previousSibling.nextSibling = self
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
128
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
129 def replaceWith(self, replaceWith):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
130 oldParent = self.parent
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
131 myIndex = self.parent.index(self)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
132 if hasattr(replaceWith, "parent")\
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
133 and replaceWith.parent is self.parent:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
134 # We're replacing this element with one of its siblings.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
135 index = replaceWith.parent.index(replaceWith)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
136 if index and index < myIndex:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
137 # Furthermore, it comes before this element. That
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
138 # means that when we extract it, the index of this
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
139 # element will change.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
140 myIndex = myIndex - 1
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
141 self.extract()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
142 oldParent.insert(myIndex, replaceWith)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
143
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
144 def replaceWithChildren(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
145 myParent = self.parent
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
146 myIndex = self.parent.index(self)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
147 self.extract()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
148 reversedChildren = list(self.contents)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
149 reversedChildren.reverse()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
150 for child in reversedChildren:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
151 myParent.insert(myIndex, child)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
152
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
153 def extract(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
154 """Destructively rips this element out of the tree."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
155 if self.parent:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
156 try:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
157 del self.parent.contents[self.parent.index(self)]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
158 except ValueError:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
159 pass
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
160
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
161 #Find the two elements that would be next to each other if
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
162 #this element (and any children) hadn't been parsed. Connect
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
163 #the two.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
164 lastChild = self._lastRecursiveChild()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
165 nextElement = lastChild.next
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
166
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
167 if self.previous:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
168 self.previous.next = nextElement
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
169 if nextElement:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
170 nextElement.previous = self.previous
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
171 self.previous = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
172 lastChild.next = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
173
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
174 self.parent = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
175 if self.previousSibling:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
176 self.previousSibling.nextSibling = self.nextSibling
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
177 if self.nextSibling:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
178 self.nextSibling.previousSibling = self.previousSibling
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
179 self.previousSibling = self.nextSibling = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
180 return self
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
181
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
182 def _lastRecursiveChild(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
183 "Finds the last element beneath this object to be parsed."
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
184 lastChild = self
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
185 while hasattr(lastChild, 'contents') and lastChild.contents:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
186 lastChild = lastChild.contents[-1]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
187 return lastChild
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
188
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
189 def insert(self, position, newChild):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
190 if isinstance(newChild, basestring) \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
191 and not isinstance(newChild, NavigableString):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
192 newChild = NavigableString(newChild)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
193
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
194 position = min(position, len(self.contents))
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
195 if hasattr(newChild, 'parent') and newChild.parent is not None:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
196 # We're 'inserting' an element that's already one
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
197 # of this object's children.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
198 if newChild.parent is self:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
199 index = self.index(newChild)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
200 if index > position:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
201 # Furthermore we're moving it further down the
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
202 # list of this object's children. That means that
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
203 # when we extract this element, our target index
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
204 # will jump down one.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
205 position = position - 1
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
206 newChild.extract()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
207
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
208 newChild.parent = self
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
209 previousChild = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
210 if position == 0:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
211 newChild.previousSibling = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
212 newChild.previous = self
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
213 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
214 previousChild = self.contents[position-1]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
215 newChild.previousSibling = previousChild
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
216 newChild.previousSibling.nextSibling = newChild
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
217 newChild.previous = previousChild._lastRecursiveChild()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
218 if newChild.previous:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
219 newChild.previous.next = newChild
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
220
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
221 newChildsLastElement = newChild._lastRecursiveChild()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
222
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
223 if position >= len(self.contents):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
224 newChild.nextSibling = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
225
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
226 parent = self
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
227 parentsNextSibling = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
228 while not parentsNextSibling:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
229 parentsNextSibling = parent.nextSibling
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
230 parent = parent.parent
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
231 if not parent: # This is the last element in the document.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
232 break
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
233 if parentsNextSibling:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
234 newChildsLastElement.next = parentsNextSibling
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
235 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
236 newChildsLastElement.next = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
237 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
238 nextChild = self.contents[position]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
239 newChild.nextSibling = nextChild
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
240 if newChild.nextSibling:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
241 newChild.nextSibling.previousSibling = newChild
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
242 newChildsLastElement.next = nextChild
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
243
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
244 if newChildsLastElement.next:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
245 newChildsLastElement.next.previous = newChildsLastElement
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
246 self.contents.insert(position, newChild)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
247
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
248 def append(self, tag):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
249 """Appends the given tag to the contents of this tag."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
250 self.insert(len(self.contents), tag)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
251
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
252 def findNext(self, name=None, attrs={}, text=None, **kwargs):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
253 """Returns the first item that matches the given criteria and
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
254 appears after this Tag in the document."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
255 return self._findOne(self.findAllNext, name, attrs, text, **kwargs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
256
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
257 def findAllNext(self, name=None, attrs={}, text=None, limit=None,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
258 **kwargs):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
259 """Returns all items that match the given criteria and appear
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
260 after this Tag in the document."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
261 return self._findAll(name, attrs, text, limit, self.nextGenerator,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
262 **kwargs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
263
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
264 def findNextSibling(self, name=None, attrs={}, text=None, **kwargs):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
265 """Returns the closest sibling to this Tag that matches the
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
266 given criteria and appears after this Tag in the document."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
267 return self._findOne(self.findNextSiblings, name, attrs, text,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
268 **kwargs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
269
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
270 def findNextSiblings(self, name=None, attrs={}, text=None, limit=None,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
271 **kwargs):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
272 """Returns the siblings of this Tag that match the given
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
273 criteria and appear after this Tag in the document."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
274 return self._findAll(name, attrs, text, limit,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
275 self.nextSiblingGenerator, **kwargs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
276 fetchNextSiblings = findNextSiblings # Compatibility with pre-3.x
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
277
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
278 def findPrevious(self, name=None, attrs={}, text=None, **kwargs):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
279 """Returns the first item that matches the given criteria and
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
280 appears before this Tag in the document."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
281 return self._findOne(self.findAllPrevious, name, attrs, text, **kwargs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
282
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
283 def findAllPrevious(self, name=None, attrs={}, text=None, limit=None,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
284 **kwargs):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
285 """Returns all items that match the given criteria and appear
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
286 before this Tag in the document."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
287 return self._findAll(name, attrs, text, limit, self.previousGenerator,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
288 **kwargs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
289 fetchPrevious = findAllPrevious # Compatibility with pre-3.x
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
290
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
291 def findPreviousSibling(self, name=None, attrs={}, text=None, **kwargs):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
292 """Returns the closest sibling to this Tag that matches the
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
293 given criteria and appears before this Tag in the document."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
294 return self._findOne(self.findPreviousSiblings, name, attrs, text,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
295 **kwargs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
296
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
297 def findPreviousSiblings(self, name=None, attrs={}, text=None,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
298 limit=None, **kwargs):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
299 """Returns the siblings of this Tag that match the given
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
300 criteria and appear before this Tag in the document."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
301 return self._findAll(name, attrs, text, limit,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
302 self.previousSiblingGenerator, **kwargs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
303 fetchPreviousSiblings = findPreviousSiblings # Compatibility with pre-3.x
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
304
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
305 def findParent(self, name=None, attrs={}, **kwargs):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
306 """Returns the closest parent of this Tag that matches the given
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
307 criteria."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
308 # NOTE: We can't use _findOne because findParents takes a different
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
309 # set of arguments.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
310 r = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
311 l = self.findParents(name, attrs, 1)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
312 if l:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
313 r = l[0]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
314 return r
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
315
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
316 def findParents(self, name=None, attrs={}, limit=None, **kwargs):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
317 """Returns the parents of this Tag that match the given
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
318 criteria."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
319
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
320 return self._findAll(name, attrs, None, limit, self.parentGenerator,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
321 **kwargs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
322 fetchParents = findParents # Compatibility with pre-3.x
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
323
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
324 #These methods do the real heavy lifting.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
325
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
326 def _findOne(self, method, name, attrs, text, **kwargs):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
327 r = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
328 l = method(name, attrs, text, 1, **kwargs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
329 if l:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
330 r = l[0]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
331 return r
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
332
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
333 def _findAll(self, name, attrs, text, limit, generator, **kwargs):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
334 "Iterates over a generator looking for things that match."
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
335
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
336 if isinstance(name, SoupStrainer):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
337 strainer = name
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
338 # (Possibly) special case some findAll*(...) searches
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
339 elif text is None and not limit and not attrs and not kwargs:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
340 # findAll*(True)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
341 if name is True:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
342 return [element for element in generator()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
343 if isinstance(element, Tag)]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
344 # findAll*('tag-name')
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
345 elif isinstance(name, basestring):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
346 return [element for element in generator()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
347 if isinstance(element, Tag) and
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
348 element.name == name]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
349 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
350 strainer = SoupStrainer(name, attrs, text, **kwargs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
351 # Build a SoupStrainer
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
352 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
353 strainer = SoupStrainer(name, attrs, text, **kwargs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
354 results = ResultSet(strainer)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
355 g = generator()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
356 while True:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
357 try:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
358 i = g.next()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
359 except StopIteration:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
360 break
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
361 if i:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
362 found = strainer.search(i)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
363 if found:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
364 results.append(found)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
365 if limit and len(results) >= limit:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
366 break
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
367 return results
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
368
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
369 #These Generators can be used to navigate starting from both
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
370 #NavigableStrings and Tags.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
371 def nextGenerator(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
372 i = self
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
373 while i is not None:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
374 i = i.next
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
375 yield i
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
376
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
377 def nextSiblingGenerator(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
378 i = self
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
379 while i is not None:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
380 i = i.nextSibling
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
381 yield i
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
382
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
383 def previousGenerator(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
384 i = self
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
385 while i is not None:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
386 i = i.previous
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
387 yield i
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
388
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
389 def previousSiblingGenerator(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
390 i = self
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
391 while i is not None:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
392 i = i.previousSibling
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
393 yield i
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
394
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
395 def parentGenerator(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
396 i = self
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
397 while i is not None:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
398 i = i.parent
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
399 yield i
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
400
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
401 # Utility methods
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
402 def substituteEncoding(self, str, encoding=None):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
403 encoding = encoding or "utf-8"
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
404 return str.replace("%SOUP-ENCODING%", encoding)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
405
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
406 def toEncoding(self, s, encoding=None):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
407 """Encodes an object to a string in some encoding, or to Unicode.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
408 ."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
409 if isinstance(s, unicode):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
410 if encoding:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
411 s = s.encode(encoding)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
412 elif isinstance(s, str):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
413 if encoding:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
414 s = s.encode(encoding)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
415 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
416 s = unicode(s)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
417 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
418 if encoding:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
419 s = self.toEncoding(str(s), encoding)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
420 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
421 s = unicode(s)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
422 return s
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
423
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
424 class NavigableString(unicode, PageElement):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
425
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
426 def __new__(cls, value):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
427 """Create a new NavigableString.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
428
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
429 When unpickling a NavigableString, this method is called with
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
430 the string in DEFAULT_OUTPUT_ENCODING. That encoding needs to be
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
431 passed in to the superclass's __new__ or the superclass won't know
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
432 how to handle non-ASCII characters.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
433 """
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
434 if isinstance(value, unicode):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
435 return unicode.__new__(cls, value)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
436 return unicode.__new__(cls, value, DEFAULT_OUTPUT_ENCODING)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
437
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
438 def __getnewargs__(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
439 return (NavigableString.__str__(self),)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
440
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
441 def __getattr__(self, attr):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
442 """text.string gives you text. This is for backwards
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
443 compatibility for Navigable*String, but for CData* it lets you
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
444 get the string without the CData wrapper."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
445 if attr == 'string':
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
446 return self
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
447 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
448 raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
449
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
450 def __unicode__(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
451 return str(self).decode(DEFAULT_OUTPUT_ENCODING)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
452
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
453 def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
454 if encoding:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
455 return self.encode(encoding)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
456 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
457 return self
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
458
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
459 class CData(NavigableString):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
460
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
461 def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
462 return "<![CDATA[%s]]>" % NavigableString.__str__(self, encoding)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
463
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
464 class ProcessingInstruction(NavigableString):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
465 def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
466 output = self
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
467 if "%SOUP-ENCODING%" in output:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
468 output = self.substituteEncoding(output, encoding)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
469 return "<?%s?>" % self.toEncoding(output, encoding)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
470
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
471 class Comment(NavigableString):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
472 def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
473 return "<!--%s-->" % NavigableString.__str__(self, encoding)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
474
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
475 class Declaration(NavigableString):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
476 def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
477 return "<!%s>" % NavigableString.__str__(self, encoding)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
478
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
479 class Tag(PageElement):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
480
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
481 """Represents a found HTML tag with its attributes and contents."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
482
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
483 def _invert(h):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
484 "Cheap function to invert a hash."
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
485 i = {}
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
486 for k,v in h.items():
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
487 i[v] = k
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
488 return i
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
489
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
490 XML_ENTITIES_TO_SPECIAL_CHARS = { "apos" : "'",
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
491 "quot" : '"',
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
492 "amp" : "&",
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
493 "lt" : "<",
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
494 "gt" : ">" }
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
495
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
496 XML_SPECIAL_CHARS_TO_ENTITIES = _invert(XML_ENTITIES_TO_SPECIAL_CHARS)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
497
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
498 def _convertEntities(self, match):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
499 """Used in a call to re.sub to replace HTML, XML, and numeric
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
500 entities with the appropriate Unicode characters. If HTML
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
501 entities are being converted, any unrecognized entities are
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
502 escaped."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
503 x = match.group(1)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
504 if self.convertHTMLEntities and x in name2codepoint:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
505 return unichr(name2codepoint[x])
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
506 elif x in self.XML_ENTITIES_TO_SPECIAL_CHARS:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
507 if self.convertXMLEntities:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
508 return self.XML_ENTITIES_TO_SPECIAL_CHARS[x]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
509 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
510 return u'&%s;' % x
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
511 elif len(x) > 0 and x[0] == '#':
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
512 # Handle numeric entities
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
513 if len(x) > 1 and x[1] == 'x':
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
514 return unichr(int(x[2:], 16))
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
515 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
516 return unichr(int(x[1:]))
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
517
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
518 elif self.escapeUnrecognizedEntities:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
519 return u'&amp;%s;' % x
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
520 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
521 return u'&%s;' % x
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
522
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
523 def __init__(self, parser, name, attrs=None, parent=None,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
524 previous=None):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
525 "Basic constructor."
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
526
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
527 # We don't actually store the parser object: that lets extracted
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
528 # chunks be garbage-collected
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
529 self.parserClass = parser.__class__
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
530 self.isSelfClosing = parser.isSelfClosingTag(name)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
531 self.name = name
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
532 if attrs is None:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
533 attrs = []
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
534 elif isinstance(attrs, dict):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
535 attrs = attrs.items()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
536 self.attrs = attrs
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
537 self.contents = []
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
538 self.setup(parent, previous)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
539 self.hidden = False
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
540 self.containsSubstitutions = False
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
541 self.convertHTMLEntities = parser.convertHTMLEntities
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
542 self.convertXMLEntities = parser.convertXMLEntities
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
543 self.escapeUnrecognizedEntities = parser.escapeUnrecognizedEntities
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
544
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
545 # Convert any HTML, XML, or numeric entities in the attribute values.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
546 convert = lambda(k, val): (k,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
547 re.sub("&(#\d+|#x[0-9a-fA-F]+|\w+);",
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
548 self._convertEntities,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
549 val))
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
550 self.attrs = map(convert, self.attrs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
551
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
552 def getString(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
553 if (len(self.contents) == 1
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
554 and isinstance(self.contents[0], NavigableString)):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
555 return self.contents[0]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
556
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
557 def setString(self, string):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
558 """Replace the contents of the tag with a string"""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
559 self.clear()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
560 self.append(string)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
561
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
562 string = property(getString, setString)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
563
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
564 def getText(self, separator=u""):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
565 if not len(self.contents):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
566 return u""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
567 stopNode = self._lastRecursiveChild().next
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
568 strings = []
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
569 current = self.contents[0]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
570 while current is not stopNode:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
571 if isinstance(current, NavigableString):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
572 strings.append(current.strip())
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
573 current = current.next
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
574 return separator.join(strings)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
575
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
576 text = property(getText)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
577
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
578 def get(self, key, default=None):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
579 """Returns the value of the 'key' attribute for the tag, or
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
580 the value given for 'default' if it doesn't have that
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
581 attribute."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
582 return self._getAttrMap().get(key, default)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
583
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
584 def clear(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
585 """Extract all children."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
586 for child in self.contents[:]:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
587 child.extract()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
588
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
589 def index(self, element):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
590 for i, child in enumerate(self.contents):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
591 if child is element:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
592 return i
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
593 raise ValueError("Tag.index: element not in tag")
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
594
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
595 def has_key(self, key):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
596 return self._getAttrMap().has_key(key)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
597
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
598 def __getitem__(self, key):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
599 """tag[key] returns the value of the 'key' attribute for the tag,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
600 and throws an exception if it's not there."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
601 return self._getAttrMap()[key]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
602
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
603 def __iter__(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
604 "Iterating over a tag iterates over its contents."
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
605 return iter(self.contents)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
606
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
607 def __len__(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
608 "The length of a tag is the length of its list of contents."
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
609 return len(self.contents)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
610
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
611 def __contains__(self, x):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
612 return x in self.contents
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
613
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
614 def __nonzero__(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
615 "A tag is non-None even if it has no contents."
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
616 return True
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
617
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
618 def __setitem__(self, key, value):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
619 """Setting tag[key] sets the value of the 'key' attribute for the
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
620 tag."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
621 self._getAttrMap()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
622 self.attrMap[key] = value
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
623 found = False
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
624 for i in range(0, len(self.attrs)):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
625 if self.attrs[i][0] == key:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
626 self.attrs[i] = (key, value)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
627 found = True
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
628 if not found:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
629 self.attrs.append((key, value))
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
630 self._getAttrMap()[key] = value
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
631
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
632 def __delitem__(self, key):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
633 "Deleting tag[key] deletes all 'key' attributes for the tag."
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
634 for item in self.attrs:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
635 if item[0] == key:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
636 self.attrs.remove(item)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
637 #We don't break because bad HTML can define the same
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
638 #attribute multiple times.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
639 self._getAttrMap()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
640 if self.attrMap.has_key(key):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
641 del self.attrMap[key]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
642
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
643 def __call__(self, *args, **kwargs):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
644 """Calling a tag like a function is the same as calling its
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
645 findAll() method. Eg. tag('a') returns a list of all the A tags
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
646 found within this tag."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
647 return apply(self.findAll, args, kwargs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
648
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
649 def __getattr__(self, tag):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
650 #print "Getattr %s.%s" % (self.__class__, tag)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
651 if len(tag) > 3 and tag.rfind('Tag') == len(tag)-3:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
652 return self.find(tag[:-3])
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
653 elif tag.find('__') != 0:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
654 return self.find(tag)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
655 raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__, tag)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
656
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
657 def __eq__(self, other):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
658 """Returns true iff this tag has the same name, the same attributes,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
659 and the same contents (recursively) as the given tag.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
660
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
661 NOTE: right now this will return false if two tags have the
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
662 same attributes in a different order. Should this be fixed?"""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
663 if other is self:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
664 return True
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
665 if not hasattr(other, 'name') or not hasattr(other, 'attrs') or not hasattr(other, 'contents') or self.name != other.name or self.attrs != other.attrs or len(self) != len(other):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
666 return False
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
667 for i in range(0, len(self.contents)):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
668 if self.contents[i] != other.contents[i]:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
669 return False
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
670 return True
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
671
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
672 def __ne__(self, other):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
673 """Returns true iff this tag is not identical to the other tag,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
674 as defined in __eq__."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
675 return not self == other
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
676
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
677 def __repr__(self, encoding=DEFAULT_OUTPUT_ENCODING):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
678 """Renders this tag as a string."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
679 return self.__str__(encoding)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
680
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
681 def __unicode__(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
682 return self.__str__(None)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
683
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
684 BARE_AMPERSAND_OR_BRACKET = re.compile("([<>]|"
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
685 + "&(?!#\d+;|#x[0-9a-fA-F]+;|\w+;)"
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
686 + ")")
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
687
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
688 def _sub_entity(self, x):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
689 """Used with a regular expression to substitute the
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
690 appropriate XML entity for an XML special character."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
691 return "&" + self.XML_SPECIAL_CHARS_TO_ENTITIES[x.group(0)[0]] + ";"
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
692
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
693 def __str__(self, encoding=DEFAULT_OUTPUT_ENCODING,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
694 prettyPrint=False, indentLevel=0):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
695 """Returns a string or Unicode representation of this tag and
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
696 its contents. To get Unicode, pass None for encoding.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
697
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
698 NOTE: since Python's HTML parser consumes whitespace, this
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
699 method is not certain to reproduce the whitespace present in
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
700 the original string."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
701
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
702 encodedName = self.toEncoding(self.name, encoding)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
703
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
704 attrs = []
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
705 if self.attrs:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
706 for key, val in self.attrs:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
707 fmt = '%s="%s"'
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
708 if isinstance(val, basestring):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
709 if self.containsSubstitutions and '%SOUP-ENCODING%' in val:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
710 val = self.substituteEncoding(val, encoding)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
711
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
712 # The attribute value either:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
713 #
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
714 # * Contains no embedded double quotes or single quotes.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
715 # No problem: we enclose it in double quotes.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
716 # * Contains embedded single quotes. No problem:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
717 # double quotes work here too.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
718 # * Contains embedded double quotes. No problem:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
719 # we enclose it in single quotes.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
720 # * Embeds both single _and_ double quotes. This
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
721 # can't happen naturally, but it can happen if
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
722 # you modify an attribute value after parsing
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
723 # the document. Now we have a bit of a
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
724 # problem. We solve it by enclosing the
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
725 # attribute in single quotes, and escaping any
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
726 # embedded single quotes to XML entities.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
727 if '"' in val:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
728 fmt = "%s='%s'"
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
729 if "'" in val:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
730 # TODO: replace with apos when
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
731 # appropriate.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
732 val = val.replace("'", "&squot;")
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
733
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
734 # Now we're okay w/r/t quotes. But the attribute
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
735 # value might also contain angle brackets, or
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
736 # ampersands that aren't part of entities. We need
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
737 # to escape those to XML entities too.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
738 val = self.BARE_AMPERSAND_OR_BRACKET.sub(self._sub_entity, val)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
739
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
740 attrs.append(fmt % (self.toEncoding(key, encoding),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
741 self.toEncoding(val, encoding)))
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
742 close = ''
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
743 closeTag = ''
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
744 if self.isSelfClosing:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
745 close = ' /'
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
746 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
747 closeTag = '</%s>' % encodedName
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
748
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
749 indentTag, indentContents = 0, 0
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
750 if prettyPrint:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
751 indentTag = indentLevel
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
752 space = (' ' * (indentTag-1))
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
753 indentContents = indentTag + 1
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
754 contents = self.renderContents(encoding, prettyPrint, indentContents)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
755 if self.hidden:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
756 s = contents
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
757 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
758 s = []
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
759 attributeString = ''
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
760 if attrs:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
761 attributeString = ' ' + ' '.join(attrs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
762 if prettyPrint:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
763 s.append(space)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
764 s.append('<%s%s%s>' % (encodedName, attributeString, close))
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
765 if prettyPrint:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
766 s.append("\n")
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
767 s.append(contents)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
768 if prettyPrint and contents and contents[-1] != "\n":
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
769 s.append("\n")
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
770 if prettyPrint and closeTag:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
771 s.append(space)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
772 s.append(closeTag)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
773 if prettyPrint and closeTag and self.nextSibling:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
774 s.append("\n")
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
775 s = ''.join(s)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
776 return s
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
777
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
778 def decompose(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
779 """Recursively destroys the contents of this tree."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
780 self.extract()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
781 if len(self.contents) == 0:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
782 return
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
783 current = self.contents[0]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
784 while current is not None:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
785 next = current.next
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
786 if isinstance(current, Tag):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
787 del current.contents[:]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
788 current.parent = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
789 current.previous = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
790 current.previousSibling = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
791 current.next = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
792 current.nextSibling = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
793 current = next
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
794
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
795 def prettify(self, encoding=DEFAULT_OUTPUT_ENCODING):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
796 return self.__str__(encoding, True)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
797
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
798 def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
799 prettyPrint=False, indentLevel=0):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
800 """Renders the contents of this tag as a string in the given
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
801 encoding. If encoding is None, returns a Unicode string.."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
802 s=[]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
803 for c in self:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
804 text = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
805 if isinstance(c, NavigableString):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
806 text = c.__str__(encoding)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
807 elif isinstance(c, Tag):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
808 s.append(c.__str__(encoding, prettyPrint, indentLevel))
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
809 if text and prettyPrint:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
810 text = text.strip()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
811 if text:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
812 if prettyPrint:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
813 s.append(" " * (indentLevel-1))
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
814 s.append(text)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
815 if prettyPrint:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
816 s.append("\n")
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
817 return ''.join(s)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
818
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
819 #Soup methods
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
820
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
821 def find(self, name=None, attrs={}, recursive=True, text=None,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
822 **kwargs):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
823 """Return only the first child of this Tag matching the given
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
824 criteria."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
825 r = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
826 l = self.findAll(name, attrs, recursive, text, 1, **kwargs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
827 if l:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
828 r = l[0]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
829 return r
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
830 findChild = find
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
831
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
832 def findAll(self, name=None, attrs={}, recursive=True, text=None,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
833 limit=None, **kwargs):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
834 """Extracts a list of Tag objects that match the given
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
835 criteria. You can specify the name of the Tag and any
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
836 attributes you want the Tag to have.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
837
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
838 The value of a key-value pair in the 'attrs' map can be a
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
839 string, a list of strings, a regular expression object, or a
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
840 callable that takes a string and returns whether or not the
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
841 string matches for some custom definition of 'matches'. The
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
842 same is true of the tag name."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
843 generator = self.recursiveChildGenerator
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
844 if not recursive:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
845 generator = self.childGenerator
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
846 return self._findAll(name, attrs, text, limit, generator, **kwargs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
847 findChildren = findAll
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
848
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
849 # Pre-3.x compatibility methods
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
850 first = find
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
851 fetch = findAll
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
852
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
853 def fetchText(self, text=None, recursive=True, limit=None):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
854 return self.findAll(text=text, recursive=recursive, limit=limit)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
855
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
856 def firstText(self, text=None, recursive=True):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
857 return self.find(text=text, recursive=recursive)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
858
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
859 #Private methods
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
860
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
861 def _getAttrMap(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
862 """Initializes a map representation of this tag's attributes,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
863 if not already initialized."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
864 if not getattr(self, 'attrMap'):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
865 self.attrMap = {}
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
866 for (key, value) in self.attrs:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
867 self.attrMap[key] = value
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
868 return self.attrMap
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
869
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
870 #Generator methods
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
871 def childGenerator(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
872 # Just use the iterator from the contents
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
873 return iter(self.contents)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
874
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
875 def recursiveChildGenerator(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
876 if not len(self.contents):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
877 raise StopIteration
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
878 stopNode = self._lastRecursiveChild().next
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
879 current = self.contents[0]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
880 while current is not stopNode:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
881 yield current
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
882 current = current.next
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
883
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
884
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
885 # Next, a couple classes to represent queries and their results.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
886 class SoupStrainer:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
887 """Encapsulates a number of ways of matching a markup element (tag or
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
888 text)."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
889
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
890 def __init__(self, name=None, attrs={}, text=None, **kwargs):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
891 self.name = name
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
892 if isinstance(attrs, basestring):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
893 kwargs['class'] = _match_css_class(attrs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
894 attrs = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
895 if kwargs:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
896 if attrs:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
897 attrs = attrs.copy()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
898 attrs.update(kwargs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
899 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
900 attrs = kwargs
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
901 self.attrs = attrs
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
902 self.text = text
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
903
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
904 def __str__(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
905 if self.text:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
906 return self.text
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
907 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
908 return "%s|%s" % (self.name, self.attrs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
909
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
910 def searchTag(self, markupName=None, markupAttrs={}):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
911 found = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
912 markup = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
913 if isinstance(markupName, Tag):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
914 markup = markupName
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
915 markupAttrs = markup
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
916 callFunctionWithTagData = callable(self.name) \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
917 and not isinstance(markupName, Tag)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
918
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
919 if (not self.name) \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
920 or callFunctionWithTagData \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
921 or (markup and self._matches(markup, self.name)) \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
922 or (not markup and self._matches(markupName, self.name)):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
923 if callFunctionWithTagData:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
924 match = self.name(markupName, markupAttrs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
925 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
926 match = True
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
927 markupAttrMap = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
928 for attr, matchAgainst in self.attrs.items():
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
929 if not markupAttrMap:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
930 if hasattr(markupAttrs, 'get'):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
931 markupAttrMap = markupAttrs
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
932 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
933 markupAttrMap = {}
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
934 for k,v in markupAttrs:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
935 markupAttrMap[k] = v
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
936 attrValue = markupAttrMap.get(attr)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
937 if not self._matches(attrValue, matchAgainst):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
938 match = False
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
939 break
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
940 if match:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
941 if markup:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
942 found = markup
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
943 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
944 found = markupName
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
945 return found
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
946
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
947 def search(self, markup):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
948 #print 'looking for %s in %s' % (self, markup)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
949 found = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
950 # If given a list of items, scan it for a text element that
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
951 # matches.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
952 if hasattr(markup, "__iter__") \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
953 and not isinstance(markup, Tag):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
954 for element in markup:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
955 if isinstance(element, NavigableString) \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
956 and self.search(element):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
957 found = element
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
958 break
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
959 # If it's a Tag, make sure its name or attributes match.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
960 # Don't bother with Tags if we're searching for text.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
961 elif isinstance(markup, Tag):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
962 if not self.text:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
963 found = self.searchTag(markup)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
964 # If it's text, make sure the text matches.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
965 elif isinstance(markup, NavigableString) or \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
966 isinstance(markup, basestring):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
967 if self._matches(markup, self.text):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
968 found = markup
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
969 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
970 raise Exception, "I don't know how to match against a %s" \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
971 % markup.__class__
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
972 return found
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
973
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
974 def _matches(self, markup, matchAgainst):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
975 #print "Matching %s against %s" % (markup, matchAgainst)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
976 result = False
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
977 if matchAgainst is True:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
978 result = markup is not None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
979 elif callable(matchAgainst):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
980 result = matchAgainst(markup)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
981 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
982 #Custom match methods take the tag as an argument, but all
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
983 #other ways of matching match the tag name as a string.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
984 if isinstance(markup, Tag):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
985 markup = markup.name
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
986 if markup and not isinstance(markup, basestring):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
987 markup = unicode(markup)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
988 #Now we know that chunk is either a string, or None.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
989 if hasattr(matchAgainst, 'match'):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
990 # It's a regexp object.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
991 result = markup and matchAgainst.search(markup)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
992 elif hasattr(matchAgainst, '__iter__'): # list-like
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
993 result = markup in matchAgainst
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
994 elif hasattr(matchAgainst, 'items'):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
995 result = markup.has_key(matchAgainst)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
996 elif matchAgainst and isinstance(markup, basestring):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
997 if isinstance(markup, unicode):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
998 matchAgainst = unicode(matchAgainst)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
999 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1000 matchAgainst = str(matchAgainst)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1001
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1002 if not result:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1003 result = matchAgainst == markup
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1004 return result
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1005
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1006 class ResultSet(list):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1007 """A ResultSet is just a list that keeps track of the SoupStrainer
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1008 that created it."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1009 def __init__(self, source):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1010 list.__init__([])
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1011 self.source = source
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1012
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1013 # Now, some helper functions.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1014
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1015 def buildTagMap(default, *args):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1016 """Turns a list of maps, lists, or scalars into a single map.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1017 Used to build the SELF_CLOSING_TAGS, NESTABLE_TAGS, and
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1018 NESTING_RESET_TAGS maps out of lists and partial maps."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1019 built = {}
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1020 for portion in args:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1021 if hasattr(portion, 'items'):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1022 #It's a map. Merge it.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1023 for k,v in portion.items():
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1024 built[k] = v
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1025 elif hasattr(portion, '__iter__'): # is a list
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1026 #It's a list. Map each item to the default.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1027 for k in portion:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1028 built[k] = default
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1029 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1030 #It's a scalar. Map it to the default.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1031 built[portion] = default
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1032 return built
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1033
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1034 # Now, the parser classes.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1035
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1036 class BeautifulStoneSoup(Tag, SGMLParser):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1037
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1038 """This class contains the basic parser and search code. It defines
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1039 a parser that knows nothing about tag behavior except for the
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1040 following:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1041
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1042 You can't close a tag without closing all the tags it encloses.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1043 That is, "<foo><bar></foo>" actually means
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1044 "<foo><bar></bar></foo>".
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1045
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1046 [Another possible explanation is "<foo><bar /></foo>", but since
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1047 this class defines no SELF_CLOSING_TAGS, it will never use that
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1048 explanation.]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1049
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1050 This class is useful for parsing XML or made-up markup languages,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1051 or when BeautifulSoup makes an assumption counter to what you were
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1052 expecting."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1053
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1054 SELF_CLOSING_TAGS = {}
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1055 NESTABLE_TAGS = {}
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1056 RESET_NESTING_TAGS = {}
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1057 QUOTE_TAGS = {}
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1058 PRESERVE_WHITESPACE_TAGS = []
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1059
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1060 MARKUP_MASSAGE = [(re.compile('(<[^<>]*)/>'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1061 lambda x: x.group(1) + ' />'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1062 (re.compile('<!\s+([^<>]*)>'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1063 lambda x: '<!' + x.group(1) + '>')
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1064 ]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1065
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1066 ROOT_TAG_NAME = u'[document]'
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1067
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1068 HTML_ENTITIES = "html"
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1069 XML_ENTITIES = "xml"
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1070 XHTML_ENTITIES = "xhtml"
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1071 # TODO: This only exists for backwards-compatibility
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1072 ALL_ENTITIES = XHTML_ENTITIES
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1073
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1074 # Used when determining whether a text node is all whitespace and
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1075 # can be replaced with a single space. A text node that contains
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1076 # fancy Unicode spaces (usually non-breaking) should be left
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1077 # alone.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1078 STRIP_ASCII_SPACES = { 9: None, 10: None, 12: None, 13: None, 32: None, }
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1079
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1080 def __init__(self, markup="", parseOnlyThese=None, fromEncoding=None,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1081 markupMassage=True, smartQuotesTo=XML_ENTITIES,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1082 convertEntities=None, selfClosingTags=None, isHTML=False):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1083 """The Soup object is initialized as the 'root tag', and the
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1084 provided markup (which can be a string or a file-like object)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1085 is fed into the underlying parser.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1086
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1087 sgmllib will process most bad HTML, and the BeautifulSoup
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1088 class has some tricks for dealing with some HTML that kills
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1089 sgmllib, but Beautiful Soup can nonetheless choke or lose data
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1090 if your data uses self-closing tags or declarations
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1091 incorrectly.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1092
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1093 By default, Beautiful Soup uses regexes to sanitize input,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1094 avoiding the vast majority of these problems. If the problems
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1095 don't apply to you, pass in False for markupMassage, and
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1096 you'll get better performance.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1097
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1098 The default parser massage techniques fix the two most common
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1099 instances of invalid HTML that choke sgmllib:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1100
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1101 <br/> (No space between name of closing tag and tag close)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1102 <! --Comment--> (Extraneous whitespace in declaration)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1103
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1104 You can pass in a custom list of (RE object, replace method)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1105 tuples to get Beautiful Soup to scrub your input the way you
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1106 want."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1107
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1108 self.parseOnlyThese = parseOnlyThese
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1109 self.fromEncoding = fromEncoding
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1110 self.smartQuotesTo = smartQuotesTo
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1111 self.convertEntities = convertEntities
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1112 # Set the rules for how we'll deal with the entities we
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1113 # encounter
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1114 if self.convertEntities:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1115 # It doesn't make sense to convert encoded characters to
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1116 # entities even while you're converting entities to Unicode.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1117 # Just convert it all to Unicode.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1118 self.smartQuotesTo = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1119 if convertEntities == self.HTML_ENTITIES:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1120 self.convertXMLEntities = False
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1121 self.convertHTMLEntities = True
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1122 self.escapeUnrecognizedEntities = True
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1123 elif convertEntities == self.XHTML_ENTITIES:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1124 self.convertXMLEntities = True
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1125 self.convertHTMLEntities = True
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1126 self.escapeUnrecognizedEntities = False
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1127 elif convertEntities == self.XML_ENTITIES:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1128 self.convertXMLEntities = True
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1129 self.convertHTMLEntities = False
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1130 self.escapeUnrecognizedEntities = False
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1131 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1132 self.convertXMLEntities = False
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1133 self.convertHTMLEntities = False
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1134 self.escapeUnrecognizedEntities = False
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1135
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1136 self.instanceSelfClosingTags = buildTagMap(None, selfClosingTags)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1137 SGMLParser.__init__(self)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1138
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1139 if hasattr(markup, 'read'): # It's a file-type object.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1140 markup = markup.read()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1141 self.markup = markup
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1142 self.markupMassage = markupMassage
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1143 try:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1144 self._feed(isHTML=isHTML)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1145 except StopParsing:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1146 pass
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1147 self.markup = None # The markup can now be GCed
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1148
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1149 def convert_charref(self, name):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1150 """This method fixes a bug in Python's SGMLParser."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1151 try:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1152 n = int(name)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1153 except ValueError:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1154 return
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1155 if not 0 <= n <= 127 : # ASCII ends at 127, not 255
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1156 return
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1157 return self.convert_codepoint(n)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1158
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1159 def _feed(self, inDocumentEncoding=None, isHTML=False):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1160 # Convert the document to Unicode.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1161 markup = self.markup
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1162 if isinstance(markup, unicode):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1163 if not hasattr(self, 'originalEncoding'):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1164 self.originalEncoding = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1165 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1166 dammit = UnicodeDammit\
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1167 (markup, [self.fromEncoding, inDocumentEncoding],
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1168 smartQuotesTo=self.smartQuotesTo, isHTML=isHTML)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1169 markup = dammit.unicode
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1170 self.originalEncoding = dammit.originalEncoding
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1171 self.declaredHTMLEncoding = dammit.declaredHTMLEncoding
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1172 if markup:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1173 if self.markupMassage:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1174 if not hasattr(self.markupMassage, "__iter__"):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1175 self.markupMassage = self.MARKUP_MASSAGE
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1176 for fix, m in self.markupMassage:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1177 markup = fix.sub(m, markup)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1178 # TODO: We get rid of markupMassage so that the
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1179 # soup object can be deepcopied later on. Some
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1180 # Python installations can't copy regexes. If anyone
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1181 # was relying on the existence of markupMassage, this
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1182 # might cause problems.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1183 del(self.markupMassage)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1184 self.reset()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1185
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1186 SGMLParser.feed(self, markup)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1187 # Close out any unfinished strings and close all the open tags.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1188 self.endData()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1189 while self.currentTag.name != self.ROOT_TAG_NAME:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1190 self.popTag()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1191
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1192 def __getattr__(self, methodName):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1193 """This method routes method call requests to either the SGMLParser
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1194 superclass or the Tag superclass, depending on the method name."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1195 #print "__getattr__ called on %s.%s" % (self.__class__, methodName)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1196
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1197 if methodName.startswith('start_') or methodName.startswith('end_') \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1198 or methodName.startswith('do_'):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1199 return SGMLParser.__getattr__(self, methodName)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1200 elif not methodName.startswith('__'):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1201 return Tag.__getattr__(self, methodName)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1202 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1203 raise AttributeError
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1204
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1205 def isSelfClosingTag(self, name):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1206 """Returns true iff the given string is the name of a
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1207 self-closing tag according to this parser."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1208 return self.SELF_CLOSING_TAGS.has_key(name) \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1209 or self.instanceSelfClosingTags.has_key(name)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1210
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1211 def reset(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1212 Tag.__init__(self, self, self.ROOT_TAG_NAME)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1213 self.hidden = 1
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1214 SGMLParser.reset(self)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1215 self.currentData = []
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1216 self.currentTag = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1217 self.tagStack = []
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1218 self.quoteStack = []
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1219 self.pushTag(self)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1220
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1221 def popTag(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1222 tag = self.tagStack.pop()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1223
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1224 #print "Pop", tag.name
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1225 if self.tagStack:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1226 self.currentTag = self.tagStack[-1]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1227 return self.currentTag
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1228
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1229 def pushTag(self, tag):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1230 #print "Push", tag.name
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1231 if self.currentTag:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1232 self.currentTag.contents.append(tag)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1233 self.tagStack.append(tag)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1234 self.currentTag = self.tagStack[-1]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1235
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1236 def endData(self, containerClass=NavigableString):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1237 if self.currentData:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1238 currentData = u''.join(self.currentData)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1239 if (currentData.translate(self.STRIP_ASCII_SPACES) == '' and
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1240 not set([tag.name for tag in self.tagStack]).intersection(
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1241 self.PRESERVE_WHITESPACE_TAGS)):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1242 if '\n' in currentData:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1243 currentData = '\n'
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1244 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1245 currentData = ' '
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1246 self.currentData = []
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1247 if self.parseOnlyThese and len(self.tagStack) <= 1 and \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1248 (not self.parseOnlyThese.text or \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1249 not self.parseOnlyThese.search(currentData)):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1250 return
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1251 o = containerClass(currentData)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1252 o.setup(self.currentTag, self.previous)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1253 if self.previous:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1254 self.previous.next = o
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1255 self.previous = o
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1256 self.currentTag.contents.append(o)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1257
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1258
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1259 def _popToTag(self, name, inclusivePop=True):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1260 """Pops the tag stack up to and including the most recent
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1261 instance of the given tag. If inclusivePop is false, pops the tag
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1262 stack up to but *not* including the most recent instqance of
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1263 the given tag."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1264 #print "Popping to %s" % name
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1265 if name == self.ROOT_TAG_NAME:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1266 return
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1267
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1268 numPops = 0
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1269 mostRecentTag = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1270 for i in range(len(self.tagStack)-1, 0, -1):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1271 if name == self.tagStack[i].name:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1272 numPops = len(self.tagStack)-i
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1273 break
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1274 if not inclusivePop:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1275 numPops = numPops - 1
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1276
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1277 for i in range(0, numPops):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1278 mostRecentTag = self.popTag()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1279 return mostRecentTag
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1280
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1281 def _smartPop(self, name):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1282
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1283 """We need to pop up to the previous tag of this type, unless
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1284 one of this tag's nesting reset triggers comes between this
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1285 tag and the previous tag of this type, OR unless this tag is a
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1286 generic nesting trigger and another generic nesting trigger
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1287 comes between this tag and the previous tag of this type.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1288
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1289 Examples:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1290 <p>Foo<b>Bar *<p>* should pop to 'p', not 'b'.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1291 <p>Foo<table>Bar *<p>* should pop to 'table', not 'p'.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1292 <p>Foo<table><tr>Bar *<p>* should pop to 'tr', not 'p'.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1293
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1294 <li><ul><li> *<li>* should pop to 'ul', not the first 'li'.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1295 <tr><table><tr> *<tr>* should pop to 'table', not the first 'tr'
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1296 <td><tr><td> *<td>* should pop to 'tr', not the first 'td'
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1297 """
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1298
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1299 nestingResetTriggers = self.NESTABLE_TAGS.get(name)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1300 isNestable = nestingResetTriggers != None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1301 isResetNesting = self.RESET_NESTING_TAGS.has_key(name)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1302 popTo = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1303 inclusive = True
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1304 for i in range(len(self.tagStack)-1, 0, -1):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1305 p = self.tagStack[i]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1306 if (not p or p.name == name) and not isNestable:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1307 #Non-nestable tags get popped to the top or to their
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1308 #last occurance.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1309 popTo = name
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1310 break
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1311 if (nestingResetTriggers is not None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1312 and p.name in nestingResetTriggers) \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1313 or (nestingResetTriggers is None and isResetNesting
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1314 and self.RESET_NESTING_TAGS.has_key(p.name)):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1315
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1316 #If we encounter one of the nesting reset triggers
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1317 #peculiar to this tag, or we encounter another tag
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1318 #that causes nesting to reset, pop up to but not
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1319 #including that tag.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1320 popTo = p.name
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1321 inclusive = False
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1322 break
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1323 p = p.parent
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1324 if popTo:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1325 self._popToTag(popTo, inclusive)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1326
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1327 def unknown_starttag(self, name, attrs, selfClosing=0):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1328 #print "Start tag %s: %s" % (name, attrs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1329 if self.quoteStack:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1330 #This is not a real tag.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1331 #print "<%s> is not real!" % name
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1332 attrs = ''.join([' %s="%s"' % (x, y) for x, y in attrs])
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1333 self.handle_data('<%s%s>' % (name, attrs))
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1334 return
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1335 self.endData()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1336
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1337 if not self.isSelfClosingTag(name) and not selfClosing:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1338 self._smartPop(name)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1339
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1340 if self.parseOnlyThese and len(self.tagStack) <= 1 \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1341 and (self.parseOnlyThese.text or not self.parseOnlyThese.searchTag(name, attrs)):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1342 return
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1343
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1344 tag = Tag(self, name, attrs, self.currentTag, self.previous)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1345 if self.previous:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1346 self.previous.next = tag
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1347 self.previous = tag
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1348 self.pushTag(tag)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1349 if selfClosing or self.isSelfClosingTag(name):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1350 self.popTag()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1351 if name in self.QUOTE_TAGS:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1352 #print "Beginning quote (%s)" % name
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1353 self.quoteStack.append(name)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1354 self.literal = 1
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1355 return tag
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1356
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1357 def unknown_endtag(self, name):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1358 #print "End tag %s" % name
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1359 if self.quoteStack and self.quoteStack[-1] != name:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1360 #This is not a real end tag.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1361 #print "</%s> is not real!" % name
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1362 self.handle_data('</%s>' % name)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1363 return
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1364 self.endData()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1365 self._popToTag(name)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1366 if self.quoteStack and self.quoteStack[-1] == name:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1367 self.quoteStack.pop()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1368 self.literal = (len(self.quoteStack) > 0)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1369
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1370 def handle_data(self, data):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1371 self.currentData.append(data)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1372
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1373 def _toStringSubclass(self, text, subclass):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1374 """Adds a certain piece of text to the tree as a NavigableString
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1375 subclass."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1376 self.endData()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1377 self.handle_data(text)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1378 self.endData(subclass)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1379
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1380 def handle_pi(self, text):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1381 """Handle a processing instruction as a ProcessingInstruction
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1382 object, possibly one with a %SOUP-ENCODING% slot into which an
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1383 encoding will be plugged later."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1384 if text[:3] == "xml":
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1385 text = u"xml version='1.0' encoding='%SOUP-ENCODING%'"
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1386 self._toStringSubclass(text, ProcessingInstruction)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1387
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1388 def handle_comment(self, text):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1389 "Handle comments as Comment objects."
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1390 self._toStringSubclass(text, Comment)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1391
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1392 def handle_charref(self, ref):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1393 "Handle character references as data."
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1394 if self.convertEntities:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1395 data = unichr(int(ref))
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1396 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1397 data = '&#%s;' % ref
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1398 self.handle_data(data)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1399
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1400 def handle_entityref(self, ref):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1401 """Handle entity references as data, possibly converting known
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1402 HTML and/or XML entity references to the corresponding Unicode
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1403 characters."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1404 data = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1405 if self.convertHTMLEntities:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1406 try:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1407 data = unichr(name2codepoint[ref])
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1408 except KeyError:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1409 pass
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1410
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1411 if not data and self.convertXMLEntities:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1412 data = self.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1413
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1414 if not data and self.convertHTMLEntities and \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1415 not self.XML_ENTITIES_TO_SPECIAL_CHARS.get(ref):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1416 # TODO: We've got a problem here. We're told this is
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1417 # an entity reference, but it's not an XML entity
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1418 # reference or an HTML entity reference. Nonetheless,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1419 # the logical thing to do is to pass it through as an
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1420 # unrecognized entity reference.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1421 #
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1422 # Except: when the input is "&carol;" this function
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1423 # will be called with input "carol". When the input is
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1424 # "AT&T", this function will be called with input
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1425 # "T". We have no way of knowing whether a semicolon
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1426 # was present originally, so we don't know whether
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1427 # this is an unknown entity or just a misplaced
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1428 # ampersand.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1429 #
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1430 # The more common case is a misplaced ampersand, so I
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1431 # escape the ampersand and omit the trailing semicolon.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1432 data = "&amp;%s" % ref
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1433 if not data:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1434 # This case is different from the one above, because we
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1435 # haven't already gone through a supposedly comprehensive
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1436 # mapping of entities to Unicode characters. We might not
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1437 # have gone through any mapping at all. So the chances are
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1438 # very high that this is a real entity, and not a
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1439 # misplaced ampersand.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1440 data = "&%s;" % ref
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1441 self.handle_data(data)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1442
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1443 def handle_decl(self, data):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1444 "Handle DOCTYPEs and the like as Declaration objects."
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1445 self._toStringSubclass(data, Declaration)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1446
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1447 def parse_declaration(self, i):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1448 """Treat a bogus SGML declaration as raw data. Treat a CDATA
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1449 declaration as a CData object."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1450 j = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1451 if self.rawdata[i:i+9] == '<![CDATA[':
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1452 k = self.rawdata.find(']]>', i)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1453 if k == -1:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1454 k = len(self.rawdata)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1455 data = self.rawdata[i+9:k]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1456 j = k+3
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1457 self._toStringSubclass(data, CData)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1458 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1459 try:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1460 j = SGMLParser.parse_declaration(self, i)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1461 except SGMLParseError:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1462 toHandle = self.rawdata[i:]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1463 self.handle_data(toHandle)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1464 j = i + len(toHandle)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1465 return j
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1466
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1467 class BeautifulSoup(BeautifulStoneSoup):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1468
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1469 """This parser knows the following facts about HTML:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1470
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1471 * Some tags have no closing tag and should be interpreted as being
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1472 closed as soon as they are encountered.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1473
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1474 * The text inside some tags (ie. 'script') may contain tags which
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1475 are not really part of the document and which should be parsed
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1476 as text, not tags. If you want to parse the text as tags, you can
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1477 always fetch it and parse it explicitly.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1478
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1479 * Tag nesting rules:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1480
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1481 Most tags can't be nested at all. For instance, the occurance of
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1482 a <p> tag should implicitly close the previous <p> tag.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1483
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1484 <p>Para1<p>Para2
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1485 should be transformed into:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1486 <p>Para1</p><p>Para2
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1487
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1488 Some tags can be nested arbitrarily. For instance, the occurance
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1489 of a <blockquote> tag should _not_ implicitly close the previous
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1490 <blockquote> tag.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1491
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1492 Alice said: <blockquote>Bob said: <blockquote>Blah
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1493 should NOT be transformed into:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1494 Alice said: <blockquote>Bob said: </blockquote><blockquote>Blah
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1495
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1496 Some tags can be nested, but the nesting is reset by the
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1497 interposition of other tags. For instance, a <tr> tag should
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1498 implicitly close the previous <tr> tag within the same <table>,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1499 but not close a <tr> tag in another table.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1500
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1501 <table><tr>Blah<tr>Blah
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1502 should be transformed into:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1503 <table><tr>Blah</tr><tr>Blah
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1504 but,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1505 <tr>Blah<table><tr>Blah
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1506 should NOT be transformed into
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1507 <tr>Blah<table></tr><tr>Blah
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1508
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1509 Differing assumptions about tag nesting rules are a major source
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1510 of problems with the BeautifulSoup class. If BeautifulSoup is not
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1511 treating as nestable a tag your page author treats as nestable,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1512 try ICantBelieveItsBeautifulSoup, MinimalSoup, or
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1513 BeautifulStoneSoup before writing your own subclass."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1514
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1515 def __init__(self, *args, **kwargs):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1516 if not kwargs.has_key('smartQuotesTo'):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1517 kwargs['smartQuotesTo'] = self.HTML_ENTITIES
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1518 kwargs['isHTML'] = True
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1519 BeautifulStoneSoup.__init__(self, *args, **kwargs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1520
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1521 SELF_CLOSING_TAGS = buildTagMap(None,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1522 ('br' , 'hr', 'input', 'img', 'meta',
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1523 'spacer', 'link', 'frame', 'base', 'col'))
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1524
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1525 PRESERVE_WHITESPACE_TAGS = set(['pre', 'textarea'])
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1526
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1527 QUOTE_TAGS = {'script' : None, 'textarea' : None}
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1528
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1529 #According to the HTML standard, each of these inline tags can
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1530 #contain another tag of the same type. Furthermore, it's common
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1531 #to actually use these tags this way.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1532 NESTABLE_INLINE_TAGS = ('span', 'font', 'q', 'object', 'bdo', 'sub', 'sup',
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1533 'center')
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1534
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1535 #According to the HTML standard, these block tags can contain
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1536 #another tag of the same type. Furthermore, it's common
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1537 #to actually use these tags this way.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1538 NESTABLE_BLOCK_TAGS = ('blockquote', 'div', 'fieldset', 'ins', 'del')
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1539
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1540 #Lists can contain other lists, but there are restrictions.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1541 NESTABLE_LIST_TAGS = { 'ol' : [],
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1542 'ul' : [],
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1543 'li' : ['ul', 'ol'],
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1544 'dl' : [],
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1545 'dd' : ['dl'],
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1546 'dt' : ['dl'] }
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1547
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1548 #Tables can contain other tables, but there are restrictions.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1549 NESTABLE_TABLE_TAGS = {'table' : [],
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1550 'tr' : ['table', 'tbody', 'tfoot', 'thead'],
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1551 'td' : ['tr'],
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1552 'th' : ['tr'],
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1553 'thead' : ['table'],
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1554 'tbody' : ['table'],
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1555 'tfoot' : ['table'],
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1556 }
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1557
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1558 NON_NESTABLE_BLOCK_TAGS = ('address', 'form', 'p', 'pre')
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1559
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1560 #If one of these tags is encountered, all tags up to the next tag of
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1561 #this type are popped.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1562 RESET_NESTING_TAGS = buildTagMap(None, NESTABLE_BLOCK_TAGS, 'noscript',
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1563 NON_NESTABLE_BLOCK_TAGS,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1564 NESTABLE_LIST_TAGS,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1565 NESTABLE_TABLE_TAGS)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1566
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1567 NESTABLE_TAGS = buildTagMap([], NESTABLE_INLINE_TAGS, NESTABLE_BLOCK_TAGS,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1568 NESTABLE_LIST_TAGS, NESTABLE_TABLE_TAGS)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1569
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1570 # Used to detect the charset in a META tag; see start_meta
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1571 CHARSET_RE = re.compile("((^|;)\s*charset=)([^;]*)", re.M)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1572
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1573 def start_meta(self, attrs):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1574 """Beautiful Soup can detect a charset included in a META tag,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1575 try to convert the document to that charset, and re-parse the
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1576 document from the beginning."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1577 httpEquiv = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1578 contentType = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1579 contentTypeIndex = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1580 tagNeedsEncodingSubstitution = False
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1581
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1582 for i in range(0, len(attrs)):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1583 key, value = attrs[i]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1584 key = key.lower()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1585 if key == 'http-equiv':
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1586 httpEquiv = value
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1587 elif key == 'content':
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1588 contentType = value
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1589 contentTypeIndex = i
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1590
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1591 if httpEquiv and contentType: # It's an interesting meta tag.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1592 match = self.CHARSET_RE.search(contentType)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1593 if match:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1594 if (self.declaredHTMLEncoding is not None or
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1595 self.originalEncoding == self.fromEncoding):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1596 # An HTML encoding was sniffed while converting
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1597 # the document to Unicode, or an HTML encoding was
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1598 # sniffed during a previous pass through the
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1599 # document, or an encoding was specified
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1600 # explicitly and it worked. Rewrite the meta tag.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1601 def rewrite(match):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1602 return match.group(1) + "%SOUP-ENCODING%"
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1603 newAttr = self.CHARSET_RE.sub(rewrite, contentType)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1604 attrs[contentTypeIndex] = (attrs[contentTypeIndex][0],
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1605 newAttr)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1606 tagNeedsEncodingSubstitution = True
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1607 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1608 # This is our first pass through the document.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1609 # Go through it again with the encoding information.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1610 newCharset = match.group(3)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1611 if newCharset and newCharset != self.originalEncoding:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1612 self.declaredHTMLEncoding = newCharset
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1613 self._feed(self.declaredHTMLEncoding)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1614 raise StopParsing
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1615 pass
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1616 tag = self.unknown_starttag("meta", attrs)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1617 if tag and tagNeedsEncodingSubstitution:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1618 tag.containsSubstitutions = True
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1619
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1620 class StopParsing(Exception):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1621 pass
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1622
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1623 class ICantBelieveItsBeautifulSoup(BeautifulSoup):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1624
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1625 """The BeautifulSoup class is oriented towards skipping over
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1626 common HTML errors like unclosed tags. However, sometimes it makes
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1627 errors of its own. For instance, consider this fragment:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1628
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1629 <b>Foo<b>Bar</b></b>
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1630
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1631 This is perfectly valid (if bizarre) HTML. However, the
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1632 BeautifulSoup class will implicitly close the first b tag when it
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1633 encounters the second 'b'. It will think the author wrote
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1634 "<b>Foo<b>Bar", and didn't close the first 'b' tag, because
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1635 there's no real-world reason to bold something that's already
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1636 bold. When it encounters '</b></b>' it will close two more 'b'
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1637 tags, for a grand total of three tags closed instead of two. This
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1638 can throw off the rest of your document structure. The same is
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1639 true of a number of other tags, listed below.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1640
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1641 It's much more common for someone to forget to close a 'b' tag
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1642 than to actually use nested 'b' tags, and the BeautifulSoup class
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1643 handles the common case. This class handles the not-co-common
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1644 case: where you can't believe someone wrote what they did, but
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1645 it's valid HTML and BeautifulSoup screwed up by assuming it
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1646 wouldn't be."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1647
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1648 I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS = \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1649 ('em', 'big', 'i', 'small', 'tt', 'abbr', 'acronym', 'strong',
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1650 'cite', 'code', 'dfn', 'kbd', 'samp', 'strong', 'var', 'b',
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1651 'big')
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1652
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1653 I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS = ('noscript',)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1654
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1655 NESTABLE_TAGS = buildTagMap([], BeautifulSoup.NESTABLE_TAGS,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1656 I_CANT_BELIEVE_THEYRE_NESTABLE_BLOCK_TAGS,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1657 I_CANT_BELIEVE_THEYRE_NESTABLE_INLINE_TAGS)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1658
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1659 class MinimalSoup(BeautifulSoup):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1660 """The MinimalSoup class is for parsing HTML that contains
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1661 pathologically bad markup. It makes no assumptions about tag
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1662 nesting, but it does know which tags are self-closing, that
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1663 <script> tags contain Javascript and should not be parsed, that
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1664 META tags may contain encoding information, and so on.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1665
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1666 This also makes it better for subclassing than BeautifulStoneSoup
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1667 or BeautifulSoup."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1668
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1669 RESET_NESTING_TAGS = buildTagMap('noscript')
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1670 NESTABLE_TAGS = {}
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1671
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1672 class BeautifulSOAP(BeautifulStoneSoup):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1673 """This class will push a tag with only a single string child into
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1674 the tag's parent as an attribute. The attribute's name is the tag
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1675 name, and the value is the string child. An example should give
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1676 the flavor of the change:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1677
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1678 <foo><bar>baz</bar></foo>
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1679 =>
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1680 <foo bar="baz"><bar>baz</bar></foo>
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1681
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1682 You can then access fooTag['bar'] instead of fooTag.barTag.string.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1683
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1684 This is, of course, useful for scraping structures that tend to
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1685 use subelements instead of attributes, such as SOAP messages. Note
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1686 that it modifies its input, so don't print the modified version
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1687 out.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1688
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1689 I'm not sure how many people really want to use this class; let me
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1690 know if you do. Mainly I like the name."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1691
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1692 def popTag(self):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1693 if len(self.tagStack) > 1:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1694 tag = self.tagStack[-1]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1695 parent = self.tagStack[-2]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1696 parent._getAttrMap()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1697 if (isinstance(tag, Tag) and len(tag.contents) == 1 and
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1698 isinstance(tag.contents[0], NavigableString) and
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1699 not parent.attrMap.has_key(tag.name)):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1700 parent[tag.name] = tag.contents[0]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1701 BeautifulStoneSoup.popTag(self)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1702
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1703 #Enterprise class names! It has come to our attention that some people
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1704 #think the names of the Beautiful Soup parser classes are too silly
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1705 #and "unprofessional" for use in enterprise screen-scraping. We feel
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1706 #your pain! For such-minded folk, the Beautiful Soup Consortium And
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1707 #All-Night Kosher Bakery recommends renaming this file to
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1708 #"RobustParser.py" (or, in cases of extreme enterprisiness,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1709 #"RobustParserBeanInterface.class") and using the following
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1710 #enterprise-friendly class aliases:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1711 class RobustXMLParser(BeautifulStoneSoup):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1712 pass
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1713 class RobustHTMLParser(BeautifulSoup):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1714 pass
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1715 class RobustWackAssHTMLParser(ICantBelieveItsBeautifulSoup):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1716 pass
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1717 class RobustInsanelyWackAssHTMLParser(MinimalSoup):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1718 pass
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1719 class SimplifyingSOAPParser(BeautifulSOAP):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1720 pass
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1721
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1722 ######################################################
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1723 #
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1724 # Bonus library: Unicode, Dammit
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1725 #
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1726 # This class forces XML data into a standard format (usually to UTF-8
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1727 # or Unicode). It is heavily based on code from Mark Pilgrim's
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1728 # Universal Feed Parser. It does not rewrite the XML or HTML to
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1729 # reflect a new encoding: that happens in BeautifulStoneSoup.handle_pi
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1730 # (XML) and BeautifulSoup.start_meta (HTML).
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1731
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1732 # Autodetects character encodings.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1733 # Download from http://chardet.feedparser.org/
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1734 try:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1735 import chardet
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1736 # import chardet.constants
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1737 # chardet.constants._debug = 1
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1738 except ImportError:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1739 chardet = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1740
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1741 # cjkcodecs and iconv_codec make Python know about more character encodings.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1742 # Both are available from http://cjkpython.i18n.org/
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1743 # They're built in if you use Python 2.4.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1744 try:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1745 import cjkcodecs.aliases
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1746 except ImportError:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1747 pass
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1748 try:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1749 import iconv_codec
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1750 except ImportError:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1751 pass
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1752
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1753 class UnicodeDammit:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1754 """A class for detecting the encoding of a *ML document and
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1755 converting it to a Unicode string. If the source encoding is
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1756 windows-1252, can replace MS smart quotes with their HTML or XML
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1757 equivalents."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1758
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1759 # This dictionary maps commonly seen values for "charset" in HTML
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1760 # meta tags to the corresponding Python codec names. It only covers
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1761 # values that aren't in Python's aliases and can't be determined
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1762 # by the heuristics in find_codec.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1763 CHARSET_ALIASES = { "macintosh" : "mac-roman",
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1764 "x-sjis" : "shift-jis" }
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1765
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1766 def __init__(self, markup, overrideEncodings=[],
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1767 smartQuotesTo='xml', isHTML=False):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1768 self.declaredHTMLEncoding = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1769 self.markup, documentEncoding, sniffedEncoding = \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1770 self._detectEncoding(markup, isHTML)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1771 self.smartQuotesTo = smartQuotesTo
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1772 self.triedEncodings = []
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1773 if markup == '' or isinstance(markup, unicode):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1774 self.originalEncoding = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1775 self.unicode = unicode(markup)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1776 return
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1777
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1778 u = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1779 for proposedEncoding in overrideEncodings:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1780 u = self._convertFrom(proposedEncoding)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1781 if u: break
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1782 if not u:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1783 for proposedEncoding in (documentEncoding, sniffedEncoding):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1784 u = self._convertFrom(proposedEncoding)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1785 if u: break
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1786
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1787 # If no luck and we have auto-detection library, try that:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1788 if not u and chardet and not isinstance(self.markup, unicode):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1789 u = self._convertFrom(chardet.detect(self.markup)['encoding'])
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1790
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1791 # As a last resort, try utf-8 and windows-1252:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1792 if not u:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1793 for proposed_encoding in ("utf-8", "windows-1252"):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1794 u = self._convertFrom(proposed_encoding)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1795 if u: break
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1796
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1797 self.unicode = u
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1798 if not u: self.originalEncoding = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1799
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1800 def _subMSChar(self, orig):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1801 """Changes a MS smart quote character to an XML or HTML
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1802 entity."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1803 sub = self.MS_CHARS.get(orig)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1804 if isinstance(sub, tuple):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1805 if self.smartQuotesTo == 'xml':
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1806 sub = '&#x%s;' % sub[1]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1807 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1808 sub = '&%s;' % sub[0]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1809 return sub
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1810
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1811 def _convertFrom(self, proposed):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1812 proposed = self.find_codec(proposed)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1813 if not proposed or proposed in self.triedEncodings:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1814 return None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1815 self.triedEncodings.append(proposed)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1816 markup = self.markup
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1817
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1818 # Convert smart quotes to HTML if coming from an encoding
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1819 # that might have them.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1820 if self.smartQuotesTo and proposed.lower() in("windows-1252",
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1821 "iso-8859-1",
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1822 "iso-8859-2"):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1823 markup = re.compile("([\x80-\x9f])").sub \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1824 (lambda(x): self._subMSChar(x.group(1)),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1825 markup)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1826
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1827 try:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1828 # print "Trying to convert document to %s" % proposed
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1829 u = self._toUnicode(markup, proposed)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1830 self.markup = u
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1831 self.originalEncoding = proposed
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1832 except Exception, e:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1833 # print "That didn't work!"
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1834 # print e
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1835 return None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1836 #print "Correct encoding: %s" % proposed
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1837 return self.markup
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1838
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1839 def _toUnicode(self, data, encoding):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1840 '''Given a string and its encoding, decodes the string into Unicode.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1841 %encoding is a string recognized by encodings.aliases'''
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1842
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1843 # strip Byte Order Mark (if present)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1844 if (len(data) >= 4) and (data[:2] == '\xfe\xff') \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1845 and (data[2:4] != '\x00\x00'):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1846 encoding = 'utf-16be'
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1847 data = data[2:]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1848 elif (len(data) >= 4) and (data[:2] == '\xff\xfe') \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1849 and (data[2:4] != '\x00\x00'):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1850 encoding = 'utf-16le'
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1851 data = data[2:]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1852 elif data[:3] == '\xef\xbb\xbf':
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1853 encoding = 'utf-8'
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1854 data = data[3:]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1855 elif data[:4] == '\x00\x00\xfe\xff':
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1856 encoding = 'utf-32be'
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1857 data = data[4:]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1858 elif data[:4] == '\xff\xfe\x00\x00':
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1859 encoding = 'utf-32le'
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1860 data = data[4:]
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1861 newdata = unicode(data, encoding)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1862 return newdata
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1863
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1864 def _detectEncoding(self, xml_data, isHTML=False):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1865 """Given a document, tries to detect its XML encoding."""
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1866 xml_encoding = sniffed_xml_encoding = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1867 try:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1868 if xml_data[:4] == '\x4c\x6f\xa7\x94':
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1869 # EBCDIC
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1870 xml_data = self._ebcdic_to_ascii(xml_data)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1871 elif xml_data[:4] == '\x00\x3c\x00\x3f':
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1872 # UTF-16BE
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1873 sniffed_xml_encoding = 'utf-16be'
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1874 xml_data = unicode(xml_data, 'utf-16be').encode('utf-8')
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1875 elif (len(xml_data) >= 4) and (xml_data[:2] == '\xfe\xff') \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1876 and (xml_data[2:4] != '\x00\x00'):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1877 # UTF-16BE with BOM
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1878 sniffed_xml_encoding = 'utf-16be'
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1879 xml_data = unicode(xml_data[2:], 'utf-16be').encode('utf-8')
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1880 elif xml_data[:4] == '\x3c\x00\x3f\x00':
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1881 # UTF-16LE
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1882 sniffed_xml_encoding = 'utf-16le'
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1883 xml_data = unicode(xml_data, 'utf-16le').encode('utf-8')
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1884 elif (len(xml_data) >= 4) and (xml_data[:2] == '\xff\xfe') and \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1885 (xml_data[2:4] != '\x00\x00'):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1886 # UTF-16LE with BOM
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1887 sniffed_xml_encoding = 'utf-16le'
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1888 xml_data = unicode(xml_data[2:], 'utf-16le').encode('utf-8')
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1889 elif xml_data[:4] == '\x00\x00\x00\x3c':
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1890 # UTF-32BE
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1891 sniffed_xml_encoding = 'utf-32be'
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1892 xml_data = unicode(xml_data, 'utf-32be').encode('utf-8')
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1893 elif xml_data[:4] == '\x3c\x00\x00\x00':
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1894 # UTF-32LE
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1895 sniffed_xml_encoding = 'utf-32le'
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1896 xml_data = unicode(xml_data, 'utf-32le').encode('utf-8')
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1897 elif xml_data[:4] == '\x00\x00\xfe\xff':
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1898 # UTF-32BE with BOM
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1899 sniffed_xml_encoding = 'utf-32be'
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1900 xml_data = unicode(xml_data[4:], 'utf-32be').encode('utf-8')
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1901 elif xml_data[:4] == '\xff\xfe\x00\x00':
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1902 # UTF-32LE with BOM
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1903 sniffed_xml_encoding = 'utf-32le'
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1904 xml_data = unicode(xml_data[4:], 'utf-32le').encode('utf-8')
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1905 elif xml_data[:3] == '\xef\xbb\xbf':
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1906 # UTF-8 with BOM
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1907 sniffed_xml_encoding = 'utf-8'
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1908 xml_data = unicode(xml_data[3:], 'utf-8').encode('utf-8')
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1909 else:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1910 sniffed_xml_encoding = 'ascii'
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1911 pass
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1912 except:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1913 xml_encoding_match = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1914 xml_encoding_match = re.compile(
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1915 '^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1916 if not xml_encoding_match and isHTML:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1917 regexp = re.compile('<\s*meta[^>]+charset=([^>]*?)[;\'">]', re.I)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1918 xml_encoding_match = regexp.search(xml_data)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1919 if xml_encoding_match is not None:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1920 xml_encoding = xml_encoding_match.groups()[0].lower()
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1921 if isHTML:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1922 self.declaredHTMLEncoding = xml_encoding
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1923 if sniffed_xml_encoding and \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1924 (xml_encoding in ('iso-10646-ucs-2', 'ucs-2', 'csunicode',
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1925 'iso-10646-ucs-4', 'ucs-4', 'csucs4',
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1926 'utf-16', 'utf-32', 'utf_16', 'utf_32',
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1927 'utf16', 'u16')):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1928 xml_encoding = sniffed_xml_encoding
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1929 return xml_data, xml_encoding, sniffed_xml_encoding
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1930
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1931
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1932 def find_codec(self, charset):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1933 return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1934 or (charset and self._codec(charset.replace("-", ""))) \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1935 or (charset and self._codec(charset.replace("-", "_"))) \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1936 or charset
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1937
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1938 def _codec(self, charset):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1939 if not charset: return charset
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1940 codec = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1941 try:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1942 codecs.lookup(charset)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1943 codec = charset
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1944 except (LookupError, ValueError):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1945 pass
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1946 return codec
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1947
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1948 EBCDIC_TO_ASCII_MAP = None
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1949 def _ebcdic_to_ascii(self, s):
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1950 c = self.__class__
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1951 if not c.EBCDIC_TO_ASCII_MAP:
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1952 emap = (0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1953 16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1954 128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1955 144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1956 32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1957 38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1958 45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1959 186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1960 195,97,98,99,100,101,102,103,104,105,196,197,198,199,200,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1961 201,202,106,107,108,109,110,111,112,113,114,203,204,205,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1962 206,207,208,209,126,115,116,117,118,119,120,121,122,210,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1963 211,212,213,214,215,216,217,218,219,220,221,222,223,224,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1964 225,226,227,228,229,230,231,123,65,66,67,68,69,70,71,72,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1965 73,232,233,234,235,236,237,125,74,75,76,77,78,79,80,81,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1966 82,238,239,240,241,242,243,92,159,83,84,85,86,87,88,89,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1967 90,244,245,246,247,248,249,48,49,50,51,52,53,54,55,56,57,
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1968 250,251,252,253,254,255)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1969 import string
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1970 c.EBCDIC_TO_ASCII_MAP = string.maketrans( \
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1971 ''.join(map(chr, range(256))), ''.join(map(chr, emap)))
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1972 return s.translate(c.EBCDIC_TO_ASCII_MAP)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1973
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1974 MS_CHARS = { '\x80' : ('euro', '20AC'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1975 '\x81' : ' ',
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1976 '\x82' : ('sbquo', '201A'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1977 '\x83' : ('fnof', '192'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1978 '\x84' : ('bdquo', '201E'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1979 '\x85' : ('hellip', '2026'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1980 '\x86' : ('dagger', '2020'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1981 '\x87' : ('Dagger', '2021'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1982 '\x88' : ('circ', '2C6'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1983 '\x89' : ('permil', '2030'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1984 '\x8A' : ('Scaron', '160'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1985 '\x8B' : ('lsaquo', '2039'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1986 '\x8C' : ('OElig', '152'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1987 '\x8D' : '?',
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1988 '\x8E' : ('#x17D', '17D'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1989 '\x8F' : '?',
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1990 '\x90' : '?',
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1991 '\x91' : ('lsquo', '2018'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1992 '\x92' : ('rsquo', '2019'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1993 '\x93' : ('ldquo', '201C'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1994 '\x94' : ('rdquo', '201D'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1995 '\x95' : ('bull', '2022'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1996 '\x96' : ('ndash', '2013'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1997 '\x97' : ('mdash', '2014'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1998 '\x98' : ('tilde', '2DC'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
1999 '\x99' : ('trade', '2122'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
2000 '\x9a' : ('scaron', '161'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
2001 '\x9b' : ('rsaquo', '203A'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
2002 '\x9c' : ('oelig', '153'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
2003 '\x9d' : '?',
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
2004 '\x9e' : ('#x17E', '17E'),
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
2005 '\x9f' : ('Yuml', ''),}
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
2006
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
2007 #######################################################################
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
2008
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
2009
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
2010 #By default, act as an HTML pretty-printer.
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
2011 if __name__ == '__main__':
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
2012 import sys
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
2013 soup = BeautifulSoup(sys.stdin)
bca61e893fcc first checkin of MPIWGWeb r2 branch from CVS into mercurial
casties
parents:
diff changeset
2014 print soup.prettify()