Files
old-nlp/venv/lib/python3.7/site-packages/bs4/__pycache__/dammit.cpython-37.pyc

239 lines
18 KiB
Plaintext
Raw Normal View History

2019-10-20 13:16:49 +02:00
B
%<25>]4y<00> @sDdZdZddlZddlmZddlZddlZddlZdZyddl Z dd<06>Z
WnFe k
r<EFBFBD>yddl Z dd<06>Z
Wne k
r<EFBFBD>dd<06>Z
YnXYnXy ddl Z Wne k
r<EFBFBD>YnXd Zd
Ze<10>Ze<05>e<0F>d <0B>ej<14>e<05>e<0E>d <0B>ej<14>d <0C>ee<e<05>eej<14>e<05>eej<14>d <0C>ee<Gd d<0E>de<17>ZGdd<10>d<10>ZGdd<12>d<12>ZdS)aBBeautiful Soup bonus library: Unicode, Dammit
This library converts a bytestream to Unicode through any means
necessary. It is heavily based on code from Mark Pilgrim's Universal
Feed Parser. It works best on XML and HTML, but it does not rewrite the
XML or HTML to reflect a new encoding; that's the tree builder's job.
<EFBFBD>MIT<49>N)<01>codepoint2namecCst|t<01>rdSt<02>|<00>dS)N<>encoding)<04>
isinstance<EFBFBD>str<74>cchardet<65>detect)<01>s<>r
<00>6/tmp/pip-install-_x9nvcel/beautifulsoup4/bs4/dammit.py<70>chardet_dammits
r cCst|t<01>rdSt<02>|<00>dS)Nr)rr<00>chardetr)r r
r
r r "s
cCsdS)Nr
)r r
r
r r *sz$^\s*<\?.*encoding=['"](.*?)['"].*\?>z0<\s*meta[^>]+charset\s*=\s*["']?([^>]*?)[ /;'">]<5D>ascii)<02>html<6D>xmlc@s<>eZdZdZdd<03>Ze<04>\ZZZdddddd <09>Ze <09>
d
<EFBFBD>Z e <09>
d <0B>Z e d d <0A><00>Ze dd<0F><00>Ze dd<11><00>Ze ddd<14><01>Ze ddd<16><01>Ze dd<18><00>ZdS)<1C>EntitySubstitutionzASubstitute XML or HTML entities for the corresponding characters.cCsxi}i}g}dg}xFtt<01><02><00>|D]2\}}t|<04>}|dkrN|<02>|<06>|||<|||<q$Wdd<04>|<02>}||t<06>|<07>fS)N)<02>'<00>apos)<02>"rz[%s]<5D>)<08>listr<00>items<6D>chr<68>append<6E>join<69>re<72>compile)<08>lookupZreverse_lookupZcharacters_for_re<72>extra<72> codepoint<6E>name<6D> characterZ re_definitionr
r
r <00>_populate_class_variablesEs
 z,EntitySubstitution._populate_class_variablesr<00>quot<6F>amp<6D>lt<6C>gt)<05>'<27>"<22>&<26><<3C>>z&([<>]|&(?!#\d+;|#x[0-9a-fA-F]+;|\w+;))z([<>&])cCs|j<00>|<01>d<01><01>}d|S)Nrz&%s;)<03>CHARACTER_TO_HTML_ENTITY<54>get<65>group)<03>cls<6C>matchobj<62>entityr
r
r <00>_substitute_html_entityosz*EntitySubstitution._substitute_html_entitycCs|j|<01>d<01>}d|S)zmUsed with a regular expression to substitute the
appropriate XML entity for an XML special character.rz&%s;)<02>CHARACTER_TO_XML_ENTITYr.)r/r0r1r
r
r <00>_substitute_xml_entitytsz)EntitySubstitution._substitute_xml_entitycCs6d}d|kr*d|kr&d}|<01>d|<03>}nd}|||S)a*Make a value into a quoted XML attribute, possibly escaping it.
Most strings will be quoted using double quotes.
Bob's Bar -> "Bob's Bar"
If a string contains double quotes, it will be quoted using
single quotes.
Welcome to "my bar" -> 'Welcome to "my bar"'
If a string contains both single and double quotes, the
double quotes will be escaped, and the string will be quoted
using double quotes.
Welcome to "Bob's Bar" -> "Welcome to &quot;Bob's bar&quot;
r(r'z&quot;)<01>replace)<04>self<6C>valueZ
quote_withZ replace_withr
r
r <00>quoted_attribute_value{sz)EntitySubstitution.quoted_attribute_valueFcCs"|j<00>|j|<01>}|r|<00>|<01>}|S)a Substitute XML entities for special XML characters.
:param value: A string to be substituted. The less-than sign
will become &lt;, the greater-than sign will become &gt;,
and any ampersands will become &amp;. If you want ampersands
that appear to be part of an entity definition to be left
alone, use substitute_xml_containing_entities() instead.
:param make_quoted_attribute: If True, then the string will be
quoted, as befits an attribute value.
)<04>AMPERSAND_OR_BRACKET<45>subr4r8)r/r7<00>make_quoted_attributer
r
r <00>substitute_xml<6D>s


z!EntitySubstitution.substitute_xmlcCs"|j<00>|j|<01>}|r|<00>|<01>}|S)a<>Substitute XML entities for special XML characters.
:param value: A string to be substituted. The less-than sign will
become &lt;, the greater-than sign will become &gt;, and any
ampersands that are not part of an entity defition will
become &amp;.
:param make_quoted_attribute: If True, then the string will be
quoted, as befits an attribute value.
)<04>BARE_AMPERSAND_OR_BRACKETr:r4r8)r/r7r;r
r
r <00>"substitute_xml_containing_entities<65>s


z5EntitySubstitution.substitute_xml_containing_entitiescCs|j<00>|j|<01>S)a<>Replace certain Unicode characters with named HTML entities.
This differs from data.encode(encoding, 'xmlcharrefreplace')
in that the goal is to make the result more readable (to those
with ASCII displays) rather than to recover from
errors. There's absolutely nothing wrong with a UTF-8 string
containg a LATIN SMALL LETTER E WITH ACUTE, but replacing that
character with "&eacute;" will make it more readable to some
people.
)<03>CHARACTER_TO_HTML_ENTITY_REr:r2)r/r r
r
r <00>substitute_html<6D>s z"EntitySubstitution.substitute_htmlN)F)F)<14>__name__<5F>
__module__<EFBFBD> __qualname__<5F>__doc__r"r,ZHTML_ENTITY_TO_CHARACTERr?r3rrr=r9<00> classmethodr2r4r8r<r>r@r
r
r
r rAs$ 

   %  rc@sHeZdZdZddd<05>Zdd<07>Zedd <09><00>Zed
d <0B><00>Z edd d <0A><01>Z
dS)<10>EncodingDetectora^Suggests a number of possible encodings for a bytestring.
Order of precedence:
1. Encodings you specifically tell EncodingDetector to try first
(the override_encodings argument to the constructor).
2. An encoding declared within the bytestring itself, either in an
XML declaration (if the bytestring is to be interpreted as an XML
document), or in a <meta> tag (if the bytestring is to be
interpreted as an HTML document.)
3. An encoding detected through textual analysis by chardet,
cchardet, or a similar external library.
4. UTF-8.
5. Windows-1252.
NFcCsN|pg|_|pg}tdd<02>|D<00><01>|_d|_||_d|_|<00>|<01>\|_|_dS)NcSsg|] }|<01><00><00>qSr
)<01>lower)<02>.0<EFBFBD>xr
r
r <00>
<listcomp><3E>sz-EncodingDetector.__init__.<locals>.<listcomp>) <09>override_encodings<67>set<65>exclude_encodings<67>chardet_encoding<6E>is_html<6D>declared_encoding<6E>strip_byte_order_mark<72>markup<75>sniffed_encoding)r6rRrKrOrMr
r
r <00>__init__<5F>s
zEncodingDetector.__init__cCs8|dk r4|<01><00>}||jkrdS||kr4|<02>|<01>dSdS)NFT)rGrM<00>add)r6r<00>triedr
r
r <00>_usable<6C>s

zEncodingDetector._usableccs<>t<00>}x |jD]}|<00>||<01>r|VqW|<00>|j|<01>r>|jV|jdkrZ|<00>|j|j<07>|_|<00>|j|<01>rp|jV|jdkr<>t |j<06>|_|<00>|j|<01>r<>|jVxdD]}|<00>||<01>r<>|Vq<>WdS)z<Yield a number of encodings that might work for this markup.N)zutf-8z windows-1252)
rLrKrWrSrP<00>find_declared_encodingrRrOrNr )r6rV<00>er
r
r <00> encodingss$  


 
 zEncodingDetector.encodingscCs<>d}t|t<01>r||fSt|<01>dkrT|dd<03>dkrT|dd<02>dkrTd}|dd<01>}n<>t|<01>dkr<>|dd<03>dkr<>|dd<02>dkr<>d}|dd<01>}nd|dd <09>d
kr<>d }|d d<01>}nB|dd<02>d kr<>d }|dd<01>}n |dd<02>dkr<>d}|dd<01>}||fS)zMIf a byte-order mark is present, strip it and return the encoding it implies.N<><00>s<00><>zzutf-16bes<00><>zutf-16le<6C>szutf-8s<00><>zutf-32bes<00><>zutf-32le)rr<00>len)r/<00>datarr
r
r rQ&s*
 z&EncodingDetector.strip_byte_order_markc Cs<>|rt|<01>}}nd}tdtt|<01>d<00><01>}t|t<04>r@tt}ntt}|d}|d}d} |j||d<07>}
|
s<EFBFBD>|r<>|j||d<07>}
|
dk r<>|
<EFBFBD><08>d} | r<>t| t<04>r<>| <09> d d
<EFBFBD>} | <09>
<EFBFBD>SdS) z<>Given a document, tries to find its declared encoding.
An XML encoding is declared at the beginning of the document.
An HTML encoding is declared in a <meta> tag, hopefully near the
beginning of the document.
iig<><67><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>?rrN)<01>endposrrr5) r^<00>max<61>intr<00>bytes<65> encoding_resr<00>search<63>groups<70>decoderG) r/rRrOZsearch_entire_documentZ
xml_endposZ html_endpos<6F>res<65>xml_reZhtml_rerPZdeclared_encoding_matchr
r
r rX@s( 

 
 z'EncodingDetector.find_declared_encoding)NFN)FF) rArBrCrDrTrW<00>propertyrZrErQrXr
r
r
r rF<00>s

! rFc<00>@s<>eZdZdZddd<04>ZdddgZgdd gfd
d <0B>Zd d <0A>Zd<>dd<10>Zd<>dd<12>Z e
dd<14><00>Z dd<16>Z dd<18>Z dddddddd d!d"d#d$d%d&d'd&d&d(d)d*d+d,d-d.d/d0d1d2d3d&d4d5d6<64> Zd7dd8d9d:d;d<d=d>d?d@dAdBd&dCd&d&dDdDdEdEdFdGdHdIdJdKdLdMd&dNdOddPdQdRdSdTdUd@dVdWdXdYdPddZdGd[d\d]d^d_d`dadFd8dbdXdcdddedfd&dgdgdgdgdgdgdhdidjdjdjdjdkdkdkdkdldmdndndndndndFdndododododOdpdqdrdrdrdrdrdrdsdQdtdtdtdtdudududud[dvd[d[d[d[d[dwd[d`d`d`d`dxdpdxdy<64><79>Zdzd{d|d}d~dd<7F>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD>d<EFBFBD><64>zZd<>d<EFBFBD>d<EFBFBD>gZed<>d<>Zed<>d<>Ze<14>dd<>d<EFBFBD><64><01>ZdS(<00> UnicodeDammitz<74>A class for detecting the encoding of a *ML document and
converting it to a Unicode string. If the source encoding is
windows-1252, can replace MS smart quotes with their HTML or XML
equivalents.z mac-romanz shift-jis)<02> macintoshzx-sjis<69> windows-1252z
iso-8859-1z
iso-8859-2NFcCs<>||_g|_d|_||_t<04>t<06>|_t||||<05>|_ t
|t <0B>sF|dkr`||_ t |<01>|_ d|_dS|j j |_ d}x,|j jD] }|j j }|<00>|<07>}|dk rxPqxW|s<>x@|j jD]4}|dkr<>|<00>|d<04>}|dk r<>|j<07>d<05>d|_Pq<>W||_ |s<>d|_dS)NFrrr5zSSome characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.T)<12>smart_quotes_to<74>tried_encodingsZcontains_replacement_charactersrO<00>logging<6E> getLoggerrA<00>logrF<00>detectorrrrRZunicode_markup<75>original_encodingrZ<00> _convert_from<6F>warning)r6rRrKrnrOrM<00>urr
r
r rTus> 


 zUnicodeDammit.__init__cCs<>|<01>d<01>}|jdkr&|j<02>|<02><01><04>}nf|j<05>|<02>}t|<03>tkr<>|jdkrfd<04><04>|d<00><04>d<05><04>}q<>d<06><04>|d<00><04>d<05><04>}n|<03><04>}|S)z[Changes a MS smart quote character to an XML or HTML
entity, or an ASCII character.<2E>rrz&#x<>;r)r)r.rn<00>MS_CHARS_TO_ASCIIr-<00>encode<64>MS_CHARS<52>type<70>tuple)r6<00>match<63>origr:r
r
r <00> _sub_ms_char<61>s

  
zUnicodeDammit._sub_ms_char<61>strictc
Cs<>|<00>|<01>}|r||f|jkr dS|j<01>||f<02>|j}|jdk rf||jkrfd}t<06>|<04>}|<05>|j |<03>}y|<00>
|||<02>}||_||_ Wn"t k
r<EFBFBD>}zdSd}~XYnX|jS)Ns([<5B>-<2D>])) <0A>
find_codecrorrRrn<00>ENCODINGS_WITH_SMART_QUOTESrrr:r<><00> _to_unicodert<00> Exception)r6Zproposed<65>errorsrRZsmart_quotes_reZsmart_quotes_compiledrwrYr
r
r ru<00>s"




zUnicodeDammit._convert_fromcCs t|||<03>S)zGiven a string and its encoding, decodes the string into Unicode.
%encoding is a string recognized by encodings.aliases)r)r6r_rr<>r
r
r r<><00>szUnicodeDammit._to_unicodecCs|js
dS|jjS)N)rOrsrP)r6r
r
r <00>declared_html_encoding<6E>sz$UnicodeDammit.declared_html_encodingcCs`|<00>|j<01>||<01><02>pN|r*|<00>|<01>dd<02><02>pN|r@|<00>|<01>dd<03><02>pN|rL|<01><04>pN|}|r\|<02><04>SdS)N<>-r<00>_)<05>_codec<65>CHARSET_ALIASESr-r5rG)r6<00>charsetr7r
r
r r<><00>s zUnicodeDammit.find_codecc Cs<|s|Sd}yt<00>|<01>|}Wnttfk
r6YnX|S)N)<04>codecsr<00> LookupError<6F>
ValueError)r6r<><00>codecr
r
r r<><00>s
zUnicodeDammit._codec)<02>euroZ20AC<41> )<02>sbquoZ201A)<02>fnofZ192)<02>bdquoZ201E)<02>hellipZ2026)<02>daggerZ2020)<02>DaggerZ2021)<02>circZ2C6)<02>permilZ2030)<02>ScaronZ160)<02>lsaquoZ2039)<02>OEligZ152<35>?)z#x17DZ17D)<02>lsquoZ2018)<02>rsquoZ2019)<02>ldquoZ201C)<02>rdquoZ201D)<02>bullZ2022)<02>ndashZ2013)<02>mdashZ2014)<02>tildeZ2DC)<02>tradeZ2122)<02>scaronZ161)<02>rsaquoZ203A)<02>oeligZ153)z#x17EZ17E)<02>Yumlr) <20><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00>ZEUR<55>,<2C>fz,,z...<2E>+z++<2B>^<5E>%<25>Sr*ZOE<4F>Zr'r(<00>*r<>z--<2D>~z(TM)r r+Zoe<6F>z<>Y<>!<21>cZGBP<42>$ZYEN<45>|z..rz(th)z<<z(R)<29>oz+-<2D>2<>3)r'<00>acuterw<00>P<>1z>>z1/4z1/2z3/4<>AZAE<41>C<>E<>I<>D<>N<>O<>U<>b<>B<>aZaerY<00>i<>n<>/<2F>y)<29>r<EFBFBD>r<>r<>r<>r<>r<>r<>r<>r<>r<>r<>r<>r<>r<>r<>r<>r<>r<>r<>r<>r<>r<>r<>r<>r<>r<>r<>r<>r<>r<>r<>r<><00><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00>s€ssƒs„s…s†s‡sˆs‰sŠssŒsŽsss“s”s•ss—s˜s™sšssœsžsŸs s¡s¢s£s¤s¥s¦s§s¨s©sªs«s¬s­s®s¯s°s±s²s³s´sµs¶s·s¸s¹sºs»s¼s½s¾s¿sÀsÁsÂsÃsÄsÅsÆsÇsÈsÉsÊsËsÌsÍsÎsÏsÐsÑsÒsÓsÔsÕsÖs×sØsÙsÚsÛsÜsÝsÞsßsàr<C3A0>sâsãsäsåsæsçsèsésêsësìsísîsïsðsñsòsósôsõsös÷søsùsúsûsüsýsþ)z<><7A><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><><00><>)r<>r<>r\)r<>r<>r])r<>r<>r[r<00><><EFBFBD><EFBFBD><EFBFBD>rx<00>utf8c Cs"|<03>dd<02><02><01>dkrtd<04><01>|<02><01>dkr0td<06><01>g}d}d}x<>|t|<01>kr<>||}t|t<05>sdt|<07>}||jkr<>||jkr<>xz|j D]$\}} }
||kr<>|| kr<>||
7}Pq<>Wq>|dkr<>||j
kr<>|<04> |||<06><00>|<04> |j
|<00>|d 7}|}q>|d 7}q>W|dk<02>r|S|<04> ||d
<EFBFBD><00>d <0B> |<04>S) a<>Fix characters from one encoding embedded in some other encoding.
Currently the only situation supported is Windows-1252 (or its
subset ISO-8859-1), embedded in UTF-8.
The input must be a bytestring. If you've already converted
the document to Unicode, you're too late.
The output is a bytestring in which `embedded_encoding`
characters have been converted to their `main_encoding`
equivalents.
r<>r<>)z windows-1252<35> windows_1252zPWindows-1252 and ISO-8859-1 are the only currently supported embedded encodings.)r<>zutf-8z4UTF-8 is the only currently supported main encoding.rrQrxN<>) r5rG<00>NotImplementedErrorr^rrb<00>ord<72>FIRST_MULTIBYTE_MARKER<45>LAST_MULTIBYTE_MARKER<45>MULTIBYTE_MARKERS_AND_SIZES<45>WINDOWS_1252_TO_UTF8rr) r/Zin_bytesZ main_encodingZembedded_encodingZ byte_chunksZ chunk_start<72>pos<6F>byte<74>start<72>end<6E>sizer
r
r <00> detwingle)s< 


 
zUnicodeDammit.detwingle)r<>)r<>)r<>rm)rArBrCrDr<>r<>rTr<>rur<>rjr<>r<>r<>r|rzr<>r<>r<>r<>rEr<>r
r
r
r rkbs`1

      rk)rD<00> __license__r<5F><00> html.entitiesrrrp<00>stringZ chardet_typerr <00> ImportErrorr Z iconv_codecZ xml_encodingZ html_meta<74>dictrdrr{r<>rcr<00>objectrrFrkr
r
r
r <00><module>s@