encutils - encoding detection collection for Python
what is it
A collection of helper functions to detect encodings of text files (like HTML, XHTML, XML, CSS, etc.) retrieved via HTTP, file or string.
is probably the main
function of interest which uses other supplied
functions itself and gathers all information together
and supplies an
the following properties:
encoding: The guessed encoding
Encoding is the explicit or implicit encoding or None and always lowercase.
- from HTTP response
- from HTML <meta> element
- from XML declaration
>>> import encutils >>> info = encutils.getEncodingInfo(url='http://cthedot.de/encutils/') >>> print info # = str(info) utf-8 >>> info # = repr(info) <encutils.EncodingInfo object encoding='utf-8' mismatch=False at 0xb86d30> >>> print info.logtext HTTP media_type: text/html HTTP encoding: utf-8 Encoding (probably): utf-8 (Mismatch: False)
Thanks to Robert Siemer for very helpful testing and improvements.
Tested with Python 2.6.2 on Windows Vista only.
- v0.9 090423
- invalid HTML (like
< />) does not stop the encoding detection anymore
- fixed tryEncodings if chardet is not installed
- mismatch is False now if mime-type is text/xml (or similar) and XML encoding pseudo attribute defines encoding as this is ignored completely!
- default encoding for CSS is UTF-8 now if not other HTTP info is given. @charset encoding information is not used by encutils!
- log output for mismatch uses != instead of <> now
- fixed testcases which were not all tested :(most embarrissing)
- invalid HTML (like
- v0.8.3.1 (not released standalone, only as part of cssutils)
optimized docs for Sphinx generations and added
- v0.8.3 080316
- fixed parsing of HTML meta element encoding information
default encoding for
text/cssis assumed as UTF-8 now (according to CSS specification). Information inside the actual CSS file is not yet used.
- v0.8.2 080315
- changed license to dual-license of LGPL and Creative Commons License
- v0.8.1 (released as part of cssutils only)
- improved inline documentation only
- v0.8 070814
- CHANGE: For content type text/html no default XML encoding is used anymore. This resulted in a mismatch which in most cases was probable none. Explicit XML encoding if XHTML is saved with text/html is still noted though!
FEATURE: new parameter
- FEATURE: added __repr__ method to EncodingInfo which outputs encoding and mismatch information
BUGFIX: fixed bug in
getMetaInfowhich returned infos not normalized (lowercase) which resulted in mismatch which not were ones really
- simplified package structure
- v0.7a2 061126
- repacked as package and egg, added generated docs too
- v0.7a1 061125
- v0.61 051101
- showing the help when run standalone uses pydoc now
- v0.6 050914
getEncodinginfo(response=None, text=u'', log=None).
EncodingInfoobject which contains all information about the given text:
encoding, mismatch, logtext, http_encoding, http_media_type, meta_encoding, meta_media_type, xml_encoding.
String value of
EncodingInfois the encoding itself. All parameters are optional in contrast to the deprecated
guessEncodingis DEPRECATED in favour of
formatto utility function
_splitContentTypeValue(including test) and replaced where it was used with
cgi.parse_header(note to self: need to now stdlib better...)
- updated license to version 2.5
- new function
- v0.5 050823
detectXMLEncodingfrom a cookbook recipe by Lars Tiede which itself is based on Paul Prescott's recipe
- all encodings returned are in lowercase from 0.5 because RFC 3023 uses lowercase names too
- bugfix: changed default character encodings
- fixed test_getTextType...
- v0.4.1 050820
- docs adjusted to epydoc automatic doc generation with ReST
- v0.4 050817
added getXMLEncoding(text) which finds
TODO: should autodetect XML encoding? maybe with parameter only?
- guessEncoding should work as defined in specifications
- constants _OTHER_TYPE, _XML_APPLICATION_TYPE, _TEXT_TYPE for different types of served XML files, internal use and probably not useful to users of encutils
- cleaned up code
- added getXMLEncoding(text) which finds explicit encoding
- v0.3 050730
- _splitContentTypeValue returns encoding if given uppercase now
- running standalone shows help now
- v0.2 050703
- cleaned up documentation and API
- v0.1 050626
- first release