encutils – encoding detection collection for Python

what is it

A collection of helper functions to detect encodings of text files (like HTML, XHTML, XML, CSS, etc.) retrieved via HTTP, file or string.

getEncodingInfo is probably the main function of interest which uses other supplied functions itself and gathers all information together and supplies an EncodingInfo object with the following properties:

example

>>> import encutils
>>> info = encutils.getEncodingInfo(url='http://cthedot.de/encutils/')

>>> print info  # = str(info)
utf-8

>>> info        # = repr(info)
<encutils.EncodingInfo object encoding='utf-8' mismatch=False at 0xb86d30>

>>> print info.logtext
HTTP media_type: text/html
HTTP encoding: utf-8
Encoding (probably): utf-8 (Mismatch: False)
  

Thanks to Robert Siemer for very helpful testing and improvements.

download

Encutils (from April 2014) is only available as part of cssutils. Install the latest version via pip: pip install cssutils .

change history

v0.9.8
part of cssutils, the API remains the same
v0.9 090423
  • invalid HTML (like < />) does not stop the encoding detection anymore
  • fixed tryEncodings if chardet is not installed
  • mismatch is False now if mime-type is text/xml (or similar) and XML encoding pseudo attribute defines encoding as this is ignored completely!
  • default encoding for CSS is UTF-8 now if not other HTTP info is given. @charset encoding information is not used by encutils!
  • log output for mismatch uses != instead of <> now
  • fixed testcases which were not all tested :(most embarrissing)
v0.8.3.1 (not released standalone, only as part of cssutils)
optimized docs for Sphinx generations and added encutils.VERSION
v0.8.3 080316
  • fixed parsing of HTML meta element encoding information
  • default encoding for text/css is assumed as UTF-8 now (according to CSS specification). Information inside the actual CSS file is not yet used.
v0.8.2 080315
changed license to dual-license of LGPL and Creative Commons License
v0.8.1 (released as part of cssutils only)
improved inline documentation only
v0.8 070814
  • CHANGE: For content type text/html no default XML encoding is used anymore. This resulted in a mismatch which in most cases was probable none. Explicit XML encoding if XHTML is saved with text/html is still noted though!
  • FEATURE: new parameter getEncodingInfo(url='URL')
  • FEATURE: added __repr__ method to EncodingInfo which outputs encoding and mismatch information
  • BUGFIX: fixed bug in getMetaInfo which returned infos not normalized (lowercase) which resulted in mismatch which not were ones really
  • simplified package structure
v0.7a2 061126
repacked as package and egg, added generated docs too
v0.7a1 061125
  • minor improvements to better work with cssutils csscapture (in cssutils v0.9.2)
  • tryEncodings uses chardet now if available
  • removed guessEncoding
v0.61 051101
showing the help when run standalone uses pydoc now
v0.6 050914
  • new function getEncodinginfo(response=None, text=u'', log=None) .
    Returns an EncodingInfo object which contains all information about the given text: encoding, mismatch, logtext, http_encoding, http_media_type, meta_encoding, meta_media_type, xml_encoding .
    String value of EncodingInfo is the encoding itself. All parameters are optional in contrast to the deprecated guessEncoding function.
  • guessEncoding is DEPRECATED in favour of getEncodingInfo
  • added parameter format to utility function buildlog .
  • removed _splitContentTypeValue (including test) and replaced where it was used with cgi.parse_header (note to self: need to now stdlib better...)
  • updated license to version 2.5
v0.5 050823
  • replaced getXMLEncoding with detectXMLEncoding from a cookbook recipe by Lars Tiede which itself is based on Paul Prescott's recipe
  • all encodings returned are in lowercase from 0.5 because RFC 3023 uses lowercase names too
  • bugfix: changed default character encodings
  • added buildlog utility function
  • fixed test_getTextType...
v0.4.1 050820
docs adjusted to epydoc automatic doc generation with ReST
v0.4 050817
  • added getXMLEncoding(text) which finds explicit encoding
    TODO: should autodetect XML encoding? maybe with parameter only?
  • guessEncoding should work as defined in specifications
  • constants _OTHER_TYPE, _XML_APPLICATION_TYPE, _TEXT_TYPE for different types of served XML files, internal use and probably not useful to users of encutils
  • cleaned up code
v0.3 050730
  • _splitContentTypeValue returns encoding if given uppercase now
  • running standalone shows help now
v0.2 050703
cleaned up documentation and API
v0.1 050626
first release