encutils

encutils – encoding detection collection for Python

what is it

A collection of helper functions to detect encodings of text files (like HTML, XHTML, XML, CSS, etc.) retrieved via HTTP, file or string.

getEncodingInfo is probably the main function of interest which uses other supplied functions itself and gathers all information together and supplies an EncodingInfo object with the following properties:

encoding : The guessed encoding

Encoding is the explicit or implicit encoding or None and always lowercase.
from HTTP response
- http_encoding
- http_media_type
from HTML <meta> element
- meta_encoding
- meta_media_type
from XML declaration
- xml_encoding

example

>>> import encutils
>>> info = encutils.getEncodingInfo(url='http://cthedot.de/encutils/')

>>> print info  # = str(info)
utf-8

>>> info        # = repr(info)
<encutils.EncodingInfo object encoding='utf-8' mismatch=False at 0xb86d30>

>>> print info.logtext
HTTP media_type: text/html
HTTP encoding: utf-8
Encoding (probably): utf-8 (Mismatch: False)

Thanks to Robert Siemer for very helpful testing and improvements.

download

Encutils (from April 2014) is only available as part of cssutils. Install the latest version via pip: pip install cssutils.

change history

v0.9.8

part of cssutils, the API remains the same

v0.9 090423

invalid HTML (like < />) does not stop the encoding detection anymore
fixed tryEncodings if chardet is not installed
mismatch is False now if mime-type is text/xml (or similar) and XML encoding pseudo attribute defines encoding as this is ignored completely!
default encoding for CSS is UTF-8 now if not other HTTP info is given. @charset encoding information is not used by encutils!
log output for mismatch uses != instead of <> now
fixed testcases which were not all tested :(most embarrissing)

v0.8.3.1 (not released standalone, only as part of cssutils)

optimized docs for Sphinx generations and added encutils.VERSION

v0.8.3 080316

fixed parsing of HTML meta element encoding information
default encoding for text/css is assumed as UTF-8 now (according to CSS specification). Information inside the actual CSS file is not yet used.

v0.8.2 080315

changed license to dual-license of LGPL and Creative Commons License

v0.8.1 (released as part of cssutils only)

improved inline documentation only

v0.8 070814

CHANGE: For content type text/html no default XML encoding is used anymore. This resulted in a mismatch which in most cases was probable none. Explicit XML encoding if XHTML is saved with text/html is still noted though!
FEATURE: new parameter getEncodingInfo(url='URL')
FEATURE: added __repr__ method to EncodingInfo which outputs encoding and mismatch information
BUGFIX: fixed bug in getMetaInfo which returned infos not normalized (lowercase) which resulted in mismatch which not were ones really
simplified package structure

v0.7a2 061126

repacked as package and egg, added generated docs too

v0.7a1 061125

minor improvements to better work with cssutils csscapture (in cssutils v0.9.2)
tryEncodings uses chardet now if available
removed guessEncoding

v0.61 051101

showing the help when run standalone uses pydoc now

v0.6 050914

new function getEncodinginfo(response=None, text=u'', log=None) .
Returns an EncodingInfo object which contains all information about the given text: encoding, mismatch, logtext, http_encoding, http_media_type, meta_encoding, meta_media_type, xml_encoding .
String value of EncodingInfo is the encoding itself. All parameters are optional in contrast to the deprecated guessEncoding function.
guessEncoding is DEPRECATED in favour of getEncodingInfo
added parameter format to utility function buildlog .
removed _splitContentTypeValue (including test) and replaced where it was used with cgi.parse_header (note to self: need to now stdlib better...)
updated license to version 2.5

v0.5 050823

replaced getXMLEncoding with detectXMLEncoding from a cookbook recipe by Lars Tiede which itself is based on Paul Prescott's recipe
all encodings returned are in lowercase from 0.5 because RFC 3023 uses lowercase names too
bugfix: changed default character encodings
added buildlog utility function
fixed test_getTextType...

v0.4.1 050820

docs adjusted to epydoc automatic doc generation with ReST

v0.4 050817

added getXMLEncoding(text) which finds explicit encoding
TODO: should autodetect XML encoding? maybe with parameter only?
guessEncoding should work as defined in specifications
constants _OTHER_TYPE, _XML_APPLICATION_TYPE, _TEXT_TYPE for different types of served XML files, internal use and probably not useful to users of encutils
cleaned up code

v0.3 050730

_splitContentTypeValue returns encoding if given uppercase now
running standalone shows help now

v0.2 050703

cleaned up documentation and API

v0.1 050626

first release