encutils – encoding detection collection for Python
what is it
A collection of helper functions to detect encodings of text files (like HTML, XHTML, XML, CSS, etc.) retrieved via HTTP, file or string.
getEncodingInfo
is probably the main function of interest which uses other supplied functions itself and gathers
all information together and supplies an
EncodingInfo
object with the following properties:
-
-
encoding
: The guessed encoding -
Encoding is the explicit or implicit encoding or None and always lowercase.
-
-
- from HTTP response
-
-
http_encoding
-
http_media_type
-
-
- from HTML <meta> element
-
-
meta_encoding
-
meta_media_type
-
-
- from XML declaration
-
-
xml_encoding
-
example
>>> import encutils >>> info = encutils.getEncodingInfo(url='http://cthedot.de/encutils/') >>> print info # = str(info) utf-8 >>> info # = repr(info) <encutils.EncodingInfo object encoding='utf-8' mismatch=False at 0xb86d30> >>> print info.logtext HTTP media_type: text/html HTTP encoding: utf-8 Encoding (probably): utf-8 (Mismatch: False)
Thanks to Robert Siemer for very helpful testing and improvements.
download
Encutils (from April 2014) is only available as part of cssutils. Install the latest version
via pip:
pip install cssutils
.
change history
- v0.9.8
- part of cssutils, the API remains the same
- v0.9 090423
-
- invalid HTML (like
< />
) does not stop the encoding detection anymore - fixed tryEncodings if chardet is not installed
- mismatch is False now if mime-type is text/xml (or similar) and XML encoding pseudo attribute defines encoding as this is ignored completely!
- default encoding for CSS is UTF-8 now if not other HTTP info is given. @charset encoding information is not used by encutils!
- log output for mismatch uses != instead of <> now
- fixed testcases which were not all tested :(most embarrissing)
- invalid HTML (like
- v0.8.3.1 (not released standalone, only as part of cssutils)
-
optimized docs for Sphinx generations and added
encutils.VERSION
- v0.8.3 080316
-
- fixed parsing of HTML meta element encoding information
-
default encoding for
text/css
is assumed as UTF-8 now (according to CSS specification). Information inside the actual CSS file is not yet used.
- v0.8.2 080315
- changed license to dual-license of LGPL and Creative Commons License
- v0.8.1 (released as part of cssutils only)
- improved inline documentation only
- v0.8 070814
-
- CHANGE: For content type text/html no default XML encoding is used anymore. This resulted in a mismatch which in most cases was probable none. Explicit XML encoding if XHTML is saved with text/html is still noted though!
-
FEATURE: new parameter
getEncodingInfo(url='URL')
- FEATURE: added __repr__ method to EncodingInfo which outputs encoding and mismatch information
-
BUGFIX: fixed bug in
getMetaInfo
which returned infos not normalized (lowercase) which resulted in mismatch which not were ones really - simplified package structure
- v0.7a2 061126
- repacked as package and egg, added generated docs too
- v0.7a1 061125
- v0.61 051101
- showing the help when run standalone uses pydoc now
- v0.6 050914
-
-
new function
getEncodinginfo(response=None, text=u'', log=None)
.
Returns anEncodingInfo
object which contains all information about the given text:encoding, mismatch, logtext, http_encoding, http_media_type, meta_encoding, meta_media_type, xml_encoding
.
String value ofEncodingInfo
is the encoding itself. All parameters are optional in contrast to the deprecatedguessEncoding
function. -
guessEncoding
is DEPRECATED in favour ofgetEncodingInfo
-
added parameter
format
to utility functionbuildlog
. -
removed
_splitContentTypeValue
(including test) and replaced where it was used withcgi.parse_header
(note to self: need to now stdlib better...) - updated license to version 2.5
-
new function
- v0.5 050823
-
-
replaced
getXMLEncoding
withdetectXMLEncoding
from a cookbook recipe by Lars Tiede which itself is based on Paul Prescott's recipe - all encodings returned are in lowercase from 0.5 because RFC 3023 uses lowercase names too
- bugfix: changed default character encodings
-
added
buildlog
utility function - fixed test_getTextType...
-
replaced
- v0.4.1 050820
- docs adjusted to epydoc automatic doc generation with ReST
- v0.4 050817
-
-
added getXMLEncoding(text) which finds explicit encoding
TODO: should autodetect XML encoding? maybe with parameter only? - guessEncoding should work as defined in specifications
- constants _OTHER_TYPE, _XML_APPLICATION_TYPE, _TEXT_TYPE for different types of served XML files, internal use and probably not useful to users of encutils
- cleaned up code
-
added getXMLEncoding(text) which finds explicit encoding
- v0.3 050730
-
- _splitContentTypeValue returns encoding if given uppercase now
- running standalone shows help now
- v0.2 050703
- cleaned up documentation and API
- v0.1 050626
- first release