W3C HTML batch Validator in Python

what is it

A script calling the W3C HTML Validator in batch mode. Adapted from a Perl version.
Needs the httplib_multipart extension to Python's httplib module, originally from a Python Cookbook recipe. The version used and available here is updated to use httplib.HTTPConnection available from Python 2.0 which also uses HTTP/1.1 (but does not work with Python versions below 2.0).


example package

validate.zip v1.7 090620, a complete example which should run unmodified.

individual files


> python validate.py exampleconfig.txt

Tested with Python 2.6.2 on Vista only.

config options

In the config file which is basically a Python file in which some variables are (re-)defined. You may specify the following options (may not be the most elegant and save way but does work well for this short script). All options are optional and have reasonable default values (apart from your website specific URL and Path information of course).

URL to the validator, default is constant w3 = 'validator.w3.org:80'
You may want to use the URL to a locally installed copy of the W3C validator.
FILESERVERROOT = 'e:\\files'
Absolute path to your local fileroot, Unix: use /, Win use \\ separator.
The script looks here for files to validate.
Validate all files starting from this path in FILESERVERROOT .
SKIPPATHS = ['include', 'WEB-INF']
Skip all files and subdirectories in these paths.
EXTS = ['html', 'htm']
Validate files with these extensions.
LOCALSERVERURL = 'http://example.org'
If UPLOAD=0 the validator GETs the files to validate from this URL. For local servers you might need to use the UPLOAD=1 option in which case this URL is not used at all.
If UPLOAD=1 (POST) the HTML to validate will be fetched from LOCALSERVERURL , else from FILESERVERROOT\VALIDATEPATH . Files to be fetched will always be the files in FILESERVERROOT\VALIDATEPATH .
REPORTDIR = '__validator'
Reports are saved in this directory
If =1 automatically open report pages in default HTML viewer (normally a webbrowser).

Also see the comments in the files which should be enough to get you started.

change history

v1.7 090620
  • BUGFIX: checks for Valid or Invalid adapted to changes of W3C HTML Validator HTML (the check is really naive!)
  • BUGFIX: fixed saving of reports
  • improvement of output
  • added valid and invalid example HTML
v1.6 060606
new option UPLOADFROMURL to retrieve files from e.g. a local server which the w3 validator may not be able to access. Useful if dynamic pages like .php or .jsp pages need to be validated.
v1.5 040910
bugfix: reported number of files validated was always 1
v1.4 040908
added option SKIPPATHS
modified httplib_multipart to use "text/html" as default content-type
v1.3 040907
added option VALIDATEPATH to validate only parts of a website
renamed most options to more meaningful names
v1.2 040906
httplib_multipart rewrite to use httplib.HTTPConnection
rewrote most of the script
v1.1 040906 not released
upload on local validator works now (because of HTTP/1.1)
v1.0 040903 not released
first working version