batch validator

W3C HTML batch Validator in Python

what is it

A script calling the W3C HTML Validator in batch mode. Adapted from a Perl version.
Needs the httplib_multipart extension to Python's httplib module, originally from a Python Cookbook recipe. The version used and available here is updated to use httplib.HTTPConnection available from Python 2.0 which also uses HTTP/1.1 (but does not work with Python versions below 2.0).

download

example package

validate.zip v1.7 090620, a complete example which should run unmodified.

individual files

validate.py
exampleconfig.txt (you may need to adapt the line endings to your OS)
httplib_multipart.py v1.4 040908 (replace .txt with .py extension)
This is a modified version of the script which uses "text/html" as the default content-type if mimetypes is unable to recognize it. This way it is possible to validate webpages like ".shtml", ".jsp", ".php" or ".py". In these cases you can only use the UPLOAD=1 option if UPLOADFROMURL=1 is also specified (from v1.6). The script fetches the files from the local LOCALSERVERURL in this case.

usage

> python validate.py exampleconfig.txt

Tested with Python 2.6.2 on Vista only.

config options

In the config file which is basically a Python file in which some variables are (re-)defined. You may specify the following options (may not be the most elegant and save way but does work well for this short script). All options are optional and have reasonable default values (apart from your website specific URL and Path information of course).

VALIDATORURL = w3: URL to the validator, default is constant w3 = 'validator.w3.org:80'
You may want to use the URL to a locally installed copy of the W3C validator.
FILESERVERROOT = 'e:\\files': Absolute path to your local fileroot, Unix: use /, Win use \\ separator.
The script looks here for files to validate.
VALIDATEPATH = 'home': Validate all files starting from this path in FILESERVERROOT .
SKIPPATHS = ['include', 'WEB-INF']: Skip all files and subdirectories in these paths.
EXTS = ['html', 'htm']: Validate files with these extensions.
LOCALSERVERURL = 'http://example.org': If UPLOAD=0 the validator GETs the files to validate from this URL. For local servers you might need to use the UPLOAD=1 option in which case this URL is not used at all.
UPLOAD=1: If =1 POST upload from FILESERVERROOT\VALIDATEPATH ) files or if =0 GET (from LOCALSERVERURL/VALIDATEPATH ) pages.
UPLOADFROMURL=1: If UPLOAD=1 (POST) the HTML to validate will be fetched from LOCALSERVERURL , else from FILESERVERROOT\VALIDATEPATH . Files to be fetched will always be the files in FILESERVERROOT\VALIDATEPATH .
REPORTDIR = '__validator': Reports are saved in this directory
OPENREPORTS = 0: If =1 automatically open report pages in default HTML viewer (normally a webbrowser).

Also see the comments in the files which should be enough to get you started.

change history

v1.7 090620

BUGFIX: checks for Valid or Invalid adapted to changes of W3C HTML Validator HTML (the check is really naive!)
BUGFIX: fixed saving of reports
improvement of output
added valid and invalid example HTML

v1.6 060606

new option


                    UPLOADFROMURL

to retrieve files from e.g. a local server which the w3 validator may not be able to access. Useful if dynamic pages like .php or .jsp pages need to be validated.

v1.5 040910

bugfix: reported number of files validated was always 1

v1.4 040908

added option


                    SKIPPATHS

modified httplib_multipart to use "text/html" as default content-type

v1.3 040907

added option


                    VALIDATEPATH

to validate only parts of a website
renamed most options to more meaningful names

v1.2 040906

httplib_multipart rewrite to use


                    httplib.HTTPConnection

rewrote most of the script

v1.1 040906 not released

upload on local validator works now (because of HTTP/1.1)

v1.0 040903 not released

first working version