About Validator.nu

Validator.nu is validation 2.0.

The Pitch

No DTD-Based Validation

Basic Usage

Validator.nu has two facets: generic (complex UI) and (X)HTML5 (simple UI).

Enter the URL (http, https or data IRI to be exact) of the document you want to validate in the field labeled “Document” and submit the form. That’s all it takes in most cases.

In the (X)HTML5 facet, the parser and the schema will be chosen based on the HTTP Content-Type of the document. In the generic facet, the parser will be chosen based on the HTTP Content-Type and a preset schema will be chosen based on the root namespace (for XML) or the doctype (for text/html).

Alternative Modes of Input

For simplicity, the HTML5 facet only shows UI for validation by URL. Validation by text area and by file upload are available in the generic facet.

Here are bookmarklets:

There is a command-line script that uploads documents from the local filesystem to the (X)HTML5 validator. Integration into vim is available.

The CMLComp Validator makes it easy to upload CML files for checking.

Configurability

Schemas

When the field for schemas is left empty, the validator will try to choose a schema on its own. If you are not happy with the guessed preset, you can specify a schema either by selecting a preset or by entering a space-separated list of schema URLs (http, https or data IRIs). In addition to actual schemas, you may use certain special URLs to invoke checkers that seem like special schemas but aren’t actually implemented as schemas.

Parser

If the automatic choice of parser does not work for you, you can choose the parser manually. The choice of parser affects the HTTP Accept request header that is sent.

Be lax about HTTP Content-Type

When the lax option is set, text/html, text/xsl and text/plain are allowed as XML content types and text/plain is allowed as an HTML content type and, if the URL ends with .rnc, as a Compact Syntax content type. Also, in the lax mode the US-ASCII default for text/* XML types is not enforced.

Normally, schemas using the RELAX NG XML syntax, Schematron schemas and the XML documents to be validated are expected to be served using an XML content type. Schemas using the RELAX NG Compact Syntax are expected to be served using application/relax-ng-compact-syntax content type. (The unregistered application/vnd.relax-ng.rnc content type is also understood.) HTML documents are expected to be served as text/html.

Show Image Report

When the “Show Image Report” checkbox is set, a report concerning the textual alternatives of img elements in the XHTML namespace is shown for accessibility review.

Show Source

You may check the “Show Source” checkbox to show the decoded source of the document being checked. Please note that the source may not be shown in its entirety if the parser encounters a fatal error. Moreover, the show source feature shows the decoded Unicode source. Erroneous byte sequences in the original source and characters that would render the validator output as non-conforming (e.g. U+0000) are not represented faithfully.

Web Service API

If you want to create you own alternative mode of input or want to call Validator.nu (or your own local copy) from within your own application, there is a RESTful Web service API. In addition to the modes of input that work from HTML forms, you can also POST the document to be checked as an HTTP entity body. In addition to the default HTML output, the messages are also available as XHTML, XML, JSON, GNU error format and plain text.

Preset Schemas

HTML5 (experimental)

HTML5 (text/html-compatible content models)

HTML5+ARIA (experimental)

HTML5 with ARIA (unendorsed integration prototype)

Mike(tm) Smith has generated documentation for this schema.

HTML 4.01 Strict + IRI / XHTML 1.0 Strict + IRI

XHTML 1.0 Strict with IRI support. Generally suitable for use HTML 4.01 Strict checking as well, although there are theoretically wrong corner cases. Uses backported HTML5 datatypes.

HTML 4.01 Transitional + IRI / XHTML 1.0 Transitional + IRI

XHTML 1.0 Transitional with IRI support. Generally suitable for use HTML 4.01 Transitional checking as well, although there are theoretically wrong corner cases. Uses backported HTML5 datatypes.

HTML 4.01 Frameset + IRI / XHTML 1.0 Frameset + IRI

XHTML 1.0 Frameset with IRI support. Generally suitable for use HTML 4.01 Frameset checking as well, although there are theoretically wrong corner cases. Uses backported HTML5 datatypes. Do not use. :-)

XHTML5 (experimental)

XHTML5 (XML-compatible content models)

XHTML5+ARIA, SVG 1.1 plus MathML 2.0 (experimental)

XHTML5 with ARIA (unendorsed integration prototype), SVG 1.1, MathML 2.0 and holes for OpenMath, RDF and Inkscape cruft.

XHTML 1.0 Strict, SVG 1.1, MathML 2.0 + IRI

XHTML 1.0 (not 1.1), SVG 1.1 and MathML 2.0 with IRI support.

XHTML 1.0 Strict, Ruby, SVG 1.1, MathML 2.0 + IRI

XHTML 1.0 (not 1.1), Ruby, SVG 1.1 and MathML 2.0 with IRI support.

XHTML Basic + IRI

A schema for XHTML Basic with IRI support. Suitable for use with the HTML parser.

SVG 1.1 + IRI

SVG 1.1 Full with IRI support (Inkscape cruft not permitted).

Non-Schema Checkers

The service supports a few special pseudo-schema URIs that map to checkers written in a Turing-complete programming language.

http://c.validator.nu/table/

Checks (X)HTML table integrity. The current implementation should be considered a prototype that has not yet been updated to match the latest spec language for HTML5. (See more detailed discussion.)

http://c.validator.nu/nfc/

Checks that constructs in the document tree are in the Unicode Normalization Form C and don’t start with a “composing character”. Using this pseudo-schema also enables normalization checking of source text. (See more detailed discussion.)

http://c.validator.nu/text-content/

Checks the text content of the (X)HTML5 meter, progress and time elements for conformance. (This is a prototype with liberties taken.)

http://c.validator.nu/unchecked/

Warns about RDF, OpenMath and Inkspace holes and about the use of version="1.0" in SVG.

http://c.validator.nu/usemap/

Checks the usemap attribute for referential integrity.

http://c.validator.nu/all/

Shorthand for http://c.validator.nu/table/ http://c.validator.nu/nfc/ http://c.validator.nu/text-content/ http://c.validator.nu/unchecked/ http://c.validator.nu/usemap/.

http://c.validator.nu/all-html4/

Shorthand for http://c.validator.nu/table/ http://c.validator.nu/nfc/ http://c.validator.nu/unchecked/ http://c.validator.nu/usemap/.

http://c.validator.nu/debug/

Dumps parse events as warnings.

FAQ

My server gives the HTML5 validator a 406 status. What’s up?

Your server cannot properly deal with an Accept header that does not have */* in it. Chances are that you are using Apache 1.3, PHP and MultiViews together. MultiViews thinks the type of your page is application/x-httpd-php, which isn’t in the Accept header. Apache 2 does not have this problem.

Can I get a “Valid HTML5” badge?

No, Validator.nu does not give badges.

I have observed that once people are given badges they start to feel entitled to the badges and become hostile if the validation service is changed so that some documents that previously were proclaimed valid no longer are. I do not want to deliberately incite an opposition to bug fixes. I know some of the schemas are not as tight as the corresponding spec prose. If I make them tighter, consider it a bug fix. Moreover, the HTML 5 spec is still changing, so the schema will change as well. Finally, I may (and even intend to) change the namespace associations of preset schemas in the future.

In addition to the problem with changing the validator after badges have been awarded, badges don’t provide value to the readers of validated pages. Validation is a tool for you as a page author—not something your readers need to verify. However, if you are writing about Web authoring and want to refer others to Validator.nu, please, by all means feel free to link to Validator.nu.

Java? Eww. Why didn’t you write it in Python or Ruby?

By the time Ruby on Rails hit everyone’s radar, this project was already underway. However, Ruby would still have been a bad choice had I considered it seriously earlier. Ruby lacks a solid Unicode infrastructure. I’ve already been in a situation when I had to stop writing app code and spend time writing the very basics Unicode infrastructure. I don’t want to be in that situation again. Ruby lacks solid XML infrastructure as well.

I chose Java over Python for three reasons: SAX, Jing and more experience with Java. Apart from Java feeling like a more secure choice because I had more experience with it, the choice between Java and Python also comes down to infrastructure. Having a platform-wide unified way for plugging together XML tools is extremely important when what you are doing entails plugging together XML tools efficiently.

Java is in a unique position when it comes to XML tool infrastructure. Java has a lot of XML-related libraries available and they pretty much all plug into the same interface. Not only is there a platform-wide XML API, it is also happens to be one of the most complete and correct of the XML APIs around. From the point of view of RELAX NG, Java being the language Jing is written in is an extremely important consideration. Jing is a seriously good piece of software. Moreover, Java is the native language of the extensibility interface for RELAX NG datatype libraries.

While I’m on a soap box, I should mention that ICU4J is a seriously good piece of software, too, and having Java’s notion of Unicode frozen as UTF-16 from to dawn of time until eternity is very important considering the stability of infrastructure. It is a horribly bad idea that the meaning of Python programs change (due to datatypes changing underneath) depending on how the interpreter was compiled. Unicode is optimized for 16-bit units. The stability of sticking to UTF-16 in RAM everywhere outweighs the theoretical purity of UTF-32 in RAM. (On disk and network, use UTF-8, of course.)

I do want to make the validator functionality available to applications that are not written in Java, though. This is why Validator.nu has a Web service interface that can be used either with the instance running at validator.nu or with a your private instance running at localhost. I encourage you to write a wrapper library for the Web service in your favorite programming language.

What’s wrong with DTDs?

I think DTDs are bad in four ways:

  1. DTDs pollute the document with schema-specific syntax. Since the document itself declares the rules, the question on answered by DTD validation is not the question that should be asked. DTD validation aswers the question “Does this document conform to the rules it declares itself?” The interesting question is “Does this document conform to these rules?” when the person who asks the question chooses the rules the question is about.

  2. DTDs mix a validation mechanism, an inclusion mechanism and an infoset augmentation mechanism. The inclusion mechanism is mainly used for character entities, which solve (but only if the DTD is processed and processing it is not required!) an input problem by burdening the recipient instead of keeping input matters between the editing software and the document author.

  3. DTDs aren’t particularly expressive.

  4. DTDs don’t support Namespaces in XML.

I hope providing an online validation service for RELAX NG removes the excuse that DTDs are needed for online validators.

Validation has a clear and precise meaning. Can’t you kids read ISO 8879?

“Validation” and “validator” in the name and the user interface of the service refer to the ISO/IEC FDIS 19757-2 definition of “validator” (which performs validation), to the Schematron “validation” function (which is performed by a validator), and to the HTML 5 definition of “validator”.

Known Issues and Ideas for Future Development

Schemas for XHTML 1.0 are used for HTML 4.01, because XHTML 1.0 is supposed to be a reformulation of HTML 4.01 in XML. However, there are some subtle spec bugs introduced in the reformulation. For this reason, some errors for HTML 4.01 are wrong. For example, XHTML 1.0 (in the DTD) forbids the name attribute on the form element, although it is allowed in HTML 4.01.

Please refer to the bug tracker for other known issues and for ideas for future development.

Reporting Bugs and Getting Help

The preferred forum for discussing issues related to using the (X)HTML5 validator is the WHATWG Help mailing list. The preferred forum for discussing issues related to implementing (X)HTML5 validators in general and this on in particular is the WHATWG Implementors mailing list. Bugs should be reported to Validator.nu Bugzilla.

Feature Details for Custom Schemas

Source Code

The source code and the dependencies can be obtained using a Python-based (no XML situps!) build script:

First, set the JAVA_HOME environment variable properly. export JAVA_HOME=/usr/lib/jvm/java-6-openjdk on Ubuntu or export JAVA_HTML=/Library/Java/Home on Mac OS X.

mkdir checker
cd checker
hg clone https://bitbucket.org/validator/build build
python build/build.py all
python build/build.py all

(Yes, the last line is there twice intentionally. Running the script twice tends to fix a ClassCastException on the first run.) This will download, build and run the system at http://localhost:8888/. For other options, please run python build/build.py --help instead. Please note that the dependencies are big. The script will spend time downloading stuff. The script requires Python, Mercurial, Subversion and JDK 5 or later (JDK 6 and Hardy’s OpenJDK work). (Tested on Mac OS X and Ubuntu with the openjdk-6-jdk package. On Windows, the build completes but the app crashes on startup.) Note: The script wants to see a Sun-compatible jar executable. Debian fastjar will not work.

Deployment

The above example starts a standalone HTTP server with debug messages printed to the console. To use AJP13 instead, use --ajp=on. A log4j configuration for deployment can be given using the --log4j= option. There is a sample file in validator/log4j-deployment-sample.properties. The directory extras/ is searched for additional jars for the classpath. For example, if you configure log4j to send email, you should put the Java Mail API and JavaBeans Activation Framework jars in extras/.

Acknowledgments

I would like to thank the Mozilla Foundation and the Mozilla Corporation for funding this project.

I would like to thank James Clark for writing Jing and for championing RELAX NG and XML. I would also like to thank everyone who tested the development builds, the writers of test cases and everyone who has developed library code and schemas that the service uses.

Mike(tm) Smith has contributed numerous fixes and updates to HTML5 validation.

Philip Jägenstedt contributed Microdata validation support.

The XHTML 1.0 schemas were originally written by James Clark and have been improved by Petr Nálevka.

fantasai designed the (X)HTML5 schema framework, wrote the (X)HTML5 Core schemas and helped along the way when I added features.

JavaScript bits, the favicon and a lot of bug reports were contributed by Simon Pieters.

The schemas for RELAX NG and XSLT were written by James Clark.

The principal author of the schema for DocBook is Norman Walsh.

The SVG schemas come from the W3C.

The MathML schema was written by Yutaka Furubayashi.

Test cases written by fantasai, Anne van Kesteren and Christoph Schneegans were very useful in developing this service.

This product includes software developed by The Apache Software Foundation (http://www.apache.org/).

This product uses The SAXON XSLT Processor from Michael Kay.

Validome by The Validome Team

Focuses on HTML, XHTML, WML. Uses SGML DTDs and custom code for HTML. Uses XSD and custom code for XHTML. Recently added support for RSS and Atom, but that feature is still in flux.

XHTML 1.0 schema validator by Christoph Schneegans

Validates using the XSD implementation of XHTML 1.0.

Relaxed by Petr Nálevka

Uses RELAX NG and Schematron for validating XHTML and HTML. (The XHTML 1.0 schemas offered here as presets are based on the schemas used in Relaxed.)

Page Valet by WebThing / Nick Kew

DTD-based SGML and XML validation.

Feed Validator by Sam Ruby, Mark Pilgrim, Joseph Walton, and Phil Ringnalda

Checks Atom and RSS feeds. Uses Python as the schema language. :-)

The W3C CSS Validation Service

Checks CSS style sheets.

The W3C Markup Validation Service

DTD-based SGML and XML validation.

Terms of service

#include "common-sense.h"
#include "disclaimer.h"

If you do not accept these terms or the Privacy Policy below, do not use the service.

This service is provided in the hope that it is useful. Neither Henri Sivonen nor anyone else has any obligation to provide this service to you. The service or any part thereof may be discontinued at any time without notice. There is absolutely no warranty. There is no guarantee of a level of service. If you need a guaranteed level of service, you should probably run your own instance of the software.

Please use the service reasonably. If you call it from your own blog, that’s cool. If you need a validator as a part of a massively traffic-generating blog hosting service, please run your own instance.

Privacy policy

When you access the validation service, data about the access is logged for the purpose of understanding the use of the service, identifying popular resources for retrieval to local storage and acting on abuse.

The HTTP request/response pair between your user agent and the service is logged in the “combined” format (without identd check). The logged data includes the network address of the remote host from which the request came, the HTTP authentication name (if for whatever reason supplied; not requested by the service), the date and time of the request, the first line of the request including the HTTP version, the path part of the URL and the query string containing the validator arguments, the HTTP “Referer” header (where you came from) and HTTP “User-Agent” header (the name and version of your browser).

Additionally, the URLs of the HTTP requests made by the validator are logged. Some internal error conditions may also be logged. When an internal error condition is logged, the log entry may include data entered by you or pertaining to the resources your request caused the validator to process. Finally, (X)HTML5 validation errors are logged for documents that are retrieved from the Web (i.e. for documents that are world-readable anyway).

The logs are readable by me (Henri Sivonen) and, technically, by the administrators of the hosting provider. I have no intent of sharing raw log entries with others (except with law enforcement officials if necessary). However, I reserve the right to publish aggregate statistics derived from the logs.