There's been lots of useful feedback to my previous post, both in the comments and on xml-dev, so I thought I would summarize my current thinking.
It's important to be clear about the objectives. First of all, MicroXML is not trying to replace or change XML. If you love XML just as it is, don't worry: XML is not going away. Relative to XML, my objectives for MicroXML are:
- Compatible: any well-formed MicroXML document should be a well-formed XML document.
- Simpler and easier: easier to understand, easier to learn, easier to remember, easier to generate, easier to parse.
- HTML5-friendly, thus easing the creation of documents that are simultaneously valid HTML5 and well-formed XML.
JSON is a good, simple, extensible format for data. But there's currently no good, simple, extensible format for documents. That's the niche I see for MicroXML. Actually, extensible is not quite the right word; generalized (in the SGML sense) is probably better: I mean something that doesn't build-in tag-names with predefined semantics. HTML5 is extensible, but it's not generalized.
There are a few technical changes that I think are desirable.
- Namespaces. It's easier to start simple and add functionality later, rather than vice-versa, so I am inclined to start with the simplest thing that could possibly work: no colons in element or attribute names (other than xml:* attributes); "xmlns" is treated as just another attribute. This makes MicroXML backwards compatible with XML Namespaces, which I think is a big win.
- DOCTYPE declaration. Allowing an empty DOCTYPE declaration <!DOCTYPE foo> with no internal or external subset adds little complexity and is a huge help on HTML5-friendliness. It should be a well-formedness constraint that the name in the DOCTYPE declaration match the name of the document element.
- Data model. It's a fundamental part of XML processing that <foo/> is equivalent to <foo></foo>. I don't think MicroXML should change that, which means that the data model should not have a flag saying whether an element uses the empty-element syntax. This is inconsistent with HTML5, which does not allow these two forms to be used interchangeably. However, I think the goal of HTML5-friendliness has to be balanced against the goal of simple and easy and, in this case, I think simple and easy wins. For the same reason, I would leave the DOCTYPE declaration out of the data model.
Here's an updated grammar.
# Documents document ::= comments (doctype comments)? element comments comments ::= (comment | s)* doctype ::= "<!DOCTYPE" s+ name s* ">" # Elements element ::= startTag content endTag | emptyElementTag content ::= (element | comment | dataChar | charRef)* startTag ::= '<' name (s+ attribute)* s* '>' emptyElementTag ::= '<' name (s+ attribute)* s* '/>' endTag ::= '</' name s* '>' # Attributes attribute ::= attributeName s* '=' s* attributeValue attributeValue ::= '"' ((attributeValueChar - '"') | charRef)* '"' | "'" ((attributeValueChar - "'") | charRef)* "'" attributeValueChar ::= char - ('<'|'&') attributeName ::= "xml:"? name # Data characters dataChar ::= char - ('<'|'&'|'>') # Character references charRef ::= decCharRef | hexCharRef | namedCharRef decCharRef ::= '&#' [0-9]+ ';' hexCharRef ::= '&#x' [0-9a-fA-F]+ ';' namedCharRef ::= '&' charName ';' charName ::= 'amp' | 'lt' | 'gt' | 'quot' | 'apos' # Comments comment ::= '<!--' (commentContentStart commentContentContinue*)? '-->' # Enforce the HTML5 restriction that comments cannot start with '-' or '->' commentContentStart ::= (char - ('-'|'>')) | ('-' (char - ('-'|'>'))) # As in XML 1.0 commentContentContinue ::= (char - '-') | ('-' (char - '-')) # Names name ::= nameStartChar nameChar* nameStartChar ::= [A-Z] | [a-z] | "_" | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] nameChar ::= nameStartChar | [0-9] | "-" | "." | #xB7 | [#x0300-#x036F] | [#x203F-#x2040] # White space s ::= #x9 | #xA | #xD | #x20 # Characters char ::= s | ([#x21-#x10FFFF] - forbiddenChar) forbiddenChar ::= surrogateChar | #FFFE | #FFFF surrogateChar ::= [#xD800-#xDFFF]