Talk:XML/Archive 4

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Some XML terms[edit]

I believe the main text needs a bit more beef, but I hesitate to add such content since it's a bit too much for most people... you can keywords search from inside Visual Studio 2005 if you have it. --Raylopez99 20:57, 1 October 2007 (UTC)


XML's set of tools helps developers in creating web pages but its usefulness goes well beyond that. XML, in combination with other standards, makes it possible to define the content of a document separately from its formatting, making it easy to reuse that content in other applications or for other presentation environments. Most importantly, XML provides a basic syntax that can be used to share information between different kinds of computers, different applications, and different organizations without needing to pass through many layers of conversion.[4] —Preceding unsigned comment added by 220.226.191.107 (talk) 05:06, 28 February 2009 (UTC)

SGML[edit]

This article states "Its predecessor, SGML, has been in use since 1986, so there is extensive experience and software available." But the article on SGML states "few SGML-aware programs existed when XML was created". Which is it? Few programs, or extensive software? Can't be both? 10:48, 12 December 2008 (UTC) —Preceding unsigned comment added by 195.72.173.51 (talk)

I assume that sentence was supposed to mean "(extensive experience) + (software)", as opposed to "extensive (experience + software)". Software included WordPerfect Office (see for example WordPerfect Office 2000 review in PCPro July 1999, WordPerfect Office 2002 review at DTPStudio and Corel WordPerfect Office X3), Adobe Framemaker + SGML (see Adobe FrameMaker; with versions for Windows, Macintosh and Unix), the OmniMark programming language, various SGML parsers (including James Clark's famous sgmls), tools for viewing and edting DTDs, etcetera. In fact, Peter Flynn published a complete book on the subject: Understanding SGML and XML Tools: Practical programs for handling structured text Boston/Dordrecht/London: Kluwer Academic Publishers, 1998. ISBN 0-7923-8169-6. (XML was brand new at the time, but development of XML software had already started while the spec was under development.)
With regard to experience, this is something that may be inferred from the body of literature on the subject. (For example, search Amazon.) --ChristopheS (talk) 12:28, 12 February 2009 (UTC)
It is a contradiction. But there were many SGML-specific applications that operated directly on the SGML, but not many applications that happened to use SGML as a side issue. While with XML, everything uses it, but relatively few applications manipulate it directly. This is largely a function that SGML matches publishing requirements, and that it was so complex that implementing it was too much of an effort for it to be used as an afterthought, the way XML is. Rick Jelliffe (talk) 14:52, 10 August 2009 (UTC)

Disadvantages of XML[edit]

XML vs. binary is a disadvantage (the first item in the list)? Why not the disadvantage of XML vs. a ham sandwich? XML is not a binary format. It should be compared to other markup systems, not to a format that it is clearly not in the family of. —Preceding unsigned comment added by 205.229.50.10 (talk) 23:36, 16 December 2008 (UTC)

If you have a credible cite that indicates XML and Ham Sandwiches as reasonably capable of mutual substitution, or otherwise capable of meeting the same or similar requirements as alternative technologies, then even that comparison might merit some mention. As long as there is a meaningful basis for comparison and it is within the scope of the article, then it is potentially relevant to the article. dr.ef.tymac (talk) 18:09, 19 January 2009 (UTC)
Ehhm, as regard to edibility, ham sandwiches rulezz, but XML might also be fine if it generates money that generates ham sandwiches... Prawn would also be fine, and maybe a pizza and a couple of beer at Friday.
Besides that, you're partially right. Foremost XML vs. "binary" is a weird comparison, since gzipping an XML will make it "binary". XML could be compared to other things according to usage, where the most obvious competitor is SGML. XML bundled with it's "obvious partners" such as XSLT, XSL:foo, XPath, XLink and such could be used for documentation with a unique DTD, and then it could be compared to f.ex. DocBook, but such specific comparisons should be highlighted as per-topic. ... said: Rursus (bork²) 13:41, 11 February 2009 (UTC)

Confused[edit]

I came here to learn about XML; I know almost nothing about it. The section on Well-formedness confuses me, as it doesn't appear to make any sense:

The only indispensable syntactical requirement is that the document has exactly one root element (also known as the document element), i.e. the text must be enclosed between a root start-tag and a corresponding end-tag, known as a 'well-formed' XML document: <book>This is a book... </book>

First of all, it's confusing that "root element" and "document element" appear to be fundamental terms, yet they're merely in bold. Seems like if it's an "indispensable syntactical requirement," it might warrant an entry--or at least a complete definition somewhere on the page.

Second, it seems like a non sequitur to say that an XML document must have "exactly one element, i.e. the text must be enclosed between a root start-tag and a corresponding end-tag..." Where's the one element? I read a statement--another requirement, in fact.

To top off this strange sentence, this is followed by "known as a 'well-formed' XML document." What is known as a well-formed document, the element? The text?

In the example, is "This is a book..." the element? The whole well-formed document?

Someone should explain this better. I'd like to understand.

Thanks. —Preceding unsigned comment added by 152.3.112.145 (talk) 21:03, 9 March 2009 (UTC)

"This article may be too long to comfortably read and navigate."[edit]

I suggest that this article is actually of an appropriate length, given the complexity of the topic. I would prefer NOT to see it broken up into subtopics. (Of course, links to subtopics would be welcomed, but I don't think there's too much text here/too many subtopics for this particular topic.)

I'm guessing that this is an automatically-generated warning from a bot. If others agree with me on this point (of the article NOT being too long), perhaps someone knows how to get rid of the warning AND flag it somehow so the bot won't re-add the warning. "???"

Aloha, philiptdotcom (talk) 02:29, 2 July 2009 (UTC)

I agree. Does anyone have an objection to us removing this notice? --Nigelj (talk) 12:11, 2 July 2009 (UTC)
It was me who tagged it I think. The article currently reads like it was copied out of a textbook; non-technical readers would have little hope of finishing it before falling asleep at their keyboards. That something is a "complicated topic" only excuses it from our guidelines on length if it is very well-written, such as our various Presidential biographies. This article could easily be cut down to a more manageable length without it being less valuable; for instance, the minutae on validation belongs on XML validation (itself a very poor article presently) and not here. Feel free to work on this if you want; I'll try to do so myself at some point, which is what the tag is there to remind me of. Chris Cunningham (not at work) - talk 12:26, 2 July 2009 (UTC)
I took the liberty of removing the tag (before reading this discussion). I did that on the basis that the amount of information presented is probably about right for most readers. That of course is a personal judgement. I'm not sure whether the tag was added before or after Tim Bray's rewrite. Mhkay (talk) 09:57, 6 August 2009 (UTC)
Before. --Cybercobra (talk) 22:29, 7 August 2009 (UTC)

This article is a mess[edit]

There are maybe two or three smaller-sized coherent WP entries struggling to get out in here. I'm going to invest a few hours in the next few days in trying to make it smaller and cleaner. There are quite a few statements currently here that need some supporting references. I think the WP policies requiring such support are sane and will try to follow them. Tim Bray (talk) 03:28, 29 July 2009 (UTC)

I'm a rather competent XML practitioner and would be willing to provide a sanity check on the structure and content. Artcolman (talk) 20:50, 30 July 2009 (UTC)

Entity references and Escaping[edit]

Are &amp; , &lt; , &gt; , &quot; and &apos; the only characters that NEED to be escaped ("need" meaning "required by the XML standard")? Or are there other characters that need to be escaped? Or even less than those 5 ( &gt; doesn't seem to necessary to correctly parse XML - in contrast to the other 4)? This is relevant (at least to me) since XML can be used with many different encodings.
And related to this: Is the character encoding used for the entire document or only for its "content" (leaving the tags, i.e. the structure around this content, in ASCII or whatever standard encoding XML is supposed to use (seems to be UTF8))? E.g., if for some strange reason i were to use EBCDIC as encoding, would the & and < used in XML be the UTF-8 & respectively UTF-8 < or the EBCDIC & respectively EBCDIC < ?
And what about "<?xml version="1.1" encoding="EBCDIC"?>"? Would that line be itself encoded in EBCDIC or in UTF-8? Catskineater (talk) 20:55, 2 August 2009 (UTC)

Sometimes ', ", and > need to be escaped, ' and " in attribute values, and > after the string "]]". The character encoding always applies to the whole document, markup and content. Tim Bray (talk) 00:37, 5 August 2009 (UTC)
Wow, wouldn't that make XML lacking external encoding information impossible to parse? To even parse the first line '<?xml version="1.1" encoding="some_strange_encoding"?>' describing the used encoding you have to know that same encoding. You could guess based on statistical information, but that doesn't sound robust. With encodings that are supersets of ASCII you won't notice the problem, but there are examples like EBCDIC which aren't supersets of ASCII. Catskineater (talk) 17:16, 5 August 2009 (UTC)
In principle, yes. XML parsers aren't actually required to understand any encodings except UTF-8 and UTF-16 with a BOM. Some parsers understand EBCDIC of various flavors, others don't: there is an explanation of how to detect EBCDIC in the non-normative Appendix F. Conceivably, some encodings might not be detectable at all: the fortunately fictional encoding US-BSCII, for example, which is the same as US-ASCII except it interchanges 'a' and 'b', meaning that "us-bscii" in US-BSCII is encoded with exactly the same bytes as "us-ascii" in US-ASCII. Fortunately encodings that are neither UTF-16, UTF-32, ASCII supersets (including UTF-8 and the ISO 8859 group), nor EBCDIC variants are very rare. --John Cowan (talk) 02:05, 9 August 2009 (UTC)
There are of course many ASCII variants too, though they are becoming rare: but in most cases they don't vary the basic subset of characters [a-zA-Z0-9] which are typically found in the encoding name. Some do, however, substitute the lower-case latin letters with upper-case letters from another alphabet such as Greek or CyrillicMhkay (talk) 02:52, 9 August 2009 (UTC)

Rewriting[edit]

I'm proceeding through from start to end, trying to leave things behind in a sane state as I pass through. So far, I've whacked a bunch of text that seemed superfluous or in the wrong place but left it behind in comments in case the right place should appear.

I spent some time going back and forth through the XML spec, and it's just too big, there's no way all the syntactic variations and corner cases can sanely be described in this entry. So it seems to me that it's important to make sure that the important stuff: elements, attributes, encoding, escaping, and so on, be well-described. Should there be an ancillary article on "XML Syntax" or some such where there could be a deep-dive on stuff nobody cares about like NOTATION and unparsed entities and so on?

So, what I'm trying to do is leave this in a condition where what's left is an opinion-free well-referenced tour through the important pieces of XML. Some things are sorely lacking: an introduction to the "XML stack" - XML & the Infoset & XSD & XSLT & XPath & RelaxNG and so on and so on.

I'd really welcome opinions about how to handle the stupid verbose unhelpful "pro & con" section. I'm really unconvinced that an encyclopaedic article on XML is actually the place for an argument about whether it's good or not. Here's what it is, here's where it's been used.

What *is* needed, and I'm not sure whether it's in this article or not, is some discussion of XML & other formats that are in common use for data interchange: ASN.1, YAML, and JSON leap to mind. Someone keeps talking up s-expressions but I've never actually seen them used for industrial data interchange. Tim Bray (talk) 00:46, 5 August 2009 (UTC)

A Comparison of data serialization formats would certainly be interesting. Regarding syntax, there is certainly precedent from several programming language articles to have a separate sub-article on syntax. --Cybercobra (talk) 01:40, 5 August 2009 (UTC)
The "pro & con" section is unhelpful because it is only a list of disconnected topics IMHO. However I think it is impossible to remove it because a lot of editors would want to have their way and list critics they could find on it. I propose to have only a short chapter here, mainly linking to another article with this list. Hervegirod (talk) 09:41, 5 August 2009 (UTC)
"a lot of editors would want to have their way" doesn't seem like a good argument to me. Wikipedia experts: is there a tag you can put on a section saying you think it should be deleted, sort of a last-call? Tim Bray (talk) 19:44, 5 August 2009 (UTC)
I agree with you (and you are an XML expert !), but I don't know if it's possible to put such a tag without "securing" some place for the critics, at least those which are properly sourced. I made this suggestion because I remember a similar situation with the Java (programming language) article. There was a "Criticism" paragraph which at the end became a long list of disconnect critics. Creating a specific article (which has its own problems, I admit) was a way to avoid this, and now the main article is just linking the specific "Criticism" article. However I know it's really an imperfect solution. Hervegirod (talk) 19:59, 5 August 2009 (UTC)

Let's emphasize how xml helped with the adoption of utf-8. 99.56.139.29 (talk) 02:39, 5 August 2009 (UTC)

It would help novices if the opening paragraphs made clear that the term "XML" can mean 3 different things, depending on context: (1) the syntax and rules defined by the W3C XML 1.* specs; (2) any vocabulary based on the XML spec (whether or not there is an associated schema); and (3) the set of XML languages applied to a particular problem set, as in "we are using XML technology for information exchanges". Ken Sall (talk) 22:28, 5 August 2009 (UTC)

Ken - good idea. I'd recast (3) slightly as "xml technology in general", in that (nice) example sentence, when they say XML they mean one or more XML languages and parsers and APIs and transformers and so on. Why don't you go ahead and add that? Tim Bray (talk) 23:16, 5 August 2009 (UTC)
Actually, "XML" is used very widely to mean anything except XML (the markup language). In particular, it is used to mean typed data objects that form a series of trees by XQuery people and merchants, and typed+validated trees (PSVI) by XSD people and merchants. And it is used to stand for the full WS-* stack sometimes: how often do we read comments on "how complicated XML is" only to find they are not discussing XML (the markup language) at all! It is like how Java EE people habitually use "Java" instead of "Java EE". Rick Jelliffe (talk) 06:48, 6 August 2009 (UTC)
Tim and Rick, I've added an "About the term XML" section. Couldn't figure out how to make the link to the Category:XML page which I think would be a good jumping off point for those interested in the "XML technologies" aspect.Ken Sall (talk) 01:51, 7 August 2009 (UTC)

Misc[edit]

How can XML be "fee free" when Microsoft is not allowed to use it, and has patented some of it? http://blogs.zdnet.com/BTL/?p=22595&tag=nl.e550 http://i.zdnet.com/blogs/msfti4icomplaint.pdf http://i.zdnet.com/blogs/msfti4ijudgment.pdf —Preceding unsigned comment added by A6zzz (talkcontribs) 22:30, 12 August 2009 (UTC)

No-one has patented XML and everyone is free to use it. Some people claim to have patented some processes that happen to use XML, and therefore restrict your ability to use such processes. But to claim you can't use XML because of patents is like claiming you can't use a spring because someone has patented a mousetrap that uses springs. Mhkay (talk) 13:59, 13 August 2009 (UTC)


The section introducing elements would be improved by noting that XML names cannot contain spaces. We take it for granted, but that <person name="fred"> means there is a person with an attribute name is not obvious: I have had people (COBOL programmers and database people) who want to read it as a property "person name" which has a value "fred". Rick Jelliffe (talk) 07:31, 6 August 2009 (UTC)

The section on Sources does not mention Charles Goldfarb. This is a major omission, bordering on insult, since many of the ideas in XML come directly from him and his efforts. I suggest the second sentence of the first paragraph should be "Many of the ideas in SGML in turn are attributed to Charles Goldfarb." Rick Jelliffe (talk) 07:31, 6 August 2009 (UTC)

The section on Sources mentions Steve deRose. I also was involved in implementing a WF-class parser in 1989 as part of the RISP LISP SGML text processor at Unicode, Tokyo. And even when people used a full SGML parser such as OmniMark, normalizing the SGML was a common step. So XML reflected many years of common practise. I don't think it is important enough to warrant a change, just noting it. Rick Jelliffe (talk) 07:31, 6 August 2009 (UTC)

not to diss Steve, but I also built a WF-class parser at the New OED project in 87-89. Rick, why don't you go ahead and make some of the changes you're suggesting. Tim Bray (talk) 08:30, 6 August 2009 (UTC)

The section on Sources is a little squiffy on ERCS, but I don't expect there is any value in teasing it out. ERCS came first: some of its recommendations (extended naming rules) made it into SGML (ENR)[[1]] which I co-drafted before the XML effort started, XML itself and adopted/adapted other parts of it (hex NCR, the particular naming rules), and then SGML was again fixed to support other leftover parts of XML [[2]] I don't think it is important enough to warrant a change, just noting it. Rick Jelliffe (talk) 07:31, 6 August 2009 (UTC)

XML Base should be mentioned in the Related Specifications. I don't know how much xml:base is used, but the XML Infoset spec depends on it as do XML Schema, RELAX NG, XPath 2.0, XSLT 2.0, XQuery. I think the ordering of the listed specifications should as much as possible reflect layering, so I would put it after XML Namespaces and before XML Infoset. I would move xml:id to after XML Infoset (since XPath 2.0 depends on it, but XML Infoset doesn't). James Clark (talk) 22:16, 7 August 2009 (UTC)

The section on xml:id was seriously in error. The language I changed was "allows an author to confer ID-ness (in the sense used in a DTD) on an attribute, by naming it in an xml:id attribute". That sounds as if you give an attribute name as the value of an xml:id attribute, and the named attribute becomes an ID. We could have done that, but we did something simpler: any attribute named xml:id is an ID, even in a document whose DTD does not define it so. So my wording is "allows an author to confer ID-ness (in the sense used in a DTD) on an attribute, by naming it xml:id." —Preceding unsigned comment added by Johnwcowan (talkcontribs) 02:14, 9 August 2009 (UTC)

As an aside, I have just edited the entry on SGML in various ways which I hope will make it a more useful companion to the XML entry, mainly at the top. It still has a lot of insane minutae about syntax, all icing and no cake, so anyone else from that era is welcome to improve it further.Rick Jelliffe (talk) 17:31, 10 August 2009 (UTC)

The section on related standards is entirely disposable. As standards they are hardly the most important or typical standards that use XML. Their only connection is that they are all ISO standards: so what?Rick Jelliffe (talk) 09:39, 18 August 2009 (UTC)

Zhang/XimpleWare/VTD-XML[edit]

There's a minor edit war going on here, where J. Zhang, founder of XimpleWare, is trying to promote his company's product by putting plugs for his VTD-XML into this article. BTW I dropped by XimpleWare via a google search and got a warning about it being a known host for browser-attack viruses. Tim Bray (talk) 21:05, 7 August 2009 (UTC)

Yes, I've been one of those deleting Zhang's contributions. He (or she?) has also been conducting a dialogue on my talk page and in private email. To be fair to Zhang, I don't think this is commercially-motivated advertising, it stems from a deep conviction that their technology is highly significant and will change the world. But for this article, the rule has to be that they first need to convince the world of its significance through means other than Wikipedia. Mhkay (talk) 09:23, 8 August 2009 (UTC)

DTD[edit]

The sentence "DTD is still used in many applications because of its ubiquity." has a strong tautological feel. I'd go ahead and edit it (to, e.g., "Despite these perceived limitations, DTD is still in widespread use.") but perhaps there's a more subtle point that I'm missing - something to do with the specific mention of applications. I.e., DTDs are still used by applications because there's so much XML out there that uses DTDs. Was that the intention motivating the mention of applications? If so, the sentence could still be changed to feel less tautological. Terry Jones (talk) 21:54, 8 August 2009 (UTC)

I've changed this to "DTD technology ...". I think what's intended here is to convey that many models are still defined with DTDs because there's a pretty good guarantee that most XML software will then support it (of course, that's not the whole story ... but a fuller analysis of why DTDs haven't gone away would perhaps not be right for this article?) Alexbrn (talk) 07:08, 9 August 2009 (UTC)

There is a new paragraph that says that DTDs are hard to read because of parameter entities. But I think it is bogus. Who thinks that XSD is easier to *read* for example? People really like RELAX NG compact syntax, and *everything* statement in RELAX NG is a parameter-entity equivalent. I would remove this sentence or phrase it in some neutral way, and (because I don't think it is universally accepted) get some citation. Rick Jelliffe (talk) 08:29, 25 August 2009 (UTC)

Referencing the spec[edit]

This article needs a policy decision as to which version of the XML spec should be referenced by default, and then go through and look at every reference to the spec; check out the References section to get a feel for the current disorder. My feeling would be to stick with www.w3.org/TR/REC-xml/ everywhere that a specific edition isn't being called out, but I haven't thought it over deeply. Tim Bray (talk) 04:47, 10 August 2009 (UTC)

I think that's the way to go. Hervegirod (talk) 17:38, 10 August 2009 (UTC)

Something for Dummies?[edit]

I know this article is not intended to be "XML for Dummies" but it doesn't at all help the kind of people that I try to explain XML to when they ask me what I do for a living. I generally do this by pointing out the syntactical similarities to HTML and the fact that you can define your own tags. Probably this is not the best way to describe XML but something near this level would be a good thing, preferably before the word "lexical" is ever used, and with an example. 173.32.243.222 (talk) 22:43, 10 August 2009 (UTC)

I like the XML-for-dummies text that's now serving as the article's lede, but I don't think it comes close to what Wikipedia:Lead wants, so I'm going to have a whack at a better lede. For the moment, I'm going to stash the for-dummies text here and see if we can find a home for it somewhere else in the article. Tim Bray (talk) 21:39, 13 August 2009 (UTC)

XML (Extensible Markup Language) is a way of marking up structured documents, that is, documents in which the markup primarily indicates the content's purpose rather than its formatting. Like HTML, it uses tags (for example <para>) and attributes (for example <image file="face.jpg"/>) but there are some very important differences. Among them are:

  • Tags must be balanced: a start-tag such as <name> must be paired with an end-tag </name> (the tags, plus their content, comprise an element), or else the tag must use the empty-tag syntax (<name/>) in which case it cannot have content, but may have attributes.
  • Elements must be properly nested: if element B's start-tag occurs after element A's start-tag, then B's end-tag must occur before A's end-tag (for example, <firstname>Fred<lastname>Fish</firstname></lastname> is not allowed).
  • The values of attributes must be enclosed in matching single- or double-quotes.
  • The document must contain exactly one top-level element (known as the root element) that encloses the rest of the elements.

Documents that meet the above four requirements, and certain others, are said to be well-formed. All XML documents must be well-formed.

  • Instead of using a more or less fixed set of elements and attributes, as in HTML, designers may create their own.
  • Optionally, XML documents may be tested against a set of formally-declared rules (known as a DTD or Schema) that specify how the elements can be arranged and what attributes they have. This is called validation.

HTML elements have built-in rendering support in web browsers, but XML documents must be accompanied by a cascading style sheet in order to be displayed graphically in a browser. XML documents may also be rendered online or printed using specialized applications.

Backwards and in heels[edit]

The lede still showed its debt to what must have been tutorials, guidelines for use, or marketing materials about why XML mattered. I compressed and reworded it, and moved some material into a ref. See if it still scans.

The different uses of 'XML language' v. 'XML dialect'; and 'specification' v. 'standard' should be cleared up, in all documents about XML. It's a big task, so let's start here :-) Someone who's done more work in the field should add a better definition of an XML dialect to the list of key terms. +sj+ 00:50, 12 August 2009 (UTC)

The new lede is not terrible, but I don't think there's consensus in the XML community that "dialect" is the correct term. My impression is that when people talk about XML-based markup vocabularies (the only term that is somewhat blessed in normative text, cf the namespaces spec), they use the term "XML Languages". So I disagree with your global move to "dialects". Any other opinions? Tim Bray (talk) 14:57, 12 August 2009 (UTC)
I hadn't come across this term "lede" before, but I see it's American slang, so that's not surprising. I think the introductory text before the table of contents should be much more concise, and it should describe XML for what it is, not for how it differs from HTML - I don't think that one should assume that the people who want to know what XML is will already be familiar with HTML, and describing it in terms of the differences is very unhelpful to those who aren't. Mhkay (talk) 02:46, 14 August 2009 (UTC)

Problems and errors[edit]

I see that in the course of a few days recently, this article has been almost entirely rewritten. That means we now have a preponderance of the opinions and viewpoints of just a few editors where previously we had the combined consensus of hundreds of contributors.

Ho hum.

Here are the first few problems problems I see already:

  1. The lead should summarise the article, not the technical spec, or try to be a mini-howto
  2. The bold characters in the lead should emphasise the name of the article, not 'file="face.jpg"'
  3. The fundamental concept within an XML document is the element, not the 'tag'
  4. XML text documents today are increasingly considered to be mere serialisations of in-memory XML DOMs (See X/HTML5)
  5. 'Elements must not overlap' is definitely not the second most important thing for the layman to know about XML! Nor is it the second part of the article the lede is summarising
  6. The theory that 'XML' has three distinct colloquial meanings is pure unreferenced personal original research and should be cited or removed immediately
  7. Our new article seems not to have grasped character encoding at all: Having said 'Almost every legal Unicode character may appear in an XML document', it goes on to give an example which restricts the character encoding to "encoding='ISO-8859-1'", which is almost unheard-of in practical XML usage
  8. It even goes so far as to say that 中 should be referred to as &#20013; or &#x4e2d; and teaches us how to avoid the correct use of Unicode altogether with "I &lt;3 J&#xF6;rg" (and all this in a Unicode XHTML document downloaded from Wikipedia!) And we used to have the valid example '<俄語>Китайська мова</俄語>' to make the whole point succinctly.

I can't get up the energy to carry on - I'm only about a quarter of the way through. Only to note that we've also gone down from about 50 cited sources to 25 as well.

I find it hard that some people's vanity was so huge that they felt that they could do better in a few hours than all that collaborative effort over so many years. Of course there were issues with the old text but there are babies and there are bathwaters too.

I don't know where to start.

--Nigelj (talk) 10:25, 13 August 2009 (UTC)

Unfortunately, with a subject like XML, there are a lot of people who know a little about it, and this tends to result over time in an article that contains a lot of things that are only half-true, and many things that are irrelevant. The new version is a great improvement, a much better place to start. Of course, further improvements are possible. Please don't accuse an expert who knows his stuff better than anyone and who volunteers his time to do this work of "vanity" - that's plain silly, and devalues the rest of your comments. Mhkay (talk) 13:37, 13 August 2009 (UTC)
I am not about to bandy real-world professional project experience, University or professional qualifications with you or anyone else with regard to this subject, but I assure you that I know more than "a little about it", thank you very much. Since I started volunteering my own time and expertise to this article in August 2005 (and to WP in general in 2004), I have seen several major re-writes of other important articles. The normal approach is to draft the rewrite somewhere in user- or talk-space and to invite others to view, comment and contribute for a few weeks or months. This ensures that the best of the old article's content and goodwill are maintained along with any new material, while the irrelevancies can be cropped. Deciding off-line amongst yourselves that a small group of you are "a bunch of highly-expert new contributors [now brought] on board" and going ahead in article-space with little general discussion is what I was referring to as a kind of vanity. Now please stop puffing yourselves up to be some superior kind of contributors, and then we can all cut out these personal attacks. And we can all get on with fixing the article, which, as is gracefully acknowledged below, is currently a little 'buggy'. --Nigelj (talk) 20:33, 13 August 2009 (UTC)
The old article was rambling, way too long, full of irrelevancies, and neglected to mention all sorts of important stuff. We've brought a bunch of highly-expert new contributors on board in the last couple of weeks and have a shorter, tighter article. I think that it's, while imperfect, quite a bit better. You're free to disagree. Tim Bray (talk) 16:56, 13 August 2009 (UTC)

Now, with respect to Nigelj's detail points:

  1. Agreed that the lede, while helpful those seeking an introduction, is inappropriate. Will try again. Also agree about inappropriate boldfacing and tag/element.
  2. I'd have to disagree with the claim that "XML text documents today are increasingly considered to be mere serialisations of in-memory XML DOMs", while agreeing that that is the HTML5 viewpoint. For the primacy of syntax over object models or APIs on the Web, see Architecture of the World Wide Web
  3. "The theory that 'XML' has three distinct colloquial meanings is pure unreferenced personal original research and should be cited or removed immediately". This is an important truth in the market, and I think important to know for someone who comes here wondering "what is this XML I'm supposed to use to solve the problem?" I understand the no-original research principle and will look for supporting citations.
  4. "Our new article seems not to have grasped character encoding at all: Having said 'Almost every legal Unicode character may appear in an XML document', it goes on to give an example which restricts the character encoding to "encoding='ISO-8859-1'"... If that's what the current text made you think it said, that's a bug & should be fixed.
  5. "It even goes so far as to say that 中 should be referred to as" - The point was that if you can't type in 中 on your keyboard, or you're stuck with using ISO-latin, you can still have it in your XML document. If there's any suggestion of "should", that's a bug. Tim Bray (talk) 16:56, 13 August 2009 (UTC)
Hi Tim. With regard to your point 2, Architecture of the World Wide Web says, "If the URI owner has provided more than one representation (in different formats such as HTML, PNG, or RDF; in different languages such as English and Spanish; or transformed dynamically according to the hardware or software capabilities of the recipient), the resulting representation may depend on negotiation between the user agent and server." This Recommendation (although published in 2004, and so much older than what I meant by 'today') seems set on the idea that what is returned over the wire is a representation of the resource: the resource is something else that is held on the server, and among the list of representation example formats, XML could easily have been added to HTML, RDF etc, I think. What is made more clear in current W3C documents like HTML5 is that the resource itself may be considered to be the in-memory DOM. The XML 1.0 spec actually begins, "Extensible Markup Language, abbreviated XML, describes a class of data objects called XML documents" - data objects called documents, not text documents. I only mentioned this as I felt it might be a good place to start, not explicitly but in the mind's eye as it were: like the BNF in the XML spec, an XML document is... a prologue, an element and some optional things like comments, processing instructions and whitespace. Then an element is... two tags and some optional content possibly including other elements, or an empty tag. A start-tag or an empty tag may contain attributes, an attribute is... etc. At some point we get down to numeric and named entities and that's where we get to &amp;, &#20013; and why they exist etc. I'm just suggesting one possible plan or structure that means we don't have to gloss over important points with partial explanations at the beginning, as can happen when we introduce them too soon to want, or be able, to explain them properly. If we start with the big picture and keep it simple, we can drill into detail later on, still building a true representation of what XML actually is and isn't, what it can do and what it can't, what it can represent, what can be done to it etc. Just a suggestion. --Nigelj (talk) 21:23, 13 August 2009 (UTC)
This turns out to be a really important argument that keeps coming back. I'm pretty convinced that one of the reasons the Web has worked so well is that it doesn't try to do APIs and object models across the net, but interoperates on the basis of syntax. If you read the Webarch document carefully I'm pretty sure it's clear that representations are just what you say they are conceptually, but they manifest in the real world as sequences of bytes. I'm also 100% sure that the XML 1.0 spec is very clear that what it's defining is a syntax that applies to sequences of Unicode characters. I agree that HTML5 is trying a bold experiment in a new direction; tie the DOM and the syntax together at the hip. While some of the HTML5 work looks likely to be very popular, e.g. <video> & <canvas>, this other model/syntax stuff remains at this point a very interesting science experiment. One of the reasons XML caught on is that it didn't initially stray outside the realm of syntax. The fact that there are multiple API flavors (stream/DOM/whatever) is a beneficial side-effect of this. Tim Bray (talk) 21:36, 13 August 2009 (UTC)
There are certainly many people who like to think of the data structure as primary, and the lexical form as a "mere" serialization. Often that's the way I think, and it's certainly the way I encourage people to think when they are programming against XML. However, it's not what the specs say, and I think the specs carry some authority. Apart from anything else, the lack of a definitive and universal data model for XML is one of its most notable features. There are many data models for XML, not one, and they are all derived from the textual syntax, not the other way around.Mhkay (talk) 03:11, 14 August 2009 (UTC)
XML manifestly is a syntax/language not an API or object model. There is no other way to read the XML specification. The element may be more important then the tag, but in XML there are no elements without tags. It certainly might be good to start with the XML information set, but that is something only implicit in XML and variable. Rick Jelliffe (talk) 10:39, 20 August 2009 (UTC)

Draft alternate lede[edit]

Here's a draft of an alternate text for the lede (would also replace the "Meanings" section). I've tried to follow Wikipedia:LEAD; I think this draft aligns nicely with the TOC. Input? Tim Bray (talk) 23:03, 13 August 2009 (UTC)

XML (Extensible Markup Language) is a set of rules for encoding documents electronically. It is defined in the XML 1.0 Specification produced by the W3C and several other related specifications; all are fee-free open standards.[1] As of 2009, hundreds of XML-based languages have been developed,[2] including RSS, Atom, SOAP, and XHTML. XML has become the default file format for most office-productivity tools, including Microsoft Office, OpenOffice.org, AbiWord, and Apple's iWork.

XML’s design goals emphasize simplicity, generality, and usability over the Internet.[3] It is a textual data format, with strong support via Unicode for the languages of the world. Although XML’s design focuses on documents, it is widely used for the representation of arbitrary data structures, for example in web services.

There are a variety of programming interfaces which software developers may use to access XML data, and several schema systems designed to aid in the definition of XML-based languages.

For further coverage of the many XML languages and technologies, see the XML Category.

Looks good to me. It's accurate and not over-technical. I'm not 100% confortable with "encoding", because of the potential confusion with character encoding, but I can't think of anything better. I would leave out the quotes around "schema", which suggest that the word is somehow being used improperly. I do wonder what "usability over the Internet" is actually supposed to mean: usability is a measure of the efficiency and effectiveness of a user performing a task, and I don't know what the task is - presumably doing *something* over the internet, but I'm not sure what. Mhkay (talk) 02:37, 14 August 2009 (UTC)
Agreed, mostly. I'd welcome a better word than "encoding", sigh. I took the quotes off schema. "Usability" is taken more or less directly from the spec's "goals" section which says "straightforwardly usable" Tim Bray (talk) 03:40, 14 August 2009 (UTC)
I don't like the first verb being "provides". The lead sentence should identify up-front what the thing is. i.e., the first verb used should be "is". --Cybercobra (talk) 04:24, 14 August 2009 (UTC)
Works for me. Tim Bray (talk) 05:28, 14 August 2009 (UTC)
I would strike the last sentence; it's an improper self-reference. --Cybercobra (talk) 06:35, 14 August 2009 (UTC) Also, XML's SGML origins should be mentioned. Its error handling might also merit a mention/sentence; this I'm less sure about. --Cybercobra (talk) 06:39, 14 August 2009 (UTC)
Thanks for all the detail edits. Revised the last sentence, I do think it's important to squeeze in a category pointer in the lede somehow or other. Got a better idea? (Surely there must be a common Wikipedia idiom for doing this.) As for SGML, I think its importance is now historical and thus the coverage in the Origins section is fine. As for error-handling, the draft lede is already kind of long. Tim Bray (talk) 07:25, 14 August 2009 (UTC)
Actually, the lede can be up to 4 paragraphs. --Cybercobra (talk) 06:19, 15 August 2009 (UTC)

Typographical conventions[edit]

The current article has inconsistencies. The following constructs appear in the article:

- individual characters, for example A and 中 - Syntax characters, e.g. < and & - small chunks of XML, e.g. <line-break/> - multi-line examples, e.g. at the bottom of the "Key Concepts" section

The potential typographical tools are: double-quotes, bold, italic, and monospace via the "code" and "source" elements. Does Wikipedia have a set of rules that we can follow? I'm cleaning up some of the more obvious inconsistencies but it would be nice to standardize carefully end to end. Tim Bray (talk) 05:19, 15 August 2009 (UTC)

I've now made them consistent. Code in <code></code>, characters in quotes. Bold should only be used when introducing terminology and italic is only used for emphasis or to format the title of a work, not for quoted text. That leaves quotes and monospace, which is what I went with. --Cybercobra (talk) 06:12, 15 August 2009 (UTC)

Disposable sections[edit]

Copying Rick's remark from above:

The section on related standards is entirely disposable. As standards they are hardly the most important or typical standards that use XML. Their only connection is that they are all ISO standards: so what?Rick Jelliffe (talk) 09:39, 18 August 2009 (UTC)

+1 on removing that section. Also I note the four trailing sections:

  • 10 See also
  • 11 References
  • 12 Further reading
  • 13 External links

They look redundant and I can't figure out which rule sorted which references into which section. Could someone with more Wikipedia experience check it out and see if there's scope for reorg/simplification. Most of the contents look perfectly reasonable. Tim Bray (talk) 15:03, 18 August 2009 (UTC)

There was 1 redundant link, everything seems to be in the right section. The External links may need pruning. --Cybercobra (talk) 17:28, 18 August 2009 (UTC)

NeoOffice, AbiWord[edit]

Someone seems to think these applications are important enough to justify a mention in the lead paragraph of the XML article. I don't think they are. I have no views on the merits of the products, but they get 0.5m and 2.3m Google hits respectively, and there are thousands of XML applications with more than that: we can't mention them all. (We don't mention SVG, for example, and that gets 13m.) It's a list of examples that's there to demonstrate the truth of an assertion, and three examples is enough. Any more than that is advertising.Mhkay (talk) 21:22, 22 September 2009 (UTC)

The lede should summarise the main points of the article[edit]

At the moment it seems to cover its own, different ground.

The trouble with this includes, if there is anything controversial in the lede, there is no room to explain the finer points, without the lede getting unduly long. I'm not happy with the implication of saying "As of 2009 [...] XML-based formats have become the default for [...] Microsoft Office (Office Open XML) [and] OpenOffice.org (OpenDocument)". While MS Office may be just changing over in 2009, XML-based formats have been the default on OOo for a decade (since it was StarOffice). Lumping them together in this way, which is unfair on the reader by giving a false impression unless they follow links and do their own research, is due to trying to make too many points in one sentence, which in turn in brought about by trying to cover new ground in the lede that is not properly discussed in the body of the article, which itself is in contravention of the guidelines.

The main sections of the article are

  1. Key terminology
  2. Characters and escaping
  3. Well-formedness and error-handling
  4. Schemas and validation
  5. Related specifications
  6. Use on the Internet
  7. Programming interfaces
  8. History

We either need to discuss the appropriateness of having these sections in the first place, or be content to summarise them in the lede. --Nigelj (talk) 08:49, 4 October 2009 (UTC)

The lede can contain more than just mere summary of subsections; in the case of the sentence in question, in the context of its surrounding paragraph, it is demonstrating the widespreadness of XML. The statement is not inaccurate; it does not say that no XML office document formats were the default prior to that date, but rather that as of that date all the major programs in the area have standardized on XML (vs. just one of them). --Cybercobra (talk) 09:19, 4 October 2009 (UTC)

XML vs SGML[edit]

Perhaps there should be a section on what exactly the differences are or what form the special case takes. (Or maybe expand on "Sources".) 118.90.15.97 (talk) 09:42, 28 November 2009 (UTC)

Reasons for Popularity[edit]

Why is XML so popular? Why do some many protocols use XML, despite its extreme inefficiency? Could we cover that in the article? --Matthew Bauer (talk) 04:12, 5 December 2009 (UTC)

I think you would have difficulty drafting text for the article that answers that question in an objective way. You would certainly have to find published material that discusses the question, and ensure that any opinions expressed are attributed to reputable sources.
My own view (and you might find published articles or talks that express this) is that the primary reasons were (a) the fact that XML handled both documents and data at a time when the web badly needed a technology that could do both; (b) the fact that it was good enough to meet all the requirements; (c) the fact that it was very cheap to implement; (d) the fact that there were no major competitors at the time it was launched, and that all the influential players (W3C, Microsoft, Sun, IBM, Oracle etc) endorsed it. There had been alternative technologies with better performance and better functionality for years (such as ASN.1) but they were horrendously expensive to implement.Mhkay (talk) 23:23, 6 December 2009 (UTC)

Apparent Contradiction[edit]

Under the heading "Comments," the article states

The string " -- " (double-hyphen) is not allowed, and entities must not be recognized within comments.

An example of a valid comment: "<!-- no need to escape <code> & such in comments -->" 

So is the double-hyphen allowed in comments, or not? Moioci (talk) 23:47, 7 December 2009 (UTC)

The double-hyphen is used to start and terminate the comment; other double-hyphens aren't allowed in the text of the comment apparently.

<!-- This is valid -->

<!-- This -- is -- not -->

Will try and edit to clarify. --Cybercobra (talk) 01:43, 8 December 2009 (UTC)

Don't make the mistake of thinking that every nuance of the spec has to be in this article. It's supposed to be an encyclopaedic overview, not a detailed reference. OK, I know there people who will try to write XML using Wikipedia as their only source of information, but they are not our target audience. Mhkay (talk) 13:14, 8 December 2009 (UTC)

Criticisms of XML missing[edit]

I notice that on the 6th of August, any discussion of the issues of XML became limited to in-place comments in the corresponding sections. Whilst this is appropriate for someone reading the complete article, anyone skimming the resource (as is typically done when someone is researching) will be left with an artificial perspective. Whilst I can understand that the previous list could well have suffered from endless expansion, it is none-the-less appropriate for an article on any subject to detail the views of it's detractors, and did appear to be referenced reasonably. LinaMishima (talk) 17:10, 22 December 2009 (UTC)

Personally, I think Wikipedia articles (especially on a technical subject) are best when they stick to facts. Opinions about strengths and weaknesses will always be opinions, even if they were first aired in a place other than Wikipedia. Note that it wasn't just the negative opinions that went in that rewrite, it was also the positive ones. Mhkay (talk) 13:59, 23 December 2009 (UTC)
That someone has performed analysis, praise or criticism is a fact - we don't get to discount this, especially when those involved are notable, their comments hold weight, and may even be backed up with yet more verifiable details. If anything, objective details on a subject are already easy to find without having to visit wikipedia. Without collecting such information, researchers cannot use wikipedia as a summary source - they are ultimately forced elsewhere. HTML talks about features missing, DisplayPort is compared with other formats, SOAP details such information, JSON also has comparisons. More notably, Opera (web browser) (FA) features discussion of its reception, as does Twitter (GA) and YouTube (GA). Perhaps the most appropriate thing to do would be to merge the seperate sections from articles on similar technologies, and write an XML-focused summary for the section. LinaMishima (talk) 05:04, 24 December 2009 (UTC)
Yeah, around that time, in August 2009, we were told that we were "a lot of people who know a little about it", and that someone had "brought a bunch of highly-expert new contributors on board in the last couple of weeks". We were told not to "accuse an expert" of not being better than all of us put together. So, I left them to it. I can't be bothered with that attitude on WP or anywhere else. --Nigelj (talk) 12:27, 24 December 2009 (UTC)