This is the talk page for discussing improvements to the XML article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Archives: 1, 2, 3, 4

Internet B‑class High‑importance

	Internet portal This article is within the scope of WikiProject Internet, a collaborative effort to improve the coverage of the Internet on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.InternetWikipedia:WikiProject InternetTemplate:WikiProject InternetInternet articles
B	This article has been rated as B-class on Wikipedia's content assessment scale.
High	This article has been rated as High-importance on the project's importance scale.

The following Wikipedia contributor may be personally or professionally connected to the subject of this article. Relevant policies and guidelines may include conflict of interest, autobiography, and neutral point of view.

TimBray (talk · contribs) / Bray, Tim This user has contributed to the article.

Archives

Some XML terms

I believe the main text needs a bit more beef, but I hesitate to add such content since it's a bit too much for most people... you can keywords search from inside Visual Studio 2005 if you have it. --Raylopez99 20:57, 1 October 2007 (UTC)[reply]

XML's set of tools helps developers in creating web pages but its usefulness goes well beyond that. XML, in combination with other standards, makes it possible to define the content of a document separately from its formatting, making it easy to reuse that content in other applications or for other presentation environments. Most importantly, XML provides a basic syntax that can be used to share information between different kinds of computers, different applications, and different organizations without needing to pass through many layers of conversion.[4] —Preceding unsigned comment added by 220.226.191.107 (talk) 05:06, 28 February 2009 (UTC)[reply]

SGML

This article states "Its predecessor, SGML, has been in use since 1986, so there is extensive experience and software available." But the article on SGML states "few SGML-aware programs existed when XML was created". Which is it? Few programs, or extensive software? Can't be both? 10:48, 12 December 2008 (UTC) —Preceding unsigned comment added by 195.72.173.51 (talk)

I assume that sentence was supposed to mean "(extensive experience) + (software)", as opposed to "extensive (experience + software)". Software included WordPerfect Office (see for example WordPerfect Office 2000 review in PCPro July 1999, WordPerfect Office 2002 review at DTPStudio and Corel WordPerfect Office X3), Adobe Framemaker + SGML (see Adobe FrameMaker; with versions for Windows, Macintosh and Unix), the OmniMark programming language, various SGML parsers (including James Clark's famous sgmls), tools for viewing and edting DTDs, etcetera. In fact, Peter Flynn published a complete book on the subject: Understanding SGML and XML Tools: Practical programs for handling structured text Boston/Dordrecht/London: Kluwer Academic Publishers, 1998. ISBN 0-7923-8169-6. (XML was brand new at the time, but development of XML software had already started while the spec was under development.)

With regard to experience, this is something that may be inferred from the body of literature on the subject. (For example, search Amazon.) --ChristopheS (talk) 12:28, 12 February 2009 (UTC)[reply]

Disadvantages of XML

XML vs. binary is a disadvantage (the first item in the list)? Why not the disadvantage of XML vs. a ham sandwich? XML is not a binary format. It should be compared to other markup systems, not to a format that it is clearly not in the family of. —Preceding unsigned comment added by 205.229.50.10 (talk) 23:36, 16 December 2008 (UTC)[reply]

If you have a credible cite that indicates XML and Ham Sandwiches as reasonably capable of mutual substitution, or otherwise capable of meeting the same or similar requirements as alternative technologies, then even that comparison might merit some mention. As long as there is a meaningful basis for comparison and it is within the scope of the article, then it is potentially relevant to the article. dr.ef.tymac (talk) 18:09, 19 January 2009 (UTC)[reply]

Ehhm, as regard to edibility, ham sandwiches rulezz, but XML might also be fine if it generates money that generates ham sandwiches... Prawn would also be fine, and maybe a pizza and a couple of beer at Friday.

Besides that, you're partially right. Foremost XML vs. "binary" is a weird comparison, since gzipping an XML will make it "binary". XML could be compared to other things according to usage, where the most obvious competitor is SGML. XML bundled with it's "obvious partners" such as XSLT, XSL:foo, XPath, XLink and such could be used for documentation with a unique DTD, and then it could be compared to f.ex. DocBook, but such specific comparisons should be highlighted as per-topic. ... said: Rursus (bork²) 13:41, 11 February 2009 (UTC)[reply]

Confused

I came here to learn about XML; I know almost nothing about it. The section on Well-formedness confuses me, as it doesn't appear to make any sense:

The only indispensable syntactical requirement is that the document has exactly one root element (also known as the document element), i.e. the text must be enclosed between a root start-tag and a corresponding end-tag, known as a 'well-formed' XML document: <book>This is a book... </book>

First of all, it's confusing that "root element" and "document element" appear to be fundamental terms, yet they're merely in bold. Seems like if it's an "indispensable syntactical requirement," it might warrant an entry--or at least a complete definition somewhere on the page.

Second, it seems like a non sequitur to say that an XML document must have "exactly one element, i.e. the text must be enclosed between a root start-tag and a corresponding end-tag..." Where's the one element? I read a statement--another requirement, in fact.

To top off this strange sentence, this is followed by "known as a 'well-formed' XML document." What is known as a well-formed document, the element? The text?

In the example, is "This is a book..." the element? The whole well-formed document?

Someone should explain this better. I'd like to understand.

Thanks. —Preceding unsigned comment added by 152.3.112.145 (talk) 21:03, 9 March 2009 (UTC)[reply]

"This article may be too long to comfortably read and navigate."

I suggest that this article is actually of an appropriate length, given the complexity of the topic. I would prefer NOT to see it broken up into subtopics. (Of course, links to subtopics would be welcomed, but I don't think there's too much text here/too many subtopics for this particular topic.)

I'm guessing that this is an automatically-generated warning from a bot. If others agree with me on this point (of the article NOT being too long), perhaps someone knows how to get rid of the warning AND flag it somehow so the bot won't re-add the warning. "???"

Aloha, philiptdotcom (talk) 02:29, 2 July 2009 (UTC)[reply]

I agree. Does anyone have an objection to us removing this notice? --Nigelj (talk) 12:11, 2 July 2009 (UTC)[reply]

It was me who tagged it I think. The article currently reads like it was copied out of a textbook; non-technical readers would have little hope of finishing it before falling asleep at their keyboards. That something is a "complicated topic" only excuses it from our guidelines on length if it is very well-written, such as our various Presidential biographies. This article could easily be cut down to a more manageable length without it being less valuable; for instance, the minutae on validation belongs on XML validation (itself a very poor article presently) and not here. Feel free to work on this if you want; I'll try to do so myself at some point, which is what the tag is there to remind me of. Chris Cunningham (not at work) - talk 12:26, 2 July 2009 (UTC)[reply]

I took the liberty of removing the tag (before reading this discussion). I did that on the basis that the amount of information presented is probably about right for most readers. That of course is a personal judgement. I'm not sure whether the tag was added before or after Tim Bray's rewrite. Mhkay (talk) 09:57, 6 August 2009 (UTC)[reply]

This article is a mess

There are maybe two or three smaller-sized coherent WP entries struggling to get out in here. I'm going to invest a few hours in the next few days in trying to make it smaller and cleaner. There are quite a few statements currently here that need some supporting references. I think the WP policies requiring such support are sane and will try to follow them. Tim Bray (talk) 03:28, 29 July 2009 (UTC)[reply]

I'm a rather competent XML practitioner and would be willing to provide a sanity check on the structure and content. Artcolman (talk) 20:50, 30 July 2009 (UTC)[reply]

Entity references and Escaping

Are & , < , > , " and ' the only characters that NEED to be escaped ("need" meaning "required by the XML standard")? Or are there other characters that need to be escaped? Or even less than those 5 ( > doesn't seem to necessary to correctly parse XML - in contrast to the other 4)? This is relevant (at least to me) since XML can be used with many different encodings.
And related to this: Is the character encoding used for the entire document or only for its "content" (leaving the tags, i.e. the structure around this content, in ASCII or whatever standard encoding XML is supposed to use (seems to be UTF8))? E.g., if for some strange reason i were to use EBCDIC as encoding, would the & and < used in XML be the UTF-8 & respectively UTF-8 < or the EBCDIC & respectively EBCDIC < ?
And what about "<?xml version="1.1" encoding="EBCDIC"?>"? Would that line be itself encoded in EBCDIC or in UTF-8? Catskineater (talk) 20:55, 2 August 2009 (UTC)[reply]

Sometimes ', ", and > need to be escaped, ' and " in attribute values, and > after the string "]]". The character encoding always applies to the whole document, markup and content. Tim Bray (talk) 00:37, 5 August 2009 (UTC)[reply]

Wow, wouldn't that make XML lacking external encoding information impossible to parse? To even parse the first line '<?xml version="1.1" encoding="some_strange_encoding"?>' describing the used encoding you have to know that same encoding. You could guess based on statistical information, but that doesn't sound robust. With encodings that are supersets of ASCII you won't notice the problem, but there are examples like EBCDIC which aren't supersets of ASCII. Catskineater (talk) 17:16, 5 August 2009 (UTC)[reply]

Rewriting

I'm proceeding through from start to end, trying to leave things behind in a sane state as I pass through. So far, I've whacked a bunch of text that seemed superfluous or in the wrong place but left it behind in comments in case the right place should appear.

I spent some time going back and forth through the XML spec, and it's just too big, there's no way all the syntactic variations and corner cases can sanely be described in this entry. So it seems to me that it's important to make sure that the important stuff: elements, attributes, encoding, escaping, and so on, be well-described. Should there be an ancillary article on "XML Syntax" or some such where there could be a deep-dive on stuff nobody cares about like NOTATION and unparsed entities and so on?

So, what I'm trying to do is leave this in a condition where what's left is an opinion-free well-referenced tour through the important pieces of XML. Some things are sorely lacking: an introduction to the "XML stack" - XML & the Infoset & XSD & XSLT & XPath & RelaxNG and so on and so on.

I'd really welcome opinions about how to handle the stupid verbose unhelpful "pro & con" section. I'm really unconvinced that an encyclopaedic article on XML is actually the place for an argument about whether it's good or not. Here's what it is, here's where it's been used.

What *is* needed, and I'm not sure whether it's in this article or not, is some discussion of XML & other formats that are in common use for data interchange: ASN.1, YAML, and JSON leap to mind. Someone keeps talking up s-expressions but I've never actually seen them used for industrial data interchange. Tim Bray (talk) 00:46, 5 August 2009 (UTC)[reply]

A Comparison of data serialization formats would certainly be interesting. Regarding syntax, there is certainly precedent from several programming language articles to have a separate sub-article on syntax. --Cybercobra (talk) 01:40, 5 August 2009 (UTC)[reply]

The "pro & con" section is unhelpful because it is only a list of disconnected topics IMHO. However I think it is impossible to remove it because a lot of editors would want to have their way and list critics they could find on it. I propose to have only a short chapter here, mainly linking to another article with this list. Hervegirod (talk) 09:41, 5 August 2009 (UTC)[reply]

"a lot of editors would want to have their way" doesn't seem like a good argument to me. Wikipedia experts: is there a tag you can put on a section saying you think it should be deleted, sort of a last-call? Tim Bray (talk) 19:44, 5 August 2009 (UTC)[reply]

I agree with you (and you are an XML expert !), but I don't know if it's possible to put such a tag without "securing" some place for the critics, at least those which are properly sourced. I made this suggestion because I remember a similar situation with the Java (programming language) article. There was a "Criticism" paragraph which at the end became a long list of disconnect critics. Creating a specific article (which has its own problems, I admit) was a way to avoid this, and now the main article is just linking the specific "Criticism" article. However I know it's really an imperfect solution. Hervegirod (talk) 19:59, 5 August 2009 (UTC)[reply]

Let's emphasize how xml helped with the adoption of utf-8. 99.56.139.29 (talk) 02:39, 5 August 2009 (UTC)[reply]

It would help novices if the opening paragraphs made clear that the term "XML" can mean 3 different things, depending on context: (1) the syntax and rules defined by the W3C XML 1.* specs; (2) any vocabulary based on the XML spec (whether or not there is an associated schema); and (3) the set of XML languages applied to a particular problem set, as in "we are using XML technology for information exchanges". Ken Sall (talk) 22:28, 5 August 2009 (UTC)[reply]

Ken - good idea. I'd recast (3) slightly as "xml technology in general", in that (nice) example sentence, when they say XML they mean one or more XML languages and parsers and APIs and transformers and so on. Why don't you go ahead and add that? Tim Bray (talk) 23:16, 5 August 2009 (UTC)[reply]

Actually, "XML" is used very widely to mean anything except XML (the markup language). In particular, it is used to mean typed data objects that form a series of trees by XQuery people and merchants, and typed+validated trees (PSVI) by XSD people and merchants. And it is used to stand for the full WS-* stack sometimes: how often do we read comments on "how complicated XML is" only to find they are not discussing XML (the markup language) at all! It is like how Java EE people habitually use "Java" instead of "Java EE". Rick Jelliffe (talk) 06:48, 6 August 2009 (UTC)[reply]

Misc

The versions sections is incorrect that all versions of XML prior to the 5th edition used Unicode 2.0. Unicode 4th edition uses Unicode 3.1 for example. The text should be corrected to something like "Prior to its fifth edition release, XML 1.0 differed from XML 1.1 in the requirements of characters used for element and attribute names: in the first four editions of XML 1.0 the characters were exclusively enumerated using a specific version of the Unicode standard (Unicode 2.0 to Unicode 3.1.) The fifth edition substitutes the mechanism of XML 1.1, which is more future-proof but reduces Redundancy (information theory) and therefore degrades error checking." Rick Jelliffe (talk) 07:31, 6 August 2009 (UTC)[reply]

The section introducing elements would be improved by noting that XML names cannot contain spaces. We take it for granted, but that <person name="fred"> means there is a person with an attribute name is not obvious: I have had people (COBOL programmers and database people) who want to read it as a property "person name" which has a value "fred". Rick Jelliffe (talk) 07:31, 6 August 2009 (UTC)[reply]

The section on Sources does not mention Charles Goldfarb. This is a major omission, bordering on insult, since many of the ideas in XML come directly from him and his efforts. I suggest the second sentence of the first paragraph should be "Many of the ideas in SGML in turn are attributed to Charles Goldfarb." Rick Jelliffe (talk) 07:31, 6 August 2009 (UTC)[reply]

The section on Sources mentions Steve deRose. I also was involved in implementing a WF-class parser in 1989 as part of the RISP LISP SGML text processor at Unicode, Tokyo. And even when people used a full SGML parser such as OmniMark, normalizing the SGML was a common step. So XML reflected many years of common practise. I don't think it is important enough to warrant a change, just noting it. Rick Jelliffe (talk) 07:31, 6 August 2009 (UTC)[reply]

not to diss Steve, but I also built a WF-class parser at the New OED project in 87-89. Rick, why don't you go ahead and make some of the changes you're suggesting. Tim Bray (talk) 08:30, 6 August 2009 (UTC)[reply]

The section on Sources is a little squiffy on ERCS, but I don't expect there is any value in teasing it out. ERCS came first: some of its recommendations (extended naming rules) made it into SGML (ENR)[[1]] which I co-drafted before the XML effort started, XML itself and adopted/adapted other parts of it (hex NCR, the particular naming rules), and then SGML was again fixed to support other leftover parts of XML [[2]] I don't think it is important enough to warrant a change, just noting it. Rick Jelliffe (talk) 07:31, 6 August 2009 (UTC)[reply]

Re-Archive?

Since we've done a bunch of major changes, most of the comments of this Talk page are now irrelevant. Are there any particular procedures or incantations in spinning most of it off to another Archive page? If so, could someone who knows how please go ahead and do it (and then remove this paragraph)? Tim Bray (talk) 17:10, 6 August 2009 (UTC)[reply]