User:Proteins/Writing scripts for Wikipedia
Scripts are amazing. They give you nearly unlimited power to analyze Wikipedia articles, to modify their appearance and even to add new elements. For example, you can count the number of polysyllabic words (analysis), color the words according to their syllables (modification) and create interactive dialogs for the reader (addition). In general, scripts do not affect the underlying article, the one stored on the database, so that multiple people can view the same article according to their own preferences, by using different scripts.
Wiki markup, HTML code and the DOM tree
Wikipedia articles can be represented in three forms: as wiki-markup, as HTML and as a DOM tree. This section explains the difference, and how you can see and modify the same article in each of its forms.
A typical article on Wikipedia is written in standard wiki-markup. For example, bold-faced words are written as '''bold-faced words''' contained between three single quotes. I believe this is the form of the article stored in the Wikipedia database. To see and modify the article in this form, you click on the "edit this page" tab at the top of the article.
Here are instructions to view the HTML code in different browsers:
- In Firefox 3, you can type "Ctrl-U" and a pop-up window should appear containing the HTML code. Alternatively, you can click on the "View" menu, which is located in the top menu bar between the "Edit" and "History" menus. Within the View menu, click on the "Page source", which is the next-to-last choice, and the same pop-up window should appear.
- In Opera, typing Ctrl-U also shows the HTML code in a pop-up window. Alternatively, you can click on the "View" menu, which is located in the top menu bar between the "Edit" and "Bookmarks" menus. Within the View menu, click on the "Source", which is the sixth choice, and the same pop-up window should appear.
- In Google Chrome, typing Ctrl-U also shows the HTML code in a pop-up window. Alternatively, you can click on the "control this page" icon to the very right of the search bar. On the resulting submenu, click on "Developer", and on its submenu, click on "View source".
- In Safari, typing Ctrl-ALT-U shows the HTML code in a pop-up window. Alternatively, you can click on the "View" menu, located as usual between the "Edit" and "History" menus in the top menu bar. Under the View menu, you can click on "View source" for the same pop-up window.
- In Internet Explorer 7, click on the "Page" menu, which is to the left of the "Tools" menu, on the line just above the fram showing the article. Third from the bottom of that menu is the choice "View source", which will open the HTML in a pop-up window using Notepad.
The DOM tree
How to view the DOM tree in your browser
Most browsers allow you to see the DOM tree, which is the browser's internal representation of the webpage. The following instructions should allow you to see it in different browsers:
- In Firefox 3, the best approach is to download an add-on known as "DOM inspector". Once added, it should appear under the "Tools" menu in the top bar of the browser, which is next to the "Bookmarks" menu". DOM Inspector can also be activated using the keycode Ctrl-Shift-I.
- In Google Chrome, right-clicking on any part of the page summons a menu. At the bottom of that menu is the choice "Inspect element", which shows the position of the element in the DOM tree.
- In Internet Explorer 7, the Internet Explorer Developer Toolbar, a free download from Microsoft, is used to show the DOM tree. This toolbar can be found at the far right, behind the double arrows that are to the right of the "Tools" menu, which is itself to the right of the "Page" menu.
- In Safari, click on the "Develop" menu and select the choice "Show Web Inspector". The Develop menu is located in the topmost menu bar, between the "Bookmarks" and "Window" menus. If the Develop menu is not there, click on the "Edit" menu and select its last element, "Preferences". A window will pop up, on which you choose the last tab, labeled "Advanced". At the bottom of the Advanced screen is a checkbox labeled "Show Develop menu in menu bar." Clicking this checkbox should introduce the Develop menu in the menu bar.
- In Opera, the equivalent DOM inspector can be turned on by clicking on the "Tools" menu in the top menu bar (sandwiched between the "Widgets" and "Help" menus). Under the Tools menu, click on the "Advanced" submenu, and from the resulting sub-sub-menu, choose "Developer Tools". This should turn on an analysis system at the bottom of the screen, which incidentally can also be detached into a window of its own. Within this analysis window, clicking on the "DOM" tab should reveal the DOM tree. One drawback of this inspector seems to be that it does not reveal the changes in the DOM tree after your script has run. Instead, it reloads the webpage afresh, always showing the original unmodified DOM tree.
The DOM tree of typical Wikipedia pages
Inspecting the DOM tree of Wikipedia articles will reveal a common architecture. The main content of the article is contained inside a DIV element with the id label "bodyContent"; to reach this crucial node, however, you need to drill down a few levels. The bodyContent node is found under the "content" node, which in turn is under the "column-content" node, which in turn is under the "globalWrapper" node, which is turn is under the standard BODY node, which is under the HTML node, which is under the "document" node, the top of the DOM tree. Thus, to reach bodyContent, you need to follow the sequence of child-nodes (sometimes called a "trail" through the document, or an XPath)
document → HTML → BODY → globalWrapper → column-content → content → bodyContent
Why are so many levels necessary before getting to the main article? The MediaWiki software uses these other levels to add all the extra decorations found on the page. For example, the user commands along the upper edge at the right, such as your user name, you user talk page, your preferences, etc. are found under "column-one" node, which is the sibling node of the "column-content" node. So are the tabs at the top of the page such as "article", "talk", "edit this page", etc. as well as the menus for navigation, search, interaction and toolbox in the left-hand column. By placing these in a separate node, they can be located and manipulated independently from the content.
Looking inside the bodyContent node using a DOM inspector reveals all the HTML code that makes up the article. For example, typical section headings are contained under H2 nodes, whereas successive subsections are contained under H3, H4 and H5 nodes. Normal text is contained in paragraph nodes labeled "P". Unordered (that is, bullet-pointed) lists and ordered (that is, numbered) lists are contained under UL and OL nodes, respectively; individual items in both cases are contained under LI (list item) nodes. Indentation corresponds to discursive lists; these are labeled with a DL, and the indented text is contained under a DD node. In some cases, a DL list is actually a definition list, one that has defined, boldfaced terms contained under a DT node; these terms are generated using an initial semicolon in wiki-markup. Larger-scale groupings of HTML nodes can be made using DIV and SPAN tags.
Getting a handle on elements in the DOM tree
Originally, there were competing models for the DOM tree between Internet Explorer and all other browsers. Thus, different scripts would have to be written for different browsers. Fortunately, the DOM model has been largely—but not completely!—standardized, so that one Wikipedia script is likely to work the same on most browsers.
Note to self: describe here how to determine the browser type and how to work with it.
The highest level objects used in scripts are the document and window objects.
Note to self: describe here how to do basic manipulations of the document and windows.
A specific element in a DOM tree can be obtained through its id code. For example, the bodyContent node can be obtained by the command
body_content = document.getElementById("bodyContent");