User:Edward Z. Yang/Wikipedia Bot in PHP

From Wikipedia, the free encyclopedia

The Wikipedia Bot in PHP is meant to be an alternative to pywikipedia written in PHP. The Linkfix bot is built upon this incomplete framework.

Not all of us can read Python. Python is not especially well suited for creating web sites (can be done, but PHP was specifically created for that task!) Plus, I don't know how to write Python. All these factors meant we needed a Wikipedia Bot... in PHP.

I don't have plans for the bot to actually make writes. What it will do is parse Wikipedia pages and recombine the data in interesting ways. Some ideas I have:

  1. Personal Edit Tracking - The set of scripts will, in conjunction with your watchlist, allow you to set edits as "unread." As edits are made, they are cached, and then, after a weeks vacation, will present them in a readable digest for you, listing the diffs of all the changes. You can then mark that particular set of diffs as read, and then it won't show up on your homepage.
  2. Watchlist... Recent Changes style - Show all recent edits for the page on your watchlist, not just the most recent one.
  3. Watchlist Scraper - Sometimes Wikipedia's just really slow. Wouldn't be nice if your computer automatically grabbed the contents of your watchlist every few minutes so that a (slightly stale but fast) copy could be displayable? I have this implemented locally, and am only wondering how to packaged it.
  4. Talk Page Parser - Take information from talk pages and parse them into bulletin board format. Allows you to easily follow conversations on busy talk pages.
  5. Link checkers - Click through every link on a page and then see if they're redirects or disambiguation pages. Return easy to read text file to aid in the correction of these links (LinkFix bot).
  6. Elementary wikitext parsing - Yes, Parser.php is pretty terrible. Some of the stuff's not that difficult to do though, right?

Of course, much of this hasn't been coded yet. But I do have some usable tools.

Some source code[edit]

For the interested. Probably not very useful. Probably stale too.

Wikipedia_Bot.php[edit]

<?php

//We should make compatibility settings for this include list
include_once('ezy/EZY_CommandLine.php');
include_once('simpletest/browser.php');

include_once('Wikipedia_Parser.php');
include_once('Wikipedia_Extractor.php');

class Wikipedia_Bot_CommandLine extends EZY_CommandLine {
    var $output_base = 0;
    var $newline = "\r\n";
    var $output_wrap = 80;
}

/**
 * This class is a PHP implementation of a Bot for Wikipedia.
 * 
 * Dependent on SimpleTest's SimpleBrowser and EZY_CommandLine, this class
 * effectively acts like a bot by providing a wrapper for common tasks in
 * Wikipedia. As of now, it only can read articles.
 * 
 * The most notable sections of this bot are the routine_*() functions. These
 * combine all the functions of Wikipedia_Bot and Wikipedia_Parser (another
 * class that handles simple parsing of wikitext and article names) to perform
 * useful tasks. Normally, you'd want to call these, and future development
 * will be primarily relegated to these routines.
 * 
 * In the future, we might move out the routines to their own classes of related
 * functions.
 * 
 * @author Edward Z. Yang <edwardzyang@thewritingpot.com>
 * @copyright Edward Z. Yang 2005
 */
class Wikipedia_Bot
{
    
    /**
     * Contains instance of EZY_CommandLine for handling command line output.
     * 
     * @var object EZY_CommandLine
     */
    var $cl;
    
    /**
     * Contains instance of EZY_CommandLine for handling output to files.
     * 
     * @var object EZY_CommandLine
     */
    var $fl;
    
    /**
     * Contains instance of SimpleBrowser for handling webbrowsing.
     * 
     * @var object SimpleBrowser
     */
    var $browser;
    
    /**
     * Specifies whether or not script should beep when certain events happen
     * 
     * Currently, this script requires a command called wav (which plays wav
     * files) to be registered. Obviously, this is extremely inflexible.
     * Unfortunately, I have no clue how to make a beep happen without using
     * an external file.
     * 
     * THIS DOES NOT CONTROL WHETHER OR NOT OUTPUT IS GIVEN. Output is always
     * given: that's how this script works. Maybe that's not such a good idea.
     * 
     * Consider creating several different levels via bitmasking. Then change
     * into int.
     * 
     * @var bool
     */
    var $quiet = true;
    
    var $_name;         //name of current page
    var $_URL;          //url of project
    var $_URLPageView;  //url of current page
    var $_URLPageEdit;  //url of current page's edit page
    var $_pageSource;   //contents of page's source
    var $_pageHTML;     //contents of page's HTML
    
    /**
     * Constructor.
     * 
     * Instantiates SimpleBrowser and EZY_CommandLine classes and then calls a
     * functions that sets the configuration parameters for EZY_CommandLine.
     * 
     * We probably want to add an $argv argument at some point so the class can
     * handle command line parsing.
     */
    function Wikipedia_Bot() {
        $this->browser =& new SimpleBrowser();
        $this->cl =& new Wikipedia_Bot_CommandLine;
        $this->fl =& new Wikipedia_Bot_CommandLine;
    }
    
    /**
     * Creates a log file, automatically creating directories as needed
     * 
     * @param string filename
     * @return resource created file
     */
    function createLog($name) {
        if (isset($this->fh)) {
           fclose($this->fh);
           unset($this->fh);
        }
        
        $dir = explode('/',$name);
        
        $size = count($dir);
        $old = getcwd();
        chdir('output');
        foreach ($dir as $key => $value) {
            if ($key + 1 == $size) {
                break;
            }
            if (!file_exists($value . '/')) {
                mkdir($value . '/');
            }
            chdir($value);
        }
        chdir($old);
        
        return fopen('output/' . $name,'w');
    }
    
    /**
     * Sets user agent on SimpleBrowser.
     * 
     * @param string keyword
     */
    function setUserAgent($keyword) {
        if ($keyword == 'cloak') {
            $this->browser->addHeader('User-Agent: Mozilla/5.0 (Windows; U; '
            . 'Windows NT 5.1; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6');
        } elseif ($keyword == 'declare') {
            $this->browser->addHeader('User-Agent: Wikipedia_Bot; PHP Reading '
            . 'Assistant');
        }
    }
    
    /**
     * Sets the URL of the corresponding Wikimedia project.
     * 
     * @param string Wikimedia project's url
     */
    function setURL($url) {
        $this->_URL = $url;
    }
    
    /**#@+
     * @access private
     */
    
    /**
     * Encodes a name into a form suitable for use in /wiki/
     * 
     * @param string name
     * @return string safe name
     */
    function _encode($name) {
        $name = str_replace(' ','_',$name);
        $name = urlencode($name);
        return $name;
    }
    
    /**
     * Returns URL of a page
     * 
     * @param string name
     * @return string URL
     */
    function _getURLView($name) {
        $name = $this->_encode($name);
        $url = 'http://' . $this->_URL . '/w/index.php?title=' . $name;
        return $url;
    }
    
    /**
     * Returns URL of a page's edit page
     * 
     * @param string name
     * @return string URL
     */
    function _getURLEdit($name) {
        $url = $this->_getURLView($name);
        $url .=  '&action=edit';
        return $url;
    }
    
    /**#@-*/
    
    /**
     * Points the browser at an article and initializes URLs.
     * 
     * @param string name
     */
    function point($name) {
        $this->_name = $name;
        $this->_URLPageView = $this->_getURLView($name);
        $this->_URLPageEdit = $this->_getURLEdit($name);
    }
    
    /**
     * Grabs wikitext for a page. Also implies that we want to edit the page.
     */
    function startEdit() {
        if($this->browser->get($this->_URLPageEdit)) {
            $html = $this->browser->getContent();
            $extract = Wikipedia_Extractor::editpageWikitext($html);
            $this->_pageSource = $extract;
            return true;
        } else {
            return false;
        }
    }
    
    /**
     * Returns the wikitext of a page.
     * 
     * @return string wikitext
     */
    function returnSource() {
        return $this->_pageSource;
    }
    
    /**
     * Returns the HTML of a page.
     * 
     * @return string html
     */
    function returnHTML() {
        $this->browser->get($this->_URLPageView);
        $extract=Wikipedia_Extractor::generalContent($this->browser->getContent());
        $this->_pageHTML = $extract;
        return $this->_pageHTML;
    }
    
    /**
     * Makes the computer go "Vuvuvuv!"
     */
    function beep($override = false) {
        if (!$this->quiet || $override) {
            shell_exec('wav alert.wav');
        }
    }
    
    /**
     * Creates a LinkFix Dump.
     */
    function routine_linkFix() {
        $link_array = Wikipedia_Parser::getAllWikilinks($this->_pageSource);
        $check_array = array();
        
        $name_cache = array();
        $rdb_cache = array();
        
        $name = $this->_name;
        
        if (empty($link_array)) {
            $this->cl->output('There were no links in the page.');
            return;
        }
        
        $line = $link_array[sizeof($link_array) - 1]['line'];
        
        if ($line == 1) {
            $padto = 1;
        } else {
            $padto = (int) ceil(log10($line));
        }
        
        $routine = 'LinkFix_dump';
        $date = date('Y-m-d.H-i-s');
        $safe_name = $this->_encode($name);
        $file = $routine . '/' . $safe_name . '/' . $date . '.txt';
        
        $this->fl->handle = $this->createLog($file);
        $this->cl->handle = fopen('php://stdout', 'w');
        
        $this->fl->output("LinkFix Dump\n{$name}\n$date",
          CL_TITLE | CL_PRESERVE);
        
        $count_checked = 0;
        $count_skipped = 0;
        $count_error = 0;
        $count_hit = 0;
        $drop_error = array();
        $time = time();
        
        $this->cl->output("Checking Links...", CL_TITLE);
        
        
        
        foreach ($link_array as $value) {
            
            $title = Wikipedia_Parser::formatTitle($value['link']);
            $line = str_pad((string) $value['line'], $padto) . ' ';
            
            if ($title == $name || $title == '') {
                continue;
            }
            
            //A little caching
            if (!empty($name_cache[$title])) {
                $count_skipped++;
                continue;
            }
            
            $name_cache[$title] = true;
            
            if (!Wikipedia_Parser::askCheckWikilink($title)) {
                continue;
            }
            
            $this->point($title);
            
            if(!$this->startEdit()) {
                $count_error++;
                $this->fl->output($line . '[['.$title.']] -> ERROR');
                continue;
            }
            
            $text = $this->returnSource();
            if (Wikipedia_Parser::is_redirect($text,$title) ||
                Wikipedia_Parser::is_disambig($text) == DISAMBIG_DEFINITE) {
                
                $tohere = Wikipedia_Parser::substrRedirectLocation($text);
                
                if ($tohere) {
                    $message = ' -> ' . '[['.$tohere.']]';
                } else {
                    $message = ' -> DISAMBIG';
                }
                
                $this->fl->output($line . '[['.$title.']]' . $message);
                
                unset($name_cache[$title]);
                $name_cache[$tohere] = true;
                
                $count_hit++;
            }
            
            $count_checked++;
            
            if ($count_checked % 10 == 0) {
                $utime = time() - $time;
                $minutes = (string) floor($utime/60);
                $seconds = (string) $utime % 60;
                $seconds = $seconds > 9 ? $seconds : '0'.$seconds;
                $utime = "$minutes:$seconds";
                $this->cl->output("Checked $count_checked at $utime");
            }
            
        }
        
        $this->cl->output("Status", CL_TITLE);
        $this->cl->output("Checked: $count_checked");
        $this->cl->output("Skipped: $count_skipped");
        $this->cl->output("Errors: $count_error");
        $this->cl->output("Hits: $count_hit");
        
        $utime = time() - $time;
        $minutes = (string) floor($utime/60);
        $seconds = (string) $utime % 60;
        $seconds = $seconds > 9 ? $seconds : '0'.$seconds;
        $utime = "$minutes:$seconds";
        
        $this->cl->output("Time: $utime");
        
        $this->fl->output('# DONE');
        
    }
    
    
}

?>

Wikipedia_Parser.php[edit]

<?php

include_once('ezy/EZY_FormatText.php');

define('DISAMBIG_NOT',0);
define('DISAMBIG_MAYBE',1);
define('DISAMBIG_DEFINITE',2);

define('FORMAT_KEEPANCHOR',1);

class Wikipedia_Parser extends EZY_FormatText {
    
    function getAllWikilinks($wikitext) {
        
        //No wikilinks!
        if (strpos($wikitext, '[[') === false) {
            return array();
        }
        
        $newline = Wikipedia_Parser::detectNewlines($wikitext);
        
        //Parse it!
        $currentLocation = 0;
        $currentIndex = 0;
        $currentLine = 1;
        $collection = array();
        
        while (strpos($wikitext, '[[', $currentLocation) !== false) {
            
            $closeEndBracketLocation = strpos($wikitext, ']]', $currentLocation);
            $startBracketLocation = strpos($wikitext, '[[', $currentLocation);
            $endBracketLocation = strpos($wikitext, ']]', $startBracketLocation);
            $nextBracketLocation =  strpos($wikitext, '[[', $startBracketLocation + 2);
            
            $slice = substr($wikitext, $currentLocation,
              $startBracketLocation + 2 - $currentLocation);
            $currentLine += substr_count($slice, $newline);
            
            //account for a errant bracket inbetween
            if ($closeEndBracketLocation !== $nextBracketLocation) {
                //actually, it has no effect
            }
            //account for a weird jump
            if ($nextBracketLocation !== false && $nextBracketLocation < $endBracketLocation) {
                $currentLocation = $nextBracketLocation;
                continue;
            }
            
            $wikiLink = substr($wikitext,$startBracketLocation+2,$endBracketLocation-$startBracketLocation-2);
            if (strpos($wikiLink,'|') !== false) {
                $wikiLinkArray = explode('|',$wikiLink);
                $wikiCore = $wikiLinkArray[0];
            } else {
                $wikiCore = $wikiLink;
            }
            $collection[$currentIndex]['link'] = $wikiCore;
            $collection[$currentIndex]['location'] = $startBracketLocation;
            $collection[$currentIndex]['line'] = $currentLine;
            $currentIndex++;
            
            $currentLocation = $startBracketLocation + 2;
            
        }
        return $collection;
        
    }
    
    function replaceWikilinks($wikitext, $instructions) {
        
        $internal = array();
        foreach($instructions as $key => $value) {
            $internal[$key] = $value['location'];
        }
        array_multisort(
            $internal,
            SORT_DESC,
            SORT_NUMERIC,
            $instructions
        );
        
        $size = sizeof($instructions);
        for($i = 0; $i < $size; $i++) {
            
            $display = $instructions[$i]['display'];
            $link = $instructions[$i]['link'];
            
            //Information about this wikilink
            $startBracket = $instructions[$i]['location'];
            //Make sure we weren't passed a bad location
            if ($wikitext{$startBracket} != '[' ||
                $wikitext{$startBracket+1} != '[') {
                continue;
            }
            $startWikilink = $startBracket + 2;
            $startDisplay = strpos($wikitext,'|',$startWikilink);
            $endWikilink = strpos($wikitext,']]',$startWikilink);
            
            //make sure we didn't catch an errant startdisplay
            if ($startDisplay === false || $startDisplay > $endWikilink) {
                $startDisplay = false;
            }
            
            $endBracket = $endWikilink + 2;
            $endTrail = $endBracket;
            $length = strlen($wikitext);
            while($endTrail < $length && ctype_lower($wikitext{$endTrail})) {
                $endTrail++;
            }
            
            //Operate
            if ($display !== false) {
                $wikitext = substr_replace($wikitext,'',$endBracket,
                  $endTrail - $endBracket);
                if ($startDisplay !== false) {
                    //Gut the whole thing, it's got a display, we've got a link
                    //and a display
                    $wikitext = substr_replace($wikitext,$link.'|'.$display,
                      $startWikilink, $startWikilink - $endWikilink);
                } else {
                    //Okay, there is a display we want to add in, but no
                    //existing display. Just splice it in.
                    $wikitext = substr_replace($wikitext,$link.'|'.$display,
                      $startWikilink,$endWikilink - $startWikilink);
                }
            } else {
                $trail = substr($wikitext,$endBracket,$endTrail - $endBracket);
                $wikitext = substr_replace($wikitext,'',$endBracket,
                  $endTrail - $endBracket);
                if ($startDisplay !== false) {
                    //There's a display, so tack the trail on and then replace
                    //the links
                    $wikitext = substr_replace($wikitext,$trail,
                      $endWikilink, 0);
                    $wikitext = substr_replace($wikitext,$link,$startWikilink,
                      $startDisplay-$startWikilink);
                } else {
                    //No display, so tack on trail and add the new link with a |
                    $wikitext = substr_replace($wikitext,$trail,$endWikilink,0);
                    $wikitext = substr_replace($wikitext,$link.'|',
                      $startWikilink,0);
                }
            }
            
        }
        
        return $wikitext;
        
    }
    
    function substrRedirectLocation($wikitext) {
        preg_match('/^#REDIRECT[ ]*\[\[([^\]]+)\]\]/',$wikitext,$matches);
        if (isset($matches[1])) {
            return Wikipedia_Parser::formatTitle($matches[1]);
        } else {
            return '';
        }
    }
    
    function formatTitle($title,$flag = 0) {
        if ($title{0} == ':') {
            $title = substr($title, 1);
        }
        if (strpos($title, '#') !== false && !($flag & FORMAT_KEEPANCHOR)) {
            $title = substr($title, 0, strpos($title, '#'));
        }
        if ($title == '') {
            return '';
        }
        $title = ucfirst($title);
        $title = str_replace('_',' ',$title);
        return $title;
    }
    
    function is_redirect($wikitext, $name = false) {
        if ($name !== false) { //exception: don't count [[31 June]] style redirects
            if (Wikipedia_Parser::is_date($name)) {
                return false;
            }
        }
        return preg_match('/^#REDIRECT[ ]*\[\[[^\]]+\]\]/',$wikitext);
    }
    
    function is_disambig($wikitext) {
        if(preg_match('/({{disambig}}|{{TLAdisambig}})/i',$wikitext)) {
            return DISAMBIG_DEFINITE;
        } elseif (preg_match('/({{Otheruses}}|{{Otheruses-number}}|{{Otherplaces}})/i',$wikitext)) {
            return DISAMBIG_MAYBE;
        } elseif (preg_match('/({{Otheruses2\|[^}]+}}|{{Otheruses3\|[^}]+}}|{{Otherplaces2\|[^}]+}})/i',$wikitext)) {
            return DISAMBIG_MAYBE;
        } else {
            return DISAMBIG_NOT;
        }
    }
    
    function is_year($title) {
        if (is_numeric($title)) {
            return true;
        }
        if (substr($title,-2) == 'BC' &&
          is_numeric(substr($title, 0, -3))) {
            return true;
        }
        return false;
    }
    
    function is_date($title) {
        if (preg_match('/^\d{1,2} (January|February|March|April|May|June|July|'.
          'August|September|October|November|December)$/',$title)) {
            return true;
        } elseif (preg_match('/^(January|February|March|April|May|June|July|Au'.
          'gust|September|October|November|December) \d{1,2}$/',$title)) {
            return true;
        }
        return false;
    }
    
    function is_image($title) {
        if ($title{0} == ':') {
            $title = substr($title,1);
        }
        if (substr($title,0,6) == 'Image:' && strlen($title) > 6) {
            return true;
        }
        return false;
    }
    
    function is_category($title) {
        if ($title{0} == ':') {
            $title = substr($title,1);
        }
        if (substr($title,0,9) == 'Category:' && strlen($title) > 9) {
            return true;
        }
        return false;
    }
    
    function is_interwiki($title) {
        if (strlen($title) > 3 && $title{2} == ':') {
            return true;
        }
        return false;
    }
    
    function is_heading($line) {
        if (strpos($line,"\n") !== false || strpos($line,"\r") !== false) {
            return false;
        }
        return preg_match('/^((?:=){1,6})[^=]+\1/', $line);
    }
    
    function getSignature($line) {
        $date = preg_match("/(\d{2}:\d{2}, \S+ \d{1,2}, \d{4} \(UTC\)|" .
            "\d{2}:\d{2}, \d{1,2} \S+ \d{4} \(UTC\)|" .
            "\d{2}:\d{2}, \d{4} \S+ \d{1,2} \(UTC\)|" .
            "\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} \(UTC\)|" .
            "\d{2}:\d{2}:\d{2}, \d{4}-\d{2}-\d{2} \(UTC\)" .
            ")\s*$/",$line, $matches);
        
        if (!$date) {
            return false;
        }
        $links = array_reverse(Wikipedia_Parser::getAllWikilinks($line));
        foreach ($links as $value) {
            $link = $value['link'];
            $namespace = Wikipedia_Parser::getNamespace($link);
            $namespace = Wikipedia_Parser::formatTitle($namespace);
            if ($namespace === 'User' || $namespace === 'User_talk') {
                return array('user'=>substr($link,strpos($link,':')+1),'time'=>$matches[0]);
            }
        }
        return false;
    }
    
    function getNamespace($title) {
        if (strpos($title,':') === false) {
            return '';
        }
        if ($title{0} == ':') {
            $title = substr($title,1,strpos($title,':',1));
        }
        return substr($title,0,strpos($title,':'));
    }
    
    function askCheckWikilink($title) {
        if (empty($title)) {
            return false;
        }
        
        if (strpos($title,':') === strlen($title) - 1) {
            return false;
        }
        
        if (Wikipedia_Parser::is_year($title)) {
            return false;
        }
        if (Wikipedia_Parser::is_date($title)) {
            return false;
        }
        if (Wikipedia_Parser::is_interwiki($title)) {
            return false;
        }
        
        $namespace = Wikipedia_Parser::getNamespace($title);
        if ($namespace !== '') {
            return false;
        }
        
        //special cases
        if ($title == 'Documentary') {
            return false;
        }
        
        return true;
    }
    
    function parseTalk($wikitext) {
        
        $newline = Wikipedia_Parser::detectNewlines($wikitext);
        
        $wikitext_array = explode($newline, $wikitext);
        $DEFINITION_LIST = array('#'=>true,'*'=>true,':'=>true,';'=>true);
        $DEFINITION_LIST_STRING = '#*:;';
        
        $i = 0;
        $ini_array = array();
        foreach ($wikitext_array as $key => $line) {
            
            if (trim($line) == '') {
                continue;
            }
            
            if (Wikipedia_Parser::is_heading($line)) {
                if ($key === 0) {
                    continue;
                } else {
                    return false;
                }
            }
            
            $ini_array[$i] = array();
            
            $length = strlen($line);
            for ($j = 0;
                 $j < $length && isset($DEFINITION_LIST[$line{$j}]);
                 $j++) {
            }
            $ini_array[$i]['depth'] = $j;
            
            $signature = Wikipedia_Parser::getSignature($line);
            $user = isset($signature['user']) ? $signature['user'] : false;
            $time = isset($signature['time']) ? $signature['time'] : false;
            
            if ($user !== false) {
                for ($j = $i - 1;
                     $j >= 0 && $ini_array[$j]['user'] === false &&
                       $ini_array[$j]['depth'] >= $ini_array[$i]['depth'];
                     $j--) {
                    $ini_array[$j]['user'] = $user;
                    if ($ini_array[$j]['depth'] > $ini_array[$i]['depth']) {
                        $ini_array[$j]['line'] = str_repeat(':',
                          $ini_array[$j]['depth'] - $ini_array[$i]['depth']) .
                          $ini_array[$j]['line'];
                    }
                    $ini_array[$j]['depth'] = $ini_array[$i]['depth'];
                }
                $j++;
                $ini_array[$j]['commentstart'] = true;
            }
            
            $ini_array[$i]['user'] = $user;
            $ini_array[$i]['time'] = $time;
            $ini_array[$i]['line'] = ltrim($line, $DEFINITION_LIST_STRING);
            
            $i++;
            
        }
        
        
        $ret_array = array();
        $i = -1;
        $last_author = false;
        foreach ($ini_array as $value) {
            
            if (isset($value['commentstart'])) {
                $i++;
                $ret_array[$i]['body'] = $value['line'];
                $ret_array[$i]['depth'] = $value['depth'];
            } else {
                $ret_array[$i]['body'] .= $newline.$newline.$value['line'];
            }
            
            if ($value['time'] && $value['user']) {
                $ret_array[$i]['time'] = $value['time'];
                $ret_array[$i]['user'] = $value['user'];
            }
            
        }
        
        return $ret_array;
        
    }
    
    function parseSections($wikitext) {
        
        $newline = Wikipedia_Parser::detectNewlines($wikitext);
        $wikitext_array = explode($newline, $wikitext);
        
        $ret_array = array();
        $ret_array[0]['body'] = '';
        $header_array = array();
        $i = 0;
        foreach ($wikitext_array as $key => $value) {
            if (isset($header_array[$key]) || Wikipedia_Parser::is_heading($value)) {
                $header_array[$key] = true;
                $i++;
                $ret_array[$i]['title'] = Wikipedia_Parser::getHeading($value);
                $ret_array[$i]['body'] = '';
                continue;
            }
            if (trim($value) === '' &&
              (isset($header_array[$key-1]) || $key === 0 ||
               $lookahead = Wikipedia_Parser::is_heading($wikitext_array[$key+1]))) {
                if (isset($lookahead)) {
                    $header_array[$key-1] = true;
                }
                continue;
            }
            $ret_array[$i]['body'] .= $value . $newline;
            
        }
        
        return $ret_array;
        
    }
    
    function getHeading($line) {
        if (!Wikipedia_Parser::is_heading($line)) {
            return false;
        }
        preg_match('/^((?:=){1,6})([^=]+)\1/', $line, $matches);
        return trim($matches[2]);
    }
    
}

?>

Wikipedia_Extractor.php[edit]

<?php

class Wikipedia_Extractor
{
    
    /**
     * Takes output from Edit Page and grabs Wikitext
     * 
     * Current algorithm takes the first occurence of a textbox, and gets the
     * contents inside it. Handles decoding of HTML via internal functions.
     * No regular expressions.
     */
    function editpageWikitext($html) {
        $start = strpos($html,'<textarea ');
        $end = strpos($html,'</textarea>') + strlen('</textarea>');
        $textbox = substr($html,$start, $end - $start);
        $textbox = strip_tags($textbox);
        $textbox = html_entity_decode($textbox);
        return $textbox;
    }
    
    /**
     * Takes output from article and grabs only article HTML
     * 
     * Uses templates built in <!-- start content --> hooks to locate the text.
     * No regular expressions. Fixes URLs so that they're absolute, although
     * it only does it for certain prefixes (aka could be improved). No regular
     * expressions.
     */
    function generalContent($html) {
        
        $tag_start = "<!-- start content -->";
        $tag_end = "<!-- end content -->";
        
        $response_start = strpos($html, $tag_start) + strlen($tag_start);
        $response_end = strpos($html, $tag_end); 
        $data = substr($html,$response_start,($response_end - $response_start)); 
        
        //Change URLs to Absolute
        $data = str_replace('"/w/index.php',
          '"http://en.wikipedia.org/w/index.php',$data);
        $data = str_replace('"/wiki/', '"http://en.wikipedia.org/wiki/',$data);
        
        return $data;
    }
    
    /**
     * Takes output from article and grabs server information
     * 
     * Output in form of array with keys 'server' and 'time'. Grabs information
     * by looking for the last comment. Uses perl regexps.
     */
    
    function generalServer($html) {
        
        $regexp = "#<!-- Served by (\w+) in (\d*\.\d+) secs. -->\s*</body>\s*</html>\s*$#";
        $result = preg_match($regexp, $html, $matches);
        
        if ($result) {
            $data['server'] = $matches[1];
            $data['time'] = (float) $matches[2];
        } else {
            $data = false;
        }
        
        return $data;
    }
    
    function _strtotime($string) {
        preg_match('/^(\d{2}):(\d{2}), (.+)$/', $string, $matches);
        return strtotime($matches[3] . ' ' . $matches[1] . ':' . $matches[2] . 'Z');
    }
    
    function historyContent($html) {
        
        $pattern = 
        '~' .
        '<li>' .
        '\((?:<a href="/w/index.php\?title=[^&]+&amp;diff=\d+&amp;oldid=\d+" title="[^"]+">)?cur(?:</a>)?\)' .
        ' ' .
        '\((?:<a href="/w/index.php\?title=[^&]+&amp;diff=\d+&amp;oldid=\d+" title="[^"]+">)?last(?:</a>)?\)' .
        ' ' . 
        '<input type="radio" value="\d+"(?: style="visibility:hidden")?(?: checked="checked")? name="oldid" />' .
        '<input type="radio" value="\d+"(?: checked="checked")? name="diff" />' .
        ' ' .
        '<a href="/w/index.php\?title=.+?&amp;oldid=\d+" title=".+?">\d{2}:\d{2}, .+? \d{1,2}, \d{4}</a>' .
        ' ' .
        '<span class=\'history-user\'><a (?:href="/wiki/User:.+?"|href="/w/index.php\?title=User:[^&]+&amp;action=edit" class="new") title="User:.+?">.+?</a></span>' .
        '(?: <span class="minor">m</span>)?' .
        '(?: <span class=\'comment\'>\(.+?\)</span>)?' .
        '</li>' .
        '~'
        ;
        var_dump($pattern);
        preg_match_all($pattern, $html, $matches);
        
        var_dump($matches);
        
    }
    
}

?>

EZY_FormatText.php[edit]

<?php

class EZY_FormatText
{
    
    function rmNewlines($msg, $replace_with = ' ') {
        $msg = str_replace("\r\n",$replace_with,$msg);
        $msg = str_replace("\n",  $replace_with,$msg);
        $msg = str_replace("\r",  $replace_with,$msg);
        return $msg;
    }
    
    function detectNewlines($msg) {
        $newline_windows = substr_count($msg, "\r\n");
        $newline_unix = substr_count($msg, "\n"); - $newline_windows;
        $newline_mac = substr_count($msg, "\r"); - $newline_windows;
        
        //gives us our preference, allows us to figure out which one
        $array[$newline_mac] = "\r";
        $array[$newline_unix] = "\n";
        $array[$newline_windows] = "\r\n";
        
        $count = max($newline_windows, $newline_unix, $newline_mac);
        
        return $array[$count];
    }
    
    function normNewlines($msg, $newline) {
        if ($newline == "\n") {
            $msg = str_replace("\r\n","\n",$msg);
            $msg = str_replace("\r","\n",$msg);
        } elseif ($newline == "\r\n") {
            //This routines a bit more complicated, addslashes prevents
            //multiple instances of literal \n and then you have that represent
            //the whole \r\n before switching it all over. Could 
            //preg_replace be faster and more intuitive? You never know...
            $msg = addslashes($msg);
            $msg = str_replace("\r\n","\\n",$msg);
            $msg = str_replace("\r","\\n",$msg);
            $msg = str_replace("\n","\\n",$msg);
            $msg = str_replace("\\n","\r\n",$msg);
            $msg = stripslashes($msg);
        } elseif ($newline == "\r") {
            $msg = str_replace("\r\n","\r",$msg);
            $msg = str_replace("\n","\r",$msg);
        }
        return $msg;
    }
    
    //Automatically detects linestyle, you can change it later.
    function wrap($msg, $wrap = 75, $indent = 0, $wrapindent = 0) { 
        
        $s_indent = str_repeat(' ',$indent);
        $s_wrapindent = str_repeat(' ',$wrapindent);
        
        $newlines = EZY_FormatText::detectNewlines($msg);
        $msg_chunks = explode($newlines, $msg);
        
        $newmsg = '';
        foreach ($msg_chunks as $value) {
            if ($wrapindent > 0) {
                $msg_wrap = wordwrap($value, $wrap-$indent, $newlines, true);
                $lead = array();
                $lead[0] = substr($msg_wrap,0,strpos($msg_wrap,$newlines));
                $rest_string = trim(substr($msg,strpos($msg_wrap,$newlines)));
                $rest_wrap = wordwrap($rest_string, $wrap-$indent-$wrapindent,
                  $newlines, true);
                $rest = explode($newlines, $rest_wrap);
                $array = array_merge($lead,$rest);
            } else {
                $msg_wrap = wordwrap($value, $wrap - $indent, $newlines, true);
                $array = explode($newlines, $msg_wrap);
            }
            foreach ($array as $number => $string) {
                if ($indent) {
                    $newmsg .= $s_indent;
                }
                if ($number > 0 && $wrapindent) {
                    $newmsg .= $s_wrapindent;
                }
                $newmsg .= $string;
                $newmsg .= $newlines;
            }
        }
        $newmsg = substr($newmsg,0,-strlen($newlines));
        return $newmsg;
    }
    
    function _parseBorderOptions($opt) {
        $new['width'] = isset($opt['width']) ? $opt['width'] : false;
        if (!$new['width']) {
            return array();
        }
        if ($new['width'] == 1) {
            $new['style'] = isset($opt['style']) ? $opt['style'] : false;
            if ($new['style'] === false) {
                return array();
            }
        } else {
        
        }
        $new['length'] = isset($opt['length']) ? $opt['length'] : 75;
        $new['indent'] = isset($opt['indent']) ? $opt['indent'] : 0;
        return $new;
    }
    
    //Incomplete
    function addBorders($msg,$top=array(),$right=array(),$bottom=array(),
      $left=array()) {
        
        $newline = EZY_FormatText::detectNewlines($msg);
        
        $top    = EZY_FormatText::_parseBorderOptions($top);
        $right  = EZY_FormatText::_parseBorderOptions($right);
        $bottom = EZY_FormatText::_parseBorderOptions($bottom);
        $left   = EZY_FormatText::_parseBorderOptions($left);
        
        if (empty($left) && empty($right) && (!empty($top) || !empty($bottom))){
            //Vertical model
            
            if (!empty($top)) {
                $indent = str_repeat(' ',$top['indent']);
                if (!$top['width']) {
                } elseif ($top['width'] == 1) {
                    $line = str_repeat($top['style'],
                      floor($top['length'] / strlen($top['style'])));
                    $msg = $line . $newline . $msg;
                }
            }
            
            if (!empty($bottom)) {
                $indent = str_repeat(' ',$bottom['indent']);
                if (!$bottom['width']) {
                } elseif ($bottom['width'] == 1) {
                    $line = str_repeat($bottom['style'],
                      floor($bottom['length'] / strlen($bottom['style'])));
                    $msg = $msg . $newline . $line;
                    //var_dump($msg);
                }
            }
            
        } elseif (empty($top) && empty($bottom) &&
          (!empty($left) || !empty($right))) {
            //Horizontal model
            
        } else {
            //Box model
           
        }
        return $msg;
    }
    
    
    
}

?>