User talk:ST47/perlwikipedia

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Unix[edit]

I'm running ActivePerl on a Windows box, and perlwikipedia doesn't seem to work. Am I correct that pw assumes an Unix environment? For instance, perlwikipedia.pm says:

system("test -s \".perlwikipedia-$editor-cookies\"");

Is "test" a Unix command? – Quadell (talk) (random) 17:44, 24 May 2007 (UTC)

I have a similar issue, and others have reported it as well. Shadow doesn't get too happy when we talk about windows around him, though, so I suppose it won't be fixed. I think it's an encoding error, and I've said several times that I was going to look at it, but I never got around to it. --ST47Talk 18:04, 25 May 2007 (UTC)
Oops! Yeah, "test" is a Unix command to check if a file exists, I'll write a better handler for this code ASAP and commit it to SVN. Shadow1 (talk) 18:58, 25 May 2007 (UTC)
Ok, the code should now work on Windows/ActiveState if you update your working copy. Thanks for reminding me that not everyone uses Linux! Shadow1 (talk) 19:08, 25 May 2007 (UTC)

Thanks for the quick turnaround! Now it still errors out, but at a different location. When I try to log in, I get:

Error requesting Special%3AUserlogin: 403 Forbidden

When I turn on debug, just before it dies it tells me

Retrieving http://en.wikipedia.org/w/index.php?title=Special%3AUserlogin&action=edit

Of course I can enter this URL in my browser and not get a 403 error. Is this an incompatibility with Windows, or something else? Any ideas? – Quadell (talk) (random) 19:55, 25 May 2007 (UTC)

That's because the user agent is blocked, you need to change it to something specific to your bot if you want to do anythyng. --ST47Talk 23:57, 25 May 2007 (UTC)
I'm not sure I understand. I'm just running login.pl from here. I don't see anywhere to set the user agent. Is this related to the line "my $editor = Perlwikipedia->new('Bot/WP/EN/E/ExtranetBot');" in this code you wrote? – Quadell (talk) (random) 02:32, 26 May 2007 (UTC)
I believe the command is $editor->{mech}->agent('w/e'); --ST47Talk 11:25, 27 May 2007 (UTC)

List?[edit]

Another question: Is there a list (or category) of bots using perlwikipedia? – Quadell (talk) (random) 17:54, 24 May 2007 (UTC)

I just created Category:Perlwikipedia bots. Shadow1 (talk) 18:58, 25 May 2007 (UTC)
Thanks! If my bot gets approved, I'll add it. – Quadell (talk) (random) 19:56, 25 May 2007 (UTC)

More tech support[edit]

Hi. Perlwikipedia looks like a great tool, and I'd love to use it, but I can't get it to work. The supposed test script, login.pl, does not seem to work as-is. (I get "Error requesting Special%3AUserlogin: 403 Forbidden".) ST47, above, suggested I add the line "$editor->{mech}->agent('w/e');" to specify the user agent. When I do that, I get this error: "There is no form named "userlogin" at C:/Perl/lib/Perlwikipedia.pm line 102. Died at C:/Perl/lib/WWW/Mechanize.pm line 1684."

If I can't get this to work, I'll have to find some other way to interface with Wikipedia. Any help anyone could provide would be greatly appreciated. (I'm using ActivePerl on a Windows box, by the way.) Thanks, – Quadell (talk) (random) 14:21, 28 May 2007 (UTC)

First, 'w/e' means 'whatever', so replace that with something descriptive. I usually use Bot/WP/EN/ST47/BotName. I don't know what that error means, but make sure you have the latest version and such. --ST47Talk 14:31, 28 May 2007 (UTC)
Unless you are using the passwordless login method that I described on the Google Code wiki, there is no reason you should need to use Login.pl. It's a script that is designed to fetch the login data for your bot's account and place it into a file so that your bot can log into Wikipedia without using a password in cleartext. From what I've seen, the source code you're using should work perfectly fine if you insert the bot's password into the right place in the login() call. Shadow1 19:06, 30 May 2007 (UTC)
It may be that my modules (LWP, Mechanize, etc.) were not installed correctly. I'm investigating. – Quadell (talk) (random) 12:53, 31 May 2007 (UTC)
That was it. With LWP and Mechanize reinstalled, it works fine. Huzzah! – Quadell (talk) (random) 14:42, 31 May 2007 (UTC)

New problem. It logs in fine, but when attempting to get_text, on a Windows system, it puts itself in an endless loop. (It works fine on a *nix system.) My code looks like this:

use Perlwikipedia;
use strict;
my $pw=Perlwikipedia->new();
$pw->{debug} = 1;
$pw->{mech}->agent('Bot/WP/EN/Quadell/polbot');
my $login_status=$pw->login('Polbot','(my password)');
die "I can't log in." unless ($login_status eq 'Success');
my $html = $pw->get_text('User:Polbot');

The output on a Windows system (with debug on) is as follows:

Retrieving http://en.wikipedia.org/w/index.php?title=Special%3AUserlogin&action=edit
Login as "Polbot" succeeded.
Retrieving http://en.wikipedia.org/w/index.php?title=User%3APolbot&action=edit&oldid=&section=
Retrieving http://en.wikipedia.org/w/index.php?title=&action=edit
Retrieving http://en.wikipedia.org/w/index.php?title=&action=edit
Retrieving http://en.wikipedia.org/w/index.php?title=&action=edit
Retrieving http://en.wikipedia.org/w/index.php?title=&action=edit
. . .

It continues trying to load a page with no title specified until I cancel the program. This seems to be because m/var wgAction = "edit"/ doesn't match, so the until condition is never met. Debugging, I tried to print $res->content from within the get_text definition, and it seems to be complete gobledegook. Is there an encoding problem, maybe? – Quadell (talk) (random) 15:58, 31 May 2007 (UTC)

Install the module Compress::Zlib. For some reason, the servers like to return gzip-compressed content, so installing this module should fix the last of your problems. Shadow1 (talk) 16:22, 31 May 2007 (UTC)
I installed Compress::Zlib, but it does the same thing. – Quadell (talk) (random) 17:35, 31 May 2007 (UTC)
The only other problem I can think of is that there's something wrong with your installation of ActiveState/WWW::Mechanize that's causing it to not properly decode the content. In the actual Perlwikipedia.pm file, change

use WWW::Mechanize;

to

use WWW::Mechanize::Gzip;

and

WWW::Mechanize->new( cookie_jar => {}, onerror => \&Carp::carp );

to

WWW::Mechanize::Gzip->new( cookie_jar => {}, onerror => \&Carp::carp ); .

Other than that, I really can't help you much more. Shadow1 (talk) 19:23, 1 June 2007 (UTC)

Actually, no, never mind that. The author of WWW::Mechanize recently removed support for decoding Gzipped content via content(), so make sure you're using the latest version of the module. It should be version 1.30. Update the module and you should be fine. Shadow1 (talk) 13:14, 2 June 2007 (UTC)
I have the latest WWW::Mechanize, v1.30. It's not a problem with Mechanize. The following code works as expected:
my $agent = WWW::Mechanize->new('polbot');
$agent->get("http://en.wikipedia.org/w/index.php?title=Main_page&action=view");
print ($agent->{content});
But this code hangs forever:
my $pw = Perlwikipedia->new();
$pw->{mech}->agent('Bot/WP/EN/Quadell/polbot');
print ($pw->get_text('Main page'));
I've repeated this error on a different Windows box with a fresh ActivePerl and Mechanize install. As of now, it looks to me like PerlWikipedia does not work on ActivePerl on Windows. – Quadell (talk) (random) 02:12, 6 June 2007 (UTC)
Ok, change all instances of "->content" in get_text to "->decoded_content" and see if that works. If it does, then it's some sort of odd problem with ActiveState's Mechanize, although I just tested Perlwikipedia on my Windows machine and it worked fine. Shadow1 (talk) 19:40, 6 June 2007 (UTC)
That worked! I'm befuddled as to why I have this problem and you don't, but I'm certainly glad to have a fix. Thanks for all your help! – Quadell (talk) (random) 19:53, 6 June 2007 (UTC)
I'm guessing that there are code differences between ActiveState's Mechanize and CPAN's, but that's water under the bridge. Thanks for helping to resolve this issue; I'll change the code accordingly and commit it to SVN. Shadow1 (talk) 20:33, 6 June 2007 (UTC)

New sub I created[edit]

Hey. I created a new sub that I use in my Perlwikipedia.pm. You might want to consider adding it to the official release. You pass in an image name, it returns an array of all articles that include the image (from the "File links" list).

=item get_file_links($pagename)

Returns array containing the pages that link to an image or other media.

=cut

sub get_file_links {
        my $self         = shift;
        my $pagename = shift;
        my $res = $self->_get( $pagename, 'view');
    unless ($res) { return; }
        unless ($res->decoded_content =~ m/\(pages on other projects are not listed\):<\/div><\/p>\n<ul>(.*?)\n<\/ul>/s) {return;}
        my $linklist = $1;
        my @articles = split(/\n/, $linklist);
        my @return;
        foreach my $article (@articles) {
                if ($article =~ m/<li><a href=\"[^"]*\" title=\"([^"]*)\">/) {
                        push(@return, $1);
                }
        }
        return @return;
}

what_links_here[edit]

The behavior of what_links_here seems problematic to me. It is currently returning not only pages that link to the specified page, but also pages that link to redirects to the specified page. However, it doesn't return the first page that links to a redirect.

For example, look at Jill Gascoine & Jill Gascoigne. Jill Gascoigne is a redirect to Jill Gascoine. If I compare a what_links_here here on both pages, the results of Jill Gascoine include all of those of Jill Gascoigne except for Morecambe and Wise which is missing.

It seems to me that what_links_here should only return pages that actually link to the requested page. Returning links to redirects doesn't seem that useful as I would rather specifically request what_links_here on the redirect if that's what I want, but perhaps I'm overlooking something.

So, I recommend either that:

  1. what_links_here be fixed to return the first page linking to a redirect; or
  2. what_links_here's Special::Whatlinkshere screen-scrap be replaced with a call to api.php which only returns direct links.

The benefit of the second is that api.php also supports filtering by namespace which would be convenient in some applications.

If there is interest in the api.php approach, I am willing to write the patch for that. -- JLaTondre 19:41, 4 August 2007 (UTC)

It should work now. Shadow1 (talk) 13:44, 25 August 2007 (UTC)
Thanks. -- JLaTondre 23:56, 26 August 2007 (UTC)

CPAN[edit]

I started using this module and it looks fine. Of all the Perl bot frameworks i tried this is the first that i could install and made it do the right thing pretty quickly. Thanks for your work!

A question: Is there a reason you don't host this module on CPAN? CPAN is the natural place to look for Perl code, but anyone who searches CPAN for "MediaWiki" today finds the module of that name, which has impressive documentation, but appears to be unmaintained. Finding your framework on Wikipedia wasn't so trivial. --Amir E. Aharoni (talk) 15:47, 2 June 2008 (UTC)

Unicode[edit]

I'm running Ubuntu with Perl 5.8 in an all-UTF-8 environment. Perlwikipedia (today's SVN) seems to assume the terminal runs Latin-1. --LA2 (talk) 22:12, 11 July 2008 (UTC)

Categories[edit]

The function get_pages_in_category() seems to retrieve a web page and follow the "next 200" link. This of course has a different name in other languages of Wikipedia. The bot should be able to use an API call instead, to retrieve the full list of category members. See http://en.wikipedia.org/w/api.php for documentation. --LA2 (talk) 22:15, 11 July 2008 (UTC)

German Umlaute[edit]

If I use german Umlaute in a mediawiki article I get a pagelinks like Zweidimensionale_H%C3%A4ufigkeitsverteilung_-_Zweidimensionale_H%C3%A4ufigkeitstabellen. If I put this in get_text then I get an empty contents. In a browser this works; any idea what I can do? I have extracted the link with perl from the HTML of Special:Allpages. -- sigbert 14:50, 21 Aug 2008

I found a solution by modifying Perlwikipedia.pm. I added under sub new a $self->{getesc}=0; and replaced under sub _get the line my $no_escape = shift || 0; by my $no_escape = shift || $self->{getesc};. Then I can force from outside if a uri_escape_utf8 is done or not. --Sigbert (talk) 12:30, 17 September 2008 (UTC)

Dagothbot[edit]

At one of my wikis, I would like to develop a bot called DagothBot. I have XAMPP installed. Can Perlwikipedia work with the "perl.exe" file in XAMPP or do I need to download Perl from perl.org? I use x10Hosting to host the wiki. Dagoth Ur, Mad God 09:48, 18 September 2008 (UTC)