Wikipedia talk:Database download
- Please note that questions about the database download are more likely to be answered on the wikitech-l mailing list than on this talk page.
Archives |
|||
|---|---|---|---|
|
[edit] Downloading Wikipedia, step-by-step instructions
Can someone replace this paragraph with step by step instructions on how to download wikipedia into a usable format for use offline?
- What software progams do you need? (I've read that mysql is no longer being supported, I checked Apaches website but have no idea what program, of their many programs, is needed to make a usable offline wikipedia)
- What are the web addresses of those programs?
- What version of those programs do you need?
- Is a dump what you need to download?
- Can someone write a brief laymens translation of the different files at: http://download.wikimedia.org/enwiki/latest/ (what exactly each file contains, or what you need to download if you are trying to do a certain thing, ex; you want the latest version of the articles in text only) you can get a general idea from the title but it's still not very clear.
- What program do you need to download the files on that page.
- If you get Wikipedia downloaded and setup, how do you update it, or replace a single article with a newer version.
The instructions on m:Help:Running MediaWiki on Debian GNU/Linux tell you how to install a fresh mediawiki installation. From here, you will still need to get the xml dump, and run mwdumper.
[edit] Image dumpds are defective or inaccesible, Images torrent tracker doesn't work either
All links to image dumps in this article are bad, all bit torrents are not working. As of now there is no way to download the images other then scraping the article pages. This will choke wiki bandwith but looks like people have no other choice. I think sysops should take a look at this, unless of cours this is intentional,.... is it ???
[edit] Why No Wikipedia Torrents?
I noticed everyone whinning about downloads taking up precious wikipedia bandwidth, which makes me wonder why isn't the most popular and largest downloads NOT availabe as a torrent? Project Gutenberg has the option to download ebook CD's and DVD's via torrent. • Sbmeirow • Talk • 21:39, 21 December 2010 (UTC)
- The dumps change quickly, they are generated every month or so, when you are downloading a dump, it gets out-of-date very soon. By the way, you have some not official wiki torrents here http://thepiratebay.org/search/wikipedia/0/99/0 This is being discussed here too. emijrp (talk) 13:56, 23 December 2010 (UTC)
-
- So what if the dumps change quickly? Just make publishing a new .torrent file part of the procedures for posting the bz2 files. It's not rocket science.
-
- I suggest you start a dialog with some of the linux distributions (debian, fedora, et al) - they manage to distribute their ISOs via torrents with great success. I believe the linux ISOs are of comparable size and frequency to wikipedia dumps.
-
- As for unofficial torrents going up on thepiratebay - it doesn't matter how wikipedia distribute the dumps there will always be someone who attempts to publish their own torrents. Do what the linux distribution people do - ignore them.
-
- Currently I'm getting a download rate of 11Kb/sec on these dumps. If you made them available as torrents you would have near to no bandwidth problems. —Preceding unsigned comment added by 203.214.66.249 (talk) 23:46, 16 February 2011 (UTC)
-
-
-
-
-
-
- Another advantage to torrents is being able to download the dump on an unreliable network. — Preceding unsigned comment added by 61.128.234.243 (talk) 03:00, 10 June 2011 (UTC)
-
-
-
-
-
[edit] Copyright infrigements
- Unlike most article text, images are not necessarily licensed under the GFDL & CC-BY-SA-3.0. They may be under one of many free licenses, in the public domain, believed to be fair use, or even copyright infringements (which should be deleted). In particular, use of fair use images outside the context of Wikipedia or similar works may be illegal. Images under most licenses require a credit, and possibly other attached copyright information.
Does anyone else feel that even though it says 'most' and it's probably true copyright infrigements are more common among images the statement is potentially confusing as it seems to suggest copyright infrigements are something unique to images. I was thinking of something like 'or believed'. And perhaps at the end 'Remember some content may be copyright infrigements (which should be deleted).'Nil Einne (talk) 14:39, 16 January 2011 (UTC) Is there anyway to easily download all of these images? I could write a quick script to look at the tags on each File: page and filter out the non-free images.--RaptorHunter (talk) 22:53, 24 April 2011 (UTC)
[edit] split dump into 26 part (A to Z)
it has been done for several hundreds years to manage the huge sizes of encyclopedias.It would be more manageble by users to download. And maybe it helps in creating the dump images too. Paragraph 2 states "The dump process has had difficulty scaling to cope with the size of the very largest wikis". Can you provide more info on what kind of difficulties where encountered ? Why can't "traditional" splitting into A - Z help the dump process? 00:56, 3 February 2011 (UTC) — Preceding unsigned comment added by Luky83 (talk • contribs)
- AFAIK the main problem was that creating the dumps took very long time, which should be hopefully solved now. And your proposal wouldn't help that. Also, there are several reasons why that would be a bad it idea:
- People would have to download 26 files instead of one. That's needlessly tedious.
- The dumps aren't read by people (at least not directly), but by programs. There is not much advantage for programs in this division.
- Some articles don't start with one of the 26 letters of the English alphabet, but that could be easily solved by a 27th part.
- Svick (talk) 13:39, 10 February 2011 (UTC)
[edit] Dump
Is there a dump of wikipedia articles without the brackets. Clean text without the brackets links and references like [[ ]] and <refences>s? Clean and plain text. I'm just not sure if I find this information in the article? Thanks. 71.33.206.156 (talk) 10:24, 5 March 2011 (UTC)
- It should be easy to run something like sed or a python script on the database dump to strip all of that formatting out. — Preceding unsigned comment added by RaptorHunter (talk • contribs) 02:47, 11 March 2011 (UTC)
- Use this tool: http://medialab.di.unipi.it/wiki/Wikipedia_Extractor — Preceding unsigned comment added by 61.128.234.243 (talk) 03:42, 10 June 2011 (UTC)
[edit] Database Torrent
I have created a Wikipedia Torrent so that other people can download the dump without wasting Wikipedia's bandwidth.
I will seed this for awhile, but I really am hoping that Wikipedia could publish an official torrent itself every year containing validated articles (checked for vandalism and quality), and then seed that torrent indefinitely. This will reduce strain on the servers and make the process of obtaining a database dump much simpler and easier. It would also serve as a snapshot in time, so that users could browse the 2010 or 2011 wikipedia.
Torrent Link — Preceding unsigned comment added by RaptorHunter (talk • contribs) 02:58, 11 March 2011 (UTC)
- At this exact moment, it looks like there are 15 seeders with 100% of the file available to share via torrent at http://thepiratebay.org/torrent/6234361 • Sbmeirow • Talk • 06:19, 25 April 2011 (UTC)
[edit] Wikipedia Namespace
I wrote a script to cut out all of the Wikipedia namespace from the database dump. This includes stuff like WP:AFD WP:Village Pump and hundreds of pages of archives that most people don't want. Removing it saves about 1.2GiB or 19% I also cut out the File: MediaWiki: and Help: Namespaces.
Tell me if you think those namespaces should be cut from my next torrent.--RaptorHunter (talk) 22:47, 24 April 2011 (UTC)
- It would make sense to remove stuff that isn't important so to shrink the size, but unfortunately I don't know what should or shouldn't be in the database. I highly recommend that you contact all vendors / people that have written wikipedia apps to ask for their input. • Sbmeirow • Talk • 06:11, 25 April 2011 (UTC)
- Is there anyway you can re-script this to remove every article except those with the Template name space. Such as Template:Infobox user or Template:War ? I ask because I've been looking to get a hold of just the Template name spaces but every thing I've seen requires several arbitrary steps that are both confusing and not documented well at all. --RowenStipe (talk) 10:43, 19 September 2011 (UTC)
- I have posted the script I use on pastebin. http://pastebin.com/7sg8eLeX
- Just edit the bad_titles= line to cut whatever namespaces you want. The usage comment at the top of the script explains how to run it in linux. — Preceding unsigned comment added by 71.194.190.179 (talk) 22:22, 29 December 2011 (UTC)
[edit] Full English Dumps Failing
This article explains that the full dumps of the english language wikipedia have been failing, but doesn't indicate whether or not the problem is going to be resolved. Can we expect to see a full extract any time soon, or even for this to become routine again? — Preceding unsigned comment added by Jdownie (talk • contribs) 11:56, 22 April 2011 (UTC)
[edit] May 2011 Database Dump.
I have created a new torrent for the May 2011 database dump. This torrent resolves some of the problems with the old torrent. It's also a gigabyte smaller (I cut out all of the Wikipedia: namespace pages) --RaptorHunter (talk) 01:57, 29 May 2011 (UTC)
http://thepiratebay.org/torrent/6430796
[edit] June 2011 Database Dump much smaller than May 2011 Database Dump
It appears that http://download.wikimedia.org/enwiki/20110620/enwiki-20110620-pages-articles.xml.bz2 file is 5.8GB, whereas http://download.wikimedia.org/enwiki/20110526/enwiki-20110526-pages-articles.xml.bz2 is 6.8GB. The big difference between these two files suggest a problem in the dump process. Any idea about what might have happened here? Thanks! — Preceding unsigned comment added by 65.119.214.2 (talk) 20:28, 7 July 2011 (UTC)
- See a message on the xmldatadumps-l mailing list. User<Svick>.Talk(); 17:38, 8 July 2011 (UTC)
[edit] History and logs
Is there any chance of getting just the history pages and logs for the wiki? I am interested in researching editing patterns and behaviors and I don't need any article text (especially not every version of article text). Or is there a better place to get this information? — Bility (talk) 20:06, 15 August 2011 (UTC)
- Yes, go to http://dumps.wikimedia.org/enwiki/latest/ and download the pages-logging.xml.gz and stub-meta-history.xml.gz files only. emijrp (talk) 22:39, 15 August 2011 (UTC)
[edit] Text Viewer
I have the latest dump of wikipedia articles as .xml file in hope to sort, and use some only on a few topics, then upload to a mysql database as a supplement to an online dictionary.
Which is the best text editor program people use for the 33 GB xml text file (UltraEdit / VIM)? I only have 8 GB ram on this computer- not sure if it will just load whatever part I'm currently viewing or use scratch disk or just crash.
Also, the filename is enwiki-latest-pages-articles.xml.bz2.xml.bz2.xml - I used 7 zip to expand the first level, but what would I use to uncompress next? Isn't xml already uncompressed?
Thanks to anyone who is familiar with this. --Adam00 (talk) 16:37, 13 October 2011 (UTC)
- It's not very rich in features (it can run macros) but textpad just opens the files as if they are 1 kb txt. When you scroll it loads that part of the file. If you scroll a lot it will fill up the ram. 84.106.26.81 (talk) 23:45, 1 January 2012 (UTC)