Wikipedia talk:Database download

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Adam00 (talk | contribs) at 16:37, 13 October 2011 (→‎Text Viewer: new section). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Please note that questions about the database download are more likely to be answered on the wikitech-l mailing list than on this talk page.

freecache.org

Would adding http://freecache.org to the database dump files save bandwidth? Perhaps this could be experimented on for files < 1Gb & > 5Mb -Alterego

FreeCache has been deactivated. Edward (talk) 23:06, 19 October 2010 (UTC)[reply]
    • Also, in relation to incremental updates, it seems that the only reason freecache wouldn't work is because old files aren't accessed often so they aren't cached. If some method could be devised whereas everyone who needed incremental updates accessed the files within the same time period, perhaps via an automated client, you could better utilize the ISPs bandwidth. I could be way off here. -Alterego

Downloading Wikipedia, step-by-step instructions

Can someone replace this paragraph with step by step instructions on how to download wikipedia into a usable format for use offline?

  • What software progams do you need? (I've read that mysql is no longer being supported, I checked Apaches website but have no idea what program, of their many programs, is needed to make a usable offline wikipedia)
  • What are the web addresses of those programs?
  • What version of those programs do you need?
  • Is a dump what you need to download?
  • Can someone write a brief laymens translation of the different files at: http://download.wikimedia.org/enwiki/latest/ (what exactly each file contains, or what you need to download if you are trying to do a certain thing, ex; you want the latest version of the articles in text only) you can get a general idea from the title but it's still not very clear.
  • What program do you need to download the files on that page.
  • If you get Wikipedia downloaded and setup, how do you update it, or replace a single article with a newer version.

The instructions on m:Help:Running MediaWiki on Debian GNU/Linux tell you how to install a fresh mediawiki installation. From here, you will still need to get the xml dump, and run mwdumper.

data base size?

huge artiucle, lots of information but not a single innuedno as to what is the approximate size of the wikiepdia data base...Can anyone help me find that out?--Procrastinating@talk2me 13:13, 25 December 2006 (UTC)[reply]

The last en.wiki dump i downloaded enwiki-20061130-pages-articles.xml expands to 7.93GB Reedy Boy 14:39, 25 December 2006 (UTC)[reply]
I think the indexes need an additional 30GB or so. I'm not sure yet because for the latest dump rebuildall.php has not completed after several days. (SEWilco 00:35, 22 January 2007 (UTC))[reply]
there's no information as to the size of the images, sound video, etc. The only data I could find was this link from 2008 which said the images were 420GB. This is probably all of the languages put together, since articles share images. But I agree, the article should include some information about the DATABASE SIZE, since noone is ever going to print off wikipedia on paper.Brinerustle (talk) 21:06, 22 October 2009 (UTC)[reply]

Wikimedia Commons images: 7M and growing

Some info about the growing ratio of uploaded images every month. Sizes are in MB. The largest bunch of images uploaded was in 2010-07: 207,946 images, +345 GB, 1.6 MB per image average. Regards. emijrp (talk) 11:36, 16 August 2010 (UTC)[reply]

date	imagescount	totalsize	avgsizeperimage
2003-1	1	1.29717636	1.297176361084
2004-9	685	180.10230923	0.262923079163
2004-10	4171	607.82356930	0.145726101486
2004-11	3853	692.86077213	0.179823714543
2004-12	6413	1433.85031033	0.223584954050
2005-1	7754	1862.35064888	0.240179345999
2005-2	8500	2974.20539951	0.349906517590
2005-3	12635	3643.41866493	0.288359213687
2005-4	15358	5223.11017132	0.340090517731
2005-6	18633	6828.90809345	0.366495362714
2005-7	19968	8779.54927349	0.439680953200
2005-11	21977	9490.01213932	0.431815631766
2005-5	27811	10488.68217564	0.377141497092
2005-12	28540	10900.89676094	0.381951533320
2005-8	29787	12007.92203808	0.403126264413
2005-9	28484	13409.05960655	0.470757604499
2006-2	36208	14259.63428211	0.393825515966
2006-1	35650	14718.57859612	0.412863354730
2005-10	31436	16389.36539268	0.521356578212
2006-3	41763	18516.05062675	0.443360166338
2006-4	46645	21974.19114399	0.471094246843
2006-5	49399	28121.86408234	0.569280027578
2006-6	50893	28473.23626709	0.559472545676
2006-12	52577	29190.01748085	0.555186060080
2006-7	58384	30763.06628609	0.526909192349
2006-10	59611	32019.11058044	0.537134263482
2006-11	57070	32646.13846588	0.572036770035
2006-9	64991	35881.70751190	0.552102714405
2007-2	64787	37625.21335030	0.580752517485
2007-1	71989	38771.36503792	0.538573463139
2006-8	78263	48580.74944115	0.620737122793
2007-3	97556	51218.63132572	0.525017746994
2007-6	92656	60563.04164696	0.653633241743
2007-4	116127	69539.80562592	0.598825472336
2007-5	94416	69565.31412792	0.736795819860
2007-12	96256	72016.71782589	0.748179000020
2007-7	110470	72699.43968487	0.658092148863
2008-2	103295	73830.05222321	0.714749525371
2007-11	118250	78178.20839214	0.661126498031
2008-1	114507	80664.45367908	0.704449978421
2008-3	115732	84991.75799370	0.734384249764
2007-10	112011	85709.30096245	0.765186463494
2007-8	120324	87061.07968998	0.723555397842
2008-4	102342	93631.80365562	0.914891282715
2007-9	125487	95482.06631756	0.760892094939
2008-6	101048	95703.97241211	0.947113969718
2008-11	120943	110902.88121986	0.916984705356
2008-5	116491	112381.90718269	0.964726091996
2008-9	134362	114676.82158184	0.853491475133
2008-10	114676	116692.37883282	1.017583267927
2009-2	127162	119990.37194729	0.943602427984
2008-12	193181	120587.25548649	0.624219025093
2009-1	126947	121332.83420753	0.955775514250
2008-7	119541	122324.61021996	1.023285820095
2008-8	119471	124409.18394279	1.041333745786
2010-8	89776	164473.27284527	1.832040554773
2009-9	132803	176394.38845158	1.328240991932
2010-2	158509	182738.31155205	1.152857639327
2009-5	143389	182773.11001110	1.274666187860
2009-6	160070	186919.01145649	1.167732938442
2009-11	178118	188819.04353809	1.060078394874
2009-4	196699	202346.40691471	1.028710908112
2010-3	178439	206399.28073311	1.156693776210
2010-1	361253	216406.85650921	0.599045147055
2009-12	143797	217440.78340054	1.512137133602
2009-7	164040	230185.27826881	1.403226519561
2009-8	167079	250747.52404118	1.500772233741
2009-3	209453	260923.17462921	1.245736153835
2010-6	173658	270246.74141026	1.556200931775
2010-4	208518	297715.36278248	1.427768167652
2010-5	203093	297775.34260082	1.466201900611
2009-10	272818	329581.74736500	1.208064524207
2010-7	207946	345394.14518356	1.660979990880
This is 6.95788800953294 Tb in total. Rich Farmbrough, 15:43, 17 November 2010 (UTC).[reply]

Image dumpds are defective or inaccesible, Images torrent tracker doesn't work either

All links to image dumps in this article are bad, all bit torrents are not working. As of now there is no way to download the images other then scraping the article pages. This will choke wiki bandwith but looks like people have no other choice. I think sysops should take a look at this, unless of cours this is intentional,.... is it ???

Another offline Wikipedia idea...But would this be acceptable (re)use of Wikipedia's data?

Hi everyone! :-)

After reading the Download Wikipedia pages earlier today, a few ideas have popped into my mind for a compacted, stripped down version of Wikipedia that would (Hopefully) end up being small enough to fit on standard DVD-5 media, and might be a useful resource for communities and regions where Internet access is either unavailable or prohibitively expensive. The idea that I've currently got in mind is primarily based on the idea of storing everything in 6-bit encoding (Then BZIPped) which - Though leading to a loss of anything other than very basic formatting and line/paragraph breaks - Would still retain the text of the articles themselves, which is the most important part of Wikipedia's content! :-)

Anyhow, I'm prone to getting ideas in mind, and then never acting upon them or getting sidetracked by other things...So unless I ever get all of the necessary bits 'n' pieces already coded and ready for testing, I'm not even going to consider drinking 7.00GB of Wikipedia's bandwidth on a mere personal whim. That said - Before I give more thought to the ideas in mind - I just wanted to check a few things with more knowledgeable users:

  • I believe that all text content on Wikipedia is freely available/downloadable under the GFDL (Including everything in pages-articles.xml.bz2). Is this correct, or are there copyrighted text passages that I might have to strip out in any such derivative work?
  • Given that my idea centres around preserving only the article texts in their original form - With formatting, TeX notation, links to external images/pages etc. removed where felt necessary - Would such use still fall under the scope of the GFDL, or would I be regarded as having created a plagarised work using Wikipedia's data?
  • Because of the differing sizes of assorted media and the fact that - If I managed to get myself actually doing something with this idea - I'd rather create something that could be scaled up or down to fit the target media, meaning that someone could create a smaller offline Wikipedia to fit on a 2GB pen-drive using a DVD version (~4.37GB) as the source if they so needed. Because such smaller versions would be made simply and quickly by just truncating the index and article files as appropriate, it'd help to have the most useful/popular articles sorted to the start of the article file. Therefore; Are any hit counts (Preferably counting only human visits - Not automated ones) kept against article pages, and - If so - Can these hit counts be downloaded separately, without having to download the huge ~36GB Wiki dump?
  • Finally: For obvious reasons, I wouldn't want to try downloading the whole 6.30GB compressed articles archive until I was certain that I'd created a workable conversion process and storage format that warranted a full-scale test...But having the top 5-10MB of the compressed dump would be useful for earlier testing and bug blasting. Would using Wget to do this on an occasional basis cause any problems or headaches?

Farewell for now, and many thanks in advance for any advice! >:-)

+++ DieselDragon +++ (Talk) - 22 October 2010CE = 23:44, 22 October 2010 (UTC)[reply]

I don't think a bzipped 6-bit-encoded image would be any smaller than bzipped 8-bit-encoded. Compression is about information theory, and the information in both versions is the same. Wizzy 07:39, 23 October 2010 (UTC)[reply]
I was thinking of using 6-bit encoding with a constrained alphabet/character table and possibly a database for common words to preserve the encyclopaedic text on its own (The element that I consider the most important) whilst disposing of less crucial elements such as images, wiki-markup, XML wrappers, and possibly links. This would serve the purpose of stripping out everything bar the core content itself, which should (In theory) reduce the size of the encyclopedia to managable everyday proportions whilst still keeping all of the essential elements. :-)
A simple analogy would be taking a newspaper - Removing all images, adverts, visual formatting and other unnecessary elements - And keeping just the text on it's own...Resulting in a document that's much smaller (In data terms) than the original. Using the encoding that I have in mind, the text-string "Wikipedia" which normally occupies 72 bits on it's own would be reduced to 54 bits. That said, my understanding of compression processes has never reached past the fundamental stage (LZW gives me enough of a headache!) so I don't know if 6-bit would compress as well as it's 8-bit equivalent. :-)
+++ DieselDragon +++ (Talk) - 26 October 2010CE = 14:10, 26 October 2010 (UTC)[reply]
I agree with Wizzy that using some custom encoding won't help you much, if at all. Using some other compression algorithm (7zip?) would probably give you better results. Also, downloading the dump for yourself, even just to try something out isn't wasting Wikimedia bandwidth – those files are published so that they can be downloaded. Using wget the way you describe should be okay too.
Wikipedia switched licenses, so it's now licensed under CC-BY-SA. That means that you can do pretty much anything with the texts (including changing them in any way) as long as you give proper credit to the authors and release your modifications under the same license.
Svick (talk) 17:25, 28 October 2010 (UTC)[reply]
Have you looked at Wikipedia:Version_1.0_Editorial_Team?Smallman12q (talk) 22:20, 28 October 2010 (UTC)[reply]
  • Text only, articles, redirects, templates and categories would fit on a DVD I trow. If not then the place to look for saving is pulling out HTML comments, spurious spaces, persondata, external links (since we are positing no Internet access), interwiki links, maybe infoboxes, nav-boxes and so on.
  • Best compression is apparently arithmetic encoding, and certainly the 7z/b7 compression seems far more efficient.
  • Hit data is available, it is 50M per hour compressed. I have been thinking about consolidating this, for other reasons. Let me know if you decide to go ahead with this and I will look harder at doing that.

Regards, Rich Farmbrough, 10:47, 17 November 2010 (UTC).[reply]

What is pages-meta-current.xml.bz2

I note that it isn't currently possible to download a data dump due to server maintenance issues, however I did find a dump of a file called enwiki-20091017-pages-meta-current.xml.bz2 on BitTorrent via The Pirate Bay. Is this the same data as pages-articles.xml.bz2?

One further query: has the Wikimedia Foundation considered making its data dumps available via BitTorrent? If you did, it might remove some load from your servers. -- Cabalamat (talk) 01:40, 14 December 2010 (UTC)[reply]

The download server is up now. As far as I know, using BitTorent wasn't considered, presumably because the load from the downloads is not a problem (but there is nothing stopping you or anyone else from creating those torrents). As for your question, the difference between pages-meta-current and pages-articles is that the former contains all pages, while the latter doesn't contain talk pages and user pages. Svick (talk) 01:34, 20 December 2010 (UTC)[reply]

Where is the Dump of this Wikia?

Hi all, a couple of days ago I wrote my problem in the LyricsWiki's Help Desk page but they didn't answer anything! I thought there is a relationship between Wikia projects & Wikipedia, so I decided to ask that here. This is my problem I wrote there:

"Hi, I'm lookin' for this wikia's dump, like any other wikia I went to "Statistics" page, and the Dump file (Which were in the bottom of the page in other Wikias) was not there!! Why there is not any dumps? Isn't it free in all Wiki projects? "

So, in [1] should be a dump of LyricsWiki, but there isn't! Why? Isn't it a free download? -- MehranVB talk | mail 16:14, 21 December 2010 (UTC)[reply]

Wikia and Wikipedia aren't really related, so this is a bad place to ask. I'd suggest asking at Wikia's Community Central Forum. Svick (talk) 22:54, 25 December 2010 (UTC)[reply]
Thanks! --MehranVB talk | mail 06:35, 26 December 2010 (UTC)[reply]

Why No Wikipedia Torrents?

I noticed everyone whinning about downloads taking up precious wikipedia bandwidth, which makes me wonder why isn't the most popular and largest downloads NOT availabe as a torrent? Project Gutenberg has the option to download ebook CD's and DVD's via torrent. • SbmeirowTalk • 21:39, 21 December 2010 (UTC)[reply]

The dumps change quickly, they are generated every month or so, when you are downloading a dump, it gets out-of-date very soon. By the way, you have some not official wiki torrents here http://thepiratebay.org/search/wikipedia/0/99/0 This is being discussed here too. emijrp (talk) 13:56, 23 December 2010 (UTC)[reply]
So what if the dumps change quickly? Just make publishing a new .torrent file part of the procedures for posting the bz2 files. It's not rocket science.
I suggest you start a dialog with some of the linux distributions (debian, fedora, et al) - they manage to distribute their ISOs via torrents with great success. I believe the linux ISOs are of comparable size and frequency to wikipedia dumps.
As for unofficial torrents going up on thepiratebay - it doesn't matter how wikipedia distribute the dumps there will always be someone who attempts to publish their own torrents. Do what the linux distribution people do - ignore them.
Currently I'm getting a download rate of 11Kb/sec on these dumps. If you made them available as torrents you would have near to no bandwidth problems. —Preceding unsigned comment added by 203.214.66.249 (talk) 23:46, 16 February 2011 (UTC)[reply]
I agree 100% with the above comment. I'm sure it wouldn't be difficult to find people to get involved as part of the torrent swarm. • SbmeirowTalk • 02:21, 17 February 2011 (UTC)[reply]
I haven't heard from anyone from WMF complaining about dumps bandwidth, so I think this is a solution to a problem that doesn't exists. Svick (talk) 16:52, 19 February 2011 (UTC)[reply]
Bandwidth isn't my issue, it is the time it takes ME to download through the wikipedia TINY STRAW is the issue! • SbmeirowTalk • 21:32, 19 February 2011 (UTC)[reply]
Ah, sorry I misread. That's odd, when I tried downloading a dump just now, the speed was between 500 KB/s and 1 MB/s most of the time. And I didn't have problems with slow downloads in the past either. Svick (talk) 21:59, 19 February 2011 (UTC)[reply]
Another advantage to torrents is being able to download the dump on an unreliable network. — Preceding unsigned comment added by 61.128.234.243 (talk) 03:00, 10 June 2011 (UTC)[reply]

Copyright infrigements

Unlike most article text, images are not necessarily licensed under the GFDL & CC-BY-SA-3.0. They may be under one of many free licenses, in the public domain, believed to be fair use, or even copyright infringements (which should be deleted). In particular, use of fair use images outside the context of Wikipedia or similar works may be illegal. Images under most licenses require a credit, and possibly other attached copyright information.

Does anyone else feel that even though it says 'most' and it's probably true copyright infrigements are more common among images the statement is potentially confusing as it seems to suggest copyright infrigements are something unique to images. I was thinking of something like 'or believed'. And perhaps at the end 'Remember some content may be copyright infrigements (which should be deleted).'Nil Einne (talk) 14:39, 16 January 2011 (UTC) Is there anyway to easily download all of these images? I could write a quick script to look at the tags on each File: page and filter out the non-free images.--RaptorHunter (talk) 22:53, 24 April 2011 (UTC)[reply]

split dump into 26 part (A to Z)

it has been done for several hundreds years to manage the huge sizes of encyclopedias.It would be more manageble by users to download. And maybe it helps in creating the dump images too. Paragraph 2 states "The dump process has had difficulty scaling to cope with the size of the very largest wikis". Can you provide more info on what kind of difficulties where encountered ? Why can't "traditional" splitting into A - Z help the dump process? 00:56, 3 February 2011 (UTC) — Preceding unsigned comment added by Luky83 (talkcontribs)

AFAIK the main problem was that creating the dumps took very long time, which should be hopefully solved now. And your proposal wouldn't help that. Also, there are several reasons why that would be a bad it idea:
  1. People would have to download 26 files instead of one. That's needlessly tedious.
  2. The dumps aren't read by people (at least not directly), but by programs. There is not much advantage for programs in this division.
  3. Some articles don't start with one of the 26 letters of the English alphabet, but that could be easily solved by a 27th part.
Svick (talk) 13:39, 10 February 2011 (UTC)[reply]

Dump

Is there a dump of wikipedia articles without the brackets. Clean text without the brackets links and references like [[ ]] and <refences>s? Clean and plain text. I'm just not sure if I find this information in the article? Thanks. 71.33.206.156 (talk) 10:24, 5 March 2011 (UTC)[reply]

It should be easy to run something like sed or a python script on the database dump to strip all of that formatting out. — Preceding unsigned comment added by RaptorHunter (talkcontribs) 02:47, 11 March 2011 (UTC)[reply]
Use this tool: http://medialab.di.unipi.it/wiki/Wikipedia_Extractor — Preceding unsigned comment added by 61.128.234.243 (talk) 03:42, 10 June 2011 (UTC)[reply]

Database Torrent

I have created a Wikipedia Torrent so that other people can download the dump without wasting Wikipedia's bandwidth.

I will seed this for awhile, but I really am hoping that Wikipedia could publish an official torrent itself every year containing validated articles (checked for vandalism and quality), and then seed that torrent indefinitely. This will reduce strain on the servers and make the process of obtaining a database dump much simpler and easier. It would also serve as a snapshot in time, so that users could browse the 2010 or 2011 wikipedia.

Torrent Link — Preceding unsigned comment added by RaptorHunter (talkcontribs) 02:58, 11 March 2011 (UTC)[reply]

At this exact moment, it looks like there are 15 seeders with 100% of the file available to share via torrent at http://thepiratebay.org/torrent/6234361SbmeirowTalk • 06:19, 25 April 2011 (UTC)[reply]

Wikipedia Namespace

I wrote a script to cut out all of the Wikipedia namespace from the database dump. This includes stuff like WP:AFD WP:Village Pump and hundreds of pages of archives that most people don't want. Removing it saves about 1.2GiB or 19% I also cut out the File: MediaWiki: and Help: Namespaces.

Tell me if you think those namespaces should be cut from my next torrent.--RaptorHunter (talk) 22:47, 24 April 2011 (UTC)[reply]

It would make sense to remove stuff that isn't important so to shrink the size, but unfortunately I don't know what should or shouldn't be in the database. I highly recommend that you contact all vendors / people that have written wikipedia apps to ask for their input. • SbmeirowTalk • 06:11, 25 April 2011 (UTC)[reply]
Is there anyway you can re-script this to remove every article except those with the Template name space. Such as Template:Infobox user or Template:War ? I ask because I've been looking to get a hold of just the Template name spaces but every thing I've seen requires several arbitrary steps that are both confusing and not documented well at all. --RowenStipe (talk) 10:43, 19 September 2011 (UTC)[reply]

Full English Dumps Failing

This article explains that the full dumps of the english language wikipedia have been failing, but doesn't indicate whether or not the problem is going to be resolved. Can we expect to see a full extract any time soon, or even for this to become routine again? — Preceding unsigned comment added by Jdownie (talkcontribs) 11:56, 22 April 2011 (UTC)[reply]

May 2011 Database Dump.

I have created a new torrent for the May 2011 database dump. This torrent resolves some of the problems with the old torrent. It's also a gigabyte smaller (I cut out all of the Wikipedia: namespace pages) --RaptorHunter (talk) 01:57, 29 May 2011 (UTC)[reply]

http://thepiratebay.org/torrent/6430796

June 2011 Database Dump much smaller than May 2011 Database Dump

It appears that http://download.wikimedia.org/enwiki/20110620/enwiki-20110620-pages-articles.xml.bz2 file is 5.8GB, whereas http://download.wikimedia.org/enwiki/20110526/enwiki-20110526-pages-articles.xml.bz2 is 6.8GB. The big difference between these two files suggest a problem in the dump process. Any idea about what might have happened here? Thanks! — Preceding unsigned comment added by 65.119.214.2 (talk) 20:28, 7 July 2011 (UTC)[reply]

See a message on the xmldatadumps-l mailing list. User<Svick>.Talk(); 17:38, 8 July 2011 (UTC)[reply]

History and logs

Is there any chance of getting just the history pages and logs for the wiki? I am interested in researching editing patterns and behaviors and I don't need any article text (especially not every version of article text). Or is there a better place to get this information? — Bility (talk) 20:06, 15 August 2011 (UTC)[reply]

Yes, go to http://dumps.wikimedia.org/enwiki/latest/ and download the pages-logging.xml.gz and stub-meta-history.xml.gz files only. emijrp (talk) 22:39, 15 August 2011 (UTC)[reply]
Excellent, thank you! — Bility (talk) 23:57, 15 August 2011 (UTC)[reply]

Text Viewer

I have the latest dump of wikipedia articles as .xml file in hope to sort, and use some only on a few topics, then upload to a mysql database as a supplement to an online dictionary.

Which is the best text editor program people use for the 33 GB xml text file (UltraEdit / VIM)? I only have 8 GB ram on this computer- not sure if it will just load whatever part I'm currently viewing or use scratch disk or just crash.

Also, the filename is enwiki-latest-pages-articles.xml.bz2.xml.bz2.xml - I used 7 zip to expand the first level, but what would I use to uncompress next? Isn't xml already uncompressed?

Thanks to anyone who is familiar with this.