Wikipedia:Database download: Difference between revisions
Rescuing 5 sources and tagging 0 as dead.) #IABot (v2.0.9.3 Tags: Reverted IABotManagementConsole [1.2] |
→Offline Wikipedia readers: removed broken section link, per talk page |
||
(19 intermediate revisions by 11 users not shown) | |||
Line 1: | Line 1: | ||
{{Short description| |
{{Short description|Downloading dumps of the wiki database}} |
||
{{for|scheduling, related tools etc.|m:Data dumps{{!}}Data dumps on Meta-Wiki}} |
{{for|scheduling, related tools etc.|m:Data dumps{{!}}Data dumps on Meta-Wiki}} |
||
{{redirect|WP:DD|deletion discussions|Wikipedia:Deletion discussions}} |
{{redirect|WP:DD|deletion discussions|Wikipedia:Deletion discussions}} |
||
Line 5: | Line 5: | ||
{{Wikipedia how-to|WP:DUMP|WP:DUMPS}} |
{{Wikipedia how-to|WP:DUMP|WP:DUMPS}} |
||
{{Reader help}} |
{{Reader help}} |
||
Wikipedia offers free copies of all available content to interested users. These databases can be used for [[Wikipedia:Mirrors and forks|mirroring]], personal use, informal backups, offline use or database queries (such as for [[Wikipedia:Maintenance]]). All text content is licensed under the [[Wikipedia:Text of Creative Commons Attribution-ShareAlike 3.0 Unported License|Creative Commons Attribution-ShareAlike 3.0 License]] (CC-BY-SA), and most is additionally licensed under the [[Wikipedia:Text of the GNU Free Documentation License|GNU Free Documentation License]] (GFDL).<ref>See {{section link|Wikipedia:Reusing_Wikipedia_content#Re-use_of_text_under_the_GNU_Free_Documentation_License}} for more information on compatibility with the GFDL.</ref> Images and other files are available under [[Wikipedia:Image copyright tags|different terms]], as detailed on their description pages. For our advice about complying with these licenses, see [[Wikipedia:Copyrights]]. |
Wikipedia offers free copies of all available content to interested users. These databases can be used for [[Wikipedia:Mirrors and forks|mirroring]], personal use, informal backups, offline use or database queries (such as for [[Wikipedia:Maintenance]]). All text content is licensed under the [[Wikipedia:Text of the Creative Commons Attribution-ShareAlike 3.0 Unported License|Creative Commons Attribution-ShareAlike 3.0 License]] (CC-BY-SA), and most is additionally licensed under the [[Wikipedia:Text of the GNU Free Documentation License|GNU Free Documentation License]] (GFDL).<ref>See {{section link|Wikipedia:Reusing_Wikipedia_content#Re-use_of_text_under_the_GNU_Free_Documentation_License}} for more information on compatibility with the GFDL.</ref> Images and other files are available under [[Wikipedia:Image copyright tags|different terms]], as detailed on their description pages. For our advice about complying with these licenses, see [[Wikipedia:Copyrights]]. |
||
== Offline Wikipedia readers == |
== Offline Wikipedia readers == |
||
Some of the many ways to read Wikipedia while offline: |
Some of the many ways to read Wikipedia while offline: |
||
* [[ |
* [[Kiwix]]: ({{Section link||Kiwix}}) - [https://library.kiwix.org/#lang=eng index of images] (2024) |
||
* [[ |
* [[XOWA]]: ({{Section link||XOWA}}) - [http://xowa.org/home/wiki/Dashboard/Image_databases.html index of images] (2015) |
||
* WikiTaxi: {{Section link||WikiTaxi (for Windows)}} |
* WikiTaxi: {{Section link||WikiTaxi (for Windows)}} |
||
* aarddict: {{Section link||Aard Dictionary}} |
* aarddict: {{Section link||Aard Dictionary / Aard 2}} |
||
* BzReader: {{Section link||BzReader and MzReader (for Windows)}} |
* BzReader: {{Section link||BzReader and MzReader (for Windows)}} |
||
⚫ | |||
* Wiki as E-Book: {{Section link||E-book}} |
|||
* WikiFilter: {{Section link||WikiFilter}} |
* WikiFilter: {{Section link||WikiFilter}} |
||
* Wikipedia on rockbox: {{Section link||Wikiviewer for Rockbox}} |
* Wikipedia on rockbox: {{Section link||Wikiviewer for Rockbox}} |
||
⚫ | |||
Some of them are mobile applications – see "[[list of Wikipedia mobile applications]]". |
Some of them are mobile applications – see "[[list of Wikipedia mobile applications]]". |
||
==Where do I get |
==Where do I get the dumps?== |
||
===English-language Wikipedia=== |
===English-language Wikipedia=== |
||
* Dumps from any Wikimedia Foundation project: {{URL|//dumps.wikimedia.org/}} and the [[iarchive:wikimediadownloads|Internet Archive]] |
* Dumps from any Wikimedia Foundation project: {{URL|//dumps.wikimedia.org/}} and the [[iarchive:wikimediadownloads|Internet Archive]] |
||
Line 61: | Line 60: | ||
Images and other uploaded media are available from mirrors in addition to being served directly from Wikimedia servers. Bulk download is (as of September 2013) available from mirrors but not offered directly from Wikimedia servers. See the [[m:Mirroring Wikimedia project XML dumps#Media|list of current mirrors]]. You should [[Rsync|rsync]] from the mirror, then fill in the missing images from [//upload.wikimedia.org upload.wikimedia.org]; when downloading from <code>upload.wikimedia.org</code> you should throttle yourself to 1 cache miss per second (you can check headers on a response to see if was a hit or miss and then back off when you get a miss) and you shouldn't use more than one or two simultaneous HTTP connections. In any case, make sure you have an accurate [[user agent]] string with contact info (email address) so ops can contact you if there's an issue. You should be getting checksums from the mediawiki API and verifying them. The [[mw:API:Etiquette|API Etiquette]] page contains some guidelines, although not all of them apply (for example, because upload.wikimedia.org isn't MediaWiki, there is no <code>maxlag</code> parameter). |
Images and other uploaded media are available from mirrors in addition to being served directly from Wikimedia servers. Bulk download is (as of September 2013) available from mirrors but not offered directly from Wikimedia servers. See the [[m:Mirroring Wikimedia project XML dumps#Media|list of current mirrors]]. You should [[Rsync|rsync]] from the mirror, then fill in the missing images from [//upload.wikimedia.org upload.wikimedia.org]; when downloading from <code>upload.wikimedia.org</code> you should throttle yourself to 1 cache miss per second (you can check headers on a response to see if was a hit or miss and then back off when you get a miss) and you shouldn't use more than one or two simultaneous HTTP connections. In any case, make sure you have an accurate [[user agent]] string with contact info (email address) so ops can contact you if there's an issue. You should be getting checksums from the mediawiki API and verifying them. The [[mw:API:Etiquette|API Etiquette]] page contains some guidelines, although not all of them apply (for example, because upload.wikimedia.org isn't MediaWiki, there is no <code>maxlag</code> parameter). |
||
Unlike most article text, images are not necessarily licensed under the GFDL & CC-BY-SA-3.0. They may be under one of many [[Wikipedia:File copyright tags/Free licenses|free licenses]], in the [[Wikipedia:File copyright tags/Public domain|public domain]], believed to be [[Wikipedia:File copyright tags/Non-free|fair use]], or even copyright infringements (which should be [[WP:IFD|deleted]]). In particular, use of fair use images outside the context of Wikipedia or similar works may be illegal. Images under most licenses require a credit, and possibly other attached copyright information. This information is included in image description pages, which are part of the text dumps available from [//dumps.wikimedia.org/ dumps.wikimedia.org]. In conclusion, download these images at your own risk ([//dumps.wikimedia.org/legal.html Legal]) |
Unlike most article text, images are not necessarily licensed under the GFDL & CC-BY-SA-3.0. They may be under one of many [[Wikipedia:File copyright tags/Free licenses|free licenses]], in the [[Wikipedia:File copyright tags/Public domain|public domain]], believed to be [[Wikipedia:File copyright tags/Non-free|fair use]], or even copyright infringements (which should be [[WP:IFD|deleted]]). In particular, use of fair use images outside the context of Wikipedia or similar works may be illegal. Images under most licenses require a credit, and possibly other attached copyright information. This information is included in image description pages, which are part of the text dumps available from [//dumps.wikimedia.org/ dumps.wikimedia.org]. In conclusion, download these images at your own risk ([//dumps.wikimedia.org/legal.html Legal]). |
||
==Dealing with compressed files== |
==Dealing with compressed files== |
||
Compressed dump files are significantly compressed, thus after being decompressed will take up '''large''' amounts of drive space. A large list of decompression programs are described in [[ |
Compressed dump files are significantly compressed, thus after being decompressed will take up '''large''' amounts of drive space. A large list of decompression programs are described in [[comparison of file archivers]]. The following programs in particular can be used to decompress bzip2, [[.bz2]], [[.zip]], and [[.7z]] files. |
||
;[[Microsoft Windows|Windows]] |
;[[Microsoft Windows|Windows]] |
||
Line 91: | Line 90: | ||
The older the software in a computing device, the more likely it will have a 2 GB file limit somewhere in the system. This is due to older software using 32-bit integers for file indexing, which limits file sizes to 2^31 bytes (2 GB) (for signed integers), or 2^32 (4 GB) (for unsigned integers). Older [[C (programming language)|C]] [[library (computing)|programming libraries]] have this 2 or 4 GB limit, but the newer file libraries have been converted to 64-bit integers thus supporting file sizes up to 2^63 or 2^64 bytes (8 or 16 [[exabyte|EB]]). |
The older the software in a computing device, the more likely it will have a 2 GB file limit somewhere in the system. This is due to older software using 32-bit integers for file indexing, which limits file sizes to 2^31 bytes (2 GB) (for signed integers), or 2^32 (4 GB) (for unsigned integers). Older [[C (programming language)|C]] [[library (computing)|programming libraries]] have this 2 or 4 GB limit, but the newer file libraries have been converted to 64-bit integers thus supporting file sizes up to 2^63 or 2^64 bytes (8 or 16 [[exabyte|EB]]). |
||
Before starting a download of a large file, check the storage device to ensure its file system can support files of such a large size, |
Before starting a download of a large file, check the storage device to ensure its file system can support files of such a large size, check the amount of free space to ensure that it can hold the downloaded file, and make sure the device(s) you'll use the storage with are able to read your chosen file system. |
||
===File system limits=== |
===File system limits=== |
||
Line 102: | Line 101: | ||
* [[ReFS]] supports files up to 16 [[exabyte|EB]]. |
* [[ReFS]] supports files up to 16 [[exabyte|EB]]. |
||
;[[Macintosh]] (Mac) |
;[[Macintosh]] (Mac) |
||
* [[HFS Plus]] (HFS+) supports files up to 8 [[exabyte |
* [[HFS Plus]] (HFS+) (Also known as Mac OS Extended) supports files up to 8 EiB (8 exbibytes) (2^63 bytes).<ref name="AppleVolumecomparison">{{Cite web |title=Volume Format Comparison |url=https://developer.apple.com/library/archive/documentation/FileManagement/Conceptual/APFS_Guide/VolumeFormatComparison/VolumeFormatComparison.html |access-date=2023-11-19 |website=developer.apple.com}}</ref> An exbibyte is similar to an [[exabyte]]. HFS Plus is supported on [[macOS]] 10.2+ and [[iOS]]. It was the default file system for [[macOS]] computers prior to the release of [[macOS High Sierra]] in 2017 when it was replaced as default with [[Apple File System]], [[APFS]]. |
||
* [[APFS]] supports files up to 8 exbibytes (2^63 bytes).<ref name="AppleVolumecomparison" /> |
|||
;[[Linux]] |
;[[Linux]] |
||
* [[ext2]] and [[ext3]] supports files up to 16 GB, but up to 2 TB with larger block sizes. See https://users.suse.com/~aj/linux_lfs.html for more information. |
* [[ext2]] and [[ext3]] supports files up to 16 GB, but up to 2 TB with larger block sizes. See https://users.suse.com/~aj/linux_lfs.html for more information. |
||
Line 130: | Line 130: | ||
* 64-bit kernel 2.4.x systems have an 8 EB limit for all file systems. |
* 64-bit kernel 2.4.x systems have an 8 EB limit for all file systems. |
||
* 32-bit kernel 2.6.x systems without option CONFIG_LBD have a 2 TB limit for all file systems. |
* 32-bit kernel 2.6.x systems without option CONFIG_LBD have a 2 TB limit for all file systems. |
||
* 32-bit kernel 2.6.x systems with option CONFIG_LBD and all 64-bit kernel 2.6.x systems have an 8 ZB limit for all file systems.<ref> |
* 32-bit kernel 2.6.x systems with option CONFIG_LBD and all 64-bit kernel 2.6.x systems have an 8 ZB limit for all file systems.<ref> [http://www.suse.com/~aj/linux_lfs.html Large File Support in Linux]</ref> |
||
[[Android (operating system)|Android]]: |
|||
Android is based on Linux, which determines its base limits. |
|||
* Internal storage: |
* Internal storage: |
||
** [[Android (operating system)|Android]] 2.3 and later uses the [[ext4]] file system.<ref> |
** [[Android (operating system)|Android]] 2.3 and later uses the [[ext4]] file system.<ref>[http://www.h-online.com/open/news/item/Android-2-3-Gingerbread-to-use-Ext4-file-system-1152775.html Android 2.2 and before used YAFFS file system; December 14, 2010.]</ref> |
||
** [[Android (operating system)|Android]] 2.2 and earlier uses the [[YAFFS]]2 file system. |
** [[Android (operating system)|Android]] 2.2 and earlier uses the [[YAFFS]]2 file system. |
||
* External storage slots: |
* External storage slots: |
||
Line 141: | Line 141: | ||
** Android 2.3 and later supports ext4 file system. |
** Android 2.3 and later supports ext4 file system. |
||
; [[Apple Inc.|Apple]] [[iOS]] (see [[List of |
; [[Apple Inc.|Apple]] [[iOS]] (see [[List of iPhone models]]): |
||
* All devices support [[HFS Plus]] (HFS+) for internal storage. No devices have external storage slots. Devices on 10.3 or later run [[Apple File System]] supporting a max file size of 8 EB. |
* All devices support [[HFS Plus]] (HFS+) for internal storage. No devices have external storage slots. Devices on 10.3 or later run [[Apple File System]] supporting a max file size of 8 EB. |
||
Line 147: | Line 147: | ||
====Detect corrupted files==== |
====Detect corrupted files==== |
||
It is useful to check the [[MD5]] sums (provided in a file in the download directory) to make sure the download was complete and accurate. This can be checked by running the "md5sum" command on the files downloaded. Given their sizes, this may take some time to calculate. Due to the technical details of how files are stored, ''file sizes'' may be reported differently on different filesystems, and so are not necessarily reliable. Also, corruption may have occurred during the download, though this is unlikely. |
It is useful to check the [[MD5]] sums (provided in a file in the download directory) to make sure the download was complete and accurate. This can be checked by running the "md5sum" command on the files downloaded. Given their sizes, this may take some time to calculate. Due to the technical details of how files are stored, ''file sizes'' may be reported differently on different filesystems, and so are not necessarily reliable. Also, corruption may have occurred during the download, though this is unlikely. |
||
====Reformatting external USB drives==== |
|||
If you plan to download Wikipedia Dump files to one computer and use an external [[USB flash drive]] or [[Disk enclosure|hard drive]] to copy them to other computers, then you will run into the 4 GB FAT32 file size limit. To work around this limit, reformat the >4 GB USB drive to a file system that supports larger file sizes. If working exclusively with Windows computers, then reformat the USB drive to NTFS file system. |
|||
====Linux and Unix==== |
====Linux and Unix==== |
||
Line 230: | Line 227: | ||
===Kiwix=== |
===Kiwix=== |
||
[[File:Kiwix on Android.jpg|right|thumb|Kiwix on an Android tablet]] |
[[File:Kiwix on Android.jpg|right|thumb|Kiwix on an Android tablet]] |
||
[[Kiwix]] is by far the largest offline distribution of [[Wikipedia]] to date. As an offline reader, Kiwix works with a library of contents that are zim files: you can pick & choose whichever [[Wikimedia_Foundation#Wikimedia_projects|Wikimedia project]] (Wikipedia in any language, [[Wiktionary]], [[Wikisource]], etc.), as well as [[TED Talks]], [[PhET Interactive Simulations|PhET Interactive Maths & Physics simulations]], [[ |
[[Kiwix]] is by far the largest offline distribution of [[Wikipedia]] to date. As an offline reader, Kiwix works with a library of contents that are zim files: you can pick & choose whichever [[Wikimedia_Foundation#Wikimedia_projects|Wikimedia project]] (Wikipedia in any language, [[Wiktionary]], [[Wikisource]], etc.), as well as [[TED Talks]], [[PhET Interactive Simulations|PhET Interactive Maths & Physics simulations]], [[Project Gutenberg]], etc. |
||
It is free and open source, and currently available for download on: |
It is free and open source, and currently available for download on: |
||
Line 239: | Line 236: | ||
* [https://flathub.org/apps/details/org.kiwix.desktop GNU/Linux] |
* [https://flathub.org/apps/details/org.kiwix.desktop GNU/Linux] |
||
... as well as extensions for [https://chrome.google.com/webstore/detail/kiwix/donaljnlmapmngakoipdmehbfcioahhk Chrome] & [https://addons.mozilla.org/fr/firefox/addon/kiwix-offline/ Firefox] browsers, server solutions, etc. See [https://kiwix.org official Website] for the complete Kiwix portfolio. |
... as well as extensions for [https://chrome.google.com/webstore/detail/kiwix/donaljnlmapmngakoipdmehbfcioahhk Chrome] & [https://addons.mozilla.org/fr/firefox/addon/kiwix-offline/ Firefox] browsers, server solutions, etc. See [https://www.kiwix.org/en/ official Website] for the complete Kiwix portfolio. |
||
===Aard Dictionary / Aard 2=== |
===Aard Dictionary / Aard 2=== |
||
Line 245: | Line 242: | ||
It also has a successor [http://aarddict.org/ Aard 2]. |
It also has a successor [http://aarddict.org/ Aard 2]. |
||
=== E-book === |
|||
The [http://wiki-as-ebook.sourceforge.net/ wiki-as-ebook] store provides ebooks created from a large set of Wikipedia articles with grayscale images for e-book-readers (2013). |
|||
=== Wikiviewer for [[Rockbox]] === |
=== Wikiviewer for [[Rockbox]] === |
||
Line 254: | Line 248: | ||
===Old dumps=== |
===Old dumps=== |
||
* The static version of Wikipedia created by Wikimedia: http://static.wikipedia.org/ |
* The static version of Wikipedia created by Wikimedia: http://static.wikipedia.org/ Feb. 11, 2013 - This is apparently offline now. There was no content. |
||
*[http://www.tommasoconforti.com/ Wiki2static] (site down {{asof|2005|10|lc=on}}) '''was''' an experimental program set up by [[User:Alfio]] to generate html dumps, inclusive of images, search function and alphabetical index. At the linked site experimental dumps and the script itself can be downloaded. As an example it was used to generate these copies of [http://fixedreference.org/en/20040424/wikipedia/Main_Page English WikiPedia 24 April 04] |
*[http://www.tommasoconforti.com/ Wiki2static] (site down {{asof|2005|10|lc=on}}) '''was''' an experimental program set up by [[User:Alfio]] to generate html dumps, inclusive of images, search function and alphabetical index. At the linked site experimental dumps and the script itself can be downloaded. As an example it was used to generate these copies of [http://fixedreference.org/en/20040424/wikipedia/Main_Page English WikiPedia 24 April 04], [https://web.archive.org/web/20040618150011/http://fixedreference.org/simple/20040501/wikipedia/Main_Page Simple WikiPedia 1 May 04](old database) format and [http://july.fixedreference.org/en/20040724/wikipedia/Main_Page English WikiPedia 24 July 04][http://july.fixedreference.org/simple/20040724/wikipedia/Main_Page Simple WikiPedia 24 July 04], [http://july.fixedreference.org/fr/20040727/wikipedia/Accueil WikiPedia Francais 27 Juillet 2004] (new format). [[User:BozMo|BozMo]] uses a version to generate periodic static copies at [http://fixedreference.org/ fixed reference] (site down as of October 2017). |
||
==Dynamic HTML generation from a local XML database dump== |
==Dynamic HTML generation from a local XML database dump== |
||
Line 371: | Line 365: | ||
*[[DBpedia]] |
*[[DBpedia]] |
||
*[[WikiReader]] |
*[[WikiReader]] |
||
*[[ |
*[[mw:Help:Export]] |
||
*[[m:Help:Downloading pages]] |
*[[m:Help:Downloading pages]] |
||
*[[m:Import]] |
*[[m:Help:Import]] |
||
*[[Meta:Data dumps/Other tools]], for related tools, e.g. extractors and "dump readers" |
*[[Meta:Data dumps/Other tools]], for related tools, e.g. extractors and "dump readers" |
||
*[[Wikipedia:Wikipedia CD Selection]] |
*[[Wikipedia:Wikipedia CD Selection]] |
Revision as of 15:44, 17 March 2024
This help page is a how-to guide. It explains concepts or processes used by the Wikipedia community. It is not one of Wikipedia's policies or guidelines, and may reflect varying levels of consensus. |
Readers' FAQ |
---|
Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, offline use or database queries (such as for Wikipedia:Maintenance). All text content is licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA), and most is additionally licensed under the GNU Free Documentation License (GFDL).[1] Images and other files are available under different terms, as detailed on their description pages. For our advice about complying with these licenses, see Wikipedia:Copyrights.
Offline Wikipedia readers
Some of the many ways to read Wikipedia while offline:
- Kiwix: (§ Kiwix) - index of images (2024)
- XOWA: (§ XOWA) - index of images (2015)
- WikiTaxi: § WikiTaxi (for Windows)
- aarddict: § Aard Dictionary / Aard 2
- BzReader: § BzReader and MzReader (for Windows)
- WikiFilter: § WikiFilter
- Wikipedia on rockbox: § Wikiviewer for Rockbox
- Selected Wikipedia articles as a printed document: Help:Printing
Some of them are mobile applications – see "list of Wikipedia mobile applications".
Where do I get the dumps?
English-language Wikipedia
- Dumps from any Wikimedia Foundation project: dumps
.wikimedia .org and the Internet Archive - English Wikipedia dumps in SQL and XML: dumps
.wikimedia .org /enwiki / and the Internet Archive - Download the data dump using a BitTorrent client (torrenting has many benefits and reduces server load, saving bandwidth costs).
- pages-articles-multistream.xml.bz2 – Current revisions only, no talk or user pages; this is probably what you want, and is over 19 GB compressed (expands to over 86 GB when decompressed).
- pages-meta-current.xml.bz2 – Current revisions only, all pages (including talk)
- abstract.xml.gz – page abstracts
- all-titles-in-ns0.gz – Article titles only (with redirects)
- SQL files for the pages and links are also available
- All revisions, all pages: These files expand to multiple terabytes of text. Please only download these if you know you can cope with this quantity of data. Go to Latest Dumps and look out for all the files that have 'pages-meta-history' in their name.
- To download a subset of the database in XML format, such as a specific category or a list of articles see: Special:Export, usage of which is described at Help:Export.
- Wiki front-end software: MediaWiki [1].
- Database backend software: MySQL.
- Image dumps: See below.
Should I get multistream?
TL;DR: GET THE MULTISTREAM VERSION! (and the corresponding index file, pages-articles-multistream-index.txt.bz2)
pages-articles.xml.bz2 and pages-articles-multistream.xml.bz2 both contain the same xml contents. So if you unpack either, you get the same data. But with multistream, it is possible to get an article from the archive without unpacking the whole thing. Your reader should handle this for you, if your reader doesn't support it it will work anyway since multistream and non-multistream contain the same xml. The only downside to multistream is that it is marginally larger. You might be tempted to get the smaller non-multistream archive, but this will be useless if you don't unpack it. And it will unpack to ~5-10 times its original size. Penny wise, pound foolish. Get multistream.
NOTE THAT the multistream dump file contains multiple bz2 'streams' (bz2 header, body, footer) concatenated together into one file, in contrast to the vanilla file which contains one stream. Each separate 'stream' (or really, file) in the multistream dump contains 100 pages, except possibly the last one.
How to use multistream?
For multistream, you can get an index file, pages-articles-multistream-index.txt.bz2. The first field of this index is the number of bytes to seek into the compressed archive pages-articles-multistream.xml.bz2, the second is the article ID, the third the article title.
Cut a small part out of the archive with dd using the byte offset as found in the index. You could then either bzip2 decompress it or use bzip2recover, and search the first file for the article ID.
See https://docs.python.org/3/library/bz2.html#bz2.BZ2Decompressor for info about such multistream files and about how to decompress them with python; see also https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dumps/+/ariel/toys/bz2multistream/README.txt and related files for an old working toy.
Other languages
In the dumps
Where are the uploaded files (image, audio, video, etc.)?
Images and other uploaded media are available from mirrors in addition to being served directly from Wikimedia servers. Bulk download is (as of September 2013) available from mirrors but not offered directly from Wikimedia servers. See the list of current mirrors. You should rsync from the mirror, then fill in the missing images from upload.wikimedia.org; when downloading from upload.wikimedia.org
you should throttle yourself to 1 cache miss per second (you can check headers on a response to see if was a hit or miss and then back off when you get a miss) and you shouldn't use more than one or two simultaneous HTTP connections. In any case, make sure you have an accurate user agent string with contact info (email address) so ops can contact you if there's an issue. You should be getting checksums from the mediawiki API and verifying them. The API Etiquette page contains some guidelines, although not all of them apply (for example, because upload.wikimedia.org isn't MediaWiki, there is no maxlag
parameter).
Unlike most article text, images are not necessarily licensed under the GFDL & CC-BY-SA-3.0. They may be under one of many free licenses, in the public domain, believed to be fair use, or even copyright infringements (which should be deleted). In particular, use of fair use images outside the context of Wikipedia or similar works may be illegal. Images under most licenses require a credit, and possibly other attached copyright information. This information is included in image description pages, which are part of the text dumps available from dumps.wikimedia.org. In conclusion, download these images at your own risk (Legal).
Dealing with compressed files
Compressed dump files are significantly compressed, thus after being decompressed will take up large amounts of drive space. A large list of decompression programs are described in comparison of file archivers. The following programs in particular can be used to decompress bzip2, .bz2, .zip, and .7z files.
Beginning with Windows XP, a basic decompression program enables decompression of zip files.[2][3] Among others, the following can be used to decompress bzip2 files.
- bzip2 (command-line) (from here) is available for free under a BSD license.
- 7-Zip is available for free under an LGPL license.
- WinRAR
- WinZip
- Macintosh (Mac)
- macOS ships with the command-line bzip2 tool.
- GNU/Linux
- Most GNU/Linux distributions ship with the command-line bzip2 tool.
- Some BSD systems ship with the command-line bzip2 tool as part of the operating system. Others, such as OpenBSD, provide it as a package which must first be installed.
- Notes
- Some older versions of bzip2 may not be able to handle files larger than 2 GB, so make sure you have the latest version if you experience any problems.
- Some older archives are compressed with gzip, which is compatible with PKZIP (the most common Windows format).
Dealing with large files
As files grow in size, so does the likelihood they will exceed some limit of a computing device. Each operating system, file system, hard storage device, and software (application) has a maximum file size limit. Each one of these will likely have a different maximum, and the lowest limit of all of them will become the file size limit for a storage device.
The older the software in a computing device, the more likely it will have a 2 GB file limit somewhere in the system. This is due to older software using 32-bit integers for file indexing, which limits file sizes to 2^31 bytes (2 GB) (for signed integers), or 2^32 (4 GB) (for unsigned integers). Older C programming libraries have this 2 or 4 GB limit, but the newer file libraries have been converted to 64-bit integers thus supporting file sizes up to 2^63 or 2^64 bytes (8 or 16 EB).
Before starting a download of a large file, check the storage device to ensure its file system can support files of such a large size, check the amount of free space to ensure that it can hold the downloaded file, and make sure the device(s) you'll use the storage with are able to read your chosen file system.
File system limits
There are two limits for a file system: the file system size limit, and the file system limit. In general, since the file size limit is less than the file system limit, the larger file system limits are a moot point. A large percentage of users assume they can create files up to the size of their storage device, but are wrong in their assumption. For example, a 16 GB storage device formatted as FAT32 file system has a file limit of 4 GB for any single file. The following is a list of the most common file systems, and see Comparison of file systems for additional detailed information.
- FAT16 supports files up to 4 GB. FAT16 is the factory format of smaller USB drives and all SD cards that are 2 GB or smaller.
- FAT32 supports files up to 4 GB. FAT32 is the factory format of larger USB drives and all SDHC cards that are 4 GB or larger.
- exFAT supports files up to 127 PB. exFAT is the factory format of all SDXC cards, but is incompatible with most flavors of UNIX due to licensing problems.
- NTFS supports files up to 16 TB. NTFS is the default file system for modern Windows computers, including Windows 2000, Windows XP, and all their successors to date. Versions after Windows 8 can support larger files if the file system is formatted with a larger cluster size.
- ReFS supports files up to 16 EB.
- Macintosh (Mac)
- HFS Plus (HFS+) (Also known as Mac OS Extended) supports files up to 8 EiB (8 exbibytes) (2^63 bytes).[4] An exbibyte is similar to an exabyte. HFS Plus is supported on macOS 10.2+ and iOS. It was the default file system for macOS computers prior to the release of macOS High Sierra in 2017 when it was replaced as default with Apple File System, APFS.
- APFS supports files up to 8 exbibytes (2^63 bytes).[4]
- ext2 and ext3 supports files up to 16 GB, but up to 2 TB with larger block sizes. See https://users.suse.com/~aj/linux_lfs.html for more information.
- ext4 supports files up to 16 TB, using 4 KB block size. (limit removed in e2fsprogs-1.42 (2012))
- XFS supports files up to 8 EB.
- ReiserFS supports files up to 1 EB, 8 TB on 32-bit systems.
- JFS supports files up to 4 PB.
- Btrfs supports files up to 16 EB.
- NILFS supports files up to 8 EB.
- YAFFS2 supports files up to 2 GB
- ZFS supports files up to 16 EB.
- FreeBSD and other BSDs
- Unix File System (UFS) supports files up to 8 ZiB.
Operating system limits
Each operating system has internal file system limits for file size and drive size, which is independent of the file system or physical media. If the operating system has any limits lower than the file system or physical media, then the OS limits will be the real limit.
- Windows 95, 98, ME have a 4 GB limit for all file sizes.
- Windows XP has a 16 TB limit for all file sizes.
- Windows 7 has a 16 TB limit for all file sizes.
- Windows 8, 10, and Server 2012 have a 256 TB limit for all file sizes.
- 32-bit kernel 2.4.x systems have a 2 TB limit for all file systems.
- 64-bit kernel 2.4.x systems have an 8 EB limit for all file systems.
- 32-bit kernel 2.6.x systems without option CONFIG_LBD have a 2 TB limit for all file systems.
- 32-bit kernel 2.6.x systems with option CONFIG_LBD and all 64-bit kernel 2.6.x systems have an 8 ZB limit for all file systems.[5]
Android: Android is based on Linux, which determines its base limits.
- Internal storage:
- External storage slots:
- All Android devices should support FAT16, FAT32, ext2 file systems.
- Android 2.3 and later supports ext4 file system.
- Apple iOS (see List of iPhone models)
- All devices support HFS Plus (HFS+) for internal storage. No devices have external storage slots. Devices on 10.3 or later run Apple File System supporting a max file size of 8 EB.
Tips
Detect corrupted files
It is useful to check the MD5 sums (provided in a file in the download directory) to make sure the download was complete and accurate. This can be checked by running the "md5sum" command on the files downloaded. Given their sizes, this may take some time to calculate. Due to the technical details of how files are stored, file sizes may be reported differently on different filesystems, and so are not necessarily reliable. Also, corruption may have occurred during the download, though this is unlikely.
Linux and Unix
If you seem to be hitting the 2 GB limit, try using wget version 1.10 or greater, cURL version 7.11.1-1 or greater, or a recent version of lynx (using -dump). Also, you can resume downloads (for example wget -c).
Why not just retrieve data from wikipedia.org at runtime?
Suppose you are building a piece of software that at certain points displays information that came from Wikipedia. If you want your program to display the information in a different way than can be seen in the live version, you'll probably need the wikicode that is used to enter it, instead of the finished HTML.
Also, if you want to get all the data, you'll probably want to transfer it in the most efficient way that's possible. The wikipedia.org servers need to do quite a bit of work to convert the wikicode into HTML. That's time consuming both for you and for the wikipedia.org servers, so simply spidering all pages is not the way to go.
To access any article in XML, one at a time, access Special:Export/Title of the article.
Read more about this at Special:Export.
Please be aware that live mirrors of Wikipedia that are dynamically loaded from the Wikimedia servers are prohibited. Please see Wikipedia:Mirrors and forks.
Please do not use a web crawler
Please do not use a web crawler to download large numbers of articles. Aggressive crawling of the server can cause a dramatic slow-down of Wikipedia.
Sample blocked crawler email
- IP address nnn.nnn.nnn.nnn was retrieving up to 50 pages per second from wikipedia.org addresses. Something like at least a second delay between requests is reasonable. Please respect that setting. If you must exceed it a little, do so only during the least busy times shown in our site load graphs at stats
.wikimedia .org /EN /ChartsWikipediaZZ .htm. It's worth noting that to crawl the whole site at one hit per second will take several weeks. The originating IP is now blocked or will be shortly. Please contact us if you want it unblocked. Please don't try to circumvent it – we'll just block your whole IP range.
- If you want information on how to get our content more efficiently, we offer a variety of methods, including weekly database dumps which you can load into MySQL and crawl locally at any rate you find convenient. Tools are also available which will do that for you as often as you like once you have the infrastructure in place.
- Instead of an email reply you may prefer to visit #mediawiki connect at irc.libera.chat to discuss your options with our team.
Doing SQL queries on the current database dump
You can do SQL queries on the current database dump using Quarry (as a replacement for the disabled Special:Asksql page).
Database schema
SQL schema
See also: mw:Manual:Database layout
The sql file used to initialize a MediaWiki database can be found here.
XML schema
The XML schema for each dump is defined at the top of the file and described in the MediaWiki export help page.
Help to parse dumps for use in scripts
- Wikipedia:Computer help desk/ParseMediaWikiDump describes the Perl Parse::MediaWikiDump library, which can parse XML dumps.
- Wikipedia preprocessor (wikiprep.pl) is a Perl script that preprocesses raw XML dumps and builds link tables, category hierarchies, collects anchor text for each article etc.
- Wikipedia SQL dump parser is a .NET library to read MySQL dumps without the need to use MySQL database
- WikiDumpParser – a .NET Core library to parse the database dumps.
- Dictionary Builder is a Rust program that can parse XML dumps and extract entries in files
- Scripts for parsing Wikipedia dumps – Python based scripts for parsing sql.gz files from wikipedia dumps.
- parse-mediawiki-sql – a Rust library for quickly parsing the SQL dump files with minimal memory allocation
- gitlab.com/tozd/go/mediawiki – a Go package providing utilities for processing Wikipedia and Wikidata dumps.
Doing Hadoop MapReduce on the Wikipedia current database dump
You can do Hadoop MapReduce queries on the current database dump, but you will need an extension to the InputRecordFormat to have each <page> </page> be a single mapper input. A working set of java methods (jobControl, mapper, reducer, and XmlInputRecordFormat) is available at Hadoop on the Wikipedia
Help to import dumps into MySQL
See:
Wikimedia Enterprise HTML Dumps
As part of Wikimedia Enterprise a partial mirror of HTML dumps is made public. Dumps are produced for a specific set of namespaces and wikis, and then made available for public download. Each dump output file consists of a tar.gz archive which, when uncompressed and untarred, contains one file, with a single line per article, in json format. This is currently an experimental service.
Static HTML tree dumps for mirroring or CD distribution
MediaWiki 1.5 includes routines to dump a wiki to HTML, rendering the HTML with the same parser used on a live wiki. As the following page states, putting one of these dumps on the web unmodified will constitute a trademark violation. They are intended for private viewing in an intranet or desktop installation.
- If you want to draft a traditional website in Mediawiki and dump it to HTML format, you might want to try mw2html by User:Connelly.
- If you'd like to help develop dump-to-static HTML tools, please drop us a note on the developers' mailing list.
- Static HTML dumps are now available here.
See also:
- mw:Alternative parsers lists some other not working options for getting static HTML dumps
- Wikipedia:Snapshots
- Wikipedia:TomeRaider database
Kiwix
Kiwix is by far the largest offline distribution of Wikipedia to date. As an offline reader, Kiwix works with a library of contents that are zim files: you can pick & choose whichever Wikimedia project (Wikipedia in any language, Wiktionary, Wikisource, etc.), as well as TED Talks, PhET Interactive Maths & Physics simulations, Project Gutenberg, etc.
It is free and open source, and currently available for download on:
... as well as extensions for Chrome & Firefox browsers, server solutions, etc. See official Website for the complete Kiwix portfolio.
Aard Dictionary / Aard 2
Aard Dictionary is an offline Wikipedia reader. No images. Cross-platform for Windows, Mac, Linux, Android, Maemo. Runs on rooted Nook and Sony PRS-T1 eBooks readers.
It also has a successor Aard 2.
Wikiviewer for Rockbox
The wikiviewer plugin for rockbox permits viewing converted Wikipedia dumps on many Rockbox devices. It needs a custom build and conversion of the wiki dumps using the instructions available at http://www.rockbox.org/tracker/4755 . The conversion recompresses the file and splits it into 1 GB files and an index file which all need to be in the same folder on the device or micro sd card.
Old dumps
- The static version of Wikipedia created by Wikimedia: http://static.wikipedia.org/ Feb. 11, 2013 - This is apparently offline now. There was no content.
- Wiki2static (site down as of October 2005[update]) was an experimental program set up by User:Alfio to generate html dumps, inclusive of images, search function and alphabetical index. At the linked site experimental dumps and the script itself can be downloaded. As an example it was used to generate these copies of English WikiPedia 24 April 04, Simple WikiPedia 1 May 04(old database) format and English WikiPedia 24 July 04Simple WikiPedia 24 July 04, WikiPedia Francais 27 Juillet 2004 (new format). BozMo uses a version to generate periodic static copies at fixed reference (site down as of October 2017).
Dynamic HTML generation from a local XML database dump
Instead of converting a database dump file to many pieces of static HTML, one can also use a dynamic HTML generator. Browsing a wiki page is just like browsing a Wiki site, but the content is fetched and converted from a local dump file on request from the browser.
XOWA
XOWA is a free, open-source application that helps download Wikipedia to a computer. Access all of Wikipedia offline, without an internet connection! It is currently in the beta stage of development, but is functional. It is available for download here.
Features
- Displays all articles from Wikipedia without an internet connection.
- Download a complete, recent copy of English Wikipedia.
- Display 5.2+ million articles in full HTML formatting.
- Show images within an article. Access 3.7+ million images using the offline image databases.
- Works with any Wikimedia wiki, including Wikipedia, Wiktionary, Wikisource, Wikiquote, Wikivoyage (also some non-wmf dumps)
- Works with any non-English language wiki such as French Wikipedia, German Wikisource, Dutch Wikivoyage, etc.
- Works with other specialized wikis such as Wikidata, Wikimedia Commons, Wikispecies, or any other MediaWiki generated dump
- Set up over 660+ other wikis including:
- English Wiktionary
- English Wikisource
- English Wikiquote
- English Wikivoyage
- Non-English wikis, such as French Wiktionary, German Wikisource, Dutch Wikivoyage
- Wikidata
- Wikimedia Commons
- Wikispecies
- ... and many more!
- Update your wiki whenever you want, using Wikimedia's database backups.
- Navigate between offline wikis. Click on "Look up this word in Wiktionary" and instantly view the page in Wiktionary.
- Edit articles to remove vandalism or errors.
- Install to a flash memory card for portability to other machines.
- Run on Windows, Linux and Mac OS X.
- View the HTML for any wiki page.
- Search for any page by title using a Wikipedia-like Search box.
- Browse pages by alphabetical order using Special:AllPages.
- Find a word on a page.
- Access a history of viewed pages.
- Bookmark your favorite pages.
- Downloads images and other files on demand (when connected to the internet)
- Sets up Simple Wikipedia in less than 5 minutes
- Can be customized at many levels: from keyboard shortcuts to HTML layouts to internal options
Main features
- Very fast searching
- Keyword (actually, title words) based searching
- Search produces multiple possible articles: you can choose amongst them
- LaTeX based rendering for mathematical formulae
- Minimal space requirements: the original .bz2 file plus the index
- Very fast installation (a matter of hours) compared to loading the dump into MySQL
WikiFilter
WikiFilter is a program which allows you to browse over 100 dump files without visiting a Wiki site.
WikiFilter system requirements
- A recent Windows version (Windows XP is fine; Windows 98 and ME won't work because they don't have NTFS support)
- A fair bit of hard drive space (to install you will need about 12–15 Gigabytes; afterwards you will only need about 10 Gigabytes)
How to set up WikiFilter
- Start downloading a Wikipedia database dump file such as an English Wikipedia dump. It is best to use a download manager such as GetRight so you can resume downloading the file even if your computer crashes or is shut down during the download.
- Download XAMPPLITE from [2] (you must get the 1.5.0 version for it to work). Make sure to pick the file whose filename ends with .exe
- Install/extract it to C:\XAMPPLITE.
- Download WikiFilter 2.3 from this site: http://sourceforge.net/projects/wikifilter. You will have a choice of files to download, so make sure that you pick the 2.3 version. Extract it to C:\WIKIFILTER.
- Copy the WikiFilter.so into your C:\XAMPPLITE\apache\modules folder.
- Edit your C:\xampplite\apache\conf\httpd.conf file, and add the following line:
- LoadModule WikiFilter_module "C:/XAMPPLITE/apache/modules/WikiFilter.so"
- When your Wikipedia file has finished downloading, uncompress it into your C:\WIKIFILTER folder. (I used WinRAR http://www.rarlab.com/ demo version – BitZipper http://www.bitzipper.com/winrar.html works well too.)
- Run WikiFilter (WikiIndex.exe), and go to your C:\WIKIFILTER folder, and drag and drop the XML file into the window, click Load, then Start.
- After it finishes, exit the window, and go to your C:\XAMPPLITE folder. Run the setup_xampp.bat file to configure xampp.
- When you finish with that, run the Xampp-Control.exe file, and start Apache.
- Browse to http://localhost/wiki and see if it works
- If it doesn't work, see the forums.
WikiTaxi (for Windows)
WikiTaxi is an offline-reader for wikis in MediaWiki format. It enables users to search and browse popular wikis like Wikipedia, Wikiquote, or WikiNews, without being connected to the Internet. WikiTaxi works well with different languages like English, German, Turkish, and others but has a problem with right-to-left language scripts. WikiTaxi does not display images.
WikiTaxi system requirements
- Any Windows version starting from Windows 95 or later. Large File support (greater than 4 GB which requires an exFAT filesystem) for the huge wikis (English only at the time of this writing).
- It also works on Linux with Wine.
- 16 MB RAM minimum for the WikiTaxi reader, 128 MB recommended for the importer (more for speed).
- Storage space for the WikiTaxi database. This requires about 11.7 GiB for the English Wikipedia (as of 5 April 2011), 2 GB for German, less for other Wikis. These figures are likely to grow in the future.
WikiTaxi usage
- Download WikiTaxi and extract to an empty folder. No installation is otherwise required.
- Download the XML database dump (*.xml.bz2) of your favorite wiki.
- Run WikiTaxi_Importer.exe to import the database dump into a WikiTaxi database. The importer takes care to uncompress the dump as it imports, so make sure to save your drive space and do not uncompress beforehand.
- When the import is finished, start up WikiTaxi.exe and open the generated database file. You can start searching, browsing, and reading immediately.
- After a successful import, the XML dump file is no longer needed and can be deleted to reclaim disk space.
- To update an offline Wiki for WikiTaxi, download and import a more recent database dump.
For WikiTaxi reading, only two files are required: WikiTaxi.exe and the .taxi database. Copy them to any storage device (memory stick or memory card) or burn them to a CD or DVD and take your Wikipedia with you wherever you go!
BzReader and MzReader (for Windows)
BzReader is an offline Wikipedia reader with fast search capabilities. It renders the Wiki text into HTML and doesn't need to decompress the database. Requires Microsoft .NET framework 2.0.
MzReader by Mun206 works with (though is not affiliated with) BzReader, and allows further rendering of wikicode into better HTML, including an interpretation of the monobook skin. It aims to make pages more readable. Requires Microsoft Visual Basic 6.0 Runtime, which is not supplied with the download. Also requires Inet Control and Internet Controls (Internet Explorer 6 ActiveX), which are packaged with the download.
EPWING
Offline Wikipedia database in EPWING dictionary format, which is common and an out-dated Japanese Industrial Standards (JIS) in Japan, can be read including thumbnail images and tables with some rendering limits, on any systems where a reader is available (Boookends). There are many free and commercial readers for Windows (including Mobile), Mac OS X, iOS (iPhone, iPad), Android, Unix-Linux-BSD, DOS, and Java-based browser applications (EPWING Viewers).
Mirror building
WP-MIRROR
- Important: WP-mirror hasn't been supported since 2014, and community verification is needed that it actually works. See talk page.
WP-MIRROR is a free utility for mirroring any desired set of WMF wikis. That is, it builds a wiki farm that the user can browse locally. WP-MIRROR builds a complete mirror with original size media files. WP-MIRROR is available for download.
See also
- DBpedia
- WikiReader
- mw:Help:Export
- m:Help:Downloading pages
- m:Help:Import
- Meta:Data dumps/Other tools, for related tools, e.g. extractors and "dump readers"
- Wikipedia:Wikipedia CD Selection
- Wikipedia:Size of Wikipedia
- meta:Mirroring Wikimedia project XML dumps
- meta:Static version tools
- Wikimedia offline projects
References
- ^ See Wikipedia:Reusing Wikipedia content § Re-use of text under the GNU Free Documentation License for more information on compatibility with the GFDL.
- ^ "Benchmarked: What's the Best File Compression Format?". How To Geek. How-To Geek, LLC. Retrieved 18 January 2017.
- ^ "Zip and unzip files". Microsoft. Microsoft. Retrieved 18 January 2017.
- ^ a b "Volume Format Comparison". developer.apple.com. Retrieved 2023-11-19.
- ^ Large File Support in Linux
- ^ Android 2.2 and before used YAFFS file system; December 14, 2010.
External links
- Wikimedia downloads.
- Domas visits logs (read this!). Also, old data in the Internet Archive.
- Wikimedia mailing lists archives.
- User:Emijrp/Wikipedia Archive. An effort to find all the Wiki[mp]edia available data, and to encourage people to download it and save it around the globe.
- Script to download all Wikipedia 7z dumps.