Data dumps

From Meta, a Wikimedia project coordination wiki
Jump to navigation Jump to search
See also mw:Help:Export and mw:Help:Import
The Wikimedia Foundation is requesting help to ensure that as many copies as possible are available of all Wikimedia database dumps. Please volunteer to host a mirror if you have access to sufficient storage and bandwidth.

Summary[edit]

Description[edit]

WMF releases data dumps of Wikipedia and all WMF projects on a regular basis.

Content[edit]

  • Text and metadata of current or all revisions of all pages as XML files
  • Most database tables as sql files
    • Page-to-page link lists (pagelinks, categorylinks, imagelinks, templatelinks tables)
    • Lists of pages with links outside of the project (externallinks, iwlinks, langlinks tables)
    • Media metadata (image, oldimage tables)
    • Info about each page (page, page_props, page_restrictions tables)
    • Titles of all pages in the main namespace, i.e. all articles (*-all-titles-in-ns0.gz)
    • List of all pages that are redirects and their targets (redirect table)
    • Log data, including blocks, protection, deletion, uploads (logging table)
    • Misc bits (interwiki, site_stats, user_groups tables)
  • experimental add/change dumps (no moves and deletes + some other limitations) https://wikitech.wikimedia.org/wiki/Dumps/Adds-changes_dumps

http://dumps.wikimedia.org/other/incr/

  • Stub-prefixed dumps for some projects which only have header info for pages and revisions without actual content
  • Media bundles for each project, separated into files uploaded to the project and files from Commons

Images : See here

  • Static HTML dumps for 2007-2008

http://dumps.wikimedia.org/other/static_html_dumps/

(see more)

Download[edit]

You can download the latest dumps (for the last year) here (dumps.wikimedia.org/enwiki/ for English Wikipedia, dumps.wikimedia.org/dewiki/ for German Wikipedia, etc).

Archives : dumps.wikimedia.org/archive/

Current mirrors offer an alternative to the download page.

Due to large file sizes, using a download tool is recommended.

Many older dumps can be found at the Internet Archive.

Data format[edit]

XML dumps since 2010 are in the wrapper format described at Export format (schema). Files are compressed in bzip2 (.bz2) and .7z format.

SQL dumps are provided as dumps of entire tables, using mysqldump.

Some older dumps exist in various formats.

How to and examples[edit]

See examples of importing dumps in a MySQL database with step-by-step instructions here .

Existing tools[edit]

Available tools are listed in the following locations, but information is not always up-to-date:

Access[edit]

All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages.

Support[edit]

Research projects using data from this source[edit]


What is this all about?[edit]

Wikimedia provides public dumps of our wikis' content:

  • for archival/backup purposes
  • for offline use
  • for academic research
  • for bot use
  • for republishing (don't forget to follow the license terms)
  • for fun!

Please follow the XML Data Dumps mailing list by reading the archives or subscribing, for up to date news about the dumps; you can also make inquiries about them there. If you cannot download the dump you want because it no longer exists, or if you have other issues with the files, you can ping the developers there.

Warning on time and size[edit]

Before attempting to download any of the Wikis or their components, PLEASE READ CAREFULLY the time and space scale information below! Because of the size of some file collections (TERAbytes), downloads can take days, or even weeks. (See also our FAQ on the size of the English language Wikipedia dumps.) Be sure you understand your storage capabilities before attempting downloads. Notice (below) that there are a number of versions that are "friendlier" in size and content, which you can customize to your scalability by using or not using images, using or not using talk pages, etc. A careful read of the info below will save a lot of headaches compared to jumping right into downloads.

Faster archives and servers[edit]

Once you're sure you've selected the smallest dataset which fits your purpose, make sure to get it in the most efficient way:

  • download them compressed in 7z format, which can be 10 times smaller than bz2 and decompresses faster[1] (once the 7z download is complete, you can pipe the content with 7z e -so, see also full manual);
  • download from one of the dumps mirrors, which can be much faster especially if they're near you network-wise (e.g. if you both are on a GÉANT/Internet2/etc. network);
  • if the (de)compression takes more time than expected, make sure you've downloaded either 7z or multistream.xml.bz2 files and that your software supports multithreading (like pbzip2/lbzip2 for bz2 decompression and p7zip with lzma-utils 5.1.3+ or 5.2+[2] for compression only,[3] or xargs/parallel for multiple 7z LZMA decompressions); if you write to disk, consider that LZMA decompression is likely to be faster than your disk can handle.[4][5]

What's available and where[edit]

It's all explained here: what's available and where you can download it.

How often dumps are produced[edit]

All databases are dumped on three hosts which generate dumps simultaneously. The largest database, enwiki, takes about 14 days for a full run to complete. Wikidata is not far behind.

We produce full dumps with all historical page content once a month; this dump run starts at the beginning of each month.

We produce partial dumps with current page content only, also once a month, starting about 2/3rds of the way through the month.

Failures in the dump process are generally dealt with by cleaning up the underlying issue and letting the automated runner rerun the job.

See wikitech:Dumps/Current_Architecture for more information about the processes and the dump architecture.

Larger databases such as jawiki, dewiki, and frwiki can take a long time to run, especially when compressing the full edit history or creating split stub dumps. If you see a dump seemingly stuck on one of these for a few hours, or days, it's likely not dead, but simply processing a lot of data. You can check that file sizes are increasing or that more revisions are being processed, by reloading the web page for the dump.

The download site shows the status of each dump: if it's in progress, when it was last dumped, etc.

Monitoring dump generation[edit]

If you are interested in a particular wiki and run date (e.g. frwiki, the "full" run that starts on the 1st of the month), you can check the file dumpstatus.json in the corresponding directory, i.e. for 1 April 2019's frwiki run you would look at https://dumps.wikimedia.org/frwiki/20190401/dumpstatus.json and so on. See Data dumps/Status format for more information on the format of these output files. If you are interested in getting information on all wikis, you can check the https://dumps.wikimedia.org/index.json file which aggregates the per-run json files for the most recent run across all wikis.

Feeds for last dump produced[edit]

If you're interested in a file, you can subscribe to the RSS feed for it, so that you know when a new version is produced. No more time spent opening the web page, no more dumps missed and hungry bots without their XML ration.

The URL can be found in the latest/ directory for the wiki (database name) in question: for instance

dumps.wikimedia.org/metawiki/latest/

contains the feed

dumps.wikimedia.org/metawiki/latest/metawiki-latest-pages-meta-history.xml.bz2-rss.xml

for the last *-pages-meta-history.xml.bz2 dump produced.

You can use services that turn RSS feeds to email notifications (like Blogtrottr).

Format of the dump files[edit]

The format of the various files available for download is explained here.

Download tools[edit]

You can download the XML/SQL files and the media bundles using a web client of your choice, but there are also tools for bulk downloading you may wish to use.

Tools for import[edit]

Here's your basic list of tools for importing.

Other tools[edit]

Check out and/or add to this partial list of other tools for working with the dumps, including parsers and offline readers.

Producing your own dumps[edit]

MediaWiki 1.5 and above includes a command-line maintenance script dumpBackup.php [1] which can be used to produce XML dumps directly, with or without page history.

The programs which manage our multi-database dump process are available in our source repository but would need some tweaking to be used outside of Wikimedia.

You can generate dumps from public wikis using WikiTeam tools.

Step by step importing[edit]

We documented the process to set up a small non-English-language wiki with not too many fancy extensions, using the standard MySQL database backend, on a Linux platform. Read the example or add your own.

See also the MediaWiki manual page on importing XML dumps.

Where to go for help[edit]

If you have trouble importing the files, or problems with the appearance of the pages after import, check our import issues list.

If you don't find the answer there or you have other problems with the dump files, you can:

  • Ask in #mediawiki on irc.freenode.net - Although help is not always available at all times
  • Ask on the xmldatadumps-l (quicker) or the wikitech-l mailing lists.

Alternatively, if you have a specific bug to report:

  • File a bug at Phabricator under the Dumps Generation project.

For French speaking people, see also fr:Wikipédia:Requêtes XML

FAQ[edit]

Some questions come up often enough that we have a FAQ for you to check out.

See also[edit]

On the dumps:

On related projects:

References[edit]

  1. For instance it may take less than 2 hours of wall clock time to decompress the whole 100+ GiB of the compressed full dumps of the English Wikipedia: phabricator:P4751. Using 2 CPU cores, it takes less than a day to decompress and grep all the revisions for a string: phabricator:P4750.
  2. https://sourceforge.net/p/lzmautils/discussion/708858/thread/d37155d1/#d8af
  3. https://sourceforge.net/p/sevenzip/discussion/45797/thread/40ce93af/#5d93
  4. https://sourceforge.net/p/sevenzip/discussion/45797/thread/136f029b/#32ad
  5. Many scripts can read directly from bz2/7z files, such as wikistats or Python scripts, or recommend to read from the piped decompressed content, such as wikiq and mwdiffs.