Research:Data

From Meta, a Wikimedia project coordination wiki

Jump to: navigation, search

Languages:	English · português

This page is an overview of the various sources of open-licensed data published by the Wikimedia Foundation or about Wikimedia projects. The information is intended to help community members, developers and researchers learn about available data sources and find the data they need for their work.

If you have any questions, you might find the answer in the Frequently Asked Questions about Data.

If you wish to donate or document any additional data sources, you can use the Wikimedia organization on DataHub.

See also Wikistats, Statistics and proposals.

Quick glance[edit]

Data Dumps (details)

Homepage | Download

Dumps of all WMF projects for backup, offline use, research, etc.

Wiki content, revisions, metadata, and page-to-page and outside links
XML and SQL format
once/twice a month
large file sizes

API (details)

Homepage

The API provides direct, high-level access to the data contained in MediaWiki databases through HTTP requests to the web service.

Meta info about the wiki and logged-in user, properties of pages (revisions, content, etc.) and lists of pages based on criteria
JSON, WDDX, XML, YAML, and PHP's native serialization format

Tool Labs (details)

Homepage @ Wikitech

Tool Labs allows you to connect to shared server resources and query a copy of the database (with some lag).

acts as a standard web server hosting web-based tools
command-line tools
account required

Recent changes stream (details)

Homepage

Wikimedia broadcasts every change to every Wikimedia wiki using the Socket.IO protocol.

Pageview Stats (details)

Homepage | Download

Raw pageview dumps (not unique hits) based on squid server logs.

Project, title of the page, number of requests, size of the content
Delimited and JSON
Aggregated hourly

WikiStats (details)

Homepage | Download

Reports in 25+ languages based on data dumps and server log files.

Unique visits, page views, active editors and more
Intermediate CSV files available.
Graphical presentation.
Monthly

DBpedia (details)

Homepage

DBpedia extracts structured data from Wikipedia, allows users to run complex queries and link Wikipedia data to other data sets.

RDF ,N-triplets, SPARQL endpoint, Linked Data
billions of triplets of info in a consistent Ontology

DataHub (details)

Homepage

A collection of various Wikimedia-related datasets.

smaller (usually one-time) surveys/studies
dbpedia lite, DBpedia-Live and others
EPIC/Oxford quality assessment

Data dumps[edit]

Home page[edit]

Data dumps

Description[edit]

WMF publishes data dumps of Wikipedia and all WMF projects on a regular basis. English Wikipedia is dumped once a month, while smaller projects are often dumped twice a month.

Content[edit]

Text and metadata of current or all revisions of all pages as XML files
Most database tables as sql files
- Page-to-page link lists (pagelinks, categorylinks, imagelinks, templatelinks tables)
- Lists of pages with links outside of the project (externallinks, iwlinks, langlinks tables)
- Media metadata (image, oldimage tables)
- Info about each page (page, page_props, page_restrictions tables)
- Titles of all pages in the main namespace, i.e. all articles (*-all-titles-in-ns0.gz)
- List of all pages that are redirects and their targets (redirect table)
- Log data, including blocks, protection, deletion, uploads (logging table)
- Misc bits (interwiki, site_stats, user_groups tables)
experimental add/change dumps (no moves and deletes + some other limitations) https://wikitech.wikimedia.org/wiki/Dumps/Adds-changes_dumps

http://dumps.wikimedia.org/other/incr/

Stub-prefixed dumps for some projects which only have header info for pages and revisions without actual content
Media bundles for each project, separated into files uploaded to the project and files from Commons

Images : See here

Static HTML dumps for 2007-2008

http://dumps.wikimedia.org/other/static_html_dumps/

(see more)

Download[edit]

You can download the latest dumps (for the last year) here (dumps.wikimedia.org/enwiki/ for English Wikipedia, dumps.wikimedia.org/dewiki/ for German Wikipedia, etc).

Archives : dumps.wikimedia.org/archive/

Current mirrors offer an alternative to the download page.

Due to large file sizes, using a download tool is recommended.

Data format[edit]

XML dumps since 2010 are in the wrapper format described at Export format (schema). Files are compressed in bzip2 (.bz2) and .7z format.

SQL dumps are provided as dumps of entire tables, using mysqldump.

Some older dumps exist in various formats.

How to and examples[edit]

See examples of importing dumps in a MySQL database with step-by-step instructions here .

Existing tools[edit]

Available tools are listed in the following locations, but information is not always up-to-date:

Access[edit]

All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages.

Support[edit]

Maintainer: Ariel Glenn

Mailing list: xmldatadumps-l

Research projects using data from this source[edit]

"A Breakdown of Quality Flaws in Wikipedia" examines cleanup tags on the English Wikipedia using a January 2011 dump
"There is No Deadline – Time Evolution of Wikipedia Discussions" looks at the time evolution of Wikipedia discussions, and how it correlates to editing activity, based on 9.4 million comments from the March 12, 2010 dump
"Understanding collaboration in Wikipedia" mines a complete dump of the English Wikipedia (225 million article edits) for insights into open collaboration
"Dynamics of Conflicts in Wikipedia" takes the revision history from the dump to extract the reverts based on the text comparison to study the dynamics of editorial wars in multiple language editions

API[edit]

Description[edit]

The web service API provides direct, high-level access to the data contained in MediaWiki databases. Client programs can log in to a wiki, get data, and post changes automatically by making HTTP requests.

Content[edit]

Meta information about the wiki and the logged-in user
Properties of pages, including page revisions and content, external links, categories, templates,etc.
Lists of pages that match certain criteria

Endpoint[edit]

To query the database you send a HTTP GET request to the desired endpoint (example http://en.wikipedia.org/w/api.php for English Wikipedia) setting the action parameter to "query" and defining the query details the URL.

13.08.1996

How to and examples[edit]

Here's a simple example:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xml&titles=Main%20Page

This means fetch (action=query) the content (rvprop=content) of the most recent revision of Main Page (titles=Main%20Page) of English Wikipedia (http://en.wikipedia.org/w/api.php? )in XML format (format=xml). You can paste the URL in a browser to see the output.

Further ( and more complex) examples can be found here.

Also see :

Existing tools[edit]

To try out the API interactively, use the Api Sandbox.

Access[edit]

To use the API, your application or client might need to log in.

Before you start, learn about the API etiquette.

Researchers could be given Special access rights on case-to-case bases.

All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL).

Support[edit]

FAQ: http://www.mediawiki.org/wiki/API:FAQ

Mailing list: mediawiki-api

Tool Labs[edit]

NOTE: In 2014 Tool Labs replaced the "Toolserver" server cluster managed by WMDE.

Home page[edit]

http://tools.wmflabs.org/

Description[edit]

Tool Labs hosts command line or web-based tools, which can query copies of the database. Copies are generally real-time but sometimes replication lag occurs.

Content[edit]

Tool Labs hosts copies of the databases of all Wikimedia projects including Commons. You are allowed use the contents of the database as long as you don't violate the rules.

Data format[edit]

Learn more about the current database schema.

How to[edit]

Using Tool Labs requires familiarity with Unix/Linux command line, SSL keys, SQL/databases, and some programming.

To start using the Tool Labs, see this handy guide: R:Labs2/Getting started with Tool Labs. The steps are summarized as follows:

register labs account (register)
- Add your public SSH key under preferences
request access to Tool Labs (request)
SSH to tools.wmflabs.org using your private SSH key

Existing tools[edit]

See http://tools.wmflabs.org/

Support[edit]

On IRC: #wikimedia-labs on Freenode, a great place to ask questions, get help, and meet other Tool Lab developers. See Help:IRC for more information.

Via mailing list: Labs-l@lists.wikimedia.org A list for announcements and discussion related to the Wikimedia Labs project. You can find the archives here: http://lists.wikimedia.org/pipermail/labs-l/

Found a bug?: Bugs can be posted to Bugzilla: https://bugzilla.wikimedia.org/enter_bug.cgi?product=Wikimedia%20Labs&component=tools

Projects using Tool Labs / Toolserver data[edit]

On the old toolserver:

"Circadian patterns of Wikipedia editorial activity: A demographic analysis" analyzed "34 Wikipedias in different languages [trying] to characterize and find the universalities and differences in temporal activity patterns of editors", with the underlying data provided by the German Wikimedia chapter from the toolserver.
"Feeling the Pulse of a Wiki: Visualization of Recent Changes in Wikipedia" describes a tool hosted on Toolserver providing Recent Changes visualization to aid admins

Recent changes stream[edit]

See wikitech:RCStream to subscribe to Recent changes to all Wikimedia wikis. This broadcasts edits and other changes as they happen; confirmation that an edit has completed is typically faster over this than through the browser. You specify the wikis whose edits you want to receive.

Old IRC recent changes feed[edit]

Wikimedia also has IRC feeds of recent changes hosted on the irc.wikimedia.org server. RCStream is more robust and easier to parse, but the old system is still operational and its details follow.

Changes shown automatically as they happen.
Feeds for each wiki in a separate channel.
Filtered feeds available with cloak

Data and format[edit]

Each wiki edit is reflected in the wiki's IRC channel.Displayed URLs give the cumulative differences produced by the edit concerned and any subsequent edits. The time is not listed but timestamping may be provided by your IRC-client.

The format of each edit summary is :

[page_title] [URL_of_the_revision] * [user] * [size_of_the_edit] [edit_summary]

You can see some examples below:

<rc-pmtpa> Talk:Duke of York's Picture House, Brighton http://en.wikipedia.org/w/index.php?diff=542604907&oldid=498947324 *Fortdj33* (-14) Updated classification

<rc-pmtpa> Bloody Sunday (1887) http://en.wikipedia.org/w/index.php?diff=542604908&oldid=542604828 *03184.61.149.187* (-2371) /* Aftermath */

Location[edit]

IRC feeds are hosted on the irc.wikimedia.org server.

Every one of the >730 Wikimedia wikis has an IRC RC feed. The channel name is #lang.project. For example, the channel for German Wikibooks channel is #de.wikibooks.

Existing tools[edit]

wm-bot lets you get IRC feeds filtered according to your needs. You can define a list of pages and get notifications of revisions on those pages only.
WikiStream uses IRC feeds to illustrate the amount of activity happening on Wikimedia projects.
wikimon is a WebSocket-oriented monitor for the IRC feeds

Access[edit]

Anyone can access IRC feeds. However, you need a wm-bot.

Pageview statistics[edit]

Home page[edit]

http://dumps.wikimedia.org/other/pagecounts-raw/
http://dumps.wikimedia.org/other/pagecounts-ez/merged/ highly compacted monthly aggregates, without loss of hourly resolution

Content[edit]

Each request of a page reaches one of Wikimedia's squid caching hosts. The project name, the size of the page requested, and the title of the page requested are logged and aggregated hourly. English statistics are available since 2007 and non-English since 2008.

Files starting with "projectcount" contain total hits per project per hour statistics. A separate set with repaired counts is maintained as well (several cases of multi-month underreporting could be fixed from secondary sources)

Note: These are not unique hits and changed titles/moves are counted separately.

Download[edit]

http://dumps.wikimedia.org/other/pagecounts-raw/
http://dumps.wikimedia.org/other/pagecounts-ez/projectcounts (repaired projectcounts)

Data format[edit]

Delimited format :

[Project] [Article_name] [Number_of_requests] [Size of the content returned]

where Project is in the form language.project using abbreviations described here.

Examples:

    fr.b Special:Recherche/Achille_Baraguey_d%5C%27Hilliers 1 624

means that the French Wikibooks page with title "Special:Recherche/Achille_Baraguey_d%5C%27Hilliers" was viewed 1 time in the last hour and the size of the content returned was 624.

    en Main_Page 242332 4737756101

we see that the main page of the English language Wikipedia was requested over 240 thousand times during the specific hour.

Data in JSON format is available at http://stats.grok.se/.

Existing tools[edit]

You can interactively browse the page view statistis and get data in JSON format at http://stats.grok.se/.

The following tools also use pageview statistics:

Article traffic statistics
GLAMourous - Commons image usage on Wikimedia projects
Top 100 articles for 2012 for each project

Support[edit]

Maintainer: http://stats.grok.se/ is maintained by User:Henrik

Research projects using data from this source[edit]

Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data combines the page view statistics for articles on movies from the raw page view dumps with the editorial activity data from the toolserver database to predict the financial success of movies
Wikipedia-Zugriffszahlen bestätigen Second-Screen-Trend (in German) studies how the TV schedule influences Wikipedia pageviews
More examples

WikiStats[edit]

Home page[edit]

http://stats.wikimedia.org/

Also see: mw:Analytics/Wikistats

Description[edit]

Wikistats is an informal but widely recognized name for a set of reports developed by Erik Zachte since 2003, which provide monthly trend information for all Wikimedia projects and wikis based on XML data dumps and squid server traffic.

Content[edit]

Thousands of monthly reports in 25+ languages about:

unique visitors
editor activity
page views (overall and mobile only)
article creation
browser usage

Special reports (some are one time, some regular) about:

growth per project and language
pageview and edits per project and language
server requests and traffic surges
edits & reverts
user feedback
bot activity
mailing lists

Data format[edit]

Final reports are presened in table and chart form. Intermediate files are avaialable in CSV format.

Download[edit]

CSV files

Project counts repackaged yearly

Existing tools[edit]

The scripts used to generate the CSV files (WikiCounts.pl + WikiCounts*.pm) and reports (WikiReports.pl + WikiReports*.pm )are available for download here.

Support[edit]

Maintainer: Erik Zachte

DBpedia[edit]

Home page[edit]

http://dbpedia.org

Description[edit]

DBpedia.org is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia and to link other datasets on the Web to Wikipedia data.

Content[edit]

English version of the DBpedia knowledge base

describes 3.77 million things
2.35 million are classified in a consistent Ontology(persons, places, creative works like music albums, films and video games, organizations like companies and educational institutions, species, diseases, etc.

Localized versions of DBpedia in 111 language

together describe 20.8 million things, out of which 10.5 million overlap (are interlinked) with concepts from the English DBpedia

The data set also features:

about 2 billion pieces of information (RDF triples)
labels and abstracts for >10 million unique things in up to 111 different languages
millions of
- links to images
- links to external web pages
- data links into external RDF data sets
- links to Wikipedia categories
- YAGO categories

Data format[edit]

RDF/XML
Turtle
N-Triplets
SPARQL endpoint

Download[edit]

http://wiki.dbpedia.org/Downloads38 has download links for all the data sets, different formats and languages.

http://dbpedia.org/sparql - DBpedia's SPARQL endpoint

How to and examples[edit]

Use cases shows the different ways you can use DBpedia data ( such as improving Wikipedia search or adding Wikipedia content to your webpage)
Applications shows the various applications of DBpedia including faceted browsers, visualization, URI lookup, NLP and others.

Existing tools[edit]

DBpedia Spotlight is a tool for annotating mentions of DBpedia resources in text, providing a solution for linking unstructured information sources to the Linked Open Data cloud through DBpedia.
RelFinder is a tool for interactive relationship discovery in RDF data

Access[edit]

DBpedia data from version 3.4 on is licensed under the terms of the Creative Commons Attribution-ShareAlike 3.0 License and the GNU Free Documentation License.

Support[edit]

Mailing list: DBpedia Discuss

"Biographical Social Networks on Wikipedia - A cross-cultural study of links that made history" uses data extracted from DBpedia to study how biographies on Wikipedia vary depending on language/culture.
See more DBpedia related publications, blog posts and projects here.

DataHub[edit]

The Wikimedia organization on the Open Knowledge Foundation's DataHub is a collection of datasets about Wikipedia and other projects run by the Wikimedia Foundation.

The DataHub repository is meant to become the place where all Wikimedia-related data sources are documented. The collection is open to contributions and researchers are encouraged to donate relevant datasets.

The Wikimedia group on DataHub points to some additional data sources not listed on this page. Some examples are:

dbpedia lite , which uses the API to extract structured data from Wikipedia ( not affiliated with DBpedia))
EPIC/Oxford quality assesmtent of Wikipedia by experts
Wikipedia Banner Challenge data
Wikipedia Editor Engagement Experiments: Timestamp position modification

Wikivoyage also maintains data on its own DataHub:

Hotels/restaurants/attractions data as CSV/OSM/OBF
Tourism guide for offline use

Retrieved from "https://meta.wikimedia.org/w/index.php?title=Research:Data&oldid=15506754"

Categories:

Research:Data

Quick glance[edit]

Data Dumps (details)

API (details)

Tool Labs (details)

Recent changes stream (details)

Pageview Stats (details)

WikiStats (details)

DBpedia (details)

DataHub (details)

Data dumps[edit]

Home page[edit]

Description[edit]

Content[edit]

Download[edit]

Data format[edit]

How to and examples[edit]

Existing tools[edit]

Access[edit]

Support[edit]

Research projects using data from this source[edit]

API[edit]

Description[edit]

Content[edit]

Endpoint[edit]

How to and examples[edit]

Existing tools[edit]

Access[edit]

Support[edit]

Tool Labs[edit]

Home page[edit]

Description[edit]

Content[edit]

Data format[edit]

How to[edit]

Existing tools[edit]

Support[edit]

Projects using Tool Labs / Toolserver data[edit]

Recent changes stream[edit]

Old IRC recent changes feed[edit]

Data and format[edit]

Location[edit]

Existing tools[edit]

Access[edit]

Pageview statistics[edit]

Home page[edit]

Content[edit]

Download[edit]

Data format[edit]

Existing tools[edit]

Support[edit]

Research projects using data from this source[edit]

WikiStats[edit]

Home page[edit]

Description[edit]

Content[edit]

Data format[edit]

Download[edit]

Existing tools[edit]

Support[edit]

DBpedia[edit]

Home page[edit]

Description[edit]

Content[edit]

Data format[edit]

Download[edit]

How to and examples[edit]

Existing tools[edit]

Access[edit]

Support[edit]

Research projects using data from this source[edit]

DataHub[edit]

Navigation menu

Search