Analytics Engineering

The Wikimedia Foundation's Analytics Engineering team is part of the Technology department.

The Analytics Engineering Team's primary responsibility is to "empower and support data informed decision making across the Foundation and the Community".

We make Wikimedia related data available for querying and analysis to both WMF and the different Wiki communities and stakeholders.

We develop infrastructure so all our users, both within the Foundation and within the different communities, can access data in a self-service fashion that is consistent with the values of the movement.

We keep all our documentation here on Wikitech. See also this FAQ.

About us - Analytics/Team

Contact

If you have questions about our work or the infrastructure we provide, you can contact us in two ways:

on our public mailing list, analytics@lists.wikimedia.org (subscribe, archives)
in our public IRC channel, #wikimedia-analytics ^connect. You can use the keyword a-team to ping us, so we notice your question.
during our office hours, which we host as of 2019, January 14th on the second Monday of every month. Add to your calendar or let us know if that time is too early for you, and we can hold a second session when needed.

Work organization

The analytics team uses Phabricator to track its projects.

https://phabricator.wikimedia.org/tag/analytics/ for backlog triage
https://phabricator.wikimedia.org/tag/analytics-kanban/ for in progress tasks

Prioritization

Datasets

Webrequests [Traffic logs] and derived tables, including:
- Pageviews [Filtered traffic logs] [TODO - Revamp and add various systems and key differences in schema and usage]
- Inter-language [Traffic between different languages of the same project family]
- Unique Devices Estimates of unique devices at the project or project family level
Mediawiki raw databases
EventLogging (in the event database in hive)
Edits history, Page history, User history
Other reports
Clickstream

Systems

We maintain various systems to allow querying of our datasets in different fashion.

System name and link	Type	Accessibility
Archiva	Repository for Java archives	Private
AQS - Analytics Query Service	REST API for analytics data	Public
Clients (stat100X)	Analytics client nodes to access Hadoop and various services	Private
Cluster (Hadoop, Gobblin, Hive, Oozie, Spark...)	Hadoop	Private
Dashiki	Framework for building dashboards	Public
Druid	Data storage engine optimized for exploratory analytics	Private
EventLogging	Ad-hoc streaming pipeline	Private
EventStreams	Mediawiki events streams	Public
Hue	Web interface for Hive, Oozie, and other Cluster services	Private
Kafka	Data transport and streaming system	Private
MariaDB	Data storage for MediaWiki replicas and EventLogging	Private
Matomo (formerly known as Piwik)	Small-scale web analytics platform	Private
ReportUpdater	Job Scheduler	Private
Superset	Web interface for data visualization and exploration	Private
Jupyter	Hosted notebooks for data analysis	Private
Turnilo	Web interface for exploring data stored in Druid	Private
Wikistats (1 and 2)	Community Dashboard with high-level metrics	Public

The list of scheduled manual maintenance tasks are documented here.

Try it out! Analytics/Tutorials

We'd rather have you having fun with our data :)

Please check the link above for something that might help you, and let us know if you don't find what you're after.

Table of Content

Go to the Analytics/TOC page to have a list of all pages we have under Analytics.

Analytics Engineering

Contents

About us - Analytics/Team

Contact

Work organization

Prioritization

Datasets

Systems

Try it out! Analytics/Tutorials

Table of Content

Navigation menu

Analytics Engineering

About us - Analytics/Team

Contact

Work organization

Prioritization

Datasets

Systems

Try it out! Analytics/Tutorials

Table of Content

Navigation menu

Search