Production Excellence #32: May 2021

17:47, Monday, 21 2021 June UTC

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

Zero incidents recorded in the past month. Yay! That's only five months after November 2020, the last month without documented incidents (Incident stats).

Remember to review Preventive measures in Phabricator, which are action items filed after an incident.


Trends

In May, we unfortunately saw a repeat of the worrying pattern we saw in April, but with higher numbers. We found 54 new errors. This is the most new errors in a single month, since the Excellence monthly began three years ago in 2018. About half of these (29 of 54) remain unresolved as of writing, two weeks into the following month.


Month-over-month plots based on spreadsheet data.


New errors in May

Below is a snapshot of just the 54 new issues found last month, listed by their code steward.

Be mindful that the reporting of errors is not itself a negative point per-se. I think it should be celebrated when teams have good telemetry, detect their issues early, and address them within their development cycle. It might be more worrisome when teams lack telemetry or time to find such issues, or can't keep up with the pace at which issues are found.

Anti Harassment Tools None.
Community Tech None.
Editing Team +2, -1 Cite (T283755); OOUI (T282176).
Growth Team +17, -4 Add-Link (T281960); GrowthExperiments (T281525 T281703 T283546 T283638 T283924); Echo (T282446); Recent-changes (T282047 T282726); StructuredDiscussions (T281521 T281523 T281782 T281784 T282069 T282146 T282599 T282605).
Language Team +1 Translate extension (T283828).
Parsing Team +1 Parsoid (T281932).
Reading Web None.
Structured Data None.
Product Infra Team +1 WikimediaEvents (T282580).
Analytics None.
Performance Team None.
Platform Engineering +16, -11 MediaWiki-API (T282122); MediaWiki-General (T282173); MediaWiki-Page-derived-data (T281714 T281802 T282180 T283282), MediaWiki-Revision-backend (T282145 T282723 T282825 T283170); MediaWiki-User-management (T283167); MW Expedition (T281526 T281981 T282038 T282181 T283196).
Search Platform +3, -2 CirrusSearch (T282036 T282207); GeoData (T282735).
WMDE TechWish +2, -1 Revision-Slider (T282067); VisualEditor Template dialog (T283511).
WMDE Wikidata +3, -1 Wikibase (T282534 T283198 T283862).
No owner +7, -6 CentralAuth (T282834 T283635); Change-tagging (T283098 T283099); MapSources (T282833); MediaWiki-Page-information (T283751); Other (T283252).

Outstanding errors

Take a look at the workboard and look for tasks that could use your help.

View Workboard

Summary over recent months:

Aug 2019 (0 of 14 left) ✅ Last task resolved! -1
Jan 2020 (1 of 7 left) ⚠️ Unchanged (over one year old).
Mar 2020 (2 of 2 left) ⚠️ Unchanged (over one year old).
Apr 2020 (4 of 14 left) ⬇️ One task resolved. -1
May 2020 (5 of 14 left) ⚠️ Unchanged (over one year old).
Jun 2020 (5 of 14 left) ⚠️ Unchanged (over one year old).
Jul 2020 (4 of 24 issues) ⏸ —
Aug 2020 (12 of 53 issues) ⬇️ One task resolved. -1
Sep 2020 (7 of 33 issues) ⏸ —
Oct 2020 (19 of 69 issues) ⬇️ One task resolved. -1
Nov 2020 (8 of 38 issues) ⬇️ One task resolved. -1
Dec 2020 (7 of 33 issues) ⏸ —
Jan 2021 (3 of 50 issues) ⏸ —
Feb 2021 (7 of 20 issues) ⬇️ One task resolved. -1
Mar 2021 (14 of 48 issues) ⬇️ Four tasks resolved. -4
Apr 2021 (23 of 42 issues) ⬇️ Two tasks resolved. -2
May 2021 (29 of 54 issues) 54 new issues found, of which 29 remain open. +54; -25
Tally
133 issues open, as of Excellence #31 (12 May 2021).
-12 issues closed, of the previous 133 open issues.
+29 new issues that survived May 2021.
150 issues open, as of today (12 June 2021).

Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


Footnotes:
Incident status, Wikitech.
Wikimedia incident stats by Krinkle, CodePen.
Production error data (spreadsheet and plots).
Phabricator report charts for Wikimedia-production-error project.

Tech News issue #25, 2021 (June 21, 2021)

00:00, Monday, 21 2021 June UTC
previous 2021, week 25 (Monday 21 June 2021) next
Other languages:
Bahasa Indonesia • ‎Deutsch • ‎English • ‎español • ‎français • ‎italiano • ‎magyar • ‎polski • ‎português • ‎português do Brasil • ‎suomi • ‎svenska • ‎čeština • ‎русский • ‎українська • ‎עברית • ‎العربية • ‎فارسی • ‎বাংলা • ‎中文 • ‎日本語 • ‎ꯃꯤꯇꯩ ꯂꯣꯟ

Tech News issue #24, 2021 (June 14, 2021)

00:00, Monday, 14 2021 June UTC
previous 2021, week 24 (Monday 14 June 2021) next
Other languages:
Bahasa Indonesia • ‎Deutsch • ‎English • ‎Hausa • ‎español • ‎français • ‎italiano • ‎polski • ‎português • ‎português do Brasil • ‎suomi • ‎svenska • ‎čeština • ‎русский • ‎українська • ‎עברית • ‎العربية • ‎فارسی • ‎বাংলা • ‎中文 • ‎日本語 • ‎ꯃꯤꯇꯩ ꯂꯣꯟ

For WMF staff's inspiration week, I decided to take a step back from my work building out a new skin architecture and a redesign of Vector and put myself into the shoes of a skin developer to see if the changes my team had made life easier. As a secondary objective, I was interested in how a MediaWiki skin could be written in Vue.js and what the challenges were to get there.

What I built

I decided to build a new skin called Alexandria named after the Great Library. The design was modeled on an open-source project and website I volunteer for called OpenLibrary.org.

I began by creating a skin that was JavaScript only.

I generated most of my boilerplate using the skins.wmflabs.org tool. I tweaked it so it gave me all the client-side tooling I needed.

Once I had that, I added a few more advanced features to my skin. In particular, I created a PHP class that extended the core SkinMustache class to allow me to extend the data given by core.

I wanted to be able to render my skin in JavaScript, so I needed to pass the data in PHP to the client. To do this, I added a template data value that represented a stringified JSON of the entire data that would be passed to the template like so:
https://github.com/jdlrobson/Alexandria/blob/master/SkinAlexandria.php#L29

The skin template rendered this JSON into a data attribute.

<div id="ol-app" data-json="{{data-json}}"></div>

This rendered a blank screen that was readable by JavaScript. While not a useful skin, from here, I was able to begin using Vue.js to parse that data attribute and pass it through Vue components to render the skin. https://github.com/jdlrobson/Alexandria/blob/master/resources/skin.js#L20

From here, I was up and running. I had a skin made from a Vue.js component that said hello world!

Building a skin with Vue.

This really was a breeze.

I made use of the [[ http://github.com/wikimedia/wvui | wvui library ]]for existing standard components, such as TypeaheadSearch and Button by including the wvui and vue libraries in my main skin module.

Keen to test out the work Roan and others did to request ES6 only modules, I decided to allow myself the luxury of writing code in ES6 and excluding older browsers.

https://github.com/jdlrobson/Alexandria/blob/master/skin.json#L86

When components didn’t exist in the wvui library, I made them and thinking in terms of components led to well scoped CSS. Aside from components inside the wvui library, I ended up creating components such as App.vue, AppArticle.vue, AppBanner.vue, AppFooter.vue, AppHeader.vue, DropdownMenu.vue, FooterMenu.vue, Portlet.vue, TypeaheadSearch.vue.

One thing that was frustrating as I created/renamed these components is I had to update these in the skin.json manifest. It was unintuitive and I often forgot to do it, which made development a little more tedious. I captured this in a Phabricator ticket, as I think it’s something that could be much better in the developer experience: https://phabricator.wikimedia.org/T283388.

While wvui and Vue.js were on npm, Since I was relying on some styles inside MediaWiki-core, I couldn’t use Vite or Parcel.js without lots of scaffolding, so I decided to develop without hot reloading. This slowed me down a lot as I was doing a lot of page refreshing.

Using the OpenLibrary project I copied across the CSS I needed. The resulting CSS was much better organized and scoped than the original project as I was constantly thinking about reuse.

Styling article content

To generate articles I used the MobileFrontend content provider ( https://www.mediawiki.org/wiki/Extension:MobileFrontend#Testing_with_articles_on_a_foreign_wiki_(live_data)) to generate articles to test on. I found myself running into a few issues with that and a few CSS rules that were in Minerva that should have been in MobileFrontend for toggling and ended up submitting patches to deal with that: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/693503, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/696644

Styles for thumbnails and table of contents are provided by core. Both of these styles didn’t fit in with the aesthetics of my design, so I ended up adding override CSS. I would have preferred to have not spent any time styling these elements so raised Phabricator tickets lest I forget to revisit our defaults https://phabricator.wikimedia.org/T283836 and https://phabricator.wikimedia.org/T283396 .

I wanted to do a lot with the content - such as move all images and infobox to the left, but after wrestling with inline styles, CSS grid and I gave up. It would be great if the parser marked up articles in a way that lent itself better to a grid system, but sadly it doesn’t. I didn’t know what tasks to raise here, as I didn’t think about it too deeply, but I want to acknowledge that this was a point of pain.

Server side rendering

I then turned my attention to server-side rendering. MediaWiki currently doesn’t support server-side rendering of Vue components. (https://phabricator.wikimedia.org/T272878)

To server-side render Vue components, a Node.js service is advised which PHP can request HTML from. Because I’m a little crazy and didn’t want to set up such a service, for now, I explored the differences between Mustache and Vue.js templates. I built a Node script that imported Vue components, found the template tag, and then traversed the DOM of that template rewriting it node by node recursively to a Mustache equivalent. Constraining myself to the minimal work possible I managed to create https://github.com/jdlrobson/Alexandria/blob/master/ssr.js

This mostly worked but of course, I ran into a few problems.

This couldn’t load the general components in the wvui library. For these, I had to define fallback templates such as https://github.com/jdlrobson/Alexandria/blob/master/resources/TypeaheadSearch.vue. For the search widget, I ended up rendering a form fallback that looked nothing like the JavaScript Vue version, but that was fine. I made sure the script threw an error so that I’d never load it accidentally in my JavaScript application.

I have a bad habit of giving Vue component props the same name as attributes. I had to stop this. A parameter id became menuId for example. This allowed me to avoid too much complexity in my Mustache template generator, to know when I was dealing with an attribute or a template property.

With for loops, it was easier to parse v-for=”a in list than it was to parse something like v-for=”a in data.list” so I made sure that was a constraint in the Vue templates I was writing.

I decided against computed properties as these involved JavaScript so those were not supported in my proof of concept.

Now I was generating a template via a build step, my skin was working with JavaScript loading, however, loading styles for the new experience became my next problem. I had included all my styles in a Vue template, so now needed them out. I extended my build script to generate a stylesheet as well, but later backtracked on that and pulled the styles out of the Vue components. I opened https://phabricator.wikimedia.org/T283882 to discuss best practices for that.

API-driven frontend!

Now I had a skin rendering in Vue via JavaScript. It would be silly not to play to its strengths and make it a single-page application, loading content from JS. Unfortunately, there was no Skin API, only APIs for generating content and various things can vary on a page such as JavaScript/CSS loaded, mw.config values and even items in menu.. eek!!

A while back I made https://github.com/jdlrobson/mediawiki-skins-skinjson to help with skin development. It allows you to see a JSON representation of the data that a skin template can render. I repurposed this to allow me to use it as an API in my app. I made use of this. Pages rendered via API calls and JavaScript with ease. Wire up was very small: https://github.com/jdlrobson/Alexandria/blob/master/resources/App.vue#L252 Maybe something for the Product-Infrastructure-Team-Backlog ?

This allowed me to load articles, however, I ran into technical debt. Most MediaWiki extensions expect to be run on page load. Some work, because of the use of mw.hook. For event handlers bound to the body tag using a proxy pattern, things just worked. We clearly need to use that pattern more if we ever want to go down this route.

E.g

$(‘body’).on(‘click’, ‘.uls-button’, loadULS );

https://skins-demo.wmflabs.org/wiki/Alexandria?useskin=alexandria demonstrates article loading using the Parsoid API

Reflection

  • Converting Vue templates to Mustache is possible with constraints
  • There are a few kinks in ResourceLoader that need to be worked out
  • API-driven skins are possible if we're willing to put in the effort across our codebases to build the APIs and rethink how our existing features load.

Production Excellence #31: April 2021

17:39, Saturday, 12 2021 June UTC

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

6 documented incidents. That's above the historical average of 3–4 per month.

Learn about recent incidents at Incident status on Wikitech, or Preventive measures in Phabricator.


Trends

In April, we saw a continuation of the healthy trend that started this January — a trend where the back of the line is moving forward at least as quickly as the front of the line. We did take a little breather in March where we almost broke even, but otherwise the trend is going well.

Last month we bade farewell to the production errors we found in July 2019. This month we cleared out the column for October 2019.

One point of concern is that we did encounter a high number of new production errors — errors that we failed to catch during development, code review, continuous integration, beta testing, or pre-deployment checks. Where we used to discover about a dozen of those a month, we found 42 during this month. As of writing, 17 of the 42 April-discovered errors have been resolved.

The "Old" column (generally tracking pre-2019 tasks) grew for the first time in six months. This increase can largely be attributed to improved telemetry of client-side errors uncovering issues in under-resourced products, such as the old Kaltura video player.


Month-over-month plots based on spreadsheet data.


Outstanding errors

View Workboard

Summary over recent months, per spreadsheet:

Aug 2019 (1 of 14 left) ⚠️ Unchanged (over one year old).
Oct 2019 (0 of 12 left) ✅ Last three tasks resolved! -3
Jan 2020 (1 of 7 left) ⚠️ Unchanged (over one year old).
Mar 2020 (2 of 2 left) ⚠️ Unchanged (over one year old).
Apr 2020 (5 of 14 left) ⚠️ Unchanged (over one year old).
May 2020 (5 of 14 left) ⏸ —
Jun 2020 (5 of 14 left) ⬇️ One task resolved. -1
Jul 2020 (4 of 24 issues) ⬇️ One task resolved. -1
Aug 2020 (13 of 53 issues) ⬇️ Two tasks resolved. -2
Sep 2020 (7 of 33 issues) ⏸ —
Oct 2020 (20 of 69 issues) ⬇️ Two tasks resolved. -2
Nov 2020 (9 of 38 issues) ⏸ —
Dec 2020 (7 of 33 issues) ⬇️ Four tasks resolved. -4
Jan 2021 (3 of 50 issues) ⬇️ One task resolved. -1
Feb 2021 (8 of 20 issues) ⬇️ One task resolved. -1
Mar 2021 (18 of 48 issues) ⬇️ Sixteen tasks resolved. -16
Apr 2021 (25 of 42 issues) 42 new issues found, of which 25 remained open. +42; -17
Tally
139 issues open, as of Excellence #30 (March 2021).
-31 issues closed, of the previously open issues.
+25 new issues that survived April 2021.
133 issues open, as of today (12 May 2021).

Take a look at the workboard and look for tasks that could use your help:

View Workboard


Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in production!

Until next time,

– Timo Tijhof


🎥 McMurphy: That nurse, man... she, uh, she ain't honest.
Doctor: Ah now, look. Miss Ratched is one of the finest nurses we've got in this institution.
McMurphy: Ha! Well […] She likes a rigged game, know what I mean?

Tech News issue #23, 2021 (June 7, 2021)

00:00, Monday, 07 2021 June UTC
previous 2021, week 23 (Monday 07 June 2021) next
Other languages:
Bahasa Indonesia • ‎Deutsch • ‎English • ‎Hausa • ‎español • ‎français • ‎italiano • ‎magyar • ‎polski • ‎português • ‎português do Brasil • ‎suomi • ‎svenska • ‎čeština • ‎русский • ‎українська • ‎עברית • ‎العربية • ‎বাংলা • ‎中文 • ‎日本語 • ‎ꯃꯤꯇꯩ ꯂꯣꯟ

Sponsored Phabricator Improvements

15:46, Saturday, 05 2021 June UTC

In T135327, the WMF Technical Collaboration team collected a list of Phabricator bugs and feature requests from the Wikimedia Developer Community. After identifying the most promising requests from the community, these were presented to Phacility (the organization that builds and maintains Phabricator) for sponsored prioritization.

I am very pleased to report that we are already seeing the benefits of this initiative. Several sponsored improvements have landed on https://phabricator.wikimedia.org/ over the past few weeks. For an overview of what's landed recently, read on!

Fixed

The following tasks are now resolved:

Notice three of those have task numbers lower than 2000. Those long-standing tasks date from the first months of WMF's Phabricator evaluation and RFC period. When those tasks were originally filled, Phabricator was just a test install running in WMF Labs. For me, It's especially satisfying to close so many long-standing issues that have effected many of us for more than a year.

Work in Progress

Several more issues were identified for sponsorship which are still awaiting a complete solution. Some of these are at least partially fixed and some are still pending. You can find out more details by reading the comments on each task linked below.

Other recent changes

Besides the sponsored features and bug fixes, there are several other recent improvements which are worth mentioning.

Milestones now include Next / Previous navigation

Recurring calendar events also gained next / previous navigation

New feature for Maniphest tasks: dependency graph

This very helpful feature displays a graphical representation of a task's Parents and Subtasks.

Initially there was an issue with this feature that made tasks with many relationships unable to load. This was exacerbated by the historical use of "tracking tasks" in the Wikimedia Bugzilla context. Thankfully after a quick patch from @epriestley (the primary author of Phabricator) and lots of help and testing from @Danny_B and @Paladox, @mmodell was able to deploy a fix for the issue a little over 24 hours after it was discovered.

Here's to yet more fruitful collaborations with upstream Phabricator!

Tech News issue #22, 2021 (May 31, 2021)

00:00, Monday, 31 2021 May UTC
previous 2021, week 22 (Monday 31 May 2021) next
Other languages:
Bahasa Indonesia • ‎Deutsch • ‎English • ‎français • ‎italiano • ‎magyar • ‎polski • ‎português • ‎português do Brasil • ‎română • ‎suomi • ‎svenska • ‎čeština • ‎русский • ‎українська • ‎עברית • ‎العربية • ‎فارسی • ‎বাংলা • ‎中文 • ‎日本語 • ‎ꯃꯤꯇꯩ ꯂꯣꯟ

Tech News issue #21, 2021 (May 24, 2021)

00:00, Monday, 24 2021 May UTC
previous 2021, week 21 (Monday 24 May 2021) next
Other languages:
Bahasa Indonesia • ‎Deutsch • ‎English • ‎español • ‎français • ‎italiano • ‎magyar • ‎polski • ‎português • ‎português do Brasil • ‎suomi • ‎svenska • ‎čeština • ‎русский • ‎українська • ‎עברית • ‎العربية • ‎فارسی • ‎বাংলা • ‎ગુજરાતી • ‎中文 • ‎日本語 • ‎ꯃꯤꯇꯩ ꯂꯣꯟ

Tech News issue #20, 2021 (May 17, 2021)

00:00, Monday, 17 2021 May UTC
previous 2021, week 20 (Monday 17 May 2021) next
Other languages:
Bahasa Indonesia • ‎Deutsch • ‎English • ‎français • ‎italiano • ‎magyar • ‎polski • ‎português do Brasil • ‎suomi • ‎svenska • ‎čeština • ‎русский • ‎українська • ‎עברית • ‎العربية • ‎فارسی • ‎বাংলা • ‎中文 • ‎日本語 • ‎ꯃꯤꯇꯩ ꯂꯣꯟ

Production Excellence #30: March 2021

22:57, Wednesday, 12 2021 May UTC

How’d we do in our strive for operational excellence last month? Read on to find out!

Incidents

2 documented incidents. That's average for this time of year, when we usually had 1-4 incidents.

Learn about recent incidents at Incident status on Wikitech, or Preventive measures in Phabricator.


Trends

In March we made significant progress on the outstanding errors of previous months. Several of the 2020 months are finally starting to empty out. But with over 30 new tasks from March itself remaining, we did not break even, and ended up slightly higher than last month. This could be reversing two positive trends, but I hope not.

Firstly, there was a steep increase in the number of new production errors that were not resolved within the same month. This is counter the positive trend we started in November. The past four months typically saw 10-20 errors outlive their month of discovery, and this past month saw 34 of its 48 new errors remain unresolved.

Secondly, we saw the overall number of unresolved errors increase again. This January began a downward trend for the first time in thirteen months, which continued nicely through February. But, this past month we broke even and even pushed upward by one task. I hope this is just a breather and we can continue our way downward.


Month-over-month plots based on spreadsheet data.


Outstanding errors

Take a look at the workboard and look for tasks that could use your help:

View Workboard

Summary over recent months, per spreadsheet:

Jul 2019 (0 of 18 left) ✅ Last two tasks resolved! -2
Aug 2019 (1 of 14 left) ⚠️ Unchanged (over one year old).
Oct 2019 (3 of 12 left) ⬇️ One task resolved. -1
Nov 2019 (0 of 5 left) ✅ Last task resolved! -1
Dec 2019 (0 of 9 left) ✅ Last task resolved! -1
Jan 2020 (1 of 7 left) ⬇️ One task resolved. -1
Feb 2020 (0 of 7 left) ✅ Last task resolved! -1
Mar 2020 (2 of 2 left) ⚠️ Unchanged (over one year old).
Apr 2020 (5 of 14 left) ⬇️ Four tasks resolved. -4
May 2020 (5 of 14 left) ⬇️ One task resolved. -1
Jun 2020 (6 of 14 left) ⬇️ One task resolved. -1
Jul 2020 (5 of 24 issues) ⬇️ Four tasks resolved. -4
Aug 2020 (15 of 53 issues) ⬇️ Five tasks resolved. -5
Sep 2020 (7 of 33 issues) ⬇️ One task resolved. -1
Oct 2020 (22 of 69 issues) ⬇️ Four tasks resolved. -4
Nov 2020 (9 of 38 issues) ⬇️ Two tasks resolved. -2
Dec 2020 (11 of 33 issues) ⬇️ One task resolved. -1
Jan 2021 (4 of 50 issues) ⬇️ One task resolved. -1
Feb 2021 (9 of 20 issues) ⬇️ Two tasks resolved. -2
Mar 2021 (34 of 48 issues) 34 new tasks survived and remain unresolved. +48; -14
Tally
138 issues open, as of Excellence #29 (6 Mar 2021).
-33 issues closed, of the previous 138 open issues.
+34 new issues that survived March 2021.
139 issues open, as of today (2 Apr 2021).

Thanks!

Thank you to everyone who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


Footnotes:

Incident status, Wikitech.
Wikimedia incident stats by Krinkle, CodePen.
Production Excellence: Month-over-month spreadsheet and plot.
Report charts for Wikimedia-production-error project, Phabricator.

Tech News issue #19, 2021 (May 10, 2021)

00:00, Monday, 10 2021 May UTC
previous 2021, week 19 (Monday 10 May 2021) next
Other languages:
Bahasa Indonesia • ‎Deutsch • ‎English • ‎Kiswahili • ‎Nederlands • ‎español • ‎français • ‎italiano • ‎magyar • ‎polski • ‎português • ‎português do Brasil • ‎suomi • ‎svenska • ‎čeština • ‎русский • ‎українська • ‎עברית • ‎العربية • ‎فارسی • ‎বাংলা • ‎中文 • ‎日本語 • ‎ꯃꯤꯇꯩ ꯂꯣꯟ

Tech News issue #18, 2021 (May 3, 2021)

00:00, Monday, 03 2021 May UTC
previous 2021, week 18 (Monday 03 May 2021) next
Other languages:
Bahasa Indonesia • ‎Deutsch • ‎English • ‎Nederlands • ‎Tiếng Việt • ‎español • ‎français • ‎italiano • ‎magyar • ‎polski • ‎português • ‎português do Brasil • ‎suomi • ‎svenska • ‎čeština • ‎русский • ‎українська • ‎עברית • ‎العربية • ‎فارسی • ‎বাংলা • ‎中文 • ‎日本語 • ‎ꯃꯤꯇꯩ ꯂꯣꯟ

Tech News issue #17, 2021 (April 26, 2021)

00:00, Monday, 26 2021 April UTC
This document has a planned publication deadline (link leads to timeanddate.com).
previous 2021, week 17 (Monday 26 April 2021) next
Other languages:
Bahasa Indonesia • ‎Deutsch • ‎English • ‎Nederlands • ‎Tiếng Việt • ‎français • ‎italiano • ‎magyar • ‎polski • ‎português do Brasil • ‎suomi • ‎svenska • ‎čeština • ‎русский • ‎українська • ‎עברית • ‎العربية • ‎فارسی • ‎বাংলা • ‎中文 • ‎日本語

Tech News issue #16, 2021 (April 19, 2021)

00:00, Monday, 19 2021 April UTC
previous 2021, week 16 (Monday 19 April 2021) next
Other languages:
Bahasa Indonesia • ‎Deutsch • ‎English • ‎Nederlands • ‎español • ‎français • ‎italiano • ‎magyar • ‎polski • ‎português do Brasil • ‎suomi • ‎svenska • ‎čeština • ‎русский • ‎українська • ‎עברית • ‎العربية • ‎বাংলা • ‎中文 • ‎日本語 • ‎한국어

Tracking memory issue in a Java application

13:01, Friday, 02 2021 April UTC

One of the critical pieces of our infrastructure is Gerrit. It hosts most of our git repositories and is the primary code review interface. Gerrit is written in the Java programming language which runs in the Java Virtual Machine (JVM). For a couple years we have been struggling with memory issues which eventually led to an unresponsive service and unattended restarts. The symptoms were the usual ones: the application responses being slower and degrading until server side errors render the service unusable. Eventually the JVM terminates with:

java.lang.OutOfMemoryError: Java heap space

This post is my journey toward identifying the root cause and having it fixed up by the upstream developers. Given I barely knew anything about Java and much less about its ecosystem and tooling, I have learned more than a few things on the road and felt like it was worth sharing.

Prior work

The first meaningful task was in June 2019 (T225166) which over several months has led us to:

  • replace aging underlying hardware
  • tuning the memory garbage collector and switching to the G1 garbage collector
  • raising the amount of memory allocated to the JVM (the heap)
  • upgraded the Debian operating system by two major release (Jessie Stretch Buster)
  • conduct a major upgrade of Gerrit (June 2020, Gerrit 2.15 3.2)
  • bots crawling the repositories get moved to a replica
  • fixing lack of cache in a MediaWiki extension querying Gerrit more than it should have

All of those were sane operations that are part of any application life-cycle, some were meant to address other issues. Raising the maximum heap size (20G to 32G) definitely reduced the frequency of crashes.

Still, we had memory filing over and over. The graph below shows the memory usage from September 2019 to September 2020. The increase of maximum heap usage in October 2020 is the JVM heap being raised from 20G to 32G. Each of the "little green hills" correspond to memory filing up until we either restarted Gerrit or the JVM unattended crash:

Zooming on a week, it is clearly seen the memory was almost entirely filled until we had to restart:

This had to stop. Complaints about Gerrit being unresponsive, SRE having to respond to java.lang.OutOfMemoryError: Java heap space or us having to "proactively" restart before a week-end. They were not good practices. Back and fresh from vacations, I filed a new task T263008 in September 2020 and started to tackle the problem on my spare time. Would I be able to find my way in an ecosystem totally unknown to me?

Challenge accepted!

stuff learned

  • Routine maintenance are definitely a need
  • Don't expect things to magically solve but commit to thoroughly identify the root cause instead of hoping.

Looking at memory

Since the JVM runs out of memory, lets look at memory allocation. The JDK provides several utilities to interact with a running JVM. Be it to attach a debugger, writing a copy of the whole heap or sending admin commands to the JVM.

jmap lets one take a full capture of the memory used by a Java virtual machine. It has to run as the same user as the application (we use Unix username gerrit2) and when having multiple JDKs installed, one has to make sure to invoke the jmap that is provided by the Java version running the targeted JVM.

Dumping the memory is then a magic:

sudo -u gerrit2 /usr/lib/jvm/java-8-openjdk-amd64/bin/jmap \
  -dump:live,format=b,file=/var/lib/gerrit-202009170755.hprof <pid of java process here>

It takes a few minutes depending on the number of objects. The resulting .hprof file is a binary format, which can be interpreted by various tools.

jhat, a Java heap analyzer, is provided by the JDK along jmap. I ran it disabling tracking of of object allocations (-stack false) as well as references to object (|-refs false) since even with 64G of RAM and 32 core it took a few hours and eventually crashed. That is due to the insane amount of live objects. On the server I thus ran:

/usr/lib/jvm/java-8-openjdk-amd64/bin/jhat -stack false -refs false gerrit-202009170755.hprof

It spawns a web service which I can reach from my machine over ssh using some port redirection and open a web browser for it:

ssh  -C -L 8080:ip6-localhost:7000 gerrit1001.wikimedia.org &
xdg-open http://ip6-localhost:8080/

Instance Counts for All Classes (excluding native types)

2237744 instances of class org.eclipse.jgit.lib.ObjectId
2128766 instances of class org.eclipse.jgit.lib.ObjectIdRef$PeeledNonTag
735294 instances of class org.eclipse.jetty.util.thread.Locker
735294 instances of class org.eclipse.jetty.util.thread.Locker$Lock
735283 instances of class org.eclipse.jetty.server.session.Session
...

And an other view shows 3.5G of byte arrays.

I got pointed to https://heaphero.io/ however the file is too large to upload and it contains sensitive information (credentials, users personal information) which we can not share with a third party.

Nothing really conclusive at this point, the heap dump has been taken shortly after a restart and Gerrit was not in trouble.

Eventually I found Javamelody has a view providing the exact same information without all the trouble of figuring out jmap, jhat and ssh proper set of parameters. Just browse to the monitoring page and:

stuff learned

  • jmap to issue commands to the jvm including taking a heap dump
  • jhat to run analysis with some options required to make it workable
  • Use JavaMelody instead

JVM handling of out of memory error

An idea was to take a heap dump whenever the JVM encounters an out of memory error. That can be turned on by passing the extended option HeapDumpOnOutOfMemoryError to the JVM and specifying where the dump will be written to with HeapDumpPath:

java \
  -XX:+HeapDumpOnOutOfMemoryError \
  -XX:HeapDumpPath=/srv/gerrit \
  -jar gerrit.war ...

And surely next time it ran out of memory:

Nov 07 13:43:35 gerrit2001 java[30197]: java.lang.OutOfMemoryError: Java heap space
Nov 07 13:43:35 gerrit2001 java[30197]: Dumping heap to /srv/gerrit/java_pid30197.hprof ...
Nov 07 13:47:02 gerrit2001 java[30197]: Heap dump file created [35616147146 bytes in 206.962 secs]

Which results in a 34GB dump file which was not convenient for a full analysis. Even with 16G of heap for the analyze and a couple hours of CPU churning it was not any helpful

And at this point the JVM is still around, the java process is still there and thus systemd does not restart the service for us even though we have instructed it to do so:

/lib/systemd/system/gerrit.service
[Service]
ExecStart=java -jar gerrit.war
Restart=always
RestartSec=2s

That lead to our Gerrit replica being down for a whole weekend with no alarm whatsoever (T267517). I imagine the reason for the JVM not exiting on an OutOfMemoryError is to let one investigate the reason. Just like heap dump, the behavior can be configured via the ExitOnOutOfMemoryError extended option:

java -XX:+ExitOnOutOfMemoryError

Next time the JVM will exit causing systemd to notice the service went away and so it will happily restart it again.

stuff learned

  • automatic heap dumping with the JVM for future analysis
  • Be sure to have the JVM exit when running out of memory so systemd will restart the service
  • Process can be up while still not serving its purpose

Side track to jgit cache

When I filed the task, I suspected enabling git protocol version 2 (J199) on CI might have been the root cause. That eventually lead me to look at how Gerrit caches git operations. Being a Java application it does not use the regular git command but a pure Java implementation jgit, a project started by the same author as Gerrit (Shawn Pearce).

To speed up operations, jgit keeps git objects in memory with various tuning settings. You can read more about it at T263008#6601490 , but in the end it was of no use for the problem. @thcipriani would later point out that jgit cache does not overgrow past its limit:

The investigation was not a good lead, but surely it prompted us to have a better view as to what is going on in the jgit cache. But to do so we would need to expose historical metrics of the status of the cache.

Stuff learned

  • Jgit has in memory caches to hold frequently accessed repositories / objects in the JVM memory speeding up access to them.

Metrics collection

We always had trouble determining whether our jgit cache was properly sized and tuned it randomly with little information. Eventually I found out that Gerrit does have a wide range of metrics available which are described at https://gerrit.wikimedia.org/r/Documentation/metrics.html . I always wondered how we could access them without having to write a plugin.

The first step was to add the metrics-reporter-jmx plugin. It registers all the metrics with JMX, a Java system to manage resources. That is then exposed by JavaMelody and at least let us browse the metrics:

I long had a task to get those metrics exposed (T184086) but never had a strong enough incentive to work it. The idea was to expose those metrics to the Prometheus monitoring system which would scrape them and make them available in Grafana. They can be exposed using the metrics-reporter-prometheus plugin. There is some configuration required to create an authentication token that lets Prometheus scrape the metrics and it is then all set and collected.

In Grafana, discovering which metrics are of interest might be daunting. Surely for the jgit cache it is only a few metrics we are interested in and crafting a basic dashboard for it is simple enough. But since we now collect all those metrics, surely we should have dashboards for anything that could be of interest to us.

While browsing the Gerrit upstream repositories, I found an unadvertised repository: gerrit/gerrit-monitoring. The project aims at deploying to Kubernetes a monitoring stack for Gerrit composed of Grafana, Loki, Prometheus and Promtail. While browsing the code, I found out they already had a Grafana template which I could import to our Grafana instance with some little modifications.

During the Gerrit Virtual Summit I raised that as a potentially interesting project for the whole community and surely a few days later:

In the end we have a few useful Grafana dashboards, the ones imported from the gerrit-monitoring repo are suffixed with (upstream): https://grafana.wikimedia.org/dashboards/f/5AnaHr2Mk/gerrit

And I crafted one dedicated to jgit cache: https://grafana.wikimedia.org/d/8YPId9hGz/jgit-block-cache

Stuff learned

  • Prometheus scraping system with auth token
  • Querying Prometheus metrics in Grafana and its vector selection mechanism
  • Other Gerrit administrators already created Vizualization
  • Raising our reuse prompted upstream to further advertise their solution which hopefully has led to more adoption of their solution.

Despair

After a couple months, there was no good lead. The issue has been around for a while, in a programming language I don't know with assisting tooling completely alien to me. I even found jcmd to issue commands to the JVM, such as dumping a class histogram, the same view provided by JavaMelody:

$ sudo -u gerrit2 jcmd 2347 GC.class_histogram
num     #instances         #bytes  class name
3      ----------------------------------------------
4         5:      10042773     1205132760  org.eclipse.jetty.server.session.SessionData
5         8:      10042773      883764024  org.eclipse.jetty.server.session.Session
6        11:      10042773      482053104  org.eclipse.jetty.server.session.Session$SessionInactivityTimer$1
7        13:      10042779      321368928  org.eclipse.jetty.util.thread.Locker
8        14:      10042773      321368736  org.eclipse.jetty.server.session.Session$SessionInactivityTimer
9        17:      10042779      241026696  org.eclipse.jetty.util.thread.Locker$Lock

That is quite handy when already in a terminal, saves a few click to switch to a browser, head to JavaMelody and find the link.

But it is the last week of work of the year.

Christmas is in two days.

Kids are messing up all around the home office since we are under lockdown.

Despair.

Out of rage I just stall the task shamelessly hoping for Java 11 and Gerrit 3.3 upgrades to solve this. Much like we hoped the system would be fixed by upgrading.

Wait..

1 million?

ONE MILLION ??

TEN TO THE POWER OF SIX ???

WHY IS THERE A MILLION HTTP SESSIONS HELD IN GERRIT !!!!!!?11??!!??

10042773  org.eclipse.jetty.server.session.SessionData

There. Right there. It was there since the start. In plain sight. And surely 19 hours later Gerrit had created 500k sessions for 56 MBytes of memory. It is slowly but surely leaking memory.

stuff learned

  • Everything clears up once one has found the root cause

When upstream saves you

At this point it was just an intuition, albeit a strong one. I know not much about Java or Gerrit internals and went to invoke upstream developers for further assistance. But first, I had to reproduce the issue and investigate a bit more to give as many details as possible when filing a bug report.

Reproduction

I copied a small heap dump I took just a few minutes after Gerrit got restarted, it had a manageable size making it easier to investigate. Since I am not that familiar with the Java debugging tools, I went with what I call a clickodrome interface, a UI that lets you interact solely with mouse clicks: https://visualvm.github.io/

Once the heap dump is loaded, I could easily access objects. Notably the org.eclipse.jetty.server.session.Session objects had a property expiry=0, often an indication of no expiry at all. Expired sessions are cleared by Jetty via a HouseKeeper thread which inspects sessions and deletes expired ones. I have confirmed it does run every 600 seconds, but since sessions are set to not expire, they pile up leading to the memory leak.

On December 24th, a day before Christmas, I filed a private security issue to upstream (now public): https://bugs.chromium.org/p/gerrit/issues/detail?id=13858

After the Christmas and weekend break upstream acknowledged and I did more investigating to pinpoint the source of the issue. The sessions are created by a SessionHandler and debug logs show: dftMaxIdleSec=-1 or Default maximum idle seconds set to -1, which means that by default the sessions are created without any expiry. The Jetty debug log then gave a bit more insight:

DEBUG org.eclipse.jetty.server.session : Session xxxx is immortal && no inactivity eviction

It is immortal and is thus never picked up by the session cleaner:

DEBUG org.eclipse.jetty.server.session : org.eclipse.jetty.server.session.SessionHandler
==dftMaxIdleSec=-1 scavenging session ids []
                                          ^^^ --- empty array

Our Gerrit instance has several plugins and the leak can potentially come from one of them. I then booted a dummy Gerrit on my machine (java -jar gerrit-3.3.war) cloned the built-in All-Projects.git repository repeatedly and observed objects with VisualVM. Jetty sessions with no expiry were created, which rules out plugins and point at Gerrit itself. Upstream developer Luca Milanesio pointed out that Gerrit creates a Jetty session which is intended for plugins. I have also narrowed down the leak to only be triggered by git operations made over HTTP. Eventually, by commenting out a single line of Gerrit code, I eliminated the memory leak and upstream pointed at a change released a few versions ago that may have been the cause.

Upstream then went on to reproduce on their side, took some measurement before and after commenting out and confirmed the leak (750 bytes for each git request made over HTTP). Given the amount of traffic we received from humans, systems or bots, it is not surprising we ended up hitting the JVM memory limit rather quickly.

Eventually the fix got released and new Gerrit versions were released. We upgraded to the new release and haven't restarted Gerrit since then. Problem solved!

Stuff learned

  • Even with no knowledge about a programming language, if you can build and run it, you can still debug using print or the universal optimization operator: //.
  • Quickly acknowledge upstream hints, ideas and recommendations. Even if it is to dismiss one of their leads.
  • Write a report, this blog.

Thank you upstream developers Luca Milanesio and David Ostrovsky for fixing the issue!

Thank you @dancy for the added clarifications as well as typos and grammar fixes.

References

Production Excellence #29: February 2021

01:03, Saturday, 06 2021 March UTC

How’d we do in our strive for operational excellence last month? Read on to find out!

📈 Incidents

3 documented incidents last month, [1] which is average for the time of year. [2]

Learn about these incidents at Incident status on Wikitech, and their Preventive measures in Phabricator.

For those with NDA-restricted access, there may be additional private incident reports 🔒 available.

💡 Did you know: Our Incident reports have switched to using the ISO date format in their titles and listings, for improved readability and edit-ability (esp. when publishing on a later date). So long 202010221, and hello 2021-02-21!

📊 Trends

In February we saw a continuation of the new downward trend that began this January, which came after twelve months of continued rising. Let's make sure this trend sticks with us as we work our way through the debt, whilst also learning to have a healthy week-to-week iteration where we monitor and follow-up on any new developments such that they don't introduce lasting regressions.

The recent tally (issues filed since we started reporting in March 2019) is down to 138 unresolved errors, from 152 last month. The old backlog (pre-2019 issues) also continued its 5-month streak and is down to 148, from 160 last month. If this progress continues we'll soon have fewer "Old" issues than "Recent" issues, and possibly by the start of 2022 we may be able to report and focus only on our rotation through recent issues as hopefully we are then balancing our work such that issues reported this month are addressed mostly in the same month or otherwise later that quarter within 2-3 months. Visually that would manifest as the colored chunks having a short life on the chart with each drawn at a sharp downwards angle – instead of dragged out where it was building up an ever-taller shortcake. I do like cake, but I prefer the kind I can eat. 🍰

Month-over-month plots based on spreadsheet data. [3] [4]


📖 Outstanding errors

Summary over recent months:

  • ⚠️ July 2019 (2 of 18 issues left): no change.
  • ⚠️ August 2019 (1 of 14 issues): no change.
  • ⚠️ October 2019 (4 of 12 issues): no change.
  • ⚠️ November 2019 (1 of 5 issues): no change.
  • ⚠️ December 2019 (1 of 9 issues): One task resolved (-1).
  • ⚠️ January 2020 (2 of 7 issues): no change.
  • ⚠️ February 2020 (1 of 7 issues): no change.
  • ⚠️ March 2020 (2 of 2 issues): no change.
  • April 2020 (9 of 14 issues left): no change.
  • May 2020 (6 of 14 issues left): no change.
  • June 2020 (7 of 14 issues left): no change.
  • July 2020 (9 of 24 new issues): no change.
  • August 2020 (20 of 53 new issues): Two tasks resolved (-2).
  • September 2020 (9 of 33 new issues): Five tasks resolved (-5).
  • October 2020 (26 of 69 new issues): Five tasks resolved (-5).
  • November 2020 (11 of 38 new issues): Three tasks resolved (-3).
  • December 2020 (12 of 33 new issues): Seven tasks resolved (-7).
  • January 2021 (5 of 50 new issues): Two tasks resolved (-2).
  • February 2021: 11 of 20 new issues survived the month and remained unresolved (+20; -9)
Recent tally
152 issues open, as of Excellence #28 (16 Feb 2021).
-25 issues closed since, of the previous 152 open issues.
+11 new issues that survived Feb 2021.
138 issues open, as of today 5 Mar 2021.

For the on-going month of March 2021, we've got 12 new issues so far.

Take a look at the workboard and look for tasks that could use your help!

View Workboard


🎉 Thanks!

Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


Footnotes:

[1] Incident status Wikitech.
[2] Wikimedia incident stats by Krinkle, CodePen.
[3] Month-over-month, Production Excellence spreadsheet.
[4] Open tasks, Wikimedia-prod-error, Phabricator.

Production Excellence #28: January 2021

23:57, Friday, 05 2021 March UTC

How’d we do in our strive for operational excellence last month? Read on to find out!

📈 Incidents

1 documented incident last month. That's the third month in a row that we are at or near zero major incidents – not bad! [1] [2]

Learn about recent incidents at Incident status on Wikitech, or Preventive measures in Phabricator.

💡 Did you know: Our Incident status page provides a green-yellow status reflection over the past ten days, with a link to the most recent incident doc if there was any during that time.

📊 Trends

This January saw a small recovery in our otherwise negative upward trend. For the first time in twelve month more reports were closed than new reports having outlived the previous month without resolution. What happened twelve months ago? In January 2020, we also saw a small recovery during the otherwise upward trend before and after it.

Perhaps it's something about the post-December holidays that temporarily improves the quality and/or reduces the quantity — of code changes. Only time will tell if this is the start of a new positive trend, or merely a post-holiday break. [3]

While our month-to-month trend might not (yet) be improving, we do see persistent improvements in our overall backlog of pre-2019 reports. This is in part because we generally don't file new reports there, so it makes sense that it doesn't go back up, but it's still good to see downward progress every month, unlike with reports from more recent months which often see no change month-to-month (see "Outstanding errors" below, for example).

This positive trend on our "Old" backlog started in October 2020 and has consistently progressed every month since then (refer to the "Old" numbers in red on the below chart, or the same column in the spreadsheet). [3][4]


📖 Outstanding errors

Summary over recent months:

  • ⚠️ July 2019 (2 of 18 issues left): no change.
  • ⚠️ August 2019 (1 of 14 issues): no change.
  • ✅ September 2019 (0 of 12 issues): Last two tasks were resolved (-2).
  • ⚠️ October 2019 (4 of 12 issues): One task resolved (-1).
  • ⚠️ November 2019 (1 of 5 issues): no change.
  • ⚠️ December 2019 (2 of 9 issues), Two tasks resolved (-2).
  • ⚠️ January 2020 (2 of 7 issues), no change.
  • ⚠️ February 2020 (1 of 7 issues left), One task resolved (-1).
  • March 2020 (2 of 2 issues left), no change.
  • April 2020 (9 of 14 issues left): no change.
  • May 2020 (6 of 14 issues left): One task resolved (-1).
  • June 2020 (7 of 14 issues left): no change.
  • July 2020 (9 of 24 new issues): no change.
  • August 2020 (22 of 53 new issues): One task resolved (-1).
  • September 2020 (13 of 33 new issues): One task resolved (-1).
  • October 2020 (31 of 69 new issues): Four tasks fixed (-4).
  • November 2020 (14 of 38 new issues): no change.
  • December 2020 (19 of 33 new issues) Three tasks resolved (-3)
  • January 2021: 7 of 50 new issues survived the month and remained unresolved (+50; -43)
Recent tally
160 issues open, as of Excellence #27 (4 Feb 2021).
-15 issues closed since, of the previous 160 open issues.
+7 new issues that survived January 2021.
152 issues open, as of today (16 Feb 2021).

January saw +50 new production errors reported in a single month, which is an unfortunate all-time high. However, we've also done remarkably well on addressing 43 of them within a month, when the potential root cause and diagnostics data were still fresh in our minds. Well done!

For the on-going month of February, there have been 16 new issues reported so far.

Take a look at the workboard and look for tasks that could use your help!

View Workboard


🎉 Thanks!

Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


Footnotes:

[1] Incident status Wikitech.
[2] Wikimedia incident stats by Krinkle, CodePen.
[3] Month-over-month, Production Excellence spreadsheet.
[4] Open tasks, Wikimedia-prod-error, Phabricator.

Gerrit now automatically adds reviewers

10:19, Friday, 05 2021 March UTC
WARNING: 20210305 the reviewers by blame Gerrit plugin got disabled after it got announced by this blog post. It turns out the author of change is not necessarily an adequate reviewer suggestion in our context and some were being added to review for a whole lot code than they would expect. The post still have some worthy information as to how one can find reviewers.

Finding reviewers for a change is often a challenge, especially for a newcomer or folks proposing changes to projects they are not familiar with. Since January 16th, 2019, Gerrit automatically adds reviewers on your behalf based on who last changed the code you are affecting.

Antoine "@hashar" Musso exposes what lead us to enable that feature and how to configure it to fit your project. He will offers tip as to how to seek more reviewers based on years of experience.


When uploading a new patch, reviewers should be added automatically, that is the subject of the task T91190 opened almost four years ago (March 2015). I declined the task since we already have the Reviewer bot (see section below), @Tgr found a plugin for Gerrit which analyzes the code history with git blame and uses that to determine potential reviewers for a change. It took us a while to add that particular Gerrit plugin and the first version we installed was not compatible with our Gerrit version. The plugin was upgraded yesterday (Jan 16th) and is working fine (T101131).

Let's have a look at the functionality the plugin provides, and how it can be configured per repository. I will then offer a refresher of how one can search for reviewers based on git history.

Reviewers by blame plugin

NOTE: the reviewers by blame plugin has been removed the day after this announce blog post got posted. This section thus does not apply to the Wikimedia Gerrit instance anymore. It is left here for historical reason.

The Gerrit plugin looks at affected code using git blame, it extracts the top three past authors which are then added as reviewers to the change on your behalf. Added reviewers will thus receive a notification showing you have asked them for code review.

The configuration is done on a per project basis and inherits from the parent project. Without any tweaks, your project inherits the configuration from All-Projects. If you are a project owner, you can adjust the configuration. As an example the configuration for operations/mediawiki-config which shows inherited values and an exception to not process a file named InitialiseSettings.php:

The three settings are described in the documentation for the plugin:

plugin.reviewers-by-blame.maxReviewers
The maximum number of reviewers that should be added to a change by this plugin.
By default 3.

plugin.reviewers-by-blame.ignoreFileRegEx
Ignore files where the filename matches the given regular expression when computing the reviewers. If empty or not set, no files are ignored.
By default not set.

plugin.reviewers-by-blame.ignoreSubjectRegEx
Ignore commits where the subject of the commit messages matches the given regular expression. If empty or not set, no commits are ignored.
By default not set.

By making past authors aware of a change to code they previously altered, I believe you will get more reviews and hopefully get your changes approved faster.

Previously we had other methods to add reviewers, one opt-in based and the others being cumbersome manual steps. They should be used to compliment the Gerrit reviewers by blame plugin, and I am giving an overview of each of them in the following sections.

Gerrit watchlist

The original system from Gerrit lets you watch projects, similar to a user watch list on MediaWiki. In Gerrit preferences, one can get notified for new changes, patchsets, comments... Simply indicate a repository, optionally a search query and you will receive email notifications for matching events.

The attached image is my watched projects configuration, I thus receive notifications for any changes made to the integration/config config as well as for changes in mediawiki/core which affect either composer.json or one of the Wikimedia deployment branches for that repo.

One drawback is that we can not watch a whole hierarchy of projects such as mediawiki and all its descendants, which would be helpful to watch our deployment branch. It is still useful when you are the primary maintainer of a repository since you can keep track of all activity for the repository.

Reviewer bot

The reviewer bot has been written by Merlijn van Deen (@valhallasw), it is similar to the Gerrit watched projects feature with some major benefits:

  • watcher is added as a reviewer, the author thus knows you were notified
  • it supports watching a hierarchy of projects (eg: mediawiki/*)
  • the file/branch filtering might be easier to gasp compared to Gerrit search queries
  • the watchers are stored at a central place which is public to anyone, making it easy to add others as reviewers.

One registers reviewers on a single wiki page: https://www.mediawiki.org/wiki/Git/Reviewers.

Each repository filter is a wikitext section (eg: === mediawiki/core ===) followed by a wikitext template and a file filter using using python fnmatch. Some examples:

Listen to any changes that touch i18n:

== Listen to repository groups ==
=== * ===
* {{Gerrit-reviewer|JohnDoe|file_regexp=<nowiki>i18n</nowiki>}}

Listen to MediaWiki core search related code:

=== mediawiki/core ===
* {{Gerrit-reviewer|JaneDoe|file_regexp=<nowiki>^includes/search/</nowiki>

The system works great, given maintainers remember to register on the page and that the files are not moved around. The bot is not that well known though and most repositories do not have any reviewers listed.

Inspecting git history

A source of reviewers is the git history, one can easily retrieve a list of past authors which should be good candidates to review code. I typically use git shortlog --summary --no-merges for that (--no-merges filters out merge commit crafted by Gerrit when a change is submitted). Example for MediaWiki Job queue system:

$ git shortlog --no-merges --summary --since "one year ago" includes/jobqueue/|sort -n|tail -n4
     3 Petr Pchelko
     4 Brad Jorsch
     4 Umherirrender
    16 Aaron Schulz

Which gives me four candidates that acted on that directory over a year.

Past reviewers from git notes

When a patch is merged, Gerrit records in git trace votes and the canonical URL of the change. They are available in git notes under /refs/notes/review, once notes are fetched, they can be show in git show or git log by passing --show-notes=review, for each commit, after the commit messages, the notes get displayed and show votes among other metadata:

$ git fetch refs/notes/review:refs/notes/review
$ git log --no-merges --show-notes=review -n1
commit e1d2c92ac69b6537866c742d8e9006f98d0e82e8
Author: Gergő Tisza <tgr.huwiki@gmail.com>
Date:   Wed Jan 16 18:14:52 2019 -0800

    Fix error reporting in MovePage
    
    Bug: T210739
    Change-Id: I8f6c9647ee949b33fd4daeae6aed6b94bb1988aa

Notes (review):
    Code-Review+2: Jforrester <jforrester@wikimedia.org>
    Verified+2: jenkins-bot
    Submitted-by: jenkins-bot
    Submitted-at: Thu, 17 Jan 2019 05:02:23 +0000
    Reviewed-on: https://gerrit.wikimedia.org/r/484825
    Project: mediawiki/core
    Branch: refs/heads/master

And I can then get the list of authors that previously voted Code-Review +2 for a given path. Using the previous example of includes/jobqueue/ over a year, the list is slightly different:

$ git log --show-notes=review --since "1 year ago" includes/jobqueue/|grep 'Code-Review+2:'|sort|uniq -c|sort -n|tail -n5
      2     Code-Review+2: Umherirrender <umherirrender_de.wp@web.de>
      3     Code-Review+2: Jforrester <jforrester@wikimedia.org>
      3     Code-Review+2: Mobrovac <mobrovac@wikimedia.org>
      9     Code-Review+2: Aaron Schulz <aschulz@wikimedia.org>
     18     Code-Review+2: Krinkle <krinklemail@gmail.com>

User Krinkle has approved a lot of patches, even if he doesn't show in the list of authors obtained by the previous mean (inspecting git history).

Conclusion

The Gerrit reviewers by blame plugin acts automatically which offers a good chance your newly uploaded patch will get reviewers added out of the box. For finer tweaking one should register as a reviewer on https://www.mediawiki.org/wiki/Git/Reviewers which benefits everyone. The last course of action is meant to compliment the git log history.

For any remarks, support, concerns, reach out on IRC freenode channel #wikimedia-releng or fill a task in Phabricator.

Thank you @thcipriani for the proof reading and english fixes.

Production Excellence #27: December 2020

18:35, Thursday, 04 2021 February UTC

How’d we do in our strive for operational excellence last month? Read on to find out!

📈 Incidents

1 documented incident in December. [1] In previous years, December typically had 4 or fewer documented incidents. [3]

Learn about recent incidents at Incident documentation on Wikitech, or Preventive measures in Phabricator.


📊 Trends

Month-over-month plots based on spreadsheet data. [4] [2]


📖 Outstanding errors

Take a look at the workboard and look for tasks that could use your help.
https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Summary over recent months:

  • ⚠️ July 2019 (2 of 18 issues left): no change.
  • ⚠️ August 2019 (1 of 14 issues): no change.
  • ⚠️ September 2019 (2 of 12 issues): One task resolved (-1).
  • ⚠️ October 2019 (5 of 12 issues): no change.
  • ⚠️ November 2019 (1 of 5 issues): no change.
  • ⚠️ December 2019 (4 of 9 issues), no change.
  • ⚠️ January 2020 (2 of 7 issues), no change.
  • February 2020 (2 of 7 issues left), no change.
  • March 2020 (2 of 2 issues left), no change.
  • April 2020 (9 of 14 issues left): no change.
  • May 2020 (7 of 14 issues left): no change.
  • June 2020 (7 of 14 issues left): no change.
  • July 2020 (9 of 24 new issues): no change.
  • August 2020 (23 of 53 new issues): no change.
  • September 2020 (13 of 33 new issues): One task resolved (-1).
  • October 2020 (35 of 69 new issues): Four issues fixed (-4).
  • November 2020 (14 of 38 new issues): Five issues fixed (-5).
  • December 2020: 22 of 33 new issues survived the month and remained unresolved (+33; -22)
Recent tally
149 as of Excellence #26 (15 Dec 2020).
-11 closed of the 149 recent issues.
+22 new issues survived December 2020.
160 as of 27 Jan 2020.

🎉 Thanks!

Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


Footnotes:

[1] Incident documentation 2020, Wikitech.
[2] Open tasks, Wikimedia-prod-error, Phabricator.
[3] Wikimedia incident stats by Krinkle, CodePen.
[4] Month-over-month, Production Excellence spreadsheet.

Production Excellence #26: November 2020

18:34, Thursday, 04 2021 February UTC

How’d we do in our strive for operational excellence last month? Read on to find out!

📈 Incidents

Zero documented incidents in November. [1] That's the only month this year without any (publicly documented) incidents. In 2019, November was also the only such month. [3]

Learn about recent incidents at Incident documentation on Wikitech, or Preventive measures in Phabricator.


📊 Trends

The overall increase in errors was relatively low this past month, similar to the November-December period last year.

What's new is that we can start to see a positive trend emerging in the backlogs where we've shrunk issue count three months in a row, from the 233 high in October, down to the 181 we have in the ol' backlog today.

Month-over-month plots based on spreadsheet data. [4]


📖 Outstanding errors

Take a look at the workboard and look for tasks that could use your help.
https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Summary over recent months:

  • ⚠️ July 2019 (2 of 18 tasks): One task closed (-1).
  • ⚠️ August 2019 (1 of 14 tasks): no change.
  • ⚠️ September 2019 (3 of 12 tasks): no change.
  • ⚠️ October 2019 (5 of 12 tasks): no change.
  • ⚠️ November 2019 (1 of 5 tasks): no change.
  • ⚠️ December 2019 (3 of 9 tasks left), no change.
  • January 2020 (3 of 7 tasks left), One task closed (-1).
  • February (2 of 7 tasks left), no change.
  • March (2 of 2 tasks left), no change.
  • April (9 of 14 tasks left): no change.
  • May (7 of 14 tasks left): no change.
  • June (7 of 14 tasks left): no change.
  • July 2020 (9 of 24 new tasks): no change.
  • August 2020 (23 of 53 new tasks): Three tasks closed (-3).
  • September 2020 (14 of 33 new tasks): One task closed (-1).
  • October 2020 (39 of 69 new tasks): Six tasks closed (-6).
  • November 2020: 19 of 38 new tasks survived the month and remain open today (+38; -19)
Recent tally
142 as of Excellence #25 (23 Oct 2020).
-12 closed of the 142 recent tasks.
+19 survived November 2020.
149 as of today, 15 Dec 2020.

The on-going month of December, has 19 unresolved tasks so far.


🎉 Thanks!

Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


❝   The plot "thickens" as they say. Why, by the way? Is it a soup metaphor? ❞

Footnotes:

[1] Incident documentation 2020, Wikitech.
[2] Open tasks, Wikimedia-prod-error, Phabricator.
[3] Wikimedia incident stats, Krinkle, CodePen.
[4] Month-over-month, Production Excellence (spreadsheet).

Perf Matters at Wikipedia in 2015

00:33, Thursday, 31 2020 December UTC

Hello, WANObjectCache

This year we achieved another milestone in our multi-year effort to prepare Wikipedia for serving traffic from multiple data centres.

The MediaWiki application that powers Wikipedia relies heavily on object caching. We use Memcached as horizontally scaled key-value store, and we’d like to keep the cache local to each data centre. This minimises dependencies between data centres, and makes better use of storage capacity (based on local needs).

Aaron Schulz devised a strategy that makes MediaWiki caching compatible with the requirements of a multi-DC architecture. Previously, when source data changed, MediaWiki would recompute and replace the cache value. Now, MediaWiki broadcasts “purge” events for cache keys. Each data centre receives these and sets a “tombstone”, a marker lasting a few seconds that limits any set-value operations for that key to a miniscule time-to-live. This makes it tolerable for recache-on-miss logic to recompute the cache value using local replica databases, even though they might have several seconds of replication lag. Heartbeats are used to detect the replication lag of the databases involved during any re-computation of a cache value. When that lag is more than a few seconds (a large portion of the tombstone period), the corresponding cache set-value operation automatically uses a low time-to-live. This means that large amounts of replication lag are tolerated.

This and other aspects of WANObjectCache’s design allow MediaWiki to trust that cached values are not substantially more stale, than a local replica database; provided that cross-DC broadcasting of tiny in-memory tombstones is not disrupted.


First paint time now under 900ms

In July we set out a goal: improve page load performance so our median first paint time would go down from approximately 1.5 seconds to under a second – and stay under it!

I identified synchronous scripts as the single-biggest task blocking the browser, between the start of a page navigation and the first visual change seen by Wikipedia readers. We had used async scripts before, but converting these last two scripts to be asynchronous was easier said than done.

There were several blockers to this change. Including the use of embedded scripts by interactive features. These were partly migrated to CSS-only solutions. For the other features, we introduced the notion of “delayed inline scripts”. Embedded scripts now wrap their code in a closure and add it to an array. After the module loader arrives, we process the closures from the array and execute the code within.

Another major blocker was the subset of community-developed gadgets that didn’t yet use the module loader (introduced in 2011). These legacy scripts assumed a global scope for variables, and depended on browser behaviour specific to serially loaded, synchronous, scripts. Between July 2015 and August 2015, I worked with the community to develop a migration guide. And, after a short deprecation period, the legacy loader was removed.


Hello, WebPageTest

Previously, we only collected performance metrics for Wikipedia from sampled real-user page loads. This is super and helps detect trends, regressions, and other changes at large. But, to truly understand the characteristics of what made a page load a certain way, we need synthetic testing as well.

Synthetic testing offers frame-by-frame video captures, waterfall graphs, performance timelines, and above-the-fold visual progression. We can run these automatically (e.g. every hour) for many urls, on many different browsers and devices, and from different geo locations. These tests allow us to understand the performance, and analyse it. We can then compare runs over any period of time, and across different factors. It also gives us snapshots of how pages were built at a certain point in time.

The results are automatically recorded into a database every hour, and we use Grafana to visualise the data.

In 2015 Peter built out the synthetic testing infrastructure for Wikimedia, from scratch. We use the open-source WebPageTest software. To read more about its operation, check Wikitech.


The journey to Thumbor begins

Gilles evaluated various thumbnailing services for MediaWiki. The open-source Thumbor software came out as the most promising candidate.

Gilles implemented support for Thumbor in the MediaWiki-Vagrant development environment.

To read more about our journey to Thumbor, read The Journey to Thumbor (part 1).


Save timing reduced by 50%

Save timing is one of the key performance metrics for Wikipedia. It measures the time from when a user presses “Publish changes” when editing – until the user’s browser starts to receive a response. During this time, many things happen. MediaWiki parses the wiki-markup into HTML, which can involve page macros, sub-queries, templates, and other parser extensions. These inputs must be saved to a database. There may also be some cascading updates, such as the page’s membership in a category. And last but not least, there is the network latency between user’s device and our data centres.

This year saw a 50% reduction in save timing. At the beginning of the year, median save timing was 2.0 seconds (quarterly report). By June, it was down to 1.6 seconds (report), and in September 2015, we reached 1.0 seconds! (report)

The effort to reduce save timing was led by Aaron Schulz. The impact that followed was the result of hundreds of changes to MediaWiki core and to extensions.

Deferring tasks to post-send

Many of these changes involved deferring work to happen post-send. That is, after the server sends the HTTP response to the user and closes the main database transaction. Examples of tasks that now happen post-send are: cascading updates, emitting “recent changes” objects to the database and to pub-sub feeds, and doing automatic user rights promotions for the editing user based on their current age and total edit count.

Aaron also implemented the “async write” feature in the multi-backend object cache interface. MediaWiki uses this for storing the parser cache HTML in both Memcached (tier 1) and MySQL (tier 2). The second write now happens post-send.

By re-ordering these tasks to occur post-send, the server can send a response back to the user sooner.

Working with the database, instead of against it

A major category of changes were improvements to database queries. For example, reducing lock contention in SQL, refactoring code in a way that reduces the amount of work done between two write queries in the same transaction, splitting large queries into smaller ones, and avoiding use of database master connections whenever possible.

These optimisations reduced chances of queries being stalled, and allow them to complete more quickly.

Avoid synchronous cache re-computations

The aforementioned work on WANObjectCache also helped a lot. Whenever we converted a feature to use this interface, we reduced the amount of blocking cache computation that happened mid-request. WANObjectCache also performs probabilistic preemptive refreshes of near-expiring values, which can prevent cache stampedes.

Profiling can be expensive

We disabled the performance profiler of the AbuseFilter extension in production. AbuseFilter allows privileged users to write rules that may prevent edits based on certain heuristics. Its profiler would record how long the rules took to inspect an edit, allowing users to optimise them. The way the profiler worked, though, added a significant slow down to the editing process. Work began later in 2016 to create a new profiler, which has since completed.

And more

Lots of small things. Including the fixing of the User object cache which existed but wasn’t working. And avoid caching values in Memcached if computing them is faster than the Memcached latency required to fetch it!

We also improved latency of file operations by switching more LBYL-style coding patterns to EAFP-style code. Rather than checking whether a file exists, is readable, and then checking when it was last modified – do only the latter and handle any errors. This is both faster and more correct (due to LBYL race conditions).


So long, Sajax!

Sajax was a library for invoking a subroutine on the server, and receiving its return value as JSON from client-side JavaScript. In March 2006, it was adopted in MediaWiki to power the autocomplete feature of the search input field.

The Sajax library had a utility for creating an XMLHttpRequest object in a cross-browser-compatible way. MediaWiki deprecated Sajax in favour of jQuery.ajax and the MediaWiki API. Yet, years later in 2015, this tiny part of Sajax remained popular in Wikimedia's ecosystem of community-developed gadgets.

The legacy library was loaded by default on all Wikipedia page views for nearly a decade. During a performance inspection this year, Ori Livneh decided it was high time to finish this migration. Goodbye Sajax!


Further reading

This year also saw the switch to encrypt all Wikimedia traffic with TLS by default.

Mentioned tasks: T107399, T105391, T109666, T110858, T55120.

Runnable runbooks

18:59, Tuesday, 15 2020 December UTC

Recently there has been a small effort on the Release-Engineering-Team to encode some of our institutional knowledge as runbooks linked from a page in the team's wiki space.

What are runbooks, you might ask? This is how they are described on the aforementioned wiki page:

This is a list of runbooks for the Wikimedia Release Engineering Team, covering step-by-step lists of what to do when things need doing, especially when things go wrong.

So runbooks are each essentially a sequence of commands, intended to be pasted into a shell by a human. Step by step instructions that are intended to help the reader accomplish an anticipated task or resolve a previously-encountered issue.

Presumably runbooks are created when someone encounters an issue, and, recognizing that it might happen again, helpfully documents the steps that were used to resolve said issue.

This all seems pretty sensible at first glance. This type of documentation can be really valuable when you're in an unexpected situation or trying to accomplish a task that you've never attempted before and just about anyone reading this probably has some experience running shell commands pasted from some online tutorials, setup instructions for a program, etc.

Despite the obvious value runbooks can provide, I've come to harbor a fairly strong aversion to the idea of encoding what are essentially shell scripts as individual commands on a wiki page. As someone who's job involves a lot of automation, I would usually much prefer a shell script, a python program, or even a "maintenance script" over a runbook.

After a lot of contemplation, I've identified a few reasons that I don't like runbooks on wiki pages:

  • Runbooks are tedious and prone to human errors.
    • It's easy to lose track of where you are in the process.
    • It's easy to accidentally skip a step.
    • It's easy to make typos.
  • A script can be code reviewed and version controlled in git.
  • A script can validate it's arguments which helps to catch typos.
  • I think that command line terminal input is more like code than it is prose. I am more comfortable editing code in my usual text editor as apposed to editing in a web browser. The wikitext editor is sufficient for basic text editing, and visual editor is quite nice for rich text editing, but neither is ideal for editing code.

I do realize that mediawiki does version control. I also realize that sometimes you just can't be bothered to write and debug a robust shell script to address some rare circumstances. The cost is high and it's uncertain whether the script will be worth such an effort. In those situations a runbook might be the perfect way to contribute to collective knowledge without investing a lot of time into perfecting a script.

My favorite web comic, xkcd, has a lot few things to say about this subject:

"The General Problem" xkcd #974. "Automation" xkcd #1319. "Is It Worth the Time?" xkcd #1205.

Potential Solutions

I've been pondering a solution to these issues for a long time. Mostly motivated by the pain I have experienced (and the mistakes I've made) while executing the biggest runbook of all on a regular basis.

Over the past couple of years I've come across some promising ideas which I think can help the problems I've identified with runbooks. I think that one of the most interesting is Do-nothing scripting. Dan Slimmon identifies some of the same problems that I've detailed here. He uses the term *slog* to refer to long and tedious procedures like the Wikimedia Train Deploys. The proposed solution comes in the form of a do-nothing script. You should go read that article, it's not very long. Here are a few relevant quotes:

Almost any slog can be turned into a do-nothing script. A do-nothing script is a script that encodes the instructions of a slog, encapsulating each step in a function.

...

At first glance, it might not be obvious that this script provides value. Maybe it looks like all we’ve done is make the instructions harder to read. But the value of a do-nothing script is immense:

  • It’s now much less likely that you’ll lose your place and skip a step. This makes it easier to maintain focus and power through the slog.
  • Each step of the procedure is now encapsulated in a function, which makes it possible to replace the text in any given step with code that performs the action automatically.
  • Over time, you’ll develop a library of useful steps, which will make future automation tasks more efficient.

A do-nothing script doesn’t save your team any manual effort. It lowers the activation energy for automating tasks, which allows the team to eliminate toil over time.

I was inspired by this and I think it's a fairly clever solution to the problems identified. What if we combined the best aspects of gradual automation with the best aspects of a wiki-based runbook? Others were inspired by this as well, resulting in tools like braintree/runbook, codedown and the one I'm most interested in, rundoc.

Runnable Runbooks

My ideal tool would combine code and instructions in a free-form "literate programming" style. By following some simple conventions in our runbooks we can use a tool to parse and execute the embedded code blocks in a controlled manner. With a little bit of tooling we can gain many benefits:

  • The tooling will keep track of the steps to execute, ensuring that no steps are missed.
  • Ensure that errors aren't missed by carefully checking / logging the result of each step.
  • We could also provide a mechanism for inputting the values of any variables / arguments and validate the format of user input.
  • With flexible control flow management we can even allow resuming from anywhere in the middle of a runbook after an aborted run.
  • Manual steps can just consist of a block of prose that gets displayed to the operator. With embedded markup we can format the instructions nicely and render them in the terminal using [Rich][7]. Once the operator confirms that the step is complete then the workflow moves on to the next step.

Prior Art

I've found a few projects that already implement many of these ideas. Here are a few of the most relevant:

The one I'm most interested in is Rundoc. It's almost exactly the tool that I would have created. In fact, I started writing code before discovering rundoc but once I realized how closely this matched my ideal solution, I decided to abandon my effort. Instead I will add a couple of missing features to Rundoc in order to get everything that I want and hopefully I can contribute my enhancements back upstream for the benefit of others.

Demo: https://asciinema.org/a/MKyiFbsGzzizqsGgpI4Jkvxmx
Source: https://github.com/20after4/rundoc

References

[1]: https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Runbooks "runbooks"
[2]: https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys "Train deploys"
[3]: https://blog.danslimmon.com/2019/07/15/do-nothing-scripting-the-key-to-gradual-automation/ "Do-nothing scripting: the key to gradual automation by Dan Slimmon"
[4]: https://github.com/braintree/runbook "runbook by braintree"
[5]: https://github.com/earldouglas/codedown "codedown by earldouglas"
[6]: https://github.com/eclecticiq/rundoc "rundoc by eclecticiq"
[7]: https://rich.readthedocs.io/en/latest/ "Rich python library"

Changes and improvements to PHPUnit testing in MediaWiki

10:32, Wednesday, 25 2020 November UTC

Building off the work done at the Prague Hackathon (T216260), we're happy to announce some significant changes and improvements to the PHP testing tools included with MediaWiki.

PHP unit tests can now be run statically, without installing MediaWiki

You can now download MediaWiki, run composer install, and then composer phpunit:unit to run core's unit test suite (T89432).

The standard PHPUnit entrypoint can be used, instead of the PHPUnit Maintenance class

You can now use the plain PHPUnit entrypoint at vendor/bin/phpunit instead of the MediaWiki maintenance class which wraps PHPUnit (tests/phpunit/phpunit.php).

Both the unit tests and integration tests can be executed with the standard phpunit entrypoint (vendor/bin/phpunit) or if you prefer, with the composer scripts defined in composer.json (e.g. composer phpunit:unit). We accomplished this by writing a new bootstrap.php file (the old one which the maintenance class uses was moved to tests/phpunit/bootstrap.maintenance.php) which executes the minimal amount of code necessary to make core, extension and skin classes discoverable by test classes.

Tests should be placed in tests/phpunit/{integration,unit}

Integration tests should be placed in tests/phpunit/integration while unit tests go in tests/phpunit/unit, these are discoverable by the new test suites (T87781). It sounds obvious now to write this, but a nice side effect is that by organizing tests into these directories it's immediately clear to authors and reviewers what type of test one is looking at.

Introducing MediaWikiUnitTestCase

A new base test case, MediaWikiUnitTestCase has been introduced with a minimal amount of boilerplate (@covers validator, ensuring the globals are disabled, and that the tests are in the proper directory, the default PHPUnit 4 and 6 compatibility layer). The MediaWikiTestCase has been renamed to MediaWikiIntegrationTestCase for clarity.

Please migrate tests to be unit tests where appropriate

A significant portion of core's unit tests have been ported to use MediaWikiUnitTestCase, approximately 50% of the total. We have also worked on porting extension tests to the unit/integration directories. @Ladsgroup wrote a helpful script to assist with automating the identification and moving of unit tests, see P8702. Migrating tests from MediaWikiIntegrationTestCase to MediaWikiUnitTestCase makes them faster.

Note that unit tests in CI are still run with the PHPUnit maintenance class (tests/phpunit/phpunit.php), so when reviewing unit test patches please execute them locally with vendor/bin/phpunit /path/to/tests/phpunit/unit or composer phpunit -- /path/to/tests/phpunit/unit.

Generating code coverage is now faster

The PHPUnit configuration file now resides at the root of the repository, and is called phpunit.xml.dist. (As an aside, you can copy this to phpunit.xml and make local changes, as that file is git-ignored, although you should not need to do that.) We made a modification (T192078) to the PHPUnit configuration inside MediaWiki to speed up code coverage generation. This makes it feasible to have a split window in your IDE (e.g. PhpStorm), run "Debug with coverage", and see the results in your editor fairly quickly after running the tests.

What is next?

Things we are working on:

  • Porting core tests to integration/unit
  • Porting extension tests to integration/unit.
  • Removing legacy testsuites or ensuring they can be run in a different way (passing the directory name for example).
  • Switching CI to use new entrypoint for unit tests, then for unit and integration tests

Help is wanted in all areas of the above! We can be found in the #wikimedia-codehealth channel and via the phab issues linked in this post.

Credits

The above work has been done and supported by Máté (@TK-999), Amir (@Ladsgroup), Kosta (@kostajh), James (@Jdforrester-WMF), Timo (@Krinkle), Leszek (@WMDE-leszek), Kunal (@Legoktm), Daniel (@daniel), Michael Große (@Michael), Adam (@awight), Antoine (@hashar), JR (@Jrbranaa) and Greg (@greg) along with several others. Thank you!

thanks for reading, and happy testing!

Amir, Kosta, & Máté

Production Excellence #25: October 2020

05:50, Tuesday, 24 2020 November UTC

How’d we do in our strive for operational excellence last month? Read on to find out!

📈 Incidents

2 documented incidents in October. [1] Historically, that's just below the median of 3 for this time of year. [3]

Learn about recent incidents at Incident documentation on Wikitech, or Preventive measures in Phabricator.


📊 Trends

Month-over-month plots based on spreadsheet data. [5]


📖 Outstanding errors

Take a look at the workboard and look for tasks that could use your help.
https://phabricator.wikimedia.org/tag/wikimedia-production-error/

Summary over recent months:

  • ⚠️ July 2019 (3 of 18 tasks): One task closed.
  • ⚠️ August 2019 (1 of 14 tasks): no change.
  • ⚠️ September 2019 (3 of 12 tasks): no change.
  • ⚠️ October 2019 (5 of 12 tasks): One task closed.
  • ⚠️ November 2019 (1 of 5 tasks): Two tasks closed.
  • December (3 of 9 tasks left), no change.
  • January 2020 (4 of 7 tasks left), no change.
  • February (2 of 7 tasks left), no change.
  • March (2 of 2 tasks left), no change.
  • April (9 of 14 tasks left): One task closed.
  • May (7 of 14 tasks left): no change.
  • June (7 of 14 tasks left): no change.
  • July 2020 (9 of 24 new tasks): One task closed.
  • August 2020 (26 of 53 new tasks): Five tasks closed.
  • September 2020 (15 of 33 new tasks): Two tasks closed.
  • October 2020: 45 of 69 new tasks survived the month of October and remain open today.
Recent tally
110 as of Excellence #24 (23rd Oct).
-13 closed of the 110 recent tasks.
+45 survived October 2020.
142 as of today, 23rd Nov.

For the on-going month of November, there are 25 new tasks so far.


🎉 Thanks!

Thank you to everyone else who helped by reporting, investigating, or resolving problems in Wikimedia production. Thanks!

Until next time,

– Timo Tijhof


 👤  Howard Salomon:

❝   Problem is when they arrest you, you get put on the justice train, and the train has no brain. ❞  

Footnotes:

[1] Incident documentation 2020, Wikitech
[2] Open tasks in Wikimedia-prod-error, Phabricator
[3] Wikimedia incident stats by Krinkle, CodePen
[4] Month-over-month, Production Excellence (spreadsheet)

CI now updates your deployment-charts

23:46, Tuesday, 17 2020 November UTC

If you're making changes to a service that is deployed to Kubernetes, it sure is annoying to have to update the helm deployment-chart values with the newest image version before you deploy. At least, that's how I felt when developing on our dockerfile-generating service, blubber.

Over the last two months we've added

And I'm excited to say that CI can now handle updating image versions for you (after your change has merged), in the form of a change to deployment-charts that you'll need to +2 in Gerrit. Here's what you need to do to get this working in your repo:

Add the following to your .pipeline/config.yaml file's publish stage:

promote: true

The above assumes the defaults, which are the same as if you had added:

promote:
  - chart: "${setup.projectShortName}"  # The project name
    environments: []                    # All environments
    version: '${.imageTag}'             # The image published in this stage

You can specify any of these values, and you can promote to multiple charts, for example:

promote:
  - chart: "echostore"
    environments: ["staging", "codfw"]
  - chart: "sessionstore"

The above values would promote the production image published after merging to all environments for the sessionstore service, and only the staging and codfw environments for the echostore service. You can see more examples at https://wikitech.wikimedia.org/wiki/PipelineLib/Reference#Promote

If your containerized service doesn't yet have a .pipeline/config.yaml, now is a great time to migrate it! This tutorial can help you with the basics: https://wikitech.wikimedia.org/wiki/Deployment_pipeline/Migration/Tutorial#Publishing_Docker_Images

This is just one step closer to achieving continuous delivery of our containerized services! I'm looking forward to continuing to make improvements in that area.

From student to professor: Amanda Levendowski

17:23, Monday, 16 2020 November UTC

This fall, we’re celebrating the 10th anniversary of the Wikipedia Student Program with a series of blog posts telling the story of the program in the United States and Canada.

Amanda Levendowski was a law school student 10 years ago when her professor assigned her to edit a Wikipedia article as a class assignment, part of the pilot program of what is now known as the Wikipedia Student Program. She tackled the article on the FAIR USE Act, a piece of failed copyright reform legislation introduced by Rep. Zoe Lofgren. And she was hooked.

“It felt so impactful to be able to contribute to this repository of knowledge that everyone I knew was using and leave behind something valuable,” Amanda says.

When her class ended, she wasn’t done with Wikipedia. She developed an independent study in law school to create the article about revenge porn because she was writing a scholarly piece about it and noticed that there wasn’t a Wikipedia article about the problem.

“That article has been viewed more than 1 million times — it’s probably gonna have more views than any piece of scholarship I write for the rest of my life,” she says.

She continued editing herself, even appearing in a 2015 “60 Minutes” piece about editing Wikipedia. (“There was a lot of footage that was understandably left on the cutting-room floor, but I’ll always remember wryly responding to Morley Safer when he suggested that copyright law was a little outdated and maybe a little boring — I think I said something like, ‘I’m sure many of your producers who rely on fair use would disagree.’ Who says that to Morley Safer?!” she recalls.) But she attributes her ongoing dedication to Wikipedia in part to Barbara Ringer.

“The year I graduated from law school, I overhauled the article about Ringer, the lead architect of the 1976 Copyright Act, the law around which much of my professional life revolves, during a WikiCon edit-a-thon,” she explains (the hero image on this blog post is of Amanda speaking at WikiConference USA in 2014). “There is something meditative about making an article better, about sharing an untold story, that I couldn’t resist wanting to continue experiencing alongside my students. And in the process, I found this stunning quote from Ringer about how the public interest of copyright law should be ‘to provide the widest possible access to information of all kinds.’ It’s hard to hear that and not think of Wikipedia and its mission.”

And now the student has become a professor herself. Amanda’s an Associate Professor of Law and Director, Intellectual Property and Information Policy Clinic at the Georgetown University Law Center. And she assigns her students to edit Wikipedia as a class assignment, of course.

One such student is Laura Ahmed, who is interested in the intersection of intellectual property and privacy law. Laura, who graduated in spring 2020, was both excited and nervous to tackle a Wikipedia assignment, making improvements to current Supreme Court case Google v. Oracle America, on the copyrightability of APIs and fair use.

“It is almost certainly going to have a substantial impact on software development in the United States, so I think it’s important for the information that is out there about the case to be accurate. That is what made me so nervous about it; it’s such a critical issue and I wanted to be sure that anything I was saying about it was adequately supported by facts,” she says. “Amanda was really great though about helping me get started and build up my confidence to edit the page. When we were editing, COVID-19 had just caused the Supreme Court to postpone several arguments, including this case. So Amanda suggested I start there, and once I’d made that one change it felt easier to go into the substance of the case and change some of the article to better reflect the legal arguments that are being made in the case.”

While Laura found the time constraints of a class assignment challenging, she thought the assignment was critical for both Wikipedia’s readers and her own hands-on learning as a law student.

“This assignment made me really think critically about what I’ve learned in law school and how I can use that knowledge in productive, but unexpected ways,” Laura explains. “When you’re a law student, you tend to forget that a lot of legal concepts aren’t common knowledge. So a lot of cases on Wikipedia really could benefit from a first or second year law student going in and just clarifying what the court actually said or what has actually happened with a case. It’s a nice reminder that we have more to contribute than we think.”

This reflection is exactly what Amanda experienced as a student herself, and is now seeing as an instructor. She reflects back on the American Bar Association’s Model Rules of Professional Conduct: “As a member of a learned profession…a lawyer should further the public’s understanding of and confidence in the rule of law and the justice system because legal institutions in a constitutional democracy depend on popular participation and support to maintain their authority.”

“It’s hard to imagine a more powerful way to further the public’s understanding of law and justice than by empowering law students to improve Wikipedia articles about those laws: it teaches the public, but it also teaches the students the twin skillsets of editing and the value of giving knowledge back to our communities,” Amanda says. “This community isn’t perfect, but I’m so inspired by the many, many volunteers who are striving to make it better. I’m proud to include myself and my students among them, and I’m excited to see where we are another decade out.”

Image: Geraldshields11, CC BY-SA 3.0, via Wikimedia Commons

Arabic and the Web

00:00, Monday, 16 2020 November UTC


I remember a Wikipedia workshop organized by the Institute of Computer Science at the University of Oxford? The question was why the number of Arabic speakers is around half a billion and the Arabic content is less than 5%, and in these five cases, perhaps a third is useful. A question and whether he found his answers and suggestions for a solution that the road is still long to support more content.

And because Arabic speakers are peoples who master multilingualism, perhaps unlike the American or European peoples, for example, you will always find those who master a second language, such as Algerian speaking in French and Egyptian speaking in English.

In the history of languages: And when the mother tongue is second or third. We waste time learning language instead of science. And many fall behind in their knowledge if they don't master the language. The rest do not succeed because they are not able to understand the culture of the language.

But in reality, how many people live in Algeria? How many contributors are from Algeria? And how many Algerians add encyclopedic content?

I can't answer here, but I have retrieved the 140-page report of the study in which I shared my thoughts and which was conducted by Oxford - whose excellent analyses I recommend.

In summary: we need to focus on the important objectives to define, organize and direct the work on this topic.

Permanent, adaptation and recurrence. I am optimistic about our future at this time.

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2430912

(This is a conversation that took place on a social media page, which was collected by this text for several responses with light behaviour)

Wikicite from the ground up: references

11:23, Sunday, 15 2020 November UTC
When a point is made in Wikipedia, when a statement is made in Wikidata, best practice is to include a reference. The same is true in a scholarly paper, its references are typically found in a references section.

Wikicite is a project that brings many scholarly papers into Wikidata as beautiful as it is, it is a top down process. As an ordinary editor there is a lot that you can do to enrich the result.

The paper, "Can trophic rewilding reduce the impact of fire in a more flammable world?" has a DOI, the PDF includes a reference section. It takes a lot of effort to add the authors and papers it cites to Wikidata. The visibility of the paper improves and so does the visibility of the paper it cites. The Scholia shows that at this time, this paper is not used as a reference in Wikipedia. 

There is now a template that retrieves information from Wikidata for its reference data. It will be great when it is widely adopted because it provides an additional pathway from Wikipedia to the used references and the information relating to the reference.

So what can we do to improve on the quality of the data in Wikidata. First, the processes that import the bulk of new data are crucial, they are essential and need to be appreciated as such. The next part is enabling a community to improve the data. A recent paper explained what can be done with a top down approach. All kinds of decisions were made for us and the result feels like a one off project. 

When ORCID is considered to be our partner, it makes sense to invite people registered at ORCID to contribute to Wikidata. Their papers can be uploaded from ORCID into Wikidata, their co-authors and references can be linked by these people. As they do this while being logged into ORCID, we are assured because of their known personal involvement and use this as a reference.

The quality of such a reference is better than our current references that came with a link to an "author name string". Who knows that the disambiguation was correct? When a paper is linked to at least one known ORCID person with public information, we have a link we can verify and consequently it becomes a link we can trust. Once the link with a person with a ORCID identifier is established, we can ask to acknowledge the  changes that happen in his or her papers. Our quality is enhanced and a sense of community with ORCID is established.

Thanks, GerardM

Wikicite from the ground up: "Trophic rewilding"

11:15, Sunday, 15 2020 November UTC
In nature conservation, trophic rewilding and trophic cascades are important topics. When an animal like the howler monkey is no longer around, it no longer distributes the seeds of trees. The likely effect is that in time plants are no longer part of the ecosystem. Reintroducing a howler monkey restores the relation; it is considered an example of trophic rewilding.

At Wikipedia there is no article about trophic rewilding. As someone famously said, references are the most important part of a Wikipedia article, let's start with finding references.

There is a longstanding process of importing data about scholarly papers, all kinds of scholarly papers. Some of them have "trophic rewilding" in their title. Trophic rewilding was not known as a subject so it was easy enough to look for "trophic rewilding" and add it as a subject. Slowly but surely the Scholia representation evolves. More papers means more authors and more authors known to have collaborated on multiple publications. More citations are found for these papers and by inference they have a relation to the subject.

The initial set of data is already good enough to get a grasp of the subject but when you want more, you can look for missing data using Scholia, information like missing authors. The author disambiguator aids in finding papers for the missing author. With such iterations, the Scholia for trophic rewilding becomes more complete.

Another avenue to improve the coverage of a subject is by adding "cites work" in Wikidata for a paper like this one. Not all cited works are known to Wikidata but the effect can be impressive. NB The citations are often found in a PDF  and not in the article..

Slowly but surely all the scholarly references to be used for a new article are available, you can use a template in the article to link to the (evolving) Scholia. The best bit is you can add this template in an existing Wikipedia article as well providing a scholarly rabbit hole for interested readers.

Thanks, GerardM