Page MenuHomePhabricator

Reduce DiscussionTools' usage of the parser cache
Open, Needs TriagePublic

Description

This task represents two streams of work:

  1. The longer-term work associated with reducing DiscussionTools' usage of the parser cache and
  2. The near-term work of modifying the parser cache expiry to reduce its usage so the Editing Team can proceed with scaling DiscussionTools features.

Plans

This section contains the in-progress plan for reducing DiscussionTools' usage of the parser cache.

Near-term plan of reducing parser cache usage

STEPDESCRIPTIONTICKETSTATUS
Step 1a.Pre-deploy: Draft plan for interim mitigation with Performance-Team and DBA.(this one)
Step 1b.Pre-deploy: Write down how Performance-Team and DBA monitors outcome.T280602
Step 2.Execute mitigation plan.T280605
Step 3.Post-deploy: Evaluate impact on site performance (for at least 21 days).T280606
Step 4b.Post-deploy: Ramp up parser cache retention while keeping an eye on parser cache utilization.T280604

Longer-term plan of decreasing #discussion-tools's usage of parser cache

StepDescriptionTicketNotes
1Avoid splitting parser cache on user languageT280295
2Avoid splitting parser cache on opt-out wikisT279864
3Deploy to more wikis as opt-outT275256

Related Objects

Event Timeline

ppelberg renamed this task from Decide on a plan to reduce DiscussionTools' dependence on the parser cache to Decide on a plan to reduce DiscussionTools' usage of the parser cache.Apr 20 2021, 6:07 PM
ppelberg updated the task description. (Show Details)
ppelberg renamed this task from Decide on a plan to reduce DiscussionTools' usage of the parser cache to Reduce DiscussionTools' usage of the parser cache.Apr 21 2021, 4:47 PM

@LSobanski As we discussed the other day, we will be merging T280295 and T279864 this week to be deployed next week. Let us know if see any issue with that.

@Esanders thanks for the heads up. Since these are expected to decrease the disk usage, I am nothing but supportive :)

@Krinkle + @LSobanski: below is an update about the Editing Team's plans for scaling DiscussionTools features to more projects as opt-out settings.

The below is for y'alls awareness. I don't see anything about the below changing/impacting the steps we've defined in this "epic". Although, if you see this differently, please comment as much...

Editing Team's plan for scaling DiscussionTools features

  1. This week, we're beginning conversations with volunteers at ~25 Wikipedias inviting their feedback about our plans to offer the Reply Tool as an opt-out setting (T262331) at their project. We will not be making any commitments about specific deployment dates considering these dates depend on us resolving the parser cache utilization issue. T281533 captures the work involved with having said "conversations."
  2. Once we start receiving consent from wikis to turn the Reply Tool on by default, we'll comment here asking y'all about the parser cache's utilization status so we can, in turn, provide updates to projects about when they can potentially expect to see the Reply Tool available to everyone at their projects.
  3. Once we are comfortable with the parser cache's utilization, we'll proceed with offering the Reply Tool as an opt-out setting at the projects referenced in "2."

Update: 1 July 2021

Documenting the next steps that emerged in the two meetings related to this issue today...

Next steps

  • @Krinkle to verify whether the optimizations made in T282761 have been effective: T280606
  • @Krinkle to estimate the growth in demand for Parser Cache storage: T285993 / T280604
  • Editing Team to estimate the growth in DiscussionTools' demand for Parser Cache storage: T285995 [i]

i. The need for this estimate emerged in a second conversation between @DannyH, @marcella, @DAbad, and myself.

  • @Krinkle to verify whether the optimizations made in T282761 have been effective

That's T280606: Post-deployment: evaluate impact on site performance.

  • @Krinkle to estimate the growth in demand for Parser Cache storage: T285993

This'll be part of T280604: Post-deployment: (partly) ramp parser cache retention back up , moved as subtask there.

  • @Krinkle to verify whether the optimizations made in T282761 have been effective

That's T280606: Post-deployment: evaluate impact on site performance.

Noted.

  • @Krinkle to estimate the growth in demand for Parser Cache storage: T285993

This'll be part of T280604: Post-deployment: (partly) ramp parser cache retention back up , moved as subtask there.

Noted. Excellent. Thank you.

I fell down the rabbit hole of ParserCache when I was investigating for T285987: Do not generate full html parser output at the end of Wikibase edit requests (unrelated to discussion tools but related to ParserCache). I have some results, I would like to share, where should I post my numbers?

I don't know where to put this so I put my findings here. I did a sampling of 1:256 and checked the keys. In total we have 550M PC entries.

I'm struggling to see how discussion tools can cause issues for parsercache. Its current fragmentation is next to nothing (0.28% extra rows, currently around 1.4M rows). Maybe the reduction of expiry has helped but I would like to see some numbers on that.

The actual problem is parser cache entries of commons. It's currently 29% of all parser cache entries and over 160M rows. To compare, this is more than all PC entries of enwiki, wikidata, zhwiki, frwiki, dewiki, and enwiktionary combined. I think it's related to a bot that purges all pages in commons or can be due to refreshlinks jobs or cirrussearch sanitizer job misbehaving (or combination of all). This needs a way deeper and closer look.

Looking at commons a bit closer: Out of 160M entries:

  • 136M rows are non-canonical and only 24M rows are canonical
  • 100M rows have wb=3 on them. I don't know what wikibase is supposed to do on commons for parsercache but this doesn't sound right at all. We don't have new termbox there.
  • 108M are not render requests and 60M are render requests.
  • 52M are fragmentation due to user language not being English.
  • 39M rows are because of 'responsiveimages=0'.

I keep looking at this in more depth and keep you all posted.

Some random stuff I found:

  • People can fragment parsercache by choosing random languages. For example I found an entry with userlang=-4787_or_5036=(select_(case_when_(5036=4595)_then_5036_else_(select_4595_union_select_4274)_end))--_emdu
  • TMH seems to be using ParserCache as a general purpose cache in ApiTimedText. I found entries like commonswiki:apitimedtext:Thai_National_Anthem_-_US_Navy_Band.ogg.ru.srt:srt:srt there. This is not much but has potential to explode.
  • There is a general problem of bots editing pages and triggering a parsed entry while actually no one looking at them. e.g. ruwikinews a very small wiki in terms of traffic apparently now has 15M ParserCache rows (ten times bigger than all of discussion tools overhead) mostly because they recently imported a lot of news from an old place. We can rethink and maybe avoid parsing the page and storing PC if the bot flag is set.

I dig more and let you know.

100M rows have wb=3 on them. I don't know what wikibase is supposed to do on commons for parsercache but this doesn't sound right at all. We don't have new termbox there.

This is added by WikibaseRepo and will probably appear in ALL commons and wikidata (and the associated test site) pcache keys
https://github.com/wikimedia/Wikibase/blob/c1791fbca79be6f14b42a4117367ddaa1e618023/repo/includes/RepoHooks.php#L1069-L1073
Though this has consistnetly been 3 for years now, so no extra splitting should be happening here
We could probably drop this.

Not sure why only some % of commons entries seem to have this? the hook looks like it always adds it?
Could be something to do with MCR? not sure?

There is a general problem of bots editing pages and triggering a parsed entry while actually no one looking at them. e.g. ruwikinews a very small wiki in terms of traffic apparently now has 15M ParserCache rows (ten times bigger than all of discussion tools overhead) mostly because they recently imported a lot of news from an old place. We can rethink and maybe avoid parsing the page and storing PC if the bot flag is set.

This is also a problem for Wikidata, and we are going to stop this from happening T285987: Do not generate full html parser output at the end of Wikibase edit requests

Change 708520 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[mediawiki/extensions/TimedMediaHandler@master] Avoid using ParserCache as a general purpose cache

https://gerrit.wikimedia.org/r/708520

I'm struggling to see how discussion tools can cause issues for parsercache. Its current fragmentation is next to nothing

The fragmentation issue in DT was solved many months ago at the source already, and later with the reduced retention, so it is expected to be very low now.

commons. It's currently 29% of all parser cache entries and over 160M rows. To compare, this is more than all PC entries of enwiki, wikidata, zhwiki, frwiki, dewiki, and enwiktionary combined.

Thanks, this is very nice. We hadn't yet tried to break it down this way. Right now, though, I'd say we're not actively looking to decrease. Previous experience does tell us that even low hit rates are useful in PC given the high cost of generating them. I'm actually thinking about a possible future where PC is more like ExternalStore, in that it would not have a TTL at all, but basically append-only (apart from replacing entires with current revisions, and applying deletions). Especially as we get closer to Parsoid being used for page views, which has a relatively strong need to have an expansion ready to go at all times. As well as improving performance for page views more broadly by getting the miss-rate so low that we could potentially even serve an error if a PC entry is missing (and queue a job or something). This will require a lot more work, but it shows a rough long-term direction that I'm considering. (Nothing is decided on yet.)

Some random stuff I found:

  • People can fragment parsercache by choosing random languages. For example I found an entry with userlang=-4787_or_5036…

This is required for the int-lang hack. These should be given a shortened TTL, same as for old revisions and non-canonical preferences, but at least so long as we support this feature, still worth caching I imagine.

I'm hoping to, in the next 1-2 years, deprecate and remove this feature as it seems the various purposes for it have viable alternatives nowadays. It'll take a long time to migrate, but during the migration we could potentially disable caching at some point, or severely limit which wikis/namespaces it is cached for, and eventually disabled (e.g. normalised to a valid language code).

  • TMH seems to be using ParserCache as a general purpose cache in ApiTimedText. I found entries like commonswiki:apitimedtext:Thai_National_Anthem_-_US_Navy_Band.ogg.ru.srt:srt:srt there. This is not much but has potential to explode.

Ack. I think we may have one or two other things like this. These are basically using PC as if it is the MainStash, where we are currently short on space. Being worked on at T212129.

  • There is a general problem of bots editing pages and triggering a parsed entry while actually no one looking at them. e.g. ruwikinews a very small wiki in terms of traffic apparently now has 15M ParserCache rows (ten times bigger than all of discussion tools overhead) mostly because they recently imported a lot of news from an old place. We can rethink and maybe avoid parsing the page and storing PC if the bot flag is set.

As mentioned above, PC benefits a lot from the long-tail. So intentional measures not to pre-cache entries during edits would affect performance of API queries, Jobs, and eventually page views. It may be good to have this as one of several emergency levers we can pull to reduce load, but I'm not sure about it in general.

In general though, I think right now we're stable and I'd prefer not to make major changes to demand if we can avoid it until this task and its subtasks are completed.