en.planet.wikimedia

June 07, 2017

Wikimedia Performance Team

Improving time-to-logo performance with preload resource hints

One of the goals of the Wikimedia Performance Team is to improve the performance of MediaWiki and the broader software stack used on Wikimedia wikis. In this article we’ll describe a small performance improvement we’ve implemented for MediaWiki and recently deployed to production for Wikimedia. It highlights some of the unique problems we encounter on Wikimedia sites and how new web standards can be leveraged to improve performance.

Logo as CSS background

The MediaWiki logo is defined as a CSS background image on an element. This is historically for caching reasons, because MediaWiki deployments tend to cache pages as a whole and changing the logo would thus require invalidating all pages if the logo was a regular <img> tag. By having it as a CSS background, updating the logo only requires invalidating the stylesheet where it resides. This constraint has significant implications on when the logo loads.

In the loading sequence of a web page, browsers will give a relatively low priority to CSS background images. In practice, assuming an empty browser cache, this means that the MediaWiki logo loads quite late, after most images that are part of the page content have been loaded. To the viewer, this results in the page loading somewhat out of order: images that aren’t necessarily in view are loaded first, and the logo is one of the last images to be loaded. This breaks the de facto expectation that a web page’s content loads from top to bottom.

This phenomenon extends the average duration of an imaginary metric one could call time-to-logo. The point in time when the logo appears is an important mental milestone, as it’s when a visitor has visual confirmation that they’ve landed on the right website. The issue of time-to-logo being high due to the CSS background limitation is felt even more on slow internet connections, where the logo can take seconds to appear - long after the page’s text and other images lower than the logo on the page have been loaded.

The preload resource hint

We have been looking for a solution to this problem for some time, and a relatively new browser feature has enabled us to develop a workaround. The preload resource hint, developed by the W3C, allows us to inform the browser early that the logo will be needed at some point on the page. This feature can be combined with CSS media queries, which in our case means that the browser will only preload the right version of the logo for the current pixel density/zoom. This is essential, as we don’t want to preload a version of the logo that the page won’t need. Browser cache is also respected, meaning that all we’re doing is loading the logo a lot earlier than it naturally would, which is exactly what we were looking for. In fact, the browser now knows that it needs to load the logo a lot sooner than it would have if we displayed the logo as an <img> element without a preload hint.

The preload hint for the site logo has been deployed to production for all Wikimedia wikis. It can easily be spotted in the response header of pages that display the logo (the vast majority - if not all - pages on wikis for desktop users). This is actually leveraging a little-known browser feature where <link> tags can be passed as response headers, which in this situation allows us to inform the browser even sooner that the logo will be needed.

Link: </static/images/project-logos/enwiki.png>;rel=preload;as=image;media=not all and (min-resolution:1.5dppx),</static/images/project-logos/enwiki-1.5x.png>;rel=preload;as=image;media=(min-resolution:1.5dppx) and (max-resolution:1.999999dppx),</static/images/project-logos/enwiki-2x.png>;rel=preload;as=image;media=(min-resolution:2dppx)

Measuring the impact

To confirm the expected impact of logo preloading, we recorded a before and after video using synthetic testing with Sitespeed.io, on a simulated slow internet connection, for a large page (the Barack Obama article on English Wikipedia), where the problem was more dramatic. The left pane is the article loading without logo preloading, the right pane is with logo preloading enabled. Focus your attention on the top-left of the article, where the Wikipedia logo is expected to appear:

Unfortunately current javascript APIs in the browser aren’t advanced enough to let us measure something as fine-grained as time-to-logo directly from users, which means that we can only speculate about the extent to which it had an impact in the real world. The web performance field is making progress towards measuring more user-centric metrics, such as First Meaningful Paint, but we’re still very far from having the ability to collect such metrics directly from users.

In our case, the difference seen in synthetic testing is dramatic enough that have a high level of confidence that it has made the user experience better in the real world for many people.

The preload resource hint isn’t supported by all major web browsers yet. When more browsers support it, MediaWiki will automatically benefit from it. We hope that wikis as large as Wikipedia relying on this very useful browser feature will be an incentive for more browsers to support it.

by Gilles (Gilles Dubuc) at June 07, 2017 05:04 PM

Wikimedia Foundation

Building communities to support free knowledge: Addis Wang

Photo by Victor Grigas, CC BY-SA 3.0.

Every November, people from all around the world volunteer their time as part of Wikipedia Asian Month to write Wikipedia articles about the world’s largest continent. A participant in the contest who writes a few Wikipedia articles can get postcards featuring famous Asian monuments, assuming that the article meets certain quality standards.

The plan to encourage volunteer editors with simple yet relevant rewards has blossomed into thousands of new articles added to Wikipedia.

And who conceptualized all of this and made it a reality? Addis Wang first thought of Wikipedia Asian Month in 2015 and worked to coordinate the project with other Wikipedians.

The Chinese volunteer joined Wikipedia in 2008 as a young 16-year-old; he has always been concerned with Wikipedia’s presence in his country, China, and the Asian region. Wang extensively edited the Chinese Wikipedia for years until he came to believe that more sustainable change comes when the editing experience is shared with others. That’s when he turned his efforts toward outreach activities that encourage more people to edit, build healthier Wikipedia communities, and connect Wikipedians who don’t know each other.

“I feel most proud when a small effort I make can uniquely help knowledge spread,” says Wang. “I’ve created thousands of articles on the Chinese Wikipedia, but my current project, Wikipedia Asian Month, is the one I’m most proud of. It shows how Wikipedians around world collaborate and contribute to one common goal. It also has significantly assisted the development of many local communities and small language Wikipedia projects, especially those in Asia.”

According to Wang, a strong encyclopedia needs a community to support its existence, and those community members will not have the motivation to volunteer before learning about the project and understanding its needs.

“We want to share our knowledge with more people and want more people to edit,” Wang explains. “But we can’t say, ‘Hey, come edit Wikipedia’ to someone who doesn’t know that Wikipedia even exists.”

To help raise awareness about Wikipedia in his country, where local for-profit competitors host most of the local online content, Wang co-founded the first community user group for the Wikimedia movement in China in 2013. The group held regular meetups for existing Wikipedians and other activities for those interested in learning about Wikipedia. In addition, one of their very first activities was hosting a Wiki Loves Monuments contest, which is held globally but organized by country. The Chinese Wikimedia community joined Wiki Loves Monuments for the first time this year thanks to the efforts of Wang and others.

Moving to study at Ohio State University in 2012 might have separated Wang from the Wikipedia community in China, but that was no excuse for him to quit contributing. Instead, Wang quickly started to work with his fellow students on establishing a Wikipedian community on campus and off in Ohio.

“A long time ago, Wikipedian Kevin Payravi started a Wikipedia club that caught my attention, so I joined him,” Wang recalls. “We are also looking for opportunities to encourage more campus-based communities to practice what we’re doing.”

The student group aimed to invite students and educators to Wikipedia editing events; it has quickly grown and collaborated on starting a community user group in the state of Ohio in 2016. The new group has embarked on organizing several projects since its beginning, including Wiki Loves Monuments 2016 in the United States.

For Wang, Wikipedia has introduced a learning revolution, and that’s what keeps him willing to support it day after day.

“Wikipedia allows people from all over the world to share the world’s knowledge, and you can simply access it for free,” says Wang. “Because we use Wikipedia every day, it’s hard to imagine how we could find information without it. Online content is filled with unverified information and soft advertisements, books are limited and going to a library is time-consuming. This also depends on whether or not there is a decent library in your community, which is still uncommon, even today. That’s why I support Wikipedia and believe that it is so important.”

Interview by Jonathan Curiel, Senior Development Communications Manager
Profile by Samir Elsharbaty, Digital Content Intern
Wikimedia Foundation

by Jonathan Curiel and Samir Elsharbaty at June 07, 2017 02:27 PM

Wikimedia Scoring Platform Team

Join my Reddit AMA about Wikipedia and ethical, transparent AI

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2017-May/000163.html)

Hey everybody,

TL;DR: I wanted to let you know about an upcoming experimental Reddit AMA ("ask me anything") chat we have planned. It will focus on artificial intelligence on Wikipedia and how we're working to counteract vandalism while also making life better for newcomers.

We plan to hold this chat on June 1st at 21:00 UTC/14:00 PST in the /r/iAMA subreddit[1]. I'd love to answer any questions you have about these topics questions, and I'll send a follow-up email to this thread shortly before the AMA begins.


For those who don't know who I am, I create artificial intelligences[2] that support the volunteers who edit Wikipedia[3]. I've been fascinated by the ways that crowds of volunteers build massive, high quality information resources like Wikipedia for over ten years.

For more background, I research and then design technologies that make it easier to spot vandalism in Wikipedia—which helps support the hundreds of thousands of editors who make productive contributions. I also think a lot about the dynamics between communities and new users—and ways to make communities inviting and welcoming to both long-time community members and newcomers who may not be aware of community norms. For a quick sampling of my work, check out my most impactful research paper about Wikipedia[3], some recent coverage of my work from *Wired*[4], or check out the master list of my projects on my WMF staff user page[5], the documentation for the technology team I run[9], or the home page for Wikimedia Research[8].

This AMA, which I'm doing with with the Foundation's Communications department, is somewhat of an experiment. The intended audience for this chat is people who might not currently be a part of our community but have questions about the way we work—as well as potential research collaborators who might want to work with our data or tools. Many may be familiar with Wikipedia but not the work we do as a community behind the scenes.

I'll be talking about the work I'm doing with the ethics of AI and how we think about artificial intelligence on Wikipedia, and ways we’re working to counteract vandalism on the world’s largest crowdsourced source of knowledge—like the ORES extension[6], which you may have seen highlighting possibly problematic edits on your watchlist and in RecentChanges.

I’d love for you to join this chat and ask questions. If you do not or prefer not to use Reddit, we will also be taking questions on ORES' MediaWiki talk page[7] and posting answers to both threads.

  1. https://www.reddit.com/r/IAmA/
  2. https://en.wikipedia.org/wiki/Artificial_intelligence
  3. https://www.mediawiki.org/wiki/ORES
  4. http://www-users.cs.umn.edu/~halfak/publications/The_Rise_and_Decline/halfaker13rise-preprint.pdf
  5. https://www.wired.com/2015/12/wikipedia-is-using-ai-to-expand-the-ranks-of-human-editors/
  6. https://en.wikipedia.org/wiki/User:Halfak_(WMF)
  7. https://www.mediawiki.org/wiki/Extension:ORES
  8. https://www.mediawiki.org/wiki/Talk:ORES
  9. https://www.mediawiki.org/wiki/Wikimedia_Research
  10. https://www.mediawiki.org/wiki/Wikimedia_Scoring_Platform_team

-@Halfak
Principal Research Scientist @ WMF
User:EpochFail / User:Halfak (WMF)

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 07, 2017 07:31 AM

June 06, 2017

Wikimedia Scoring Platform Team

Status update (June 3rd, 2017)

Hey folks,

I'll be starting to post updates here on the phame blog from now on, but if you'd prefer to be notified via the mailing lists we used to post to, that's OK. I'll make sure that the highlights and the link to these posts gets pushed there too.

We had a big presence at the Wikimedia Hackathon 2017 in Vienna. We kicked off a lot of new language focused collaborations and we deployed a new Item Quality model for Wikidata.

French and Finnish Wikipedias now have advance edit quality prediction support!

ORES is available through api.php again via rvprop=orescores and rcprop=oresscores.

Wiki labels now has a new stats reporting interface. Check out https://labels.wmflabs.org/stats

We had a major hiccup when failing over to CODFW, but we worked it out and ORES is very happy again.

See the sections below for details.

Labeling campaigns

We deployed a new edit quality labeling campaign to English Wiktionary(T165876) and we're looking for someone who can work as a liaison for this task. We've also deployed secondary labeling campaigns to Finnish Wikipedia(T166558) and Turkish Wikipedia(T164672). These secondary campaigns help us improve ORES accuracy.

Outreach & comms

We hosted a session at the Wikimedia Hackathon to tell people about ORES and show how to work with us to get support for your local wiki(T165397). We also worked with the Collaboration Team to announce that ORES Review Tool would not be enabled by default and the New Filters would be deployed as a beta feature(T163153).

New development

Lots of things here. In our modeling library, we implemented the basics of Greek and Bengali language assets so that we can start working on prediction models(T166793, T162620). After talking to people at the Wikimedia Hackathon about peculiar language overlap, we implemented a regex exclusions strategy(T166793) that will allow us to clearly state that "ha" is not laughing in Hungarian or Italian, but it is in a lot of other contexts.

We also spent some time exploring the overlap of the "damaging" and "goodfaith" models on Wikipedia(T163995). We were able to show that there's useful overlap that will allow editors working on newcomer socialization to find goodfaith newcomer who are running into trouble. The Collaboration Team adjusted the thresholds in New Filters in response to our analysis(T164621).

Using data from Wiki labels(T157495), we trained a basic item quality model for Wikidata(T164862) and demonstrated it at the Wikimedia Hackathon(T166054). We used data from Wiki labels(T130261, T163012) to build advanced edit quality models for French and Finnish Wikipedia(T130282, T163013) and those are now deployed in ORES(T166047).

We implemented a new stats reporting interface in Wiki labels(T139956) and announced it (T166529). This interface makes it easier for people managing campaigns in Wiki labels to track progress. It's a long time coming. Props to @Ladsgroup for doing a bunch of work to make it happen.

Finally, we implemented a new "score_revisions" utility that makes it quick and easy to generate scores for a set of revisions using the ORES service(T164547). This is really useful for researchers who want lots of scores and would like to avoid taking down ORES. Personally, I've been using it to audit ORES.

Maintenance and robustness

We did a major deployment of ORES in mid-April(T162892) that had some serious problems in CODFW, but not EQIAD which was super confusing (T163950), so we re-routed traffic to EQIAD(350487). While investigating, we found out that some timeouts(T163944) and server errors(T163171, T163764, T163798) were due to the same problem: There were two servers in CODFW that we didn't know existed so they weren't getting new deployment and were poisoning our worker queue with old code!

We also fixed a couple of regressions that popped up in the ORES Review Tool while new work was being done on New Filters (T165011, T164984). We fixed some weird tokenization issues due to diacritics in Bengali not being handled correctly(T164767).

We re-enabled ORES in api.php(T163687). Props to @Tgr for making this happen.

We fixed some issues with ORES swagger documentation(T162184) and some UI issues in Wiki labels related to button colors(T163222) and confusing error messages(T138563).

Documentation

We finished off some data-flow diagrams for ORES(T154441). As part of transitioning to a Wikimedia Foundation team (Scoring Platform! Woot!), we've moved all the documentation for ORES and our team to Mediawiki.org(T164991). Also, as part of the Tech Ops experimentation with failovers across datacenters, we updated our grafana metrics tracking to split metrics by datacenter(T163212). This helped us quite a bit with diagnosing the deployment issues we discussed in the last section.

That's all folks. I hope you enjoyed the new format!

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 06, 2017 11:56 PM

Wikimedia Foundation

Sacrificing freedom of expression and collaboration online to enforce copyright in Europe?

Photo by Kain Kalju, CC BY 2.0.

Respect for copyright laws is a fundamental part of Wikipedia’s culture, and embedded in the free online encyclopedia’s five central pillars. Contributors diligently monitor new edits for compliance with copyright and collaboratively resolve disputes over the permissible use of a work or its status. When a rightsholder finds that their work is used without permission on a Wikimedia Project, we encourage them to talk to the community of editors to address the issue. If this does not lead to a resolution, under the Digital Millennium Copyright Act (the US analogue to the EU’s E-Commerce Directive), rightsholders can notify the Wikimedia Foundation, as the content host, of the alleged copyright infringement. The fact that we only received twelve such notices (only four of which were valid) in the second half of 2016 is a testament to the diligence of Wikipedia editors and the accuracy of human-based detection of copyright infringement.

Yet, because European lawmakers see copyright infringement as a problem on other platforms, they are currently debating a proposal for a new EU copyright directive that—if applied to Wikipedia—would put the site’s well functioning system in peril. Article 13 of the proposal would require “information society services” that store large amounts of content uploaded by users (as Wikimedia does) to take measures to prevent uploads of content that infringes copyright. A radical new “compromise amendment” would apply this requirement to all services—not just ones hosting “large amounts” of content. The Commission’s proposal suggests that hosts implement upload filters, or, as they call them, “effective content recognition technologies”. Some large, for-profit platforms already have such filtering technologies in place, but we strongly oppose any law that would make them mandatory. Filtering technologies have many flaws, and a requirement to implement them would be detrimental to the efficient and effective global online collaboration that has been Wikipedia’s foundation for the past 16 years.

First, filters are often too broad in their application because they aren’t able to account for the context of the use of a work. Automated content detection generally has no knowledge of licenses or other agreements between users, platforms, and rightsholders that may be in place. Such filtering systems also fail to make good case-by-case decisions that would take into consideration copyright laws in various countries that may actually allow for the use of a work online. As a result, a lot of culturally or otherwise valuable works are caught as “false positives” by the detection systems and consequently taken off the platforms. In fact, automated takedowns are such a prevalent phenomenon, that researchers have seen the need to document them in order to provide transparency around these processes that affect freedom of expression online, and maybe even the rule of law. Moreover, such filter systems also have been shown to create additional opportunities for attacks on users’ privacy.

Second, mandatory filtering technology that scans all uploads to a platform can be used for all kinds of purposes, not just copyright enforcement. Automatic content filters can also monitor expression and target illicit or unwanted speech, for instance under the guise of anti-terrorism policies. In other words: they can be repurposed for extensive surveillance of online communications. While intended to address copyright infringement, Art. 13 of the proposed copyright directive would actually lay the groundwork for mass-surveillance that threatens the privacy and free speech of all internet users, including Wikipedians who research and write about potentially controversial topics. All Europeans should be as concerned about these threats as the EU Wikimedia communities and we at the Wikimedia Foundation are.

Third, the broad and vague language of Art. 13 and the compromise amendment would undermine collaborative projects that rely on the ability of individuals around the world to discuss controversial issues and develop content together. Free knowledge that is inclusive, democratic, and verifiable can only flourish when the people sharing knowledge can engage with each other on platforms that have reasonable and transparent takedown practices. People’s ability to express themselves online shouldn’t depend on their skill at navigating opaque and capricious filtering algorithms. Automatic content filtering based on rightsholders’ interpretation of the law would—without a doubt—run counter to these principles of human collaboration that have made the Wikimedia projects so effective and successful.

Finally, automatic content detection systems are very expensive. YouTube spent USD 60 million to develop ContentID. Requiring all platforms to implement these filters would put young startups that cannot afford to build or buy them at a tremendous disadvantage. This would hurt, not foster, the digital single market in the European Union, as it would create a tremendous competitive advantage for platforms that already have implemented such filters or are able to pay for them. The result would be diminished innovation and diversity in the European interior market and less choice for European internet users.

As currently written, Art. 13 would harm freedom of expression online by inducing large-scale implementation of content detection systems. For many Europeans, Wikipedia is an important source of free knowledge. It is built by volunteers who need to be able to discuss edits with other contributors without opaque interference from automatic filters. Therefore, we urge the European Parliament and the Council to avert this threat to free expression and access to knowledge by striking Art. 13 from the proposed directive. If the provision stays, the directive will be a setback to true modernization of European copyright policy.

Jan Gerlach, Public Policy Manager
Wikimedia Foundation

by Jan Gerlach at June 06, 2017 03:00 PM

Brion Vibber

Brain dump: x86 emulation in WebAssembly

This is a quick brain dump of my recent musings on feasibility of a WebAssembly-based in-browser emulator for x86 and x86-64 processors… partially in the hopes of freeing up my brain for main project work. 😉

My big side project for some time has been ogv.js, an in-browser video player framework which uses emscripten to cross-compile C libraries for the codecs into JavaScript or, experimentally, the new WebAssembly target. That got me interested in how WebAssembly works at the low level, and how C/C++ programs work, and how we can mishmash them together in ways never intended by gods or humans.

Specifically, I’m thinking it would be fun to make an x86-64 Linux process-level emulator built around a WebAssembly implementation. This would let you load a native Linux executable into a web browser and run it, say, on your iPad. Slowly. 🙂

System vs process emulation

System emulators provide the functioning of an entire computer system, with emulated software-hardware interfaces: you load up a full kernel-mode operating system image which talks to the emulated hardware. This is what you use for playing old video games, or running an old or experimental operating system. This can require emulating lots of detail behavior of a system, which might be tricky or slow, and programs may not integrate with a surrounding environment well because they live in a tiny computer within a computer.

Process emulators work at the level of a single user-mode process, which means you only have to emulate up to the system call layer. Older Mac users may remember their shiny new Intel Macs running old PowerPC applications through the “Rosetta” emulator for instance. QEMU on Linux can be set up to handle similar cross-arch emulated execution, for testing or to make some cross-compilation scenarios easier.

A process emulator has some attraction because the model is simpler inside the process… If you don’t have to handle interrupts and task switches, you can run more instructions together in a row; elide some state changes; all kinds of fun things. You might not have to implement indirect page tables for memory access. You might even be able to get away with modeling some function calls as function calls, and loops as loops!

WebAssembly instances and Linux processes

There are many similarities, which is no coincidence as WebAssembly is designed to run C/C++ programs similarly to how they work in Linux/Unix or Windows while being shoehornable into a JavaScript virtual machine. 🙂

An instantiated WebAssembly module has a “linear memory” (a contiguous block of memory addressable via byte indexing), analogous to the address space of a Linux process. You can read and write int and float values of various sizes anywhere you like, and interpretation of bytewise data is up to you.

Like a native process, the module can request more memory from the environment, which will be placed at the end. (“grow_memory” operator somewhat analogous to Linux “brk” syscall, or some usages of “mmap”.) Unlike a native process, usable memory always starts at 0 (so you can dereference a NULL pointer!) and there’s no way to have a “sparse” address space by mapping things to arbitrary locations.

The module can also have “global variables” which live outside this address space — but they cannot be dynamically indexed, so you cannot have arrays or any dynamic structures there. In WebAssembly built via emscripten, globals are used only for some special linking structures because they don’t quite map to any C/C++ construct, but hand-written code can use them freely.

The biggest difference from native processes is that WebAssembly code doesn’t live in the linear memory space. Function definitions have their own linear index space (which can’t be dynamically indexed: references are fixed at compile time), plus there’s a “table” of indirect function references (which can be dynamically indexed into). Function pointers in WebAssembly thus aren’t actually pointers to the instructions in linear memory like on native — they’re indexes into the table of dynamic function references.

Likewise, the call stack and local variables live outside linear memory. (Note that C/C++ code built with emscripten will maintain its own parallel stack in linear memory in order to provide arrays, variables that have pointers taken to them, etc.)

WebAssembly’s actual opcodes are oriented as a stack machine, which is meant to be easy to verify and compile into more efficient register-based code at runtime.

Branching and control flow

In WebAssembly control flow is limited, with one-way branches possible only to a containing block (i.e. breaking out of a loop). Subroutine calls are only to defined functions (either directly by compile-time reference, or indirectly via the function table)

Control flow is probably the hardest thing to make really match up from native code — which lets you jump to any instruction in memory from any other — to compiled WebAssembly.

It’s easy enough to handle craaaazy native branching in an interpreter loop. Pseudocode:

loop {
instruction = decode_instruction(ip)
instruction.execute() // update ip and any registers, etc
}

In that case, a JMP or CALL or whatever just updates the instruction pointer when you execute it, and you continue on your merry way from the new position.

But what if we wanted to eke more performance out of it by compiling multiple instructions into a single function? That lets us elide unnecessary state changes (updating instruction pointers, registers, flags, etc when they’re immediately overridden) and may even give opportunity to let the compiler re-optimize things further.

A start is to combine runs of instructions that end in a branch or system call (QEMU calls them “translation units”) into a compiled function, then call those in the loop instead of individual instructions:

loop {
tu = cached_or_compiled_tu(ip)
tu.execute() // update registers, ip, etc as we go
}

So instead of decoding and executing an instruction at a time, we’re decoding several instructions, compiling a new function that runs them, and then running that. Nice, if we have to run it multiple times! But…. possibly not worth as much as we want, since a lot of those instruction runs will be really short, and there’ll be function call overhead on every run. And, it seems like it would kill CPU branch prediction and such, by essentially moving all branches to a single place (the tu.execute()).

QEMU goes further in its dynamic translation emulators, modifying the TUs to branch directly to each other in runtime discovery. It’s all very funky and scary looking…

But QEMU’s technique of modifying trampolines in the live code won’t work as we can’t modify running code to insert jump instructions… and even if we could, there are no one-way jumps, and using call instructions risks exploding the call stack on what’s actually a loop (there’s no proper tail call optimization in WebAssembly).

Relooper

What can be done, though, is to compile bigger, better, badder functions.

When emscripten is generating JavaScript or WebAssembly from your C/C++ program’s LLVM intermediate language, it tries to reconstruct high-level control structures within each function from a more limited soup of local branches. These then get re-compiled back into branch soup by the JIT compiler, but efficiently. 😉

The binaryen WebAssembly code gen library provides this “relooper” algorithm too: you pass in blocks of instructions, possible branches, and the conditions around them, and it’ll spit out some nicer branch structure if possible, or an ugly one if not.

I’m pretty sure it should be possible to take a detected loop cycle of separate TUs and create a combined TU that’s been “relooped” in a way that it is more efficient.

BBBBuuuuutttttt all this sounds expensive in terms of setup. Might want to hold off on any compilation until a loop cycle is detected, for instance, and just let the interpreter roll on one-off code.

Modifying runtime code in WebAssembly

Code is not addressable or modifiable within a live module instance; unlike in native code you can’t just write instructions into memory and jump to the pointer.

In fact, you can’t actually add code to a WebAssembly module. So how are we going to add our functions at runtime? There are two tricks:

First, multiple module instances can use the same linear memory buffer.

Second, the tables for indirect function calls can list “foreign” functions, such as JavaScript functions or WebAssembly functions from a totally unrelated module. And those tables are modifiable at runtime (from the JavaScript side of the border).

These can be used to do full-on dynamic linking of libraries, but all we really need is to be able to add a new function that can be indirect-called, which will run the compiled version of some number of instructions (perhaps even looping natively!) and then return back to the main emulator runtime when it reaches a branch it doesn’t contain.

Function calls

Since x86 has a nice handy CALL instruction, and doesn’t just rely on convention, it could be possible to model calls to already-cached TUs as indirect function calls, which may perform better than exiting out to the loop and coming back in. But they’d probably need to be guarded for early exit, for several reasons… if we haven’t compiled the entirety of the relooped code path from start to exit of the function, then we have to exit back out. A guard check on IP and early-return should be able to do that in a fairly sane way.

function tu_1234() {
// loop
do {
// calc loop condition -> set zero_flag
ip = 1235
if !zero_flag {
break
}
ip = 1236
// CALL 4567
tu = cached_or_compiled_tu(4567)
tu.execute()
if ip != 1236 {
// only partway through. back to emulator loop,
// possibly unwinding a long stack 🙂
return
}
// more code
ip
}
}

I think this makes some kind of sense. But if we’re decoding instructions + creating output on the fly, it could take a few iterations through to produce a full compiled set, and exiting a loop early might be … ugly.

It’s possible that all this is a horrible pipe dream, or would perform too bad for JIT compilation anyway.

But it could still be fun for ahead-of-time compilation. 😉 Which is complicated… a lot … by the fact that you don’t have the positions of all functions known ahead of time. Plus, if there’s dynamic linking or JIT compilation inside the process, well, none of that’s even present ahead of time.

Prior art: v86

I’ve been looking at lot at v86, a JavaScript-based x86 system emulator. v86 is a straight-up interpreter, with instruction decoding and execution mixed together a bit, but it feels fairly straightforwardly written and easy to follow when I look at things in the code.

v86 uses a set of aliased typed arrays for the system memory, another set for the register file, and then some variables/properties for misc flags and things.

Some quick notes:

  • a register file in an array means accesses at difference sizes are easy (al vs ax vs eax), and you can easily index into it from the operand selector bits from the instruction (as opposed to using a variable per register)
  • is there overhead from all the object property accesses etc? would it be more efficient to do everything within a big linear memory?
  • as a system emulator there’s some extra overhead to things like protected mode memory accesses (page tables! who knows what!) that could be avoided on a per-process model
  • 64-bit emulation would be hard in JavaScript due to lack of 64-bit integers (argh!)
  • as an interpreter, instruction decode overhead is repeated during loops!
  • to avoid expensive calculations of the flags register bits, most arithmetic operations that would change the flags instead save the inputs for the flag calculations, which get done on demand. This still is often redundant because flags may get immediately rewritten by the next instruction, but is cheaper than actually calculating them.

WebAssembly possibilities

First, since WebAssembly supports only one linear memory buffer at a time, the register file and perhaps some other data would need to live there. Most likely want a layout with the register file and other data at the beginning of memory, with the rest of memory after a fixed point belonging to the emulated process.

Putting all the emulator’s non-RAM state in the beginning means a process emulator can request more memory on demand via Linux ‘brk’ syscall, which would be implemented via the ‘grow_memory’ operator.

64-bit math

WebAssembly supports 64-bit integer memory accesses and arithmetic, unlike JavaScript! The only limitation is that you can’t (yet) export a function that returns or accepts an i64 to or from JavaScript-land. That means if we keep our opcode implementations in WebAssembly functions, they can efficiently handle 64-bit ops.

However WebAssembly’s initial version allows only 32-bit memory addressing. This may not be a huge problem for emulating 64-bit processes that don’t grow that large, though, as long as the executable doesn’t need to be loaded at a specific address (which would mean a sparse address space).

Sparse address spaces could be emulated with indirection into a “real” memory that’s in a sub-4GB space, which would be needed for a system emulator anyway.

Linux details

Statically linked ELF binaries would be easiest to model. More complex to do dynamic linking, need to pass a bundle of files in and do fix-ups etc.

Questions: are executables normally PIC as well as libraries, or do they want a default load address? (Which would break the direct-memory-access model and require some indirection for sparse address space.)

Answer: normally Linux x86_64 executables are not PIE, and want to be loaded at 0x400000 or maybe some other random place. D’oh! But… in the common case, you could simplify that as a single offset.

Syscall on 32-bit is ‘int $80’, or ‘syscall’ instruction on 64-bit. Syscalls would probably mostly need to be implemented on the JS side, poking at the memory and registers of the emulated process state and then returning.

To do network i/o would probably need to be able to block and return to the emulator… so like a function call bowing out early due to an uncompiled branch being taken, would potentially need an “early exit” from the middle of a combined TU if it does a syscall that ends up being async. On the other hand, if a syscall can be done sync, might be nice not to pay that penalty.

Could also need async syscalls for multi-process stuff via web workers… anything that must call back to main thread would need to do async.

For 64-bit, JS code would have to …. painfully … deal with 32-bit half-words. Awesome. 😉

Multiprocessing

WebAssembly initial version has no facility for multiple threads accessing the same memory, which means no threads. However this is planned to come in future…

Processes with separate address spaces could be implemented by putting each process emulator in a Web Worker, and having them communicate via messages sent to the main thread through syscalls. This forces any syscall that might need global state to be async.

Prior art: Browsix

Browsix provides a POSIX-like environment based around web techs, with processes modeled in Web Workers and syscalls done via async messages. (C/C++ programs can be compiled to work in Browsix with a modified emscripten.) Pretty sweet ideas. 🙂

I know they’re working on WebAssembly processes as well, and were looking into synchronous syscalls vi SharedArrayBuffer/Atomics as well, so this might be an interesting area to watch.

Could it be possible to make a Linux binary loader for the Browsix kernel? Maybe!

Would it be possible to make graphical Linux binaries work, with some kind of JS X11 or Wayland server? …mmmmmmaaaaybe? 😀

Closing thoughts

This all sounds like tons of fun, but may have no use other than learning a lot about some low-level tech that’s interesting.

by brion at June 06, 2017 08:59 AM

Wikimedia Scoring Platform Team

Status update (September 28th, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-September/000102.html)

Hey,

This is the 23rd weekly update from revision scoring team that we have sent
to this mailing list.

New development

  • We implemented and demonstrated a linguistic/stylometric processing strategy that should give us more signal for finding vandalism and spam[1]. See the discussion on the AI list[2].
  • As part of our support for the Collaboration Team, we've been producing tables of model statistics that correspond to set of thresholds[3]. This helps their designers work on strategies for reporting prediction confidence in an intuitive way.

Maintenance and robustness

  • We had a major downtime event that was caused by our logs being too verbose. We've recovered and turned down the log level[4].
  • We made sure that halfak got pings when ores.wikimedia.org goes down[5]

Datasets

  • We created a database on Wikimedia Labs that provides access to a dataset containing a complete set of article quality predictions for English Wikipedia[6]. See our announcements[7,8,9].
  1. https://phabricator.wikimedia.org/T146335 -- Implement a basic scoring strategy for PCFGs
  2. https://lists.wikimedia.org/pipermail/ai/2016-September/000098.html
  3. https://phabricator.wikimedia.org/T146280 -- Produce tables of stats for damaging and goodfaith models
  4. https://phabricator.wikimedia.org/T146581 -- celery log level is INFO causing disruption on ORES service
  5. https://phabricator.wikimedia.org/T146720 -- Ensure that halfak gets emails when ores.wikimedia.org goes down
  6. https://phabricator.wikimedia.org/T106278 -- Setup a db on labsdb for article quality that is publicly accessible
  7. https://phabricator.wikimedia.org/T146156 -- Announce article quality database in labsdb
  8. https://lists.wikimedia.org/pipermail/ai/2016-September/000091.html
  9. https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive_149#ORES_article_quality_data_as_a_database_table

Sincerely,
Aaron from the Revision Scoring team

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 06, 2017 05:41 AM

June 05, 2017

Wiki Education Foundation

Biology, bats, and becoming Wikipedian

Elysia Webb is a graduate student at the University of Florida. In Spring 2017 she contributed to Wikipedia for a course on Principles of Systematic Biology, taught by Emily Sessa. In this post she shares her experience improving the article on the Florida bonneted bat and the editing she has continued to do outside of class, focusing on other bat species and bat-related topics (for example, bat flight, greater long-nosed bat, and Maclaud’s horseshoe bat).

For years, I have been a consumer of Wikipedia. Now, I’m also a contributor.

Elysia Webb
Image: Elysia Webb headshot.jpg, by Elysia Webb, CC BY-SA 4.0, via Wikimedia Commons.

I didn’t get an introduction into the nuts and bolts of contributing to Wikipedia until my first year of graduate school. All the students in my systematic biology class had to create or drastically alter a page, and the topics were just as diverse as our fields of study: calibrating the molecular clock, obscure genera of flowering plants, and subfamilies of moths are examples of topics we covered. As for me, I wrote about my study species—an endangered New World bat.

The article for my bat already existed, but it was deficient in multiple areas. I took it to my sandbox to flesh it out, using the skills that I had picked up in the training modules. I created a distribution map of my species; I added paragraphs of text. Upon seeing that there were no freely-available images of this bat, which was why the article had no photos, I uploaded my first image to Wikipedia Commons—an image I had snapped with my cell phone, to become the public face of this bat.

I watched as the article grew from a skeleton to a true Wikipedia article, with sections, citations, and images, and I felt pride at knowing that I was the impetus of change. I realized that the page was part of a WikiProject, and their classification of the page changed from “start” class to “B” class, indicative of a vast improvement.

After that first article, I was hooked.

Here I was, a ready-made researcher in training, used to ferreting out information from academic journals, and compiling sources for cohesive proposals or overviews. My academic status allowed me to pass through paywalls into the gated communities of online journals, and pull out citable information. The semester that I signed up for my Wikipedia account, I surpassed a thousand edits. Now that I knew how to edit articles and had the tools to do so, I had no excuse not to.

It’s surprising that the simple act of being assigned a Wikipedia page to edit would be so transformative. It’s made me put a lot of thought about the value of science and scientific writing to everyday people. I strongly value the processes of peer review and publishing papers to share information between scientists, but I now believe that the spread of information should be vertical as well, extending past the scientist stratum. Science needs intermediaries who can translate tough jargon and concepts for anyone to read and understand. Science needs resources like Wikipedia so that the public can have free, accessible information. Wikipedia also needs scientists to hold it to its principles of truth and neutrality.

I have always seen the tremendous value of Wikipedia as a consumer, but it took an assignment via the Wiki Education Foundation to make me see the value of contributing to Wikipedia. At the end of the semester, the instructor asked for our feedback; this was the first time she had made Wikipedia an assignment, and she wanted to know if we recommended that she did the same for future classes. Unanimously, we did. I’m glad that my instructor had the prescience to create a room of Wikipedia editors, who have cumulatively added almost one hundred thousand words across over two hundred articles to the mainspace in the span of a few months. Knowing how to edit Wikipedia has emboldened us to do so, we all agreed. Not only are we emboldened, though; we also feel charged to correct untruths where we see them, and provide information where it is missing.

I am proud to consider myself a biologist, a researcher, and now, an Editor; the Wikipedia Community gives me hope that the truth will always have defenders, and Wiki Ed gives me hope that the number will keep growing.

Image: Florida bonneted bat (Eumops floridanus).jpg, by Elysia Webb, CC BY-SA 4.0, via Wikimedia Commons.

by Ryan McGrady at June 05, 2017 06:04 PM

Wikimedia Cloud Services

#wikimedia-labs irc channel renamed to #wikimedia-cloud

The first very visible step in the plan to rename things away from the term 'labs' happened around 2017-06-05 15:00Z when IRC admins made the #wikimedia-labs irc channel on Freenode invite-only and setup an automatic redirect to the new #wikimedia-cloud channel.

If you were running a bot in the old channel or using an IRC bouncer like ZNC with "sticky" channel subscriptions you may need to make some manual adjustments to your configuration.

See also

by bd808 (Bryan Davis) at June 05, 2017 03:13 PM

Gerard Meijssen

#Wikimedia - Felix Andries Vening Meinesz and the sum of all #knowledge

Mr Vening Meinesz was an important Dutch scientist. There are a few pointers to his relevance; he was a member to several august bodies and he was awarded many awards, awards from several countries.

One of the medals is the "Alexander Agassiz Medal". When you look at the English Wikipedia article, you find no red links while many award winners do not have an article. Otto S. Pettersson for instance is/was known to Wikidata but he was not associated with the award. When you google for another awardee, it is most likely that the 1926 award winner was Jacob Bjerknes, not Wilhelm Bjerknes.. Even sources get it wrong..

In many ways, awards are rather boring. Getting all the information right is a lot of work and when it is to be written in Wikipedia articles, there are too many Wikipedias and awards for all the awards to get an article in all of them.

When awards are fed to Wikipedia articles from Wikidata like it is done for sources, it becomes a lot more manageable. Increasingly Wikidata knows about more awards for more people. What does it take to reach the necessary tipping point? Which Wikipedia will consider this first?
Thanks,
     GerardM

by Gerard Meijssen (noreply@blogger.com) at June 05, 2017 09:53 AM

Wikimedia Performance Team

Investigating a performance improvement

Last week @Jdlrobson pinged me by email about a performance improvement his team noticed for large wiki articles on the mobile site in our synthetic tests run on WebPageTest. The improvement looked like this, a sudden drop in SpeedIndex (where lower is better):

SpeedIndex is described this way by Pat Meenan, the author of WebPageTest:

The Speed Index is the average time at which visible parts of the page are displayed. It is expressed in milliseconds and dependent on size of the view port.

The actual mathematical definition is a bit more complicated, but essentially it captures, from an end-user perspective, a score of how fast visual completeness is reached above the fold, for a given viewport size and a given internet connection speed. In the case of the WebPageTest run spotted by @Jdlrobson the viewport is mobile phone-sized and the speed is simulated 3G. We run different profiles in WebPageTest that represent different kinds of devices and internet speeds, because sometimes performance changes only affect some devices, some types of connection speeds and some types of wiki pages, even.

As we do for every suspected performance improvement or regression from an unknown cause, we filed a task for it, tagged Performance-Team tag and began investigating: T166373: Investigate apparent performance improvement around 2017-05-24. Creating a task directly and tagging us is also a good way to get our attention.

Comparing synthetic testing and real user metrics

When a change like this happens in synthetic testing, we first verify whether or not a similar change was seen in our real user metrics. Specifically, Navigation Timing in Grafana.

SpeedIndex can't be measured on real users. Real user metrics are limited by the APIs available in browser, which are very basic compared to what WepPageTest can do. There's no way to tell the visual completeness of the whole page from client-side code.

The main real user metric we track is firstPaint. However firstPaint measures something very different than SpeedIndex. FirstPaint is when the web browser starts painting anything on the page. Whereas SpeedIndex is about how fast visual completion in the viewport happens. Essentially, SpeedIndex is about the phase that happens after firstPaint, which real user metrics can't measure. But since they're on the same timeline, it's common for a SpeedIndex change to come with a variation in real user metrics like firstPaint. When that happens, it makes the investigation easier because we know it's not an issue in our telemetry, but a real effect. When there's no correlation between synthetic testing metrics and real user metrics, we just have to keep investigating more in depth.

This fundamental difference means that some performance improvements can improve SpeedIndex, while not changing firstPaint or any other Navigation Timing metric. It's unfortunate in the sense that we know performance has improved, we just can't measure how much it did in the real world for our users. This is exactly what we were seeing here, real user metrics didn't improve during that period. Which doesn't mean that performance didn't really improve for people. As we'll see later, it did. It's also fundamental to understand that Navigation Timing is only a partial view of performance, and that some performance changes simply cannot be measured from real user data at this time.

Comparing WebPageTest runs

The next logical step was to compare WebPageTest runs before and after the performance change. Our synthetic tests, which run continuously, can be consulted on our public WebPageTest instance. WebPageTest's UI isn't the best suited for our use case, so here's a walkthrough of where to look. First you want to click on the test history section, which brings you to this view:

Then click on the show tests from all users checkbox. You should now see all our test runs:

We test a number of pages for the desktop and mobile site, using various simulated internet connection speeds, etc. Finding the tests you're interested in in this history view requires manual labour, as you need to manually search for the labels you're interested in, the search box only applying to the URL.

WebPageTest supports a great feature to compare different runs from the history view. We won't get into that here, though, as the difference is visible from the screenshot of the runs alone. After combing through the history view, I found two runs of the same test (the Sweden article on English Wikipedia, browsing the mobile site on Chrome with a simulated 3G connection.) before and after the SpeedIndex drop.

Before:

After:

It's obvious that the content above the fold changed. The new version displays mostly text above the fold, where the old version had images. This explains the SpeedIndex improvement: it's faster to load text than an image, which means that users get content they can consume above the fold faster. This is more dramatic on slow connections, which is why this performance improvement showed up on our synthetic testing that simulated a 3G connection.

Deliberate or accidental change?

The next part of the investigation was to determine whether that was an accidental change, or a deliberate one. For this, the first place to look is the Wikimedia Server Admin Log. Whenever changes are deployed to Wikimedia production, log entries are added there. Deployments can be individual patches or our weekly deployment train. This part of the investigation is simple: we're simply going through the log, looking for anything that happened around the time the performance change happened.

And sure enough, we found this log entry around the time of the performance change:

18:31 thcipriani@tin: Synchronized wmf-config/InitialiseSettings.php: SWAT: mobileFrontend: Move first paragraph before infobox T150325 (duration: 00m 41s)

The task quoted in that log entry, T150325: Move first paragraph before infobox on stable, is a deliberate change to improve the user experience by showing the first section of an article at the top rather than the infobox. While making this change, @phuedx also improved the performance for users on slow internet connections, who will now see the first section of an article above the fold, which they can start reading early, instead of a mostly empty infobox whose images are still loading.

by Gilles (Gilles Dubuc) at June 05, 2017 08:57 AM

Gerard Meijssen

#Wikidata - a #young face for #science


These are members of the "Jonge Academie".  They are Dutch scientists and for this academy to remain young, membership expires after ten years. Similar academies exist in several other countries. Countries like Pakistan and Belgium.

With politicians riding roughshod over scientific facts, rejuvenation of science is important. The notion that scientists are old is all too easy and it is equally easy to dismiss young scientists for a lack of relevance. Check out these Dutch scientists, they are relevant and will be for a long time.
Thanks,
      GerardM

by Gerard Meijssen (noreply@blogger.com) at June 05, 2017 07:26 AM

Tech News

Tech News issue #23, 2017 (June 5, 2017)

TriangleArrow-Left.svgprevious 2017, week 23 (Monday 05 June 2017) nextTriangleArrow-Right.svg
Other languages:
العربية • ‎čeština • ‎Ελληνικά • ‎English • ‎español • ‎suomi • ‎français • ‎עברית • ‎italiano • ‎日本語 • ‎ಕನ್ನಡ • ‎polski • ‎русский • ‎svenska • ‎українська • ‎中文

June 05, 2017 12:00 AM

June 03, 2017

Wikimedia Scoring Platform Team

Status update (April 14th, 2017)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2017-April/000154.html)

Hey folks,

In this update, I'm going to change some things up to try and make this update easier for you to consume. The biggest change you'll notice is that I've broken up the [#] references in each section. I hope that saves you some scrolling and confusion. You'll also notice that I have changed the subject line from "Revision scoring" to "Scoring Platform" because it's now clear that, come July, I'll be leading a new team with that name at the Wikimedia Foundation. There'll be an announcement about that coming once our budget is finalized. I'll try to keep this subject consistent for the foreseeable future so that your email clients will continue to group the updates into one big thread.

Deployments & maintenance:

In this cycle, we've gotten better at tracking our deployments and noting what changes do out with each deployment. You can click on the phab task for a deployment and observe the sub-tasks to find out what was deployed. We had 3 deployments for ORES since mid-march[1,2,3]. We've had two deployments to Wikilabels[4,5] and we've added a maintenance notices for a short period of downtime that's coming up on April 21st[6,7].

  1. https://phabricator.wikimedia.org/T160279 -- Deploy ores in prod (Mid-March)
  2. https://phabricator.wikimedia.org/T160638 -- Deploy ORES late march
  3. https://phabricator.wikimedia.org/T161748 -- Deploy ORES early April
  4. https://phabricator.wikimedia.org/T161002 -- Late march wikilabels deployment
  5. https://phabricator.wikimedia.org/T163016 -- Deploy Wikilabels mid-April
  6. https://phabricator.wikimedia.org/T162888 -- Add header to Wikilabels that warns of upcoming maintenance.
  7. https://phabricator.wikimedia.org/T162265 -- Manage wikilabels for labsdb1004 maintenance

Making ORES better:

We've been working to make ORES easier to extend and more useful. ORES now reports it's relevant versions at https://ores.wikimedia.org/versions[8]. We've also reduced the complexity of our "precaching" system that scores edits before you ask for them[9,10]. We're taking advantage of logstash to store and query our logs[11]. We've also implemented some nice abstractions for requests and responses in ORES[12] that allowed us to improve our metrics tracking substantially[13].

  1. https://phabricator.wikimedia.org/T155814 -- Expose version of the service and its dependencies
  2. https://phabricator.wikimedia.org/T148714 -- Create generalized "precache" endpoint for ORES
  3. https://phabricator.wikimedia.org/T162627 -- Switch /precache to be a POST end point
  4. https://phabricator.wikimedia.org/T149010 -- Send ORES logs to logstash
  5. https://phabricator.wikimedia.org/T159502 -- Exclude precaching requests from cache_miss/cache_hit metrics
  6. https://phabricator.wikimedia.org/T161526 -- Implement ScoreRequest/ScoreResponse pattern in ORES

New functionality:

In the last month and a half, we've added basic support to Korean Wikipedia[14,15]. Props to Revi for helping us work through a bunch of issues with our Korean language support[16,17,18].

We've also gotten the ORES Review tool deployed to Hebrew Wikipedia[19,20,21,22] and Estonian Wikipedia[23,24,25]. We're also working with the Collaboration team to implement the threshold test statistics that they need to tune their new Edit Review interface[26] and we're working towards making this kind of work self-serve so that that product team and other tool developers won't have to wait on us to implement these threshold stats in the future[27].

  1. https://phabricator.wikimedia.org/T161617 -- Deploy reverted model for kowiki
  2. https://phabricator.wikimedia.org/T161616 -- Train/test reverted model for kowiki
  3. https://phabricator.wikimedia.org/T160752 -- Korean generated word lists are in chinese
  4. https://phabricator.wikimedia.org/T160757 -- Add language support for Korean
  5. https://phabricator.wikimedia.org/T160755 -- Fix tokenization for Korean
  6. https://phabricator.wikimedia.org/T161621 -- Deploy ORES Review Tool for hewiki
  7. https://phabricator.wikimedia.org/T130284 -- Deploy edit quality models for hewiki
  8. https://phabricator.wikimedia.org/T160930 -- Train damaging and goodfaith models for hewiki
  9. https://phabricator.wikimedia.org/T130263 -- Complete hewiki edit quality campaign
  10. https://phabricator.wikimedia.org/T159609 -- Deploy ORES review tool to etwiki
  11. https://phabricator.wikimedia.org/T130280 -- Deploy edit quality models for etwiki
  12. https://phabricator.wikimedia.org/T129702 -- Complete etwiki edit quality campaign
  13. https://phabricator.wikimedia.org/T162377 -- Implement additional test_stats in editquality
  14. https://phabricator.wikimedia.org/T162217 -- Implement "thresholds", deprecate "pile of tests_stats"

ORES training / labeling campaigns:

Thanks to a lot of networking at Wikimedia Conference and some help from Ijon (Asaf Batrov), we've found a bunch of new collaborators to help us deploy ORES to new wikis. As is critcial in this process, we need to deploy labeling campaigns so that Wikipedians can help us train ORES.

We've got new editquality labeling campaigns deployed to Albanian[28], Finnish[29], Latvian[30], Korean[31], and Turkish[21] Wikipedias.

We've also been working on a new type of model: "Item quality" in Wikidata. We've deployed, labeled, and analyzed a pilot[33], fixed some critical bugs that came up[34,35], and we've finally launched a 5k item campaign which is already 17% done[36]! See https://www.wikidata.org/wiki/Wikidata:Item_quality_campaign if you'd like to help us out.

  1. https://phabricator.wikimedia.org/T161981 -- Edit quality campaign for Albanian Wikipedia
  2. https://phabricator.wikimedia.org/T161905 -- Edit quality campaign for Finnish Wikipedia
  3. https://phabricator.wikimedia.org/T162032 -- Edit quality campaign for Latvian Wikipedia
  4. https://phabricator.wikimedia.org/T161622 -- Deploy editquality campaign in Korean Wikipedia
  5. https://phabricator.wikimedia.org/T161977 -- Start v2 editquality campaign for trwiki
  6. https://phabricator.wikimedia.org/T159570 -- Deploy the pilot of Wikidata item quality campaign
  7. https://phabricator.wikimedia.org/T160256 -- Wikidata items render badly in Wikilabels
  8. https://phabricator.wikimedia.org/T162530 -- Implement "unwanted pages" filtering strategy for Wikidata
  9. https://phabricator.wikimedia.org/T157493 -- Deploy Wikidata item quality campaign

Bug fixing:

As usual, we have a few weird bug that got in our way. We needed to move to a bigger virtual machine in "Beta Labs" because our models take up a bunch of hard drive space[37]. We found that Wikilabels wasn't removing expired tasks correctly and that this was making it difficult to finish labeling campaigns[38]. We also had a lot of right-to-left issues when we did an upgrade of OOjs UI[39]. There was an old bug we had with https://translatewiki.net in one of our message keys[40].

  1. https://phabricator.wikimedia.org/T160762 -- deployment-ores-redis /srv/ redis is too small (500MBytes)
  2. https://phabricator.wikimedia.org/T161521 -- Wikilabels is not cleaning up expired tasks for Wikidata item quality campaign
  3. https://phabricator.wikimedia.org/T161533 -- Fix RTL issues in Wikilabels after OOjs UI upgrade
  4. https://phabricator.wikimedia.org/T132197 -- qqq for a wiki-ai message cannot be loaded

-@Halfak
Principal Research Scientist
Head of the Scoring Platform Team

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 06:30 PM

Status update (March 16th, 2017)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2017-March/000145.html)

Hey folks!

I should really stop calling this a weekly update because it's getting a bit silly at this point. :) But if it were a weekly update, it would cover the weeks of 42 - 46.

Highlights:

  • 3 new models: Finnish Wikipedia (reverted) and Estonian Wikipedia (damaging & goodfaith)
  • We estimated and agreed on funding for ORES servers in the next year with Operations
  • We published a paper about vandalism detection in Wikidata and a blog post about the massive effect of some initiatives on coverage of Women Scientists in Wikipedia.

New development:

  • We added recall-based threshold metrics to the new draftquality model which should help tool devs know what which new page creations to highlight for review[1]
  • We added optional notices for ORES pages which will help us visually distinguish our experimental install in WMFlabs from the Prod install ( ores.wikimedia.org)[2]
  • We added basic language support for Finish (Thanks 4shadoww)[3] and deployed a 'reverted' model[4]
  • We lead a discussion in Wikidata about "item quality" that resulted in a Wikipedia 1.0 like scale for Wikidata quality[5,6] and designed a Wikilabels form to capture the gist of it[7]
  • We enabled the ORES Review Tool on Czech Wikipedia[8]
  • We configured ChangeProp to use our new minified JSON output to save bandwidth[9]
  • We extended the Estonian language assets (Thanks Cumbril)[10] and deployed the 'damaging' and 'goodfaith' models[11,12]
  • We enabled a testing model for 'goodfaith' on the Beta Cluster to make it easier for the Collaboration team to run tests with their new filter interface[13]
  • We created a new "precache" endpoint that will allow us to de-duplicate configuration with ChangeProp and handle all routing in ORES locally[14]

Resourcing:

  • We completed a 2 year estimate of ORES resource needs and discussed funding (capital expendature) for ORES in the coming fiscal year[15]. This will allow us to continue to grow ORES both in number of models and in scoring capacity.

Communications:

  • Amir improved the KDD paper based on review feedback[16] and got it published[17]
  • We published a blob post about our measurements of WikiProject Women Scientists[18,19] -- "The Keilana Effect"
  • Thanks to Cumbril's work, the Estonian labeling campaing was finished[20]

Deployments:

  • In early February, we deployed a new set of translations to Wikilabels (specifcally targeting Romanian Wikipedia)[21]
  • In mid-February, we deployed some fixes to ORES documentation and response formatting[22]
  • In mid-March, we deployed 3 new scoring models and ORES notices[23]

Maintenance and robustness:

  • We fixed a serious issue in the "mwoauth" library that Wikilabels depends on[24]
  • We reduced the number of revisions per request that we could receive via api.php[25]
  • We investigated a scap issue that broke ORES deployment[26]
  • We fixed a minor issue with JSON minification behavior[27] and hard-coding of the location of ORES in the documentation[28]
  • We improved performance of ORES filters on MediaWiki[29]
  • We improved the language describing ORES behavior on Special:Contributions[30]
  • We added a notice to the Wikipages that Dexbot maintains about its behavior[31]
  • We added notices to ores.wmflabs.org about it's experimental nature[32]
  • We fixed some issues with testing Finnish language assets[33]
  • We fixed some styling issues that resulted from an upgrade of OOJS UI[34]
  1. https://phabricator.wikimedia.org/T157454 -- Add recall based thresholds to draftquality model
  2. https://phabricator.wikimedia.org/T150962 -- Add an optional notice to ORES main and ui pages
  3. https://phabricator.wikimedia.org/T158587 -- Add language support for Finnish
  4. https://phabricator.wikimedia.org/T160228 -- Train/test reverted model for fiwiki
  5. https://phabricator.wikimedia.org/T157489 -- [Discuss] item quality in Wikidata
  6. https://www.wikidata.org/wiki/Wikidata:Item_quality
  7. https://phabricator.wikimedia.org/T155828 -- Design item_quality form for Wikidata
  8. https://phabricator.wikimedia.org/T151611 -- Enable ORES Review Tool on Czech Wikipedia
  9. https://phabricator.wikimedia.org/T157693 -- Use minified JSON format in ChangeProp
  10. https://phabricator.wikimedia.org/T160193 -- Extend estonian language assets from Wiki page
  11. https://phabricator.wikimedia.org/T159608 -- Train/test damaging/goodfaith models for etwiki
  12. https://phabricator.wikimedia.org/T130280 -- Deploy edit quality models for etwiki
  13. https://phabricator.wikimedia.org/T160467 -- Enable 'goodfaith' on testwiki on Beta Cluster
  14. https://phabricator.wikimedia.org/T148714 -- Create generalized "precache" endpoint for ORES
  15. https://phabricator.wikimedia.org/T157222 -- Estimate ORES capex for FY2017-18
  16. https://phabricator.wikimedia.org/T148443 -- Improve the KDD paper based on the review
  17. https://arxiv.org/abs/1703.03861
  18. https://phabricator.wikimedia.org/T160078 -- Blog post about wp10 measurements of Women Scientists
  19. https://blog.wikimedia.org/2017/03/07/the-keilana-effect/
  20. https://phabricator.wikimedia.org/T129702 -- Complete etwiki edit quality campaign
  21. https://phabricator.wikimedia.org/T157580 -- Deploy Romanian translations for Wiki labels
  22. https://phabricator.wikimedia.org/T157842 -- Prod deployment of ORES
  23. https://phabricator.wikimedia.org/T160279 -- Deploy ores in prod (Mid-March)
  24. https://phabricator.wikimedia.org/T157858 -- mwoauth is broken
  25. https://phabricator.wikimedia.org/T157983 -- Reduce the number of revisions that can be requested in one batch
  26. https://phabricator.wikimedia.org/T157623 -- Investigate failed ORES deployment
  27. https://phabricator.wikimedia.org/T157721 -- Investigate default JSON minification behavior in production
  28. https://phabricator.wikimedia.org/T157723 -- ORES swagger is hard-coded for wmflabs
  29. https://phabricator.wikimedia.org/T152585 -- rcshow=oresreview is slow
  30. https://phabricator.wikimedia.org/T158862 -- Fix message in Special:Contributions
  31. https://phabricator.wikimedia.org/T158899 -- Add notice about Dexbot overwriting manual changes to our tracking table.
  32. https://phabricator.wikimedia.org/T159055 -- Add a notice to ores-wmflabs-deploy about "experimental" nature
  33. https://phabricator.wikimedia.org/T160192 -- Fix testing issues in finnish language assets
  34. https://phabricator.wikimedia.org/T160258 -- Fix minor styling issues with OOJS-UI in wikilabels

Sincerely,
Aaron from the Scoring Platform team

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 06:25 PM

AI Wishlist initialized and a new Phab Tag (January 31st, 2017)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2017-January/000134.html)

Hey folks,

I hosted the AI Wishlist session at the Developer Summit(T147710). At that
session, we brainstormed a set of AIs that we think would be interesting to
implement. Generally I asked people to do their best to follow template
that would help us remember why the AI was important, what it would help
with, and what resources might help get it implement.

Well, I've taken all of the notes and filed a large set of phab tasks under
a new "artificial-intelligence" tag. Please review all of the fun, new
proposals that are listed there and make sure you subscribe to those that
you're interested in.

See artificial-intelligence

-@Halfak

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 06:21 PM

New version of Wiki labels (January 8th, 2017)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2017-January/000130.html)

Hello,

Wikilabels [1] is the system to label edits for ORES. Until now, users would have to visit a page in Wikipedia, for example WP:Labels [2] and install a gadget and then label edits for ORES. With the new version (0.4.0) deployed today, you can directly go to Wikilabels home page, for example https://labels.wmflabs.org/ui/enwiki and label edits from there. If you installed the gadget, you can remove it now. We also provided some sort of minification and bundling to improve its performance.

Labeling edits would help ORES work more accurately and in case ORES review tool is not enabled in your wiki, you can provide these data for us using wikilabels so can enable it for your wiki as well!

[1] https://meta.wikimedia.org/wiki/Wiki_labels
[2] https://en.wikipedia.org/wiki/Wikipedia:Labels

Best

Amir Sarabadani Tafreshi (@Ladsgroup)
Software Engineer (contractor)

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 06:20 PM

Status update (November 29th, 2016)

(The post was copied from https://lists.wikimedia.org/pipermail/ai/2016-November/000118.html)

Hey,

This is the 30th and 31st weekly update from the revision scoring team that
we have sent to this mailing list. We accidentally skipped a week again.

New development:

  • We added a new "lowest" sensitivity level to ORES review tool. This new sensistivity level will only flag edits that ORES is very confident are actually damaging[1].
  • We applied the MediaWiki standard color palette to Wikilabels[2]
  • We generated a manually censored public dataset of spam/vandalism/attack pages[3]. This will help others to develop spam, vandalism and attack page detection models. See the publication of the dataset[4].
  • We've implement color-based confidence reporting for ORES damage detection[5]

Maintenance and robustness:

  • We updated the version of OOjs-UI that gets bundled with Wiki labels[6] and moved the static assets to a new repositiory[7]
  • We fixed an issue in the recscoring library[8] that caused ORES to return invalid JSON and rendered the UI useless[9].

Communications:

  • We gave a 3 minute presentation on the state of ORES to Victoria Coleman, the WMF's new CTO[10].
  • We performed a basic analysis of Wikipedia article quality trends using the dataset we released a few weeks ago[11]. We'll have a more substantial analysis soon.
  • We made a post on the ORES review tool talk page[12,13] detailing how we plan to incorporate a new filtering strategy into the ORES review tool. Please join the discussion there.
  1. https://phabricator.wikimedia.org/T150224 -- Add "Lowest" ORES sensitivity for fpr=0.1
  2. https://phabricator.wikimedia.org/T151119 -- Apply ui standardization color palette to Wikilabels
  3. https://phabricator.wikimedia.org/T150307 -- Create manually vetted dataset of spam/vandalism/attack pages
  4. https://dx.doi.org/10.6084/m9.figshare.4245035
  5. https://phabricator.wikimedia.org/T144922 -- Visually report damaging confidence
  6. https://phabricator.wikimedia.org/T151222 -- Update bundled OOJS-ui with Wikilabels
  7. https://github.com/wiki-ai/flask-oojsui
  8. https://phabricator.wikimedia.org/T150961 -- ORES ui is broken (text field disabled)
  9. https://github.com/wiki-ai/ores/issues/177
  10. https://phabricator.wikimedia.org/T150544 -- ORES (a 2-3 minute presentation)
  11. https://phabricator.wikimedia.org/T151214 -- Basic analysis of Wikipedia quality using monthly predictions
  12. https://phabricator.wikimedia.org/T150858 -- Post about ORES review tool including ERI filters
  13. https://www.mediawiki.org/wiki/Topic:Tflhjj5x1numzg67

Sincerely,
Aaron from the Revision Scoring team

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 06:17 PM

Changes coming to ORES review tool (November 26th, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-November/000117.html)

Hey,

With merge of 320328 [1] and 320341 [2], two major changes will come to ORES
review tool:
1- You will see one more option in ORES sensitivity called "Lowest". It
means if you choose it, it only flags edit that are very likely to be
vandalism.
2- Coloring of rows will be completely different. You will see several
colors instead of one and as confidence of ORES grows, the colors will tend
to be more noticeable. It goes without saying that you can change these
colors in your own css. I put a screenshot in [3] and you can test it in
https://en.wikipedia.beta.wmflabs.org or https://mw-revscoring.wmflabs.org

Feedback is always welcome

  1. https://gerrit.wikimedia.org/r/#/c/320328/
  2. https://gerrit.wikimedia.org/r/#/c/320341/
  3. https://phabricator.wikimedia.org/T144922#2824696

Best

Amir Sarabadani Tafreshi (@Ladsgroup)
Software Engineer (contractor)

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 05:54 PM

Awesome AI topics in need of discussion (Dev Summit) (November 18th, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-November/000116.html)

Hey folks,

I'm your friendly facilitator for who forgot that today was the last day to gather discussion on a set of topics of the Dev Summit. I might be a bit biased, but I think they are all pretty interesting, so I'm reaching out with a quick overview to see if I can spur some interest from ya'll. Check 'em out:

If you're interested, please drop a note or a token in the task. BTW, you don't have to physically attend the dev summit in order to participate. I'll make sure that IRC and Etherpad are shared with all remote attendees who want to attend the sessions I'm helping to organize. I've heard that there will be additional facilities for remote attendees (maybe a youtube stream!?) this year, but I can't confirm yet.

-Aaron

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 05:52 PM

Including new filter interface in ORES review tool (November 18th, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-November/000115.html)

Hey folks,

I made a post at mw:Topic:Tflhjj5x1numzg67 about including the
new advanced filtering interface that the Collaboration Team is working on
in the ORES beta feature. See the original post and add any discussion
there.

Here's a demo of what the new filtering interface will look like:
https://www.mediawiki.org/wiki/File:New-feature_demo%E2%80%94smart_Recent_Changes_filtering_with_ORES.webm

-Aaron

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 05:50 PM

Status update (November 10th, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-November/000114.html)

Hey,

This is the 29th weekly update from revision scoring team that we have sent
to this mailing list.

Deployments:

  • We deployed logging changes to ORES that will reduce the verbosity[1]
  • We also deployed revscoring 1.3.0 and new models built with it to WMF labs[2]. This won't change anything important from a user-perspective, but it paves the way for developing new modeling strategies.

Maintenance and robustness:

  • We fixed puppet so that log file directories are also created on the celery worker nodes (affects wmflabs)[3]
  • We fixed an issue with our recall_at_fpr metrics which was incorrectly defined and implemented a recall_at_precision metric to take its place[4]

New development:

  • We've made a lot of progress on modeling sentences and have just started experimenting with a sentence model from featured articles[5]
  • We're reviewing a dataset of spam/vandalism/attack new page creations for public release[6]. This dataset will help our collaborators work with us on modeling the quality of drafts and supporting new page triage.
  1. https://phabricator.wikimedia.org/T149730 -- Deploy logging changes to ORES
  2. https://phabricator.wikimedia.org/T150447 -- Deploy revscoring 1.3.0 and updated editquality and wikiclass to wmflabs
  3. https://phabricator.wikimedia.org/T149925 -- /srv/log/ores/ not created on worker nodes
  4. https://phabricator.wikimedia.org/T149825 -- Implement recall at precision (and fix FPR metrics)
  5. https://phabricator.wikimedia.org/T148867 -- Implement sentences datascources & experiment with normalization.
  6. https://phabricator.wikimedia.org/T150307 -- Create manually vetted dataset of spam/vandalism/attack pages

Sincerely,
Aaron from the Revision Scoring team

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 05:48 PM

Status update (November 3rd, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-November/000113.html)

Hey,
This is 28th weekly update from revision scoring team that we have sent to
this mailing list.

Publications:

  • *New dataset shows fifteen years of Wikipedia’s quality trends *posted in Wikimedia Blog [1,2]
  • Halfaker, Aaron (2016): Monthly Wikipedia article quality predictions. figshare. https://doi.org/10.6084/m9.figshare.3859800 [3,4]

Maintenance and robustness:

  • Now, most of ORES extension source code is covered by continuous integration tests [5]
  • In order to keep track of changes in ORES grafana dashboards, we keep their JSON content in github now [6]
  • Implemented new metric for grafana: datasources_extracted [7]

New development

  • Thanks to the reading team, Now ORES extension has API modules to expose ORES scores, [8] filter on recent changes and watchlist, [9] and exposing ores models data [10]
  1. https://blog.wikimedia.org/2016/10/27/wikipedia-quality-trends-dataset/
  2. https://phabricator.wikimedia.org/T146709
  3. https://dx.doi.org/10.6084/m9.figshare.3859800
  4. https://phabricator.wikimedia.org/T145332
  5. https://phabricator.wikimedia.org/T146560
  6. https://phabricator.wikimedia.org/T149347
  7. https://phabricator.wikimedia.org/T149199
  8. https://phabricator.wikimedia.org/T143614
  9. https://phabricator.wikimedia.org/T143616
  10. https://phabricator.wikimedia.org/T143617

Sincerely,
@Ladsgroup from the Revision Scoring team

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 05:34 PM

Status update (August 2nd, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-August/000049.html)

Hey,

This is the 15th weekly update from revision scoring team that we have sent
to this mailing list.

New developments:

  • We'll no longer unnecessarily load the models into memory on the web workers[1].
  • We can now score multiple models against the same revision ID for (essentially) free[2].
  • Our precaching system will take advantage of this to drop load by about 3X[3].
  • Update wmflabs deploy repo for new version of ORES[4].

Documentation & maintenance:

  • We completed deployment and maintenance docs for Wiki labels[5], which means we've now got complete docs for our systems[6].
  • We implemented basic continuous integration tests for the ORES extension[7].

Downtime:

  • We had a 1 hour long downtime while trying to deploy new code to ores.wikimedia.org[8]. We've filed two critical tasks for making sure we don't make the mistake again[9,10].
  1. https://phabricator.wikimedia.org/T134606 - Score multiple models with the same cached dependencies
  2. https://phabricator.wikimedia.org/T139407 - Don't load models into memory of web workers
  3. https://phabricator.wikimedia.org/T141376 - Update precached to group requests by model
  4. https://phabricator.wikimedia.org/T141377 - Update wmflabs deploy repo for new version of ORES
  5. https://phabricator.wikimedia.org/T131768 - Wikilabels deployment docs
  6. https://phabricator.wikimedia.org/T106271 - Document maintenance tasks
  7. https://phabricator.wikimedia.org/T140455 - CI test for ORES extension
  8. https://wikitech.wikimedia.org/wiki/Incident_documentation/20160801-ORES
  9. https://phabricator.wikimedia.org/T141823 - Set up password on ORES Beta redis server
  10. https://phabricator.wikimedia.org/T141825 - Config beta ORES extension to use the beta ORES service

Sincerely,
Aaron from the Revision Scoring team

Edit: Note that when I copied this post, I forgot to copy the followups from the same month. See them all here:

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 05:29 PM

NLP (PCFG) work (September 28th, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-September/000098.html)

I've been looking at some recent work that used Probabilistic Context-free Grammars[1,2] to detect vandalism in Wikipedia. I wanted to send a quick message to share some progress.

I've built a python library that implements a really simple PCFG training and scoring strategy and written a quick demo of how it can work. In the following demo, I show how we can build a probabilistic grammar using the I'm a Little Teapot song[4]. Note how sentences that are not characteristic of the song score lower. Note that scores are log-scaled.

>>> sentences = [
...              "I am a little teapot",
...              "Here is my handle",
...              "Here is my spout",
...              "When I get all steamed up I just shout tip me over and pour me out",
...              "I am a very special pot",
...              "It is true",
...              "Here is an example of what I can do",
...              "I can turn my handle into a spout",
...              "Tip me over and pour me out"]
>>>
>>>
>>> teapot_grammar = TreeScorer.from_tree_bank(bllip_parse(s) for s in sentences)
>>>
>>> teapot_grammar.score(bllip_parse("Here is a little teapot"))
-9.392661928770137
>>> teapot_grammar.score(bllip_parse("It is my handle"))
-10.296301543090733
>>> teapot_grammar.score(bllip_parse("I am a spout"))
-10.40166205874856
>>> teapot_grammar.score(bllip_parse("Your teapot is gay"))
-12.96352974967269
>>> teapot_grammar.score(bllip_parse("Your mom's teapot is asldasnldansldal"))
-19.424997926026403

This work is inspired by work that Arthur Tilley (@aetilley) did on our team a last year[5]. The 'kasami' library represents a narrow slice of Arthur's work.

Next, I'm working on building out revscoring to implement some features
that use the scoring strategy on sentenced modified in an edit. I'm hoping
that this type of feature engineering will allow us to catch edits that
make articles more/less notable. I'm also targeting spammy language and
insults.

  1. https://en.wikipedia.org/wiki/Stochastic_context-free_grammar
  2. http://pub.cs.sunysb.edu/~rob/papers/acl11_vandal.pdf
  3. https://github.com/halfak/kasami
  4. https://en.wikipedia.org/wiki/I%27m_a_Little_Teapot
  5. https://github.com/aetilley/pcfg

-@Halfak

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 05:20 PM

ORES scores is injected to js configs (October 29th, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-October/000112.html)

Hey,

With deployment of I42300f9 [1] in wmf.23 which happened several days ago (depending on your wiki), ORES scores are injected to mediawiki javascript config variables. You can access ores data in Special:RecentChanges/Watchlist/Contributions using mw.config.get('oresData'). It opens up a whole new level of functionality for gadgets. For example, I re-wrote a huge script called ScoredRevisions [2] into several lines [3]. Also, without needing to connect to ores.wikimedia.org, it's much faster than the original gadget. You can also write scripts to sort rows in recent changes based on their ORES scores, etc. As a fun task I made my recent changes look like a rainbow :D [4] [5]

The next level is to inject ORES thresholds as mediawiki config variables so we can write up wiki-agnostic gadgets.

I would really appreciate comments or ideas :)

  1. https://gerrit.wikimedia.org/r/#/c/314449/
  2. https://github.com/he7d3r/mw-gadget-ScoredRevisions/blob/master/src/ScoredRevisions.js
  3. https://gist.github.com/Ladsgroup/e67e40500b64dd99dc7ab5c2fa34f261
  4. https://phabricator.wikimedia.org/T144922#2736504
  5. https://phab.wmfusercontent.org/file/data/hoibxop7mn4s2cooz4lz/PHID-FILE-w7bg5zw6a323zug7e6mj/pasted_file

Best,
@Ladsgroup

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 05:19 PM

Status update (October 24th, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-October/000111.html)

Hey,

This is the 26th and 27th weekly update from revision scoring team that we
have sent to this mailing list. We forgot to send the update for last week!

Last week, we were featured in Research's quarterly review. In the last 3
months, we achieved our goals to expand the ORES extension to 6 wikis (we
made it to 8!) and to release datasets of article quality predictions. The
minutes from the quarterly review are not yet online, but once they are,
you'll be able to see them at [1].

Maintenance and robustness:

  • We discussed and decided on a set of strategies for handling goodfaith/naive DOS attacks on ORES[2]
  • We fixed an i18n issue in Wiki Labels[3]
  • We updated the article quality models (wikiclass/wp10) to use revscoring 1.3.0[4]
  • We investigated and solved a memory leak in our pre-caching utility[5]
  • We configured celery to send its logs to a place where we can read them for easier debugging[6]
  • We deployed a set of schema changes to constrain the ORES Review Tools database appropriately[7]
  • Also worth noting is that the services cluster (SCB) has been expanded[8]. ORES has now doubled in capacity

Datasets

  • We discussed how to make the historical article quality dataset available via quarry[8]. Regretfully, it seems that we'll not be able to do that for at least a couple of months.

New development

  • We've implemented embedding of machine-readable scores in a JS variable on-wiki[9]. This will make it easier for tool developers to experiment with new ways of displaying Special:RecentChanges more easily. It's also a necessary precondition for adding color-based signaling of ORES' confidence about an edit.
  1. https://meta.wikimedia.org/wiki/Wikimedia_Foundation_metrics_and_activities_meetings/Quarterly_reviews/Research,_Design_Research,_Analytics,_and_Performance,_October_2016
  2. https://phabricator.wikimedia.org/T148347 -- [Discuss] DOS attacks on ORES. What to do?
  3. https://phabricator.wikimedia.org/T139587 -- Revision not found error unformatted and not localized
  4. https://phabricator.wikimedia.org/T147201 -- Update wikiclass for revscoring 1.3.0
  5. https://phabricator.wikimedia.org/T146500 -- Investigate memory leak in precached
  6. https://phabricator.wikimedia.org/T147898 -- Send celery logs to /srv/log/ores instead of /var/lib/daemon.log
  7. https://phabricator.wikimedia.org/T147734 -- Review and deploy 309825
  8. https://phabricator.wikimedia.org/T147903 -- Expand SCB cluster
  9. https://phabricator.wikimedia.org/T146718 -- [Discuss] Hosting the monthly article quality dataset on labsDB
  10. https://phabricator.wikimedia.org/T143611 -- Embed machine readable ores scores as data on pages where ORES scores things

Sincerely,
Aaron from the Revision Scoring team

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 05:16 PM

Status update (October 11th, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-October/000106.html)

Hey,

This is the 24th and 25th weekly update from revision scoring team that we
have sent to this mailing list. We skipped a week due to travel and other
work.

Maintenance and robustness:

  • We improved the performance of RecentChanges fitlering in the ORES extension[1]
  • We built and ran a maintenance script to clean up duplicate cached data for the ORES extension[2,3]
  • We updated the editquality models for the new version of revscoring (1.3.0)[4] and made some upstream changes to json2tsv to make that easier[5]
  • We quited down some of our error reporting so that our logs take up less space[6]

Datasets:

  • We generated a dataset that uses the "wp10" prediction model to assess article quality in monthly intervals for English, French, and Russian Wikipedia[7]. This should enable new research into the quality dynamics of these wikis.
  • We generated a dataset of vandalism, spam, and attack page creations for building a new "draft quality" model[8]

Communication:

  • Presented about transparent/open AI development practices around ORES at the Association of Internet Researchers[9]

New development:

  • We've made substantial progress towards adding ORES data to MediaWiki's api.php endpoints with rcshow=oresreview[10] and rvprop=ores[11]
  1. https://phabricator.wikimedia.org/T146111 -- hidenondamaging=1 query is extremely slow on enwiki
  2. https://phabricator.wikimedia.org/T145356 -- Ensure ORES data violating constraints do not affect production
  3. https://phabricator.wikimedia.org/T145503 -- Build a maintenance script to clean up duplicate data
  4. https://phabricator.wikimedia.org/T146410 -- Update editquality for revscoring 1.3.0
  5. https://phabricator.wikimedia.org/T146939 -- Add type decoding support to tsv2json
  6. https://phabricator.wikimedia.org/T146680 -- Quiet result.get Warning in tasks
  7. https://phabricator.wikimedia.org/T145655 -- Generate monthly article quality dataset
  8. https://phabricator.wikimedia.org/T135644 -- Generate spam and vandalism new page creation dataset
  9. https://phabricator.wikimedia.org/T147706 -- Present about ORES transparency at AoIR
  10. https://phabricator.wikimedia.org/T143616 -- Introduce rcshow=oresreview and similar ones
  11. https://phabricator.wikimedia.org/T143614 -- Introduce ORES rvprop

Sincerely,
Aaron from the Revision Scoring team

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 05:14 PM

Status update (September 22nd, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-September/000095.html)

Hey,

This is the 22nd weekly update from revision scoring team that we have sent to this mailing list.

UI work:

  • We configured the default threshold for the ORES review tool on Wikidata to be more strict (higher recall, lower precision)[1]
  • We fixed a display issue on Special:Contributions where the filters would not wrap[2]

Increasing model fitness:

  • We finished demonstrating model fitness gains using hash-vector features[3]. Next, we'll be working to get the hash-vector features implemented in revscoring/ORES[4].
  • We implemented a new strategy for training and testing on all data using cross-validation[5]. This will both increase the fitness of the models and make the statistics reported more robust.

Maintenance and robustness

  • We fixed an indexing issues in ores_model that prevented the deployment of updated models[6].
  • We did a minor investigation to a short period of degraded service quality on WMF Labs[7]
  1. https://phabricator.wikimedia.org/T144784 -- Change default threshold for Wikidata to high
  2. https://phabricator.wikimedia.org/T143518 -- Filter on user contribs has nowrap, causing issues
  3. https://phabricator.wikimedia.org/T128087 -- [Spike] Investigate HashingVectorizer
  4. https://phabricator.wikimedia.org/T145812 -- Implement ~100 most important hash vector features in editquality models
  5. https://phabricator.wikimedia.org/T142953 -- Train on all data, Report test statistics on cross-validation
  6. https://phabricator.wikimedia.org/T144432 -- oresm_model index should not be unique
  7. https://phabricator.wikimedia.org/T145353 -- Investigate short period of ores-web-03 insanity

Sincerely,
Aaron from the Revision Scoring team

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 05:03 PM

Status update (September 14th, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-September/000088.html)

Hey,

This is the 21st weekly update from revision scoring team that we have sent
to this mailing list.

New development

  • We received a request to get moving on Spanish Wikibooks support, so we dug in:
  • We deployed a new Wiki labels campaign[1]
  • We fixed an issue in Wiki labels that prevented requests from *. wikibooks.org[2]
  • We trained a basic "revert" detection model that seems to be pretty effective[3]
  • We also generated a dataset of article quality scores for English Wikipedia[4]. You can download it here: [5]

This week, we invested in some long term tasks. If you review our
phabricator board, you'll see substantial progress in improving our damage
detection models with hashing vectorization strategies[6, 7], implementing
a more robust model testing strategy[8], and implementing some advance
natural language processing strategies[9, 10]. Stay tuned for the
completion of these activities in the coming weeks.

  1. https://phabricator.wikimedia.org/T143962 -- Add uniqueness constraints to ores_classification
  2. https://phabricator.wikimedia.org/T145406 -- Fix CORS for wikibooks
  3. https://phabricator.wikimedia.org/T145428 -- Train/test reverted model for Spanish Wikibooks
  4. https://phabricator.wikimedia.org/T135684 -- Generate recent article quality scores for English Wikipedia
  5. https://datasets.wikimedia.org/public-datasets/enwiki/article_quality/wp10-scores-enwiki-20160820.tsv.bz2
  6. https://phabricator.wikimedia.org/T128087 -- [Spike] Investigate HashingVectorizer
  7. https://en.wikipedia.org/wiki/Feature_hashing
  8. https://phabricator.wikimedia.org/T142953 -- Train on all data, Report test statistics on cross-validation
  9. https://phabricator.wikimedia.org/T144636 -- Implement PCFG features
  10. https://en.wikipedia.org/wiki/Stochastic_context-free_grammar

Sincerely,
Aaron from the Revision Scoring team

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 05:01 PM

Status update (September 6th, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-September/000087.html)

Hey,

This is the 20th weekly update from revision scoring team that we have sent
to this mailing list.

New development:

  • We implemented the basic functionality for handling bag of words and other types of abstract feature vectors in revscoring. [1] This required some changes to some dependencies as well. [2]
  • We extended the user-group related features to include more of the dominant groups outside of English Wikipedia [3] and incremented the models that changed substantially [4]

Documentation:

  • We extended the documentation at mw:Extension:ORES to make it easier for new developers to work with us. [5]

Resourcing:

  • We discussed the teams resourcing needs (hardware, engineering, and community liaison support) with Wes Moran. [6]

Maintenance and robustness:

  • We addressed a variety of issues around caching and how the ORES extension loads new data
  • ORES now returns headers that will disable secondary caching. [7]
  • Our maintenance scripts will circumvent caches that do not listen to no-cache headers. [8, 9]
  • We fixed an issue where the ORES review tool would duplicate items in Special:RecentChanges. [10]
  • We standardized the extraction pattern for the enwiktionary model so that it looks similar to other models. [11]
  1. https://phabricator.wikimedia.org/T132580 -- Implement abstraction for Sparse Feature Vectors
  2. https://phabricator.wikimedia.org/T144430 -- Update yamlconf so that import_path can handle deep attributes
  3. https://phabricator.wikimedia.org/T143909 -- Extend user group features
  4. https://phabricator.wikimedia.org/T144855 -- Increment ruwiki editquality models
  5. https://phabricator.wikimedia.org/T144676 -- Improve technical documentation in Extension:ORES in mediawiki.ore
  6. https://phabricator.wikimedia.org/T144517 -- ORES and Product: resourcing discussion
  7. https://phabricator.wikimedia.org/T144193 -- Set max-age header to 0 seconds for ORES to quiet secondary caches
  8. https://phabricator.wikimedia.org/T144196 -- Get model version needs to invalidate cache
  9. https://phabricator.wikimedia.org/T144195 -- Check model version replaces every time it runs.
  10. https://phabricator.wikimedia.org/T144233 -- Redundant results in ORES review tool
  11. https://phabricator.wikimedia.org/T144605 -- Fix makefile entry for enwiktionary.rev_reverted.20k_2016.tsv

Sincerely,
Aaron from the Revision Scoring team

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 04:59 PM

ORES review tool deployment status (September 3rd, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-September/000085.html)

Hey folks,

I recently received an email asking for more information about how to get
the ORES review tool[1] deployed in more wikis (we currently support 8:
wikidata, fawiki, enwiki, nlwiki, ptwiki, plwiki, trwiki, ruwiki). I
figured that this summary should be shared more broadly, so I'm pasting it
below.


This is the best guide that we have for users requesting support right now:
https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service/Get_support

There's a lot that we need to do in response to requests for support. We
currently have Wiki labels[2] edit quality campaigns running in

http://labels.wmflabs.org/campaigns/arwiki/?campaigns=stats (2464/4977 labels)
http://labels.wmflabs.org/campaigns/azwiki/?campaigns=stats (0/5000 labels)
http://labels.wmflabs.org/campaigns/dewiki/?campaigns=stats (182/4177 labels)
http://labels.wmflabs.org/campaigns/enwiki/?campaigns=stats (350/6333 labels, extension)
http://labels.wmflabs.org/campaigns/eswiki/?campaigns=stats (1210/8434 labels)
http://labels.wmflabs.org/campaigns/etwiki/?campaigns=stats (824/4678 labels)
http://labels.wmflabs.org/campaigns/fawiki/?campaigns=stats (1781/3156 labels, extension)
http://labels.wmflabs.org/campaigns/frwiki/?campaigns=stats (274/5000 labels)
http://labels.wmflabs.org/campaigns/hewiki/?campaigns=stats (1069/5000 labels)
http://labels.wmflabs.org/campaigns/huwiki/?campaigns=stats (518/5000 labels)
http://labels.wmflabs.org/campaigns/idwiki/?campaigns=stats (50/2200 labels)
http://labels.wmflabs.org/campaigns/itwiki/?campaigns=stats (591/5390 labels)
http://labels.wmflabs.org/campaigns/jawiki/?campaigns=stats (356/9514 labels)
http://labels.wmflabs.org/campaigns/nowiki/?campaigns=stats (1815/5000 labels)
http://labels.wmflabs.org/campaigns/svwiki/?campaigns=stats (1657/5000 labels)
http://labels.wmflabs.org/campaigns/ukwiki/?campaigns=stats (161/3318 labels)
http://labels.wmflabs.org/campaigns/urwiki/?campaigns=stats (153/5000 labels)
http://labels.wmflabs.org/campaigns/viwiki/?campaigns=stats (0/5000 labels)

Note that we're developing a nice dashboard for this. See http://tools.wmflabs.org/dexbot/tools/wikilabels_stats.php

Note that two campaigns are labeled "extension" because we already have
support for those wikis, but we are running campaigns to extend the
observations in our labeled datasets for higher fitness. In order to get
these campaigns done, we need a local Wikipedian (or liaison) to call
attention to the campaign and make sure that questions get answered and
work continues. The wikis that already have support are those wikis where
we found a strong local collaborator to help.

  • Wikidata - User:Ladsgroup
  • Ptwiki - User:He7d3r
  • Trwiki - User:WhiteCat
  • Enwiki - User:EpochFail
  • Fawiki - User:Ladsgroup
  • Ruwiki - User:Putnik
  • Nlwiki - User:Krinkle
  • Plwiki - User:Tar_Lócesilion

As of right now, I'm the only person who is officially working on ORES full
time. Amir (User:Ladsgroup) is funded to work on ORES 4 hours per week
through WMDE. So, any time that someone asks something from us, we in
turn, ask for support in order to be able to do it. The most critical
support we could get for moving faster would be (1) community liaison
support for identifying local collaborators and driving the current Wiki
labels campaigns and (2) Engineering support to make the icky
user-script[3] into a proper extension.

  1. https://www.mediawiki.org/wiki/ORES_review_tool
  2. https://meta.wikimedia.org/wiki/Wiki_labels
  3. https://meta.wikimedia.org/wiki/Wiki_labels#Installation

-Aaron

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 04:57 PM

Deployment of ORES review tool in Englis Wikipedia as a beta feature (August 23rd, 2016)

We The Revision Scoring Team
are happy to announce the deployment of the ORES review tool as a beta feature on *English Wikipedia*. Once enabled, ORES highlights edits that are likely to be damaging in Special:RecentChanges, Special:Watchlist and Special:Contributions to help you prioritize your patrolling work. ORES detects damaging edits using a basic prediction model based on past damage.

ORES is an experimental technology. We encourage you to take advantage of it but also to be skeptical of the predictions made. It's a tool to support you – it can't replace you. Please reach out to us with your questions and concerns.

Documentation: mw:ORES review tool, mw:Extension:ORES, and m:ORES

Bugs & feature requests: Scoring-platform-team-Backlog

IRC: #wikimedia-ai

Sincerely,
Amir from the Revision Scoring team

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 04:52 PM

New models coming to ORES & notes (August 19th, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-August/000068.html)

Hey folks,

We've been working on generating some updated models for ORES. These
models will behave slightly differently from the models that we currently
have deployed. This is a natural artifact of retraining the models on the
*exact same data* again because of some random properties of the learning
algorithms. So, for the most part, this should be a non-issue for any
tools that use ORES. However, I wanted to take this opportunity to
highlight some of the facilities ORES provides to help automatically detect
and adjust for these types of changes.

Versions

ORES provides information about all of the models. This information
includes a model version number. If you are caching ORES scores locally,
we recommend invalidating old scores whenever this model number changes.
For example, https://ores.wikimedia.org/v2/scores/enwiki/damaging/12345678
currently returns

{
  "scores": {
    "enwiki": {
      "damaging": {
        "scores": {
          "12345678": {
            "prediction": false,
            "probability": {
              "false": 0.7141333465390294,
              "true": 0.28586665346097057
            }
          }
        },
        "version": "0.1.1"
      }
    }
  }
}

This score was generated with the "0.1.1" version of the model. But once
we deploy the new models, the same request will return:

{
  "scores": {
    "enwiki": {
      "damaging": {
        "scores": {
          "12345678": {
            "prediction": false,
            "probability": {
              "false": 0.8204647324045306,
              "true": 0.17953526759546945
            }
          }
        },
        "version": "0.1.2"
      }
    }
  }
}

Note that the version number changes to "0.1.2" and the probabilities
change slightly. In this case, we're essentially re-training the same
model in a similar way, so we increment the "patch" number.

However, we're switching modeling strategies for the article quality models
(enwiki-wp10, frwiki-wp10 & ruwiki-wp10), so those versions increment the
minor version from "0.3.2" to "0.4.0". You may see more substantial
changes in prediction probabilities with those models, but a quick
spot-checking suggests that the changes are not substantial.

Test statistics and threshholding

So, many tools that use our edit quality models (reverted, damaging and
goodfaith) will set threshholds for flagging edits for review. In order to
support these tools, we produce test statistics that suggest useful
thresholds.

https://ores.wmflabs.org/v2/scores/enwiki/damaging/?model_info=test_stats
produces:

...
      "filter_rate_at_recall(min_recall=0.75)": {
        "filter_rate": 0.869,
        "recall": 0.752,
        "threshold": 0.492
      },
      "filter_rate_at_recall(min_recall=0.9)": {
        "filter_rate": 0.753,
        "recall": 0.902,
        "threshold": 0.173
      },
...

These two statistics show useful thresholds for detecting damaging edits.
E.g. if you want to be sure that you catch nearly all vandalism (and are OK
with a higher false-positive rate), set the threshold at 0.173, but if
you'd like to catch most vandalism with almost no false-positives, set the
threshold at 0.492. These fields can be read automatically by tools so
that they do not need to be manually updated every time that we deploy a
new model.

Let me know if you have any questions and happy hacking!

-Aaron

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 04:44 PM

ORES going into production (June 22, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-June/000036.html)

Hey folks,

We (The Revision Scoring Team[1]) are happy to announce the deployment of
the ORES service[2] in production at a new address:
https://ores.wikimedia.org. This will replace the old Wikimedia Labs
address soon: https://ores.wmflabs.org. Along with this new location, we
are running on more predictable infrastructure that will allow us to
increase our uptime and to make the service available to MediaWiki and
extensions directly.

We've also begun deploying the ORES review tool[3] as a beta feature on
Wikidata and Persian Wikipedia in order trial the fully integrated
extension. Once enabled, the ORES review tool highlights edits that are
likely to be damaging in Special:RecentChanges to help you prioritize your
patrolling work. ORES is an experimental technology. We encourage you to
take advantage of it but also to be skeptical of the predictions made.
Please reach out to us with your questions and concerns.

We'll soon begin to deploy the ORES review tool to more wikis. Next up are
English, Portuguese, Russian, Dutch and Turkish Wikipedias. We can deploy
to these wikis because those communities have completed Wiki labels[4]
campaigns that help train ORES' classifiers to differentiate good-faith
mistakes from vandalism. If you'd like to get the ORES review tool
deployed in your wiki, please reach out to us for help setting up or
completing a Wiki labels campaign on your local wiki. Wikimania
participants can also attend our workshop[5] during the hackathon to get
setting up ORES for your local wiki.

Documentation:

Bugs & feature requests: https://phabricator.wikimedia.org/tag/revision-scoring-as-a-service-backlog/
IRC: #wikimedia-ai[6]

  1. https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service#Team
  2. https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service
  3. https://www.mediawiki.org/wiki/ORES_review_tool
  4. https://meta.wikimedia.org/wiki/Wiki_labels
  5. https://phabricator.wikimedia.org/T134628
  6. https://webchat.freenode.net/?channels=#wikimedia-ai

Stay tuned for an update about deprecation of ores.wmflabs.org and
announcements of support for new wikis. Please feel free to reach out to
us with any questions/ideas.

-Aaron

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 04:13 PM

Status update (July 6th, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-July/000039.html)

Hey,
This is the 11th weekly update from revision scoring team that we have sent
to this mailing list.

*New developments:*

  • ORES review tool as a beta feature is enabled in Dutch Wikipedia. More wikis to come soon this week [1].
  • We have basic edit quality model for Czech Wikipedia ready and merged. To be deployed this week [2].
  • We also have basic models for English Wiktionary too. This is the second non-Wikipedia project we support after Wikidata [3].
  • Thanks to Tar Lócesilion, we have Polish edit quality campaign completed, We are working on building damaging and goodfaith models at the moment [4].

*Maintenance and robustness:*

  • We decreased our web capacity in order to reduce memory pressure on scb nodes. You should not get any overload error since our capacity is still very high but if you do, please contact us immediately and we will bring it back up [5].
  • We improved documentation on ores.wikimedia.org page a little bit. To be deployed this week [6].

We are working on a rather big refactor on ores which will give us
performance boost on scoring multiple models at the same time [7] and
reduce memory usage [8]. Feel free to chime in and give us feedback [9].

  1. https://phabricator.wikimedia.org/T139432
  2. https://phabricator.wikimedia.org/T138885
  3. https://phabricator.wikimedia.org/T138630
  4. https://phabricator.wikimedia.org/T130269
  5. https://phabricator.wikimedia.org/T139177
  6. https://phabricator.wikimedia.org/T138089
  7. https://phabricator.wikimedia.org/T134606
  8. https://phabricator.wikimedia.org/T139407
  9. https://phabricator.wikimedia.org/T139408

Sincerely,
Amir from the Revision Scoring team.

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 04:11 PM

Status update (June 14th, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-June/000033.html)

Hi folks,

This is the 8th weekly update for the revision scoring team that we have
sent to this mailing list.

New developments:

  • ORES extension got deployed in Persian Wikipedia. [1] Give it a try![2]
  • Article quality model ("wp10" model) is now working for Russian Wikipedia. [3] It will be deployed this week
  • We deplyed article topic campaign for English Wikipedia [4]
  • ores.wikimedia.org does have a grafana dashboard now [5]

Maintenance and robustness:

  • ORES icinga didn't work when workers were down[13], it got fixed [6]
  • We finished load testing ores.wikimedia.org and it was quite alright [7] [8]
  • CORS is moved to uwsgi level to it works in prod too [9]
  • Deploying new versions of ORES in prod and labs has a proper documentation page now [10] [11]
  • We had intermittent spikes of errored revisions, got it resolved [12]
  1. https://phabricator.wikimedia.org/T130211
  2. https://fa.wikipedia.org/wiki/Special:Preferences#mw-prefsection-betafeatures
  3. https://phabricator.wikimedia.org/T131635
  4. https://phabricator.wikimedia.org/T137325
  5. https://phabricator.wikimedia.org/T137367
  6. https://phabricator.wikimedia.org/T137592
  7. https://phabricator.wikimedia.org/T137365
  8. https://phabricator.wikimedia.org/T137131
  9. https://phabricator.wikimedia.org/T137433
  10. https://phabricator.wikimedia.org/T137570
  11. https://wikitech.wikimedia.org/wiki/Ores/Deployment
  12. https://phabricator.wikimedia.org/T134109
  13. https://wikitech.wikimedia.org/wiki/Incident_documentation/20160610-ORES

Best,
Amir

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 04:04 PM

Gerard Meijssen

#Wikipedia - Bias and the King #Faisal International #Prize

The King Faisal International Prize is an international award recognising five distinct areas. They are: Service to Islam, Islamic Studies, Arabic language and literature, Medicine and Science.

When a Wikipedia is to write in a NPOV way about the King Faisal International Prize, all five categories need to be included. Just listing Medicine and Science and having the article as a "science award" ignores the scientific realities in the other three categories or prevents the inclusion of other theological or literature awards.

This is an unfounded bias and remediation is needed in order to achieve NPOV.
Thanks,
     GerardM

by Gerard Meijssen (noreply@blogger.com) at June 03, 2017 03:57 PM

Wikimedia Scoring Platform Team

Status update (June 7th, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-June/000032.html)

Hey folks,

This is the 7th weekly update for the revision scoring team that we have
sent to this mailing list.

New developments:

  • The production version of ORES (ores.wikimedia.org) is live! We are still testing and it might break sometimes. [1]
  • Norwegian basic support is completed. We deploy this very soon. [2] [3]
  • ores-experiment.wmflabs.org is a new setup to run experimental models. [4]
  • We implemented a demo of dependent task in celery, our distributed processing environment [5] which brings us closer to a key performance improvement [6]

Maintenance and robustness:

  1. https://phabricator.wikimedia.org/T106867
  2. https://phabricator.wikimedia.org/T131856
  3. https://phabricator.wikimedia.org/T131855
  4. https://ores-experiment.wmflabs.org/
  5. https://phabricator.wikimedia.org/T136875
  6. https://phabricator.wikimedia.org/T134606
  7. https://phabricator.wikimedia.org/T137003
  8. https://phabricator.wikimedia.org/T130872

Sincerly
-Amir

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 03:54 PM

Status update (May 10, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-May/000026.html)

Hey folks,

This is the weekly update for the Revision Scoring project for the week of
May 2nd through May 8th.

New developments:

  • Solved some issues that block a major performance improvement for score requests using multiple models[2]
  • Improved the performance of feature extraction for features that use mwparserfromhell[3,4]
  • We applied regex performance optimizations to badwords and informal word detection for many languages[9]

Maintenance and robustness:

  • Solved a regression in ScoredRevisions that caused most revisions in RecentChanges to not be scored[1]
  • Set ORES load balancer to rebalance on 500 responses from a web node[5]
  • Enabled CORS for error responses from ORES -- this makes it easier to report errors from a gadget on a wiki[6]
  • Sade the staging instance of Wikilabels[7] look a lot more like the production instance[8]
  1. https://phabricator.wikimedia.org/T134601
  2. https://phabricator.wikimedia.org/T134781
  3. https://mwparserfromhell.readthedocs.io
  4. https://phabricator.wikimedia.org/T134780
  5. https://phabricator.wikimedia.org/T111806
  6. https://phabricator.wikimedia.org/T119325
  7. https://labels-staging.wmflabs.org/gadget/
  8. https://phabricator.wikimedia.org/T134627
  9. https://phabricator.wikimedia.org/T134267

Stay tuned!
--Aaron

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 03:52 PM

Status update (May 23rd, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-May/000030.html)

Hey,
This is the fifth weekly update for revision scoring team that we sent to
this mailing list.

New developments:

  • We got Swedish basic model ready to deloy, likely to happen in the next week [1] [2]
  • We generated list of bad words for every Wikipedia with more than 100K articles (with a few exceptions) [3]

Maintenance and robustness:

  • We enabled CORS for Wikimedia wikis in wikilabels and now we won't let you do write actions via GET [4]
  • We are using systemd watchdogs in precaching to be sure it stays alive. [5]
  • We are changing some settings in nginx and uwsgi in order to finalize moving to prod [6]
  1. https://phabricator.wikimedia.org/T131450
  2. https://phabricator.wikimedia.org/T135604
  3. https://phabricator.wikimedia.org/T134629
  4. https://phabricator.wikimedia.org/T135377
  5. https://phabricator.wikimedia.org/T135941
  6. https://phabricator.wikimedia.org/T135655

Sincerely,
The Revision scoring team

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 03:52 PM

Status update (April 25, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-April/000019.html)

Hello, This is our first weekly update being posted in this mailing list

New Developments

  • Now you can abandon tasks you don't want to review in Wikilabels (T105521)
  • We collect user-agents in ORES requests (T113754)
  • Precaching in ORES will be a daemon and more selective (T106638)

Progress in supporting new languages

  • Russian reverted, damaging, and goodfaith models are built. They look good and will be deployed this week.
  • Hungarian reverted model is built, will be deployed this week. Campaign for goodfaith and damaging is loaded in Wikilabels.
  • Japanese reverted model are built, but there are still some issues to work out. (T133405)

Active Labeling campaigns

  • Edit quality (damaging and good faith)
  • Wikipedias: Arabic, Azerbaijani, Dutch, German, French, Hebrew, Hungarian, Indonesian, Italian, Japanese, Norwegian, Persian (v2), Polish, Spanish, Ukrainian, Urdu, Vietnamese
    • Wikidata
  • Edit type
  • English Wikipedia

Sincerely,
The Revision Scoring team.
https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service#Team

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 03:49 PM

Status update (May 2nd, 2016)

(This post was copied from https://lists.wikimedia.org/pipermail/ai/2016-May/000022.html)

The second weekly update on the Revision Scoring project.

New developments

  • ORES has graphite dashboard now [1,2]
  • Deploying new campaigns and testing Wikilabels got easier [3,4]
  • Revscoring feature extraction got about 13% faster [5]
  • We deployed new versions of ORES and Wikilabels [6,7]
  • Wikidata ScoredRevision gadget had a serious issue, it got fixed [8]

Progress in supporting new languages

  • Wikidata damaging and goodfaith models are built and deployed [9,10,11]
  • Dutch damaging and goodfaith models are built and deployed [12]
  • We are working on langauge utilities of Tamil [13]

Active Labeling campaigns

  • Edit quality (damaging and good faith)
  • Wikipedias: Arabic, Azerbaijani, German, French, Hebrew, Hungarian, Indonesian, Italian, Japanese, Norweigian, Persian (v2), Polish, Spanish, Ukranian, Urdu, Vietnamese
  • Edit type
  • English Wikipedia
  1. https://grafana.wikimedia.org/dashboard/db/ores
  2. https://phabricator.wikimedia.org/T127594
  3. https://phabricator.wikimedia.org/T133557
  4. https://phabricator.wikimedia.org/T102336
  5. https://github.com/wiki-ai/revscoring/pull/268
  6. https://phabricator.wikimedia.org/T134032
  7. https://phabricator.wikimedia.org/T134174
  8. https://phabricator.wikimedia.org/T133903
  9. https://phabricator.wikimedia.org/T130274
  10. https://phabricator.wikimedia.org/T130301
  11. https://lists.wikimedia.org/pipermail/wikidata/2016-May/008641.html
  12. https://phabricator.wikimedia.org/T133563
  13. https://phabricator.wikimedia.org/T134105

by Halfak (Aaron Halfaker, EpochFail, halfak) at June 03, 2017 03:46 PM

Weekly OSM

weeklyOSM 358

23/05/2017-29/05/2017

Firmenmitglieder

The recent corporate members: „Supporter“ and “Bronze“ level 1

Mapping

  • The note # one million was created in Hong Kong.
  • You can vote for the newly proposed tag name:language = ISO code until June 6th.
  • Santamariense proposes a new tag for livestock dips. They consist of rural structures where livestock go through to eliminate parasites, and are useful landmarks for orientation in otherwise uniform landscapes.
  • User nammala writes about some interesting cases of inconsistent and detailed mapping activity in OpenStreetMap.
  • Maning Sambale informs about a crucial milestone: finally emojis can be used in name tags 😎

Community

  • Mapillary would like to fill some gaps in Berlin, Germany, and therefore starts “Complete the Map Challenge – Berlin“.
  • Julio May has carried out a two-day mapping workshop in the Dominican Republic for the FUNDECOR.
  • The Belgian community voted Marek Kleciak as Mapper of the Month for May 2017. He has been (and still is) a pioneer in the field of 3D mapping, in the area:highway tagging (which was further expanded) and is strongly engaged in Nepal.
  • This discussion on osm-dev mailing list compares several ways of printing a large area on PDF.

OpenStreetMap Foundation

  • OSM blog published the list of current corporate members. We thank them here as well for their support! We welcome further organisations to check the Corporate members tiers and contact the Foundation for further enquiries.
  • The monthly CWG meeting took place on May 26th.

Events

  • From the June 2nd to the 4th, the SotM France (State of the Map France) takes place in Avignon. In addition, “OSM kids“ event have occurred in several schools from May 30th to June 1st.
  • This year SotM France has a very nice additional program for students (FR) (automatic translation) All the information can be found in the wiki (FR) (automatic translation).
  • SotM 2017 has published the list of talks and workshops; a detailed program is coming soon. Did you get your early bird ticket already?

Humanitarian OSM

  • HOT considers developing a new tool for organized field mapping.
  • HOT calls for urgent help on the current priority mapping tasks for the OSM response to the Ebola outbreak: validation for project #10 and #13, and mapping in project #9
  • In the HOT mailing list, there is a debate on how meaningful is the tagging of buildings by a node only and the think about to display such nodes in the HOT map style.

Maps

  • Stefan Keller shows all important medieval festivals in Switzerland on a map.
  • OpenAndroMaps now provides multilingual maps.
  • Viktor Garske announced the project WinilooC to find public toilets.
  • Sven Geggus asked on the implications of the OSM Carto 4.0 changes for the German-derived style.

Open Data

  • Apple provides building footprints and heights under ODbL for France and Denmark

Software

  • Dana Sulit writes in the Mapbox blog about carto-cam. This tool can help designers to evaluate custom styles.
  • OSRM (see also release table) has added the multi-level Dijkstra (MLD) routing algorithm in addition to contraction hierarchies (CH). This allows e.g. a fast and flexible change of edge weights or multiple parallel profiles.

Releases

Software Version Release date Comment
Kurviger Free * 10.0.26 2017-05-17 Various improvements.
OpenStreetMap Carto Style 4.0.0 2017-05-22 No infos.
Mapzen Lost 3.0.0 2017-05-25 Many changes, please read release info.
PyOsmium 2.12.3 2017-05-25 One new feature and two bugfixes.
SQLite 3.19.2 2017-05-25 Bugfix release.
Maps.me iOS * 7.3.4 2017-05-26 Bugfix release.
QGIS 2.18.9 2017-05-26 No infos.
Overpass-Turbo 2017-05-28 2017-05-28 Many changes during the last days, please read release infos.
JOSM 12275 2017-05-29 Minor enhancements.
Komoot Android * var 2017-05-29 No infos.
Maps.me Android * var 2017-05-29 Compacter map data, new map style for driving, transliteration into Latin.
OSRM Backend 5.7.3 2017-05-29 Bugfix release.

Provided by the OSM Software Watchlist. Timestamp: 2017-05-29 22:52:18+02 UTC

(*) unfree software. See: freesoftware.

Did you know …

  • … the OsmHydrant map created by Robert Koch with all hydrants’ locations and features extracted from OSM data?

Other “geo” things

  • The U.S. Bureau of Ocean Energy Management releases the highest-resolution bathymetry map of Gulf of Mexico to date: it shows the dynamic geology structures caused by extended salt deposits underneath the ocean bed sediments.
  • The Guardian publishes new scientific discoveries about a potential second layer of tectonic plates within Earth’s mantle. This would explain the strong earthquake activity in the Pacific region.
  • What are your favorite places in the world? Found any interesting destinations you want to travel to while exploring the map? Checkout the world wonders Mapbox’s geocoding team has discovered!
  • Seán Lynch calls for contributions on OpenLitterMap, as he plans to trace and model the path of litter along rain and sewage systems.

Upcoming Events

Where What When Country
Berlin #CompletetheMap – Mapillary photo mapping 25/05/2017-15/06/2017 germany
Taipei OSM Taipei Meetup, MozSpace 05/06/2017 taiwan
Toronto Mappy Hour 05/06/2017 canada
Rostock Rostocker Treffen 06/06/2017 germany
London Missing Maps London Mapathon, Royal Geographic Society 06/06/2017 uk
Stuttgart Stuttgarter Stammtisch 07/06/2017 germany
Montreal OSM monthly meetup at BANQ 07/06/2017 canada
Praha/Brno/Ostrava Kvartální pivo 07/06/2017 czech republic
Munich Münchner Stammtisch 08/06/2017 germany
Falkensee 108. Berlin-Brandenburg Stammtisch 09/06/2017 germany
Russia Tula Mapping Party, Tula 10/06/2017-11/06/2017
Suita 【西国街道#05・初心者向け】万博探索マッピングパーティ 10/06/2017 japan
Tokyo 第1回 東京!街歩かない!マッピングバーティ 10/06/2017 japan
Manila San Juan City Mapa-thon by MapAm❤re – Juan more time! , San Juan 10/06/2017 philippines
Passau Mappertreffen 12/06/2017 germany
Nantes Rencontres mensuelles 13/06/2017 france
Lyon Rencontre mensuelle libre 13/06/2017 france
Freiberg Stammtisch Freiberg 15/06/2017 germany
Zittau OSM-Stammtisch Zittau 16/06/2017 germany
Tokyo 東京!街歩き!マッピングパーティ:第9回 旧芝離宮恩賜庭園 17/06/2017 japan
Bonn Bonner Stammtisch 20/06/2017 germany
Lüneburg Mappertreffen Lüneburg 20/06/2017 germany
Nottingham Nottingham Pub Meetup 20/06/2017 united kingdom
Scotland Edinburgh 20/06/2017 united kingdom
Karlsruhe Stammtisch 21/06/2017 germany
Salzburg AGIT2017 05/07/2017-07/07/2017 austria
Kampala State of the Map Africa 2017 08/07/2017-10/07/2017 uganda
Champs-sur-Marne (Marne-la-Vallée) FOSS4G Europe 2017 at ENSG Cité Descartes 18/07/2017-22/07/2017 france
Boston FOSS4G 2017 14/08/2017-19/08/2017 united states
Aizu-wakamatsu Shi State of the Map 2017 18/08/2017-20/08/2017 japan
Patan State of the Map Asia 2017 23/09/2017-24/09/2017 nepal
Boulder State of the Map U.S. 2017 19/10/2017-22/10/2017 united states
Buenos Aires FOSS4G+State of the Map Argentina 2017 23/10/2017-28/10/2017 argentina
Lima State of the Map LatAm 2017 29/11/2017-02/12/2017 perú

Note: If you like to see your event here, please put it into the calendar. Only data which is there, will appear in weeklyOSM. Please check your event in our public calendar preview and correct it, where appropriate.

This weeklyOSM was produced by Anne Ghisla, Nakaner, Peda, Rogehm, SeleneYang, Spec80, derFred, jinalfoflia.

by weeklyteam at June 03, 2017 09:27 AM

June 02, 2017

Wiki Education Foundation

Speaking about fake news at Diablo Valley College

Last month, I was honored be part of the Contra Costa Community College District’s Civic Engagement Speaker Series, with the provocative title “Fake news: A threat to our society?” The event was moderated by Diablo Valley College Journalism Department Chair Mary Mazzoco. Joining me on the panel were East Bay Times education reporter Sam Richards, UC Berkeley e-learning and information studies librarian Cody Hennesy, California State Senator Bill Dodd, and student trustee Kwame Baah-Arhin.

My co-panelists had some really interesting insights into fake news. Several people mentioned the history of fake news from the beginnings of the printing press to Bat Boy to satirical news sites. But the Internet, with its mass accessibility as a platform for publishing and consuming information, has enabled the spread of dubiously sourced information. Panelists agreed that more critical digital literacy was needed.

In particular, I appreciated Cody’s great presentation, in which he talked about the UC Berkeley library guide on evaluating resources, which encourages people to consider the source, the platform, themselves, and the world. I was also interested to learn about Senator Dodd’s bill, which he has introduced in the California State Legislature, to create a model curriculum around media literacy for K-12 students.

I spoke about how a recent Stanford study termed students’ digital literacy skills “bleak”, and how Wiki Ed’s Classroom Program is working to change that. Our research shows that students who edit Wikipedia articles as part of our class gain those critical media literacy skills that are otherwise lacking.

You can watch the video of the talk here. My thanks to the gracious hosts at the Contra Costa Community College District and Diablo Valley College.

by LiAnna Davis at June 02, 2017 09:16 PM

WMF Release Engineering

New feature: Embed videos from Commons into Phabricator markup

I just finished deploying an update to Phabricator which includes a simple but rather useful feature:

T116515: Enable embedding of media from Wikimedia Commons

You can now embed videos from Wikimedia commons into any Task, Comment or Post. Just paste the commons URL to embed the standard commons player in an iframe. For example, this url:

https://commons.wikimedia.org/wiki/File:Saving_and_sharing_search_queries_in_Phabricator.webm

Produces this embedded video:

by mmodell (Mukunda Modell) at June 02, 2017 07:58 PM

Mahmoud Hashemi

Wikicite 2017, and the 7 features Wikidata needs most

At Wikicite 2017, discussions revolved around an ambitious goal to use Wikidata to create a central citation database for all the references on Wikipedia. Citations are the essential building blocks of verifiability, a core tenet of Wikipedia. This project aims to give citations the first-class treatment they deserve.

We saw three important questions emerge at the conference:

  • What does a good citation database look like?
  • How can we build this on Wikidata?
  • How can we integrate this with Wikipedia?

These are hard questions. To answer them, Wikicite brought together:

  • Expert ontologists and librarians specializing in citation and reference modeling
  • Groups like Crossref with treasure troves of rich bibliographic data.
  • Developers and data scientists with experience importing datasets into Wikidata.
image

Wikicite may be young, but clear progress has been made already. Wikidata now boasts some great collections of bibliographic data, like the Zika corpus and data from the genewiki project. Some Wikipedias, like French and Russian, are experimenting with generating citations using Wikidata. Some citation databases are integrated with Visual Editor to make it easy to add rich citations on Wikipedia–which can hopefully one day be added to Wikidata for further reuse and tracking.

There are still a few features that Wikidata needs to be a first-class host for citation data. Even the best structured data takes time to define in Wikidata’s precise terms of items, properties, modifiers, and qualifications. Although it’s possible to use some handy tools on Wikidata for bulk actions, it often requires changing your dataset to match the tool’s specific format, or writing bespoke code for your dataset. It’s still challenging to ensure data is high quality, well-sourced, and ready for long-term maintenance.

image

In listening to researchers’ talks, discussing with experts in working groups, and workshopping code with some of Wikidata’s soon-to-be biggest users, we determined that Wikidata needs seven features for true Wikicite readiness:

  1. Bulk data import. There must be an easy process for loading large amounts of data onto Wikidata. There are a few partial tools, like QuickStatements, which, while itself aptly-named, is just one part of an often-arduous workflow. Other people have written custom bots to import their specific dataset, on top of libraries like Wikidata Integrator or pywikibot. Without help from an experienced Wikidata contributor, there is not an easy self-service way to move data in bulk.

  2. Sets. Wikidata needs a feature to track and curate specific groups of items. Sets are a necessary concept to answer questions about a complete group. Right now, you can use Wikidata to tell you facts about the states in Austria, but it cannot tell you the complete list of all states in Austria. Sets are key for curators to perform this sort of cross-sectional data management.

  3. Data management tools. Data curators need tools to monitor data of interest. Wikidata is big. The basic tools like watchlists were designed for Wikipedians articles on a much smaller scale, with a much coarser granularity than the Wikidata statement or qualifier. An institution that donates data to Wikidata may want to monitor thousands (maybe hundreds of thousands) of items and properties. Donors of complete datasets will want to watch their data for deletions, additions, and edits.

  4. Grouping edit events. At the moment, many community members are adding data to Wikidata in bulk, but this is a fact that Wikidata’s user interface struggle to represent. Wikidata currently offers a piecemeal history of user’s individual edits, and encourages editors to add citations and references for individual statements. These features are vital, but we need a higher-level grouping feature for higher-level data uploads. For instance, it would be helpful to have an “upload ID” for associated edits across many claims. It would also be useful to have a dedicated namespace for human- and machine-readable documentation of the data load process, a kind of README that addresses the whole action. This kind of documentation not only helps community members get answers to questions before, during, and after large-scale activity, but it also helps future data donors learn about and follow best practices.

  5. A draft or staging space. There should be a way for people to add content to Wikidata without directly modifying “live” data. Currently, when something is added to Wikidata it is immediately mixed in with everything else. It’s daunting for new users to have to get it right on the first try, let alone take quick corrective action in the face of inevitable mistakes. Modeling a dataset in Wikidata’s terms requires using Wikidata’s specific collection of items and properties. You may not see how your data fits into Wikidata—perhaps requiring new properties and items—until you begin to add it. Experienced Wikidata volunteers may review data to ensure it’s high quality, but it would be better to enable this collaborative process before data is part of the project’s official collection. You should be able to upload your data to a staging space on Wikidata, ensure it’s high quality and properly structured, and then publish it when it’s ready. The PrimarySources tool is a community-driven start to this, but such a vital feature needs support from the core. In the longer term, this feature is a small step toward maximizing Wikidata consistency, by setting the stage to transactionally add and modify large-scale data. It would be helpful to have data cleanup tools, similar to OpenRefine, available for data staging.

  6. Data models. Wikidata needs new ways to collaborate on new kinds of items. Specifically, we need a better way to reach consensus on models for certain standard types of data. Currently, it’s possible describe the same entity in multiple ways, and lacking a forum for this process, it’s hard to discuss the differences. See, for instance, the drastically different ways that various subway lines are described as Wikidata items. Additionally, some models may want to impose certain constraints on instances, or at least indicate if an item complies with its model. Looking to the future, tools for collaborative data modeling would grow to include a library of data models unlike any other.

  7. Point in time links. There should be a way to share a dataset from Wikidata at a given point in time. Wikidata, like Wikipedia, is continuously changing. Wikipedia supports linking to a specific revision of an article at a point in time using a permalink, and you can do the same for a specific Wikidata item. However, Wikidata places special emphasis on relationships between items, yet does not extend the permalink feature to these relationships. If you run a query on the Wikidata Query Service (the SPARQL endpoint for Wikidata), and then share the query with someone else, they may see different results.

These seven features came up consistently across several groups and discussions at Wikicite. As a room full of problem solvers, several good projects are already underway to provide community-based solutions in these areas. Among the handful that were started at the conference, we are pleased to share we’ve started work on Handcart, a tool for simplifying medium-sized bulk imports, for citation data, and much, much more. We believe trying to fix a problem is the best way to learn its details and nuances.

Wikicite made a strong case that Wikidata has a lot of valuable potential for citations, and citations are crucial for Wikipedia. As we work to address these missing features in Wikidata, we are happy to be part of the Wikicite movement to build a more verifiable world.

image

Thanks for inviting us, Wikicite, hope to see everyone again next time!

(Photos are CC-BY and can be found here and here)

June 02, 2017 05:28 PM

Wiki Education Foundation

Offsetting negative externalities with positive

Erin George is Assistant Professor of Economics at Hood College. In this post she talks about her experience incorporating Wikipedia into her course on Environmental Economics.

Erin George
Image: Erin George 2.jpg, by Joann Lee, CC BY-SA 4.0, via Wikimedia Commons.

At the heart of environmental economics is the study of pollution. Economists define pollution as a negative externality, the negative spillovers of a transaction that harm individuals who were neither the buyer nor the seller in a transaction. A classic example of a negative externality is second hand smoke.

Since environmental economics is chiefly concerned with these negative effects on third parties, I decided to create a series of class activities that do the opposite wherein students provide a service with positive externalities. A positive externality is an action that creates a benefit to third parties beyond those engaged in a transaction. In my environmental economics course for the past three years, students have been involved in two assignments that focus on providing a positive benefit to their local and global communities: (1) a local service learning group project and (2) an individual creation or expansion of a Wikipedia page.

The course consists of students, sophomores to seniors, from a variety of academic disciplines: economics majors and minors, business majors taking the course as an elective, environmental science and policy majors, and occasionally students from other majors, such as political science, taking the course as an elective. Thus, students enter the class with a wide variation of knowledge about both the environment and economics. Tackling the development of a Wikipedia page allows students to highlight their particular strengths and interests; students with political science and policy interests tend to edit pages on policies and treaties, students from other countries more often edit pages of greater interest to their home localities, and economics students tend to focus on environmental economic theories.

In the past years, students created and expanded pages on a wide variety of topics: the environment of Bosnia and Herzegovina, the environment of Pennsylvania, locally unwanted land uses, material efficiency, groundwater banking, the genuine progress indicator, conservation finance, and the Minamata Convention on Mercury. They exceeded my goal of giving back their knowledge to the greater community.

The assignment stretches over the course of the semester in many small assignments, where students learn the technical side of Wikipedia and the basics of writing with a neutral point of view. Having students write in Wikipedia requires delving deep into a topic, enhancing understanding of that topic. At the same time, writing in Wikipedia teaches the basics of markup, so that students are introduced to some tools that they may not otherwise be exposed to. Wikipedia’s Visual Editor is exceedingly easy to use to update pages, so that even students who are initially hesitant are able to successfully edit their own pages. I have used Wiki Ed’s training resources for educators to a great extent and find them helpful in creating the assignment and in navigating Wikipedia with students.

Students by and large enjoy working with Wikipedia. One student commented, “I really enjoyed seeing my work online. I felt a sense of accomplishment after I completed it. More so than I usually do after I finish a paper. Sometimes, I Google my topic and just look at my Wikipedia page, check if anyone edited it. I wish I could see how many people have seen it!” After the course ends, a few students continue editing on Wikipedia. For example, one student commented, “I also appreciated the Wikipedia project in which I edited a Wikipedia article related to an environmental topic. Now, I am inspired to edit Wikipedia articles over [winter] break.”

In a bit of juxtaposition, the main difficulties in running the project turned out to be some of the most profound learning experiences for students. Several students struggled to come up with a topic that was notable with verifiable research in reliable sources and that did not already have a lot of information on it. While I introduced students to the stub pages on Wikipedia, many had difficulty in identifying a stub page where there was enough well-sourced documentation to add value. Several students ended up changing topics during the semester. This was probably the most difficult part of the project.

As an educator, perhaps the most enriching part of using Wikipedia is watching students take pride in their pages and exhibit more excitement about writing for Wikipedia than they ever do in writing a term paper. While there is always some pushback because the assignment is unusual, I have enjoyed utilizing it in my course. In a class that otherwise focuses on so many negatives (negative externalities, the negative effects of pollution and environmental degradation), students see how their contributions to the commons has a positive impact on the world around them. Students making positive externalities out of their work on negative externalities? Check.

by Guest Contributor at June 02, 2017 04:33 PM

Wikimedia Cloud Services

Manage Instance on Horizon (only)

For nearly a year, Horizon has supported instance management. It is altogether a better tool than the Special:NovaInstance page on Wikitech -- Horizon provides more useful status information for VMs, and has much better configuration management (for example changing security groups for already-running instances.)

So... I've just now removed the sidebar 'Manage Instances' link on Wikitech, and will shortly be disabling the Special:NovaInstance page as well. This is one more (small) step down the road to standardizing on Horizon as the one-stop OpenStack management tool.

by Andrew (Andrew Bogott) at June 02, 2017 12:02 AM

June 01, 2017

Wiki Education Foundation

Monthly Report for April 2017

Highlights

  • Product Manager Sage Ross announced a new Dashboard feature, Authorship Highlighting, which makes it easy for instructors to see each student’s contributions to an article.
  • Community Engagement Manager Ryan McGrady announced a new Visiting Scholar at Northeastern University, Rosie Stephenson-Goodknight, 2016 co-Wikipedian of the Year and co-founder of WikiProject Women in Red. She will be working with the Women Writers Project at Northeastern.
  • Research Fellow Zach McDowell and Director of Programs LiAnna Davis attended the Creative Commons Global Summit to present research findings on student motivations in peer production communities.
  • At the American Chemical Society’s spring annual meeting, Educational Partnerships Manager Jami Mathewson and Outreach Manager Samantha Weald spoke to ACS members about joining the Classroom Program, and nearly 100 instructors signed up to learn more.

Programs

Ye Li, Scholarly Communications and Instruction Librarian at the Colorado School of Mines, and Jami at the American Chemical Society’s (ACS) spring annual meeting in San Francisco.

Educational Partnerships

Jami and Samantha kicked off the month at the American Chemical Society’s (ACS) spring annual meeting in San Francisco. They spoke to ACS members about joining the Classroom Program, and nearly 100 instructors signed up to learn more. Chemistry students have done great work on Wikipedia, and we’re excited to see continued growth thanks to our partnership with ACS and the visibility they give us to their members.

Jami also joined the Association for Women in Mathematics (AWM) at their research symposium in Los Angeles, exploring the opportunities to recruit Classroom Program instructors through Wikipedia edit-a-thons. AWM achieved their goal of adding women in mathematics to Wikipedia, as 14 editors contributed nearly 6,000 words about women mathematicians. However, this edit-a-thon experience proved to us once again that instructors attending edit-a-thons to focus on their own education and skills do not necessarily attend to learn about new pedagogical tools. Thus, Wiki Ed will not pursue edit-a-thons as an outreach tool in the future. Still, we will continue partnering with AWM, who began sponsoring two Visiting Scholars, to focus on strengthening Wikipedia’s representation of women in math.

Attendees of the Association for Women in Mathematics Symposium in Los Angeles.

LiAnna joined longtime program instructor Jonathan Obar for a 2-hour workshop at York University in Canada. Faculty and librarians learned more about teaching with Wikipedia, and two students from Jonathan’s class spoke about their experiences being assigned to write Wikipedia articles, and why they thought it was a great experience.

 

Classroom Program

Status of the Classroom Program for Spring 2017 in numbers, as of April 30:

  • 357 Wiki Ed-supported courses were in progress (164, or 46%, were led by returning instructors)
  • 7,363 student editors were enrolled
  • 62% of students were up-to-date with the student training.
  • Students edited 7,500 articles, created 672 new entries, and added 4.66 million words.

Though the Spring 2017 term is beginning to wrap up, our students are still busy drafting their work and moving it into the article main space, and many of our courses on the quarter system are just getting underway with their Wikipedia assignments. It would be an understatement to say that the Classroom Program has experienced rapid growth in the last several years. Since Fall 2014 Wiki Ed has more than tripled the number of courses it supports. While this level of growth has affected the ways we support courses, it ultimately means that thousands of students are learning how to identify reliable sources of information and how to engage in the public sphere. It also means that Wikipedia is seeing improvement in often underdeveloped academic subjects, ranging from Practical Botany to Cultural Representations of Sexuality.

Growth at the expense of quality is an empty victory, which is why we’ve spent a considerable amount of time thinking about best practices and procedures over the past several terms. Our technical tools and resources allow us to support an ever-growing cohort of classes and students, but we still strive to foster a deep relationship between Wiki Ed and the instructors we work with. It’s not enough that Wiki Ed provide instructors with a template to run their Wikipedia assignments; we want to involve them in a growing community of academics who care about the greater dissemination of knowledge and who wish to provide their students with a pedagogical experience that goes far beyond the classroom.

Toward this end, we’ve gradually introduced a variety of ways for instructors to engage with us. In Fall 2016, we began running Wiki Ed Office Hours during which instructors are invited to speak with Wiki Ed staff about their Wikipedia assignments and to interact with other instructors teaching with Wikipedia. We also began sending a monthly newsletter to program participants so they can find out about the latest happenings at Wiki Ed, from conferences we’re attending to highlighting a variety of posts from our instructors about their experiences using a Wikipedia assignment. Our instructors are not just our program participants, but our partners in our endeavors to improve student media literacy and improve Wikipedia. This has been the guiding principle of the Spring 2017 term and will continue in the coming years.

A student in Carwil Bjork-James’ Human Rights of Indigenous Peoples class at Vanderbilt University improved Wikipedia’s article about Lean Bear, a Cheyenne chief, and uploaded this image to use in the article.
Image: Lean Bear.jpg, public domain, via Wikimedia Commons.

Student work highlights:

Students who want to create Wikipedia articles about new topics often gravitate towards subjects with a very narrow focus (where there are more likely to be notability issues) or a subject at the intersection of two or more other topics (where their work may be flagged as “novel synthesis” or too “essay-like“). Articles that are more mid-level in the hierarchy are often more fruitful candidates for expansion. The creation of the diagnostic microbiology article, by a student in Scott Mulrooney’s Microbial Biotechnology class, fills an important gap between more specific articles about individual diagnostic tests and a top-level article subject like microbiology. Industrial microbiology occupies a similar space. While that article has a history dating back to 2006, it had been deleted for copyright reasons and replaced with a short stub. Another student in the class expanded that short stub into a substantial article that discusses the role of microbiology in the medical, chemical, agricultural and food industries, thereby filling another important gap. Others created articles related to bioremediation — specifically, the role of microorganisms in the bioremediation of PCBs and oil spills. Other students in the class created or substantially expanded articles on a host of other topics, including protein engineering, host tropism (the movement of pathogens within hosts), and the production of antibiotics.

The rights of indigenous peoples have been in the news in the wake of the Dakota Access Pipeline protests. One student in Carwil Bjork-James’ Human Rights of Indigenous Peoples class created an article about police brutality against Native Americans, while another created an article about the protection of Native American sites in Florida. Students in the class also substantially expanded articles about the Toba people (one of the largest indigenous groups in Argentina), the Miskito people (one of the largest indigenous groups in Nicaragua), and the Maleku people (a small indigenous group in Costa Rica). Other students made significant improvements to articles about the Red Power movement, slavery among Native Americans in the United States, the North American fur trade and the biography of Lean Bear, a Cheyenne chief.

An ion channel is a type of pore in the cell membrane which allows specific molecules to enter the cell. A channel blocker is a molecule which blocks specific ion channels, producing specific physiological responses. Channel blockers are particularly important as pharmacological drugs, but the Wikipedia article was just a short stub until students in Michelle Mynlieff’s Neurobiology class expanded it into a substantial article which includes information about types of channel blockers and their physiological and clinical importance. They also wrote about Foix–Chavany–Marie syndrome, a neuropathological disorder that involves facial paresis. KCNB1 is a specific ion channel. Students in the class expanded the short Wikipedia article into a far more substantive one, adding information about its function as well as sections about its structure, regulation and role in disease. Despite the special constraints on editing articles about biomedical topics on Wikipedia, students in the class were able to substantially expand these and other short articles and stubs.

Community Engagement

Rosie Stephenson-Goodknight is Visiting Scholar at Northeastern University.
Image: Rosie Stephenson-Goodknight.jpg, by Victor Grigas, CC BY-SA 3.0, via Wikimedia Commons.

This month we were happy to announce the placement of a new Wikipedia Visiting Scholar at Northeastern University. Rosie Stephenson-Goodknight is a prolific, long-time Wikipedian who has received extensive recognition, including being named 2016 co-Wikipedian of the Year, for her advocacy of important Wikipedia-related issues and coordination of major community projects like WikiProject Women in Red. Sponsoring Rosie at Northeastern is the Women Writers Project, a long-term collaboration researching, collecting, encoding, sharing, and disseminating information about early women’s writing. As Visiting Scholar, Rosie will have access to Northeastern’s library resources to use in developing articles about women writers. For more information about this collaboration, see our blog post from earlier in the month.

Existing Visiting Scholars continued their great work. Smithsonian Libraries Visiting Scholar, User:Czar, brought his article about 1:54, an annual contemporary African art fair held in London, up to Good Article status. Meanwhile, Gary Greenbaum, Visiting Scholar at George Mason University, took the article on a 1953 oil painting by Chinese artist Dong Xiwen, The Founding Ceremony of the Nation, to Featured Article.

Also this month we highlighted some of the content that came out of a collaboration between WikiProject Women in Red and WikiProject Women Scientists during 2016’s Year of Science. Thanks to all the volunteers who wrote hundreds of articles about women in science for this Celebrating Women Scientists Online Editathon. For more information, and some highlights from the event, see our blog post.

Program Support

Communications

Blog posts:

External media:

Digital Infrastructure

We added an Authorship Highlighting tool to the Dashboard to make it easier for instructors to view student contributions to Wikipedia.

The biggest news in April was the launch of Authorship Highlighting, a Dashboard feature that makes it easy for instructors to see each student’s contributions to an article. This is a first step toward addressing one of the biggest unmet instructor needs — tools for reviewing and evaluating student work — and feedback so far has been enthusiastic. Work continues to refine some of the bugs and edge cases that we’ve discovered since the launch.

We also finished our grant-funded project with the AgileVentures development community. Although we didn’t get as much participation as we had hoped, it was a valuable learning experience, and we did receive several contributions of non-trivial bug fixes and small improvements to the Dashboard user experience.

Performance work on our digital tools continued behind the scenes in April, along with many infrastructure and library updates, improvements to our automated test suite, and small features and design tweaks to streamline Wiki Ed staff workflows.

Research and Academic Engagement

Zach has been working on finishing a draft of a white paper summarizing the Student Learning Outcomes Fall 2016 research, along with preliminary analysis.

Zach joined LiAnna and Executive Director Frank Schulenburg for the Hewlett Foundation’s Open Educational Resources Meeting, discussing OER/OEP research and potential research futures. He and LiAnna also attended the Creative Commons Global Summit to present research findings on student motivations in peer production communities. Attendees at the Summit engaged in roundtable discussions, asking great questions, sharing what the work meant to them, and expressed interest in engaging in the open education practice of Wikipedia-based class assignments. LiAnna and Zach also participated in a group discussion of the Creative Commons Open Education Platform, part of the Creative Commons strategic planning process.

Finance & Administration / Fundraising

Finance & Administration

Expenses for April 2017

For the month of April, our expenses were $131,467 versus our approved budget of $152,685. The $21k variance continues to be primarily due to staffing vacancies ($32k). However, some of the saving from the vacancies was offset by the need for some temporary help ($5k). The remaining variance is attributed to timing differences of a few expenses.

Our year-to-date expenditures were $1,486,614. We continue to be well below our budgeted expenditure of $1,918,502 by almost $432k. As with the monthly variance, a large portion of the variance is a result of staffing vacancies ($195k) offset by the need for some temporary help ($12k). There also continues to be savings and some timing difference with: Professional Services ($69k); Travel ($87k); Marketing and Fundraising Events ($29k); Board and Staff meetings ($46k); and Printing ($19k).

Year to date expenses as of April 2017

Fundraising

Frank, LiAnna, and Zach attended the Hewlett OER Meeting in Toronto at the end of April. The gathering of organizations working in the open educational resources and open educational practice space was a good opportunity to network, foster collaborations, and share learnings across different organizations working in the open education space. Each organization was given 2 minutes to talk about equity in their work; LiAnna used this time to talk about Wiki Ed’s work in content gaps around gender. Additionally, Wiki Ed instructor Amin Azzam led an unconference session with Frank and LiAnna about open educational practice.

Office of the ED

Every year in April, Wiki Ed’s senior leadership team gathers at the Green Gulch Farm Zen Center and plans for the time ahead

Current priorities:

  • Finalizing next year’s annual plan and budget
  • Preparing for a major new campaign

After spending the first half of the month in Europe (at the Wikimedia Conference and on vacation), Frank focused on the development of the annual plan and budget during the rest of the month. He and the other members of the senior management team spent two days at the local Zen monastery to discuss the next fiscal year and to finalize the annual plan document. This year, our new Director of Development and Strategy, TJ Bliss, joined the group and provided additional input.

Also in April, Frank worked on sustaining relationships with existing funders, participated in communications sessions for our new major campaign, and attended Hewlett’s OER grantee meeting in Toronto.

by Ryan McGrady at June 01, 2017 09:43 PM

Successes and learnings from the Year of Science

In 2016, Wiki Ed kicked off the largest targeted content initiative ever undertaken in the Wikimedia world: the Wikipedia Year of Science. When the year wrapped up in December, our programs for the initiative had added nearly 5 million words of content to the English Wikipedia. More than 6,300 students edited more than 5,700 articles on STEM and social science subjects on Wikipedia, and improved biographies of 150 women scientists. The amount of content added during the Year of Science is impressive: nearly 5 million words fills 3.5 full volumes of the last print edition of Encyclopædia Britannica.

But simply touting our numeric successes isn’t enough. As part of Wiki Ed’s commitment to the Wikimedia movement, we also wanted to evaluate our work — and document and share our learnings.

We’ve published our Year of Science evaluation on Meta-Wiki, the community wiki about Wikimedia projects. The report highlights what we did as part of the Year of Science in the Classroom Program, the Visiting Scholars Program, and other projects, as well as indications of what worked, what didn’t, and what we learned from the Year of Science. Our goal in publishing this report is to enable other groups considering creating a large-scale content initiative like the Year of Science to be able to learn from what we did.

I encourage anyone who’s interested to read the detailed report, or for a TL;DR version, check out the green “Learnings” boxes. We welcome questions on the talk page of the report, via comments on this blog post, or at Wikimania 2017 in Montreal, where I’ll be presenting on the Year of Science initiative and what we learned from it.

by LiAnna Davis at June 01, 2017 03:42 PM

Wikimedia Foundation

Czech–Polish ‘Wikiexpedition’ ends with over three thousand photos of historic Silesia

Photo by Aktron, CC BY-SA 4.0.

Time constraints. Weather. Short period of daylight. Such were the issues faced in visually capturing Silesia, a historic region primarily located in rural southwestern Poland, for use in Wikipedia article. Still, 3,600 images have now been uploaded to Wikimedia Commons, 300 of which are currently in use on various-language Wikipedias.

The odyssey started in Prague, the capital of the Czech Republic. Two Wikipedians jumped into a shared car and drove to the far eastern side of the country, stopping in Opava, the historical capital of Silesia. The region was chosen due to its remote location, being far from major cities in the Czech Republic and Poland, and its lack of active Wikipedians. These factors made it a perfect location for a systematic photographic event.

The team stayed in major towns at night and traveled around all the villages during the day. From the lowlands to the High Ash Mountains, everything that could be covered was done—hundreds of pictures per day. The major focus was on settlements that were not well-photographed on Wikimedia Commons. Between 25 and 28 villages were photographed per day, with the evenings reserved for editing, sorting, describing, categorizing and uploading the data. Night photography was also part of the Wikiexpedition.

After day five, our Polish colleague and third participant, who had traveled a whole day across entire Poland, joined us in Kędzierzyn-Koźle, Poland. He focused on railroads and transportation infrastructure, while Czech members continued to photograph settlements in the area, which added a new dimension to the expedition.

Photo by Juandev, CC BY-SA 3.0.

The three of us took 3,650 freely licensed images in eight days, all of which are now available on Wikimedia Commons. More than 212 settlements and points of interest were photographed. Over 1000 images (~27.5%) were used in more than 330 articles, especially on Czech and Polish Wikipedias, but also on English, Spanish and others. Imports to Wikidata were arranged. The event had a high output, but it was hardly able to compete with Czech Wiki Loves Monuments, Wiki Loves Earth or other large contests. Nevertheless, the fact just three people contributed so much in eight days makes this approach highly potential. Repeating the same method with more people could provide even more material for Commons and WMF projects. The financial aspect of the entire project should be taken into account—at 0.19 USD/0.17 EUR per photo, the Silesian Wikiexpedition came in significantly below average costs of similar events. Most of the urbanized areas of the Czech Republic is already documented on Wikimedia Commons, but the remote regions are far less covered. With Wikiexpeditions, this could be done effectively within years.

Still, these wikiexpeditions are not simple to organize or execute. First, planning these trips means figuring out what has already been adequately photographed. Second, once you’ve planned them, those plans can be wrong; automatically obtained data can be erroneous, misinterpreting, and sometimes completely useless. Time saved in preparation can be lost when photographing, as it is easy to get stranded in a sandpit or in the middle of a field where one was expecting a village. The geographic coordinates are, after all, only numbers interpreted by Wikidata and Open Street Map.

Third, the photography is hurried in order to fit in as many areas as possible within a short amount of time. Fourth, the processing, sorting, and describing of the data—plus the actual uploading process—takes quite a long time. Keeping ordered lists of what you photographed is a must.

Photo by Aktron, CC BY-SA 4.0.

With this successful Wikiexpedition as an important milestone, we have arranged another for 2017. This so-called Wikiexpedition West (Wikiexpedice Západ) incorporated more people, covered broader topics (such as natural reserves and historical monuments) and represented another step in our learning process. Nine locals participated, and we were able to create four teams. We are now processing images uploaded to Commons and placing them in Wikipedias.

Further plans include other regions and countries. Poland, as a large country, remains in the focus for the summer season. So another event in central or southern Poland might take place. In the meantime, Shared Knowledge and Wikimedia User Group Greece plan a joint event in September 2017.

Wikiexpeditions can be fun, they are essential to Wikimedia projects such as Commons or Wikipedia, and they also bring the community together. They can have interesting side effects, such as promotion of the encyclopedia in rural areas or engaging volunteers in remote places. Our good practice encourages followers in other countries. Most of the world is still missing on Wikipedia – both pictures and text, and we all have a chance to change this.

If you’ve got interested, please read our manuals, follow us on Facebook (Wikiexpedition group) or join the upcoming expeditions planned for the Central European region.

Let’s illustrate Wikipedia together!

Name, Position
Wikimedia Czech Republic

by Jan Lochman, Jan Loužek and Tomasz Ganicz at June 01, 2017 02:30 PM

Gerard Meijssen

#Wikipedia - another #German #Award


It is funny in its own way that the only award winner that has no "red" or "blue link" on this award page are Wikipedians; German Wikipedians. They won the 2016 GDCh-Preis für Journalisten und Schriftsteller.

Typically we do not give much attention to our achievements and as such the understated attention can be understood. At Wikidata we need to have an item in order to recognise an award winner. As there was a photo of the award ceremony, it was obvious to add it to the item for these award winners as well.
Thanks,
      GerardM

by Gerard Meijssen (noreply@blogger.com) at June 01, 2017 11:14 AM

T Shrinivasan

Project Idea – Telegram bot to translate strings for Open Source Projects

telegram bot க்கான பட முடிவு

In wikimedia hackathon, I saw a demo of using telegram bot to translate strings from translatewiki.net

here are the notes about it.

============

Telegram Translation Bot: https://phabricator.wikimedia.org/T131664 DONE

Translate on translatewiki.net without leaving your Telegram app

Code: https://github.com/amire80/mediawiki-telegram-bot/

mediawiki.org page: https://www.mediawiki.org/wiki/User:Amire80/chat_bot_draft

Phabricator: amire80 * Wikipedia: Amire80 * Twitter: @aharoni

Amir E. Aharoni and Taras Bunyk presenting

Justin Du (MtDu), Taras Bunyk, and help from Brian Wolff, Madhvuvishy, bd808, Niklas Laxström, Jon Robson, and more people!

“Most people don’t speak English”

Translatewiki.net – thousands of messages to translate

can now translate through this simple mobile app instead of needing to load the full site in a browser

selects untranslated strings, in your preferred languages, sends them to you, and you translate, and it submits them to translatewiki

Long messages are automatically skipped to fit a use on mobile.

============

Thinking as we can build a bot to translate the strings for mozilla and openstreetmaps.

Need to get your inputs/thoughts/ideas for this.

translate க்கான பட முடிவு

These links may help to build a telegram bot for translations.

https://github.com/zanata/zanata-python-client

https://translate.zanata.org

https://translate.zanata.org/iteration/view/TamilMap/1/settings?dswid=1182

use this command to get the po file in /tmp/ta.po

zanata po pull –url https://translate.zanata.org/ –project-id TamilMap –project-version 1 –transdir /tmp

We can process the po file using polib

http://polib.readthedocs.io/en/latest/quickstart.html

There are many python libraries to create a telegram bot.

http://telepot.readthedocs.io/en/latest/

https://khashtamov.com/en/how-to-create-a-telegram-bot-using-python/

https://blog.pythonanywhere.com/148/

https://www.codementor.io/garethdwyer/building-a-telegram-bot-using-python-part-1-goi5fncay
With all these tools to create a bot, to process Po files and zanata to host the translations, we can connect them all.
If any one is interested in programming for this, reply here.

Thanks.

Image sources:

http://www.asktrustdee.com/2016/03/my-top-5-telegram-bot.html
https://commons.wikimedia.org/wiki/File:Translate_en-ta.png | CC-By-SA


by tshrinivasan at June 01, 2017 05:46 AM

May 31, 2017

Wikimedia Cloud Services

Updated `webservice` command deployed

The v0.37 build of rOSTW operations-software-tools-webservice has been deployed to Tool-Labs hosts and Tools-Kubernetes Docker images.

This release contains the following changes:

Kubernetes webservices will need to be restarted to pick up the new Docker images with the updated package installed. The new Docker images also contain the latest packages from the upstream apt repositories which may provide some minor bug fixes. We are not currently tracking the exact versions of all installed packages, so we cannot provide a detailed list of the changes.

by bd808 (Bryan Davis) at May 31, 2017 08:18 PM

Wiki Education Foundation

Announcing the new Brown University Visiting Scholar, Eryk Salvaggio

One of the wonders of Wikipedia is that it’s written almost entirely by volunteers. Together, hundreds of thousands of people have contributed to more than 5.3 million articles on the English language version alone. There is a challenge for Wikipedia inherent in its volunteer model, however: subjects that many people know about, that many people are interested in, and which don’t require dense research or specialized training, are far more likely to be the subject of well-developed articles. Sports, video games, and film, for example, are well represented. Some specialized areas like medicine and military history have found groups of highly dedicated editors, but, in general, academic subjects are more likely to be either omitted or poorly covered.

When editors do want to improve an academic subject, they often find that the sources they need are trapped behind expensive paywalls or only accessible via certain institutions. It’s for subjects like these that the Wikipedia Visiting Scholars program shines. Through it, Wikipedians form relationships with educational institutions. Volunteers are given access to the institution’s library resources, like databases, ebooks, and other materials, and agree to use them to improve articles in a topic area of mutual interest. The institutions increase the impact of their holdings while helping to make a difference in an important subject area, and the Wikipedian is empowered to make contributions without being limited by source access.

Eryk Salvaggio
Image: Eryk Salvaggio.jpg, by Owlsmcgee, CC BY-SA 4.0, via Wikimedia Commons.

I’m happy to announce the newest Wikipedia Visiting Scholar, Eryk Salvaggio, sponsored by Brown University’s John Nicholas Brown Center for Public Humanities and Cultural Heritage. Eryk will be using Brown’s resources to improve articles in one of those academic areas that could use a bit more attention: ethnic studies.

Eryk has been a Wikipedian for about two years, editing as User:Owlsmcgee, and in that time has made substantial contributions to a number of articles on Japanese culture, women in the arts, and political prisoners. In particular, he brought the article on State Shinto to Good Article status and either created or made significant improvements to the articles on women in Japan, Itako, women in Shinto, Shinto wedding, Ayoka Chenzira, and the 2010 Manezhnaya Square riot trials.

Eryk has a background in journalism, communications, and academia. He was the online editor of the Bangor Daily News before moving abroad to write and teach English in Japan. He earned a masters degree in Global Communications from the London School of Economics in 2013, where he focused on media portrayals of ethnic minorities and migrant communities in Japan and the UK. His name may be familiar to readers of this blog, as Eryk was also Wiki Ed’s former Communications Manager. He is currently Event & Social Media Manager for swissnex San Francisco, where he writes about intersections of art, science, research, and technology.

“I’m excited to be back at the intersection of Wikipedia and education as a Visiting Scholar,” Eryk said. “I’m a little bit in love with the Wikipedia project, and with Wiki Ed’s mission to craft solid, reliable content to fill in Wikipedia’s missing pieces. Thanks to Brown University, I’ll be able to focus on bringing forward information related to the history, art, and culture of global diasporas and the issues related to migration in the United States and abroad.”

Sponsoring Eryk at Brown is Jim McGrath, Postdoctoral Fellow in Digital Public Humanities at the John Nicholas Brown Center for Public Humanities and Cultural Heritage.

“Wikipedia is one of the world’s largest public humanities projects, and we’re delighted to work with Eryk to make it better,” he said. “We’re particularly excited about the opportunity to think about improving resources related to Ethnic Studies on Wikipedia. With Eryk’s help, we can think more about how the research we do on Brown’s campus can make a bigger impact on an invaluable digital resource that millions of us rely on every day. The Visiting Scholars Program seems like a great way for a campus like Brown to get its students, faculty, and community partners thinking more about Wikipedia and its relationship to our ongoing work in Ethnic Studies, American Studies, and Public Humanities initiatives. We can’t wait to get started!”

Image: Nightingale-Brown House, by Kenneth C. Zirkel, CC BY-SA 3.0, via Wikimedia Commons.

by Ryan McGrady at May 31, 2017 06:56 PM

T Shrinivasan

How Wikimedia movement should be in 2030?

Today, we had a discussion on strategy for wikimedia movement for 2030, with few Tamil wikipedians, media, government, academic friends.

Wikimedia Foundation is planning on what are the things we should focus on wiki ecosystem, to make it even better for the people in 2030.

 

 

I added the following thoughts.

1. Space for adding tiny data.

In future, there will be a drastic change on computers and input devices. There will be voice inputs. Computers will be embedded in all devices. They will be communicating to each other. They will be enabled with augemented reality, virtual reality, artificial intellience to get and express various data. There should be a common data source to get any data from. Wikimedia movement should be that common data source, for all devices.

There should be options to give input as tiny bits. Knowledge should be shared by anyone, in any form. It should not be only in text form or as article. sound input and tiny bits of inputs should be allowed in wiki. Those content should be automatically translated into many languages.

For example, I should ask a device like “Hey, what is the movie shown in nearby theatres to me? ” The device should get that data from wiki. I should ask “What is the price of a TV in chennai and in Austria? ” It should get the details from wiki and reply me in voice in my language. Wiki should allow these kind of data.

2. Decentralised Wikis

Git like decentralised wiki editing will enable, more content coming from the poor internet countries.

The following are the inputs from other friends.

1. Archiving old books like google’s one million books project

2. Archiving old photos, pamphlets, advertisements, magazines

3. Connect with many organizations, governments to get their archives released in CC license

4. Connect with mobile/camera manufactures to add CC license details within the device

5. Connect with social media sites like Facebook to add options to release media files in CC license

6. Finding false news within the flood of information will be a huge problem in the coming days. Is it possible for wikipedia to verify and authenticate the news that are being shared on social media?

7. wiki should be help to build education materials for school/college students, so that all the world get free resources for studies.

8. Data should be added in a unique way, so that it is transformed to multi formats, multi languages automatically.

Hope these will be discussed by the foundation team and taken forward further.

 

After the meeting, met photographers dillibabu and sudharsan. They agreed to give their 1000s of high quality photographs taken around India. I will upload them to commons. Will share the details once started.

Thanks for Wikimedia Foundation for starting these kind of discussions. We need to plan for the future and make it happens. Visit the website and http://2030.wikimedia.org/ and share your thoughts on building a better future for wikimedia projects.

Learning on how to set plans for FreeTamilEbooks.com and kaniyam.com projects. Let us dream a lot and make them true.

 

Few more photos are here – https://www.flickr.com/photos/tshrinivasan/albums/72157684407240936


by tshrinivasan at May 31, 2017 06:13 PM

Wikimedia Tech Blog

What we learned by making both newcomers and experienced participants feel connected and engaged at the Vienna Hackathon

The mentoring program started with an introductory session on the first day, where mentors introduced themselves and pitched project ideas. From left to right: mentors Amir Sarabadani and Aaron Halfaker, coordinator Sonja Fischbauer). Photo by Manfred Werner, CC BY-SA.

At the Wikimedia Hackathon, held from May 19–21 in Vienna, Austria, we wanted to create a warm, welcoming environment for newcomers—while also making it really easy for both new and experienced participants to learn from each other and work on their technical projects.

Below are a few things we did to make both newcomers and experienced developers feel connected and engaged before, during, and after the event. Feel free to adapt any of these for future events you run, and if you have ideas for our next event, please post your feedback.

We made it really easy for participants to find potential collaborators and projects before the Hackathon began.

With 260 people expected to come to Vienna from 48 nationalities and 27 chapters and user groups, we wanted to make it really easy for participants to connect and learn about the projects they could work on. We directed participants to introduce themselves along with their interests on MediaWiki, gave them a list of suggested ways to prepare for the hackathon, and invited them to a group chat with other attendees. We also created a list of featured hackathon tasks, tagging both their difficulty and whether they were appropriate for newcomers. Participants were also directed to a Phabricator board containing more proposed sessions and skills they could learn from others.

We created a hackathon that appealed to long-time MediaWiki developers while also making it easy for newcomers to jump right in.

Hackathons can be daunting for newcomers: there’s limited time to work on technical projects and a lot to learn. This year, we launched a successful mentoring program that paired our 56 mentees with 30 experienced hackathon participants who signed up specifically to welcome new people into the Wikimedia technical community. In addition to a group chat and mentoring area, participants in the mentoring program were invited to participate in daily meetings and provided with resources to make their hackathon experience less overwhelming.

To coordinate these efforts, Wikimedia Austria reinforced their team in the months leading up to the hackathon. They hired Sonja Fischbauer as a freelance outreach coordinator, and were the first host organisation to hire a person specifically for that role.

“To onboard all the wonderful newcomers we were able to bring to the hackathon, we developed the mentoring program,” says Fischbauer. “Mentors worked exclusively with newcomers for the weekend, on beginner-friendly tasks, with the aim that everyone could achieve a result by the end of the event. The program put structure to something that’s always been a core value of the Wikimedia community: the passion to share knowledge, to collaborate and to create something new together. It was amazing to see how much effort the mentors put towards helping others, and how much fun it was for everyone! We really hope that we created a spark here, something that will catch fire and grow in the future.”

Because mentors were exclusively paired for the mentees for the entire weekend, they reinforced working together in small groups on projects which were designed to make it easy to contribute. As a result, many newcomers could even demonstrate the results of their work during the showcase on the last day.

Wikimedia Austria (WMAT) also invited local newcomers to pre-hackathon sessions specifically designed for them, so that they could dip their toes into the world of MediaWiki before committing to a three-day hackathon, and meet other participants in advance. . One of these preparation workshops was held exclusively for women and LGBT participants, to create a safe and inclusive learning space. These events helped newcomers feel welcome and gave them the opportunity to load any necessary software items onto their computers before the hackathon, so they could start hacking immediately once they arrived at the event. In addition, WMAT encouraged organizing pre-hackathons by partner communities across Central and Eastern Europe  in Hungary, Romania, Greece and the Czech Republic and each of them sent some of their local newcomers to the main event in Vienna.

We made it easy for attendees to document their experiences, which helped let others know about the hackathon.

Anyone who has attended a hackathon knows that it’s important to document what happens during or shortly after the event before people return to their normal lives. This year, we created a volunteer group specifically for people interested in blogging about their experiences in Vienna. Participants were given ideas, resources, and shown example posts to help them get their creative juices flowing.

Participants included:

Inviting people to blog and giving them resources and guidance not only helps participants write; it also helps us reach potential new collaborators or contributors who find out about the event through a personal blog.

We organized the hackathon as a partnership with our attendees, which made people feel included in the process and the outcome.

Like all of our technical events, the Vienna Hackathon was developed and run in conjunction with community members and staff in Vienna and around the globe. There were several ways for people to get involved and share their expertise, including:

  • Participating in a robust volunteer program, which included photography, blogging, being a social host, and helping document the hackathon throughout the event.
  • Adding their skills and expertise to a skillshare wall—people were asked to either offer a skill they could teach others or request help from others on a specific topic. This ended up working really well, and is one of the benefits of getting people together in person.
  • Making announcements during the opening of the event. We heard from 45 attendees about how they could help teams or people who could use a specific skillset throughout the event.
  • Participating in the project showcase, which took place on the last day of the hackathon and detailed more than 40 projects that emerged over the event.

We saw the hackathon not just as a singular event, but a way to bring people together to learn before going home to work on their projects.

Hackathons bring people together in person for short periods of time in order to learn new skills, find new collaborators, and plan future work that can be completed asynchronously. But that also means that hackathons shouldn’t be thought of as a singular event  in time—in other words, they shouldn’t end when participants go home. We see technical hackathons as a way to jumpstart ideas, bring collaborators together, and help them learn from each other before taking those ideas and conversations home to inspire new, creative ways of thinking about projects.

This also means that our hackathon projects are ongoing and available if you’re looking for a new project to work on. Some of the projects that came out of this hackathon include:

How to start working on any of these projects

By now, you may be thinking “These are awesome projects! How can I work on them or the many other amazing things people made at the Hackathon?”

You can start working on any of these projects by looking at this list. There, you can express your interest by commenting on the Phabricator task. You can also always reach out on wikitech-l, our mailing list aimed at the technical community.

And looking ahead, the next Wikimedia Hackathon will take place at Wikimania in Montreal (Canada) on August 9-10 2017. The next European Hackathon will be at the Universitat Autònoma de Barcelona (Spain) in Spring 2018. We hope to see you at one of our events!

Rachel Farrand, Events Program Manager, Developer Relations, Wikimedia Foundation
Srishti Sethi, Developer Advocate, Developer Relations, Wikimedia Foundation
Claudia Garad, Executive Director, Wikimedia Austria (Österreich)
Sonja Fischbauer, Freelancer, Wikimedia Austria (Österreich)

Thank you to Melody Kramer (Wikimedia Foundation Communications) for her help putting together this post.

by Rachel Farrand, Srishti Sethi, Claudia Garad and Sonja Fischbauer at May 31, 2017 04:33 PM

Wikipedia Weekly

Wikipedia Weekly #123 – Citations, WikiCite and Wikidata in Vienna

This Wikipedia Weekly podcast episode covers the WikiCite 2017 conference in Vienna, Austria, May 23-25, 2017. We discuss a brief history of citations and references in Wikipedia, and what a gathering of librarians, ontologists and Wikimedians did at WikiCite. We give a quick overview of Wikidata and its relations to citations, and how the Citoid citation system works by using the Zotero open source citation system.

We also discuss a brief research project Andrew and Rob did to measure how well Wikipedia’s Visual Editor/Citoid perform when citing the most used news sources in English Wikipedia. TLDR: more than half the time, people auto-generating citations from news sources using Visual Editor will result in an incomplete citation. We discuss resolutions to this, such as how to improve Zotero Translators to scrape web pages more effectively.

Participants: Andrew Lih (User:Fuzheado), Rob Fernandez (User:Gamaliel)

Links:

Opening music: “At The Count” by Broke For Free, is licensed under CC-BY-3.0; Closing music: “Things Will Settle Down Eventually” by 86 Sandals, is licensed under CC-BY-SA-3.0

All original content of this podcast is licensed under CC-BY-SA-4.0.

by admin at May 31, 2017 02:50 PM

May 30, 2017

Wikimedia Cloud Services

Project-wide sudo policies in Horizon

When @Ryan_Lane first built OpenStackManager and Wikitech, one of the first features he added was an interface to setup project-wide sudo policies via ldap.

I've basically never thought about it, and assumed that no one was using it. A few months ago various Labs people were discussing sudo policies and it turned out that we all totally misunderstood how they worked, thinking that they derived from Keystone roles rather than from a custom per-project setup. I immediately declared "No one is using this, we should just rip out all that code" and then ran a report to prove my point... and I turned out to be WRONG. There are a whole lot of different custom sudo policies set up in a whole lot of different projects.

So... rather than ripping out the code, I've implemented a new sudo interface that runs in Horizon. [T162097] It is a bit slow, and only slightly easier to use than the old OpenStackManager interface, but it gets us one step closer to moving all PVS user interfaces to Horizon. [T161553]

For the moment, users can edit the same policies either on Horizon or on Wikitech. If I don't get complaints then I'll remove the UI from wikitech in a few weeks.

by Andrew (Andrew Bogott) at May 30, 2017 11:17 PM