Page MenuHomePhabricator

create tool to crunch metrics for views (play started) of video and audio files
Closed, ResolvedPublic

Description

This is a request for UI, raw data is available.

Currently the only measure of usage of videos and audio files is the number of page views of pages on which appear, measured using tools like stats.grok.se. Because videos and audio files must be clicked on to play, using page views is a very inaccurate measure.

Create a tool to crunch the of number of times videos and audio files have been played across all Wikimedia projects would offer a much more accurate understanding. The tool should be easy to use and can provide metrics for both individual files and collections of files e.g a Commons categories and could also be an inbuilt view counter on file pages on Commons in a similar style to Youtube.

This is metric is especially important for people (like me) working with partner organisations considering making content available under an open license. I'm not currently able to offer these organisations any sort of useful information about how many times their content has been seen, where as commercial sites e.g Youtube that do not require an open license can offer information on the number of video views.

Event Timeline

John_Cummings assigned this task to kevinator.
John_Cummings raised the priority of this task from to Needs Triage.
John_Cummings updated the task description. (Show Details)
John_Cummings added a project: Analytics.
John_Cummings added a subscriber: John_Cummings.

@Nemo_bis - there is no easy way to get at this data through either the commons interface or a tool, I think what he is requesting is a tool or visible stat that helps improve the visibility of these numbers/stats. Something like a stats.grok equivilant.

@Sadads Yes this is exactly it, something that is easy to use and can provide metrics for both individual files and collections of files e.g a Commons categories

@Mrjohncummings you might want to update your request then, to reflect the need for a tool or more transparent interface for this data.

John_Cummings renamed this task from Publicly available view counter of videos and audio files to Publicly available and easy to use metrics for plays of video and audio files..Oct 23 2015, 2:32 PM
John_Cummings updated the task description. (Show Details)
John_Cummings set Security to None.

@Sadads done :) Apologies to analytics team if it is still not clear, happy to discuss further.

jeremyb-phone renamed this task from Publicly available and easy to use metrics for plays of video and audio files. to create tool to crunch metrics for views (play started) of video and audio files.Oct 23 2015, 3:14 PM
jeremyb-phone updated the task description. (Show Details)

is there a phabricator project like "tool requests"? does analytics use workboards? (@Aklapper)

I'm surprised that this got assigned to me suddenly? I guess you want me to work on this? Why was I chosen? :P

Based on some very crude estimates, it looks like there's about 500 GB of raw data there… (after uncompressing; give or take a couple hundred). If we were to limit it to audio/video files only, it's more like 5-20 GB. I'm not sure how reasonable would it be to keep this amount of data on Labs.

Well then, I went ahead, created a tool and crunched the data. Anybody, please feel free to write a UI for it :)

Here's a 58 MB bzipped file containing number of views per day, generated from the raw data (third column), for audio/video files only. It expands to 814 MB of JSON – a lot less than I expected, since it seems that many of the files viewed every day are the same. Small enough to load the whole thing into memory :) There were 538965 unique file names (some actually pointing to non-existent files).

updated to cover whole year 2015: (small snippet to see the format: )

The tool I wrote to make this: https://github.com/MatmaRex/commons-media-views (it can also be run daily to update the dataset). It's written in Ruby (tested with Ruby 2.0.0) and requires at least 4 GB of RAM to run (with the current size of the data). No doubt it could be rewritten to be more efficient, but it was fast enough to process the whole data set in a few days, and takes 10-15 minutes on my machine to process an additional day.

matmarex added a subscriber: matmarex.

Well then, I went ahead, created a tool and crunched the data. Anybody, please feel free to write a UI for it :)

Here's a 58 MB bzipped file containing number of views per day, generated from the raw data (third column), for audio/video files only. It expands to 814 MB of JSON – a lot less than I expected, since it seems that many of the files viewed every day are the same. Small enough to load the whole thing into memory :) There were 538965 unique file names (some actually pointing to non-existent files).

(small snippet to see the format: )

The tool I wrote to make this: https://github.com/MatmaRex/commons-media-views (it can also be run daily to update the dataset). It's written in Ruby (tested with Ruby 2.0.0) and requires at least 4 GB of RAM to run (with the current size of the data). No doubt it could be rewritten to be more efficient, but it was fast enough to process the whole data set in a few days, and takes 10-15 minutes on my machine to process an additional day.

Wonderful, for the UI it would be very helpful if you both see the plays of an individual video and be able to pick a category on Commons and see the video plays for that category.

I have created the infrastructure for logging the play counts in a central database. Currently I am working on ingesting all the historical data going back to the first day of data on January 1, 2015. Once that is done there will be a daily script that adds the prior day's values.

I have also created the corresponding API methods for querying this data, including for a specific date, last 30 days, last 90 days, and arbitrary date range for an individual file or for a category of files (with subcategories). I need to create a web interface for accessing these APIs. There will be an online user-friendly interface and an API endpoint that returns JSON. It may be a bit slow.

We have an API: https://tools.wmflabs.org/mediaplaycounts/api/1/FilePlaycount/date/Donning_PPE-_Engage_Trained_Observer_CDC02.webm/20150101

Once I map the different API URLs to the framework itself then I will begin work on a lookup interface.

The API documentation is located at P4339 if you are interested in developing tools around these metrics. (Pinging @MusikAnimal.) Note that the dataset is incomplete; work to ingest all the data from 1 January 2015 to present. As of writing the ingest script is around midway through April 2016.

The web interface that lets you do lookups without bothering with API query URLs is forthcoming.

The API documentation is located at P4339 if you are interested in developing tools around these metrics. (Pinging @MusikAnimal.) Note that the dataset is incomplete; work to ingest all the data from 1 January 2015 to present. As of writing the ingest script is around midway through April 2016.

The web interface that lets you do lookups without bothering with API query URLs is forthcoming.

Awesome! I will look into adding this to Pageviews Analysis :) So this goes off of the raw dumps at https://dumps.wikimedia.org/other/mediacounts/ ?

So this goes off of the raw dumps at https://dumps.wikimedia.org/other/mediacounts/ ?

Correct. More specifically, it goes off of columns 0, 3, 4, and 16, which (I believe) are the counts referring to actual plays of the media content, as opposed to previews.

Going to close this task as complete since the metrics crunching is now underway; T149642 is the task for implementing the UI.

All the past data has now been ingested, 1 January 2015 to 1 November 2016. There will be daily ingests that take place at around 20:00 UTC each day.

I'm really glad this work is happening. As for which columns to include here is some feedback:

The file layout is as follows:

#1: Base name
#2: Total sum of response size for base name
#3: Total requests
#4: Requests for the raw - original - unmodified file
#5: Requests for a transcoded audio file
#6: Requests sound data1 (reserved)
#7: Requests sound data2 (reserved)
#8: Requests for a transcoded image file
#9: Requests for a transcoded image file ( 0 <= width <= 199)
#10: Requests for a transcoded image file ( 200 <= width <= 399)
#11: Requests for a transcoded image file ( 400 <= width <= 599)
#12: Requests for a transcoded image file ( 600 <= width <= 799)
#13: Requests for a transcoded image file ( 800 <= width <= 999)
#14: Requests for a transcoded image file (1000 <= width )
#15: Requests image data1 (reserved)
#16: Requests image data2 (reserved)
#17: Requests for a transcoded movie file
#18: Requests for a transcoded movie file ( 0 <= height <= 239)
#19: Requests for a transcoded movie file (240 <= height <= 479)
#20: Requests for a transcoded movie file (480 <= height )
#21: Requests movie data1 (reserved)
#22: Requests movie data2 (reserved)
#23: Requests with an internal refer (from WMF domains)
#24: Requests with an external refer (from non-WMF domains)
#25: Requests with unknown referer

If I'm correct columns 0, 3, 4, and 16, would map to this list (starting with 1) as
#1: Base name
#4: Requests for the raw - original - unmodified file
#5: Requests for a transcoded audio file
#17: Requests for a transcoded movie file

Is this correct?

That is correct @ezachte, or at least those are the columns I went with (so I hope they're correct!).

@harej-NIOSH thanks

These columns are a very minimal set. The idea behind counting image requests in bins per file size was that it allows making a distinction between thumbs (shown within a page, not necessarily related to what the reader was looking for, and maybe even outside the visible window) and images a reader specifically wanted to see in more detail. The latter are the ones I would report to museum directors (my credo: let's err on the side of modesty, even then our numbers are high beyond belief). Of course the supplied explanation/definition should mention this restriction.

Some images are even used as small navigation icons, so what is depicted is is hardly discernible. Adding view counts for those icons to a monthly total per collection in a report for this parties is in my opinion misinformation (and I have seen such reports more than once). So a decent threshold for image size would alleviate that problem. (it's not perfect, some small images would be discarded as false negatives). More subtle: allow images below the threshold if no larger version was ever requested.

Here is an example of an image that is mostly requested at a very minimal display size:
/wikipedia/commons/c/cd/Socrates.png
It's main usage is a icon within templates

The image is requested as follows: on Nov 20, 2016:
total 179076
larger than 200 pixels width: 1123 (or 0.006 or less than 1 percent)

Cheers,
Erik

@ezachte I would be interested in the data for full size images vs. thumbnail/icon, but this API focuses on the number of times a playable media file has been played. Though extending it to include this additional data should not be difficult.