create tool to crunch metrics for views (play started) of video and audio files
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	John_Cummings
	Oct 23 2015, 8:40 AM

Description

This is a request for UI, raw data is available.

Currently the only measure of usage of videos and audio files is the number of page views of pages on which appear, measured using tools like stats.grok.se. Because videos and audio files must be clicked on to play, using page views is a very inaccurate measure.

Create a tool to crunch the of number of times videos and audio files have been played across all Wikimedia projects would offer a much more accurate understanding. The tool should be easy to use and can provide metrics for both individual files and collections of files e.g a Commons categories and could also be an inbuilt view counter on file pages on Commons in a similar style to Youtube.

This is metric is especially important for people (like me) working with partner organisations considering making content available under an open license. I'm not currently able to offer these organisations any sort of useful information about how many times their content has been seen, where as commercial sites e.g Youtube that do not require an open license can offer information on the number of video views.

Related Objects

Mentioned In: T149642: Add Mediaviews to Pageviews suite
T117956: Make mediacounts available in Wikimedia Labs
Mentioned Here: T149642: Add Mediaviews to Pageviews suite
P4339 Media Playcounts API documentation

Event Timeline

John_Cummings created this task.Oct 23 2015, 8:40 AM

John_Cummings assigned this task to • kevinator.

John_Cummings raised the priority of this task from to Needs Triage.

John_Cummings updated the task description. (Show Details)

John_Cummings added a project: Analytics.

John_Cummings added a subscriber: John_Cummings.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 23 2015, 8:40 AM

Esh77 added a subscriber: Esh77.Oct 23 2015, 9:25 AM

Fae added a subscriber: Fae.Oct 23 2015, 9:48 AM

Can you clarify what you need? This was fixed months ago: https://wikitech.wikimedia.org/wiki/Analytics/Data/Mediacounts

See some example data extractions for a couple WMIT projects:

Richard_Nevell_WMUK added a subscriber: Richard_Nevell_WMUK.Oct 23 2015, 11:13 AM

@Nemo_bis - there is no easy way to get at this data through either the commons interface or a tool, I think what he is requesting is a tool or visible stat that helps improve the visibility of these numbers/stats. Something like a stats.grok equivilant.

@Sadads Yes this is exactly it, something that is easy to use and can provide metrics for both individual files and collections of files e.g a Commons categories

@Mrjohncummings you might want to update your request then, to reflect the need for a tool or more transparent interface for this data.

John_Cummings renamed this task from Publicly available view counter of videos and audio files to Publicly available and easy to use metrics for plays of video and audio files..Oct 23 2015, 2:32 PM

John_Cummings updated the task description. (Show Details)

John_Cummings set Security to None.

@Sadads done :) Apologies to analytics team if it is still not clear, happy to discuss further.

jeremyb-phone renamed this task from Publicly available and easy to use metrics for plays of video and audio files. to create tool to crunch metrics for views (play started) of video and audio files.Oct 23 2015, 3:14 PM

jeremyb-phone updated the task description. (Show Details)

is there a phabricator project like "tool requests"? does analytics use workboards? (@Aklapper)

jeremyb-phone edited subscribers, added: jeremyb; removed: jeremyb-phone.Oct 23 2015, 3:16 PM

brion added a subscriber: brion.Oct 23 2015, 3:58 PM

• ezachte added a subscriber: • ezachte.Oct 23 2015, 7:41 PM

MartinPoulter added a subscriber: MartinPoulter.Nov 3 2015, 3:20 PM

Nemo_bis mentioned this in T117956: Make mediacounts available in Wikimedia Labs.Nov 6 2015, 9:32 AM

Milimetric reassigned this task from • kevinator to matmarex.Dec 7 2015, 6:05 PM

Milimetric edited projects, added Multimedia; removed Analytics.

Restricted Application added subscribers: StudiesWorld, Matanya. · View Herald TranscriptDec 7 2015, 6:05 PM

I'm surprised that this got assigned to me suddenly? I guess you want me to work on this? Why was I chosen? :P

Based on some very crude estimates, it looks like there's about 500 GB of raw data there… (after uncompressing; give or take a couple hundred). If we were to limit it to audio/video files only, it's more like 5-20 GB. I'm not sure how reasonable would it be to keep this amount of data on Labs.

Well then, I went ahead, created a tool and crunched the data. Anybody, please feel free to write a UI for it :)

Here's a 58 MB bzipped file containing number of views per day, generated from the raw data (third column), for audio/video files only. It expands to 814 MB of JSON – a lot less than I expected, since it seems that many of the files viewed every day are the same. Small enough to load the whole thing into memory :) There were 538965 unique file names (some actually pointing to non-existent files).

data.json.bz257 MBDownload

updated to cover whole year 2015: data.json.bz261 MBDownload
(small snippet to see the format:

data-snippet.json3 KBDownload

)

The tool I wrote to make this: https://github.com/MatmaRex/commons-media-views (it can also be run daily to update the dataset). It's written in Ruby (tested with Ruby 2.0.0) and requires at least 4 GB of RAM to run (with the current size of the data). No doubt it could be rewritten to be more efficient, but it was fast enough to process the whole data set in a few days, and takes 10-15 minutes on my machine to process an additional day.

matmarex removed matmarex as the assignee of this task.Dec 18 2015, 9:46 PM

matmarex added a subscriber: matmarex.

Nemo_bis added a project: Tools.Dec 19 2015, 7:50 AM

Nemo_bis removed a project: Multimedia.

In T116363#1873639, @matmarex wrote:

Well then, I went ahead, created a tool and crunched the data. Anybody, please feel free to write a UI for it :)

Here's a 58 MB bzipped file containing number of views per day, generated from the raw data (third column), for audio/video files only. It expands to 814 MB of JSON – a lot less than I expected, since it seems that many of the files viewed every day are the same. Small enough to load the whole thing into memory :) There were 538965 unique file names (some actually pointing to non-existent files).

data.json.bz257 MBDownload
(small snippet to see the format:
data-snippet.json3 KBDownload
)

The tool I wrote to make this: https://github.com/MatmaRex/commons-media-views (it can also be run daily to update the dataset). It's written in Ruby (tested with Ruby 2.0.0) and requires at least 4 GB of RAM to run (with the current size of the data). No doubt it could be rewritten to be more efficient, but it was fast enough to process the whole data set in a few days, and takes 10-15 minutes on my machine to process an additional day.

Wonderful, for the UI it would be very helpful if you both see the plays of an individual video and be able to pick a category on Commons and see the video plays for that category.

harej-NIOSH claimed this task.Oct 17 2016, 8:07 PM

Richard_Nevell_WMUK awarded a token.Oct 18 2016, 9:21 AM

I have created the infrastructure for logging the play counts in a central database. Currently I am working on ingesting all the historical data going back to the first day of data on January 1, 2015. Once that is done there will be a daily script that adds the prior day's values.

I have also created the corresponding API methods for querying this data, including for a specific date, last 30 days, last 90 days, and arbitrary date range for an individual file or for a category of files (with subcategories). I need to create a web interface for accessing these APIs. There will be an online user-friendly interface and an API endpoint that returns JSON. It may be a bit slow.

We have an API: https://tools.wmflabs.org/mediaplaycounts/api/1/FilePlaycount/date/Donning_PPE-_Engage_Trained_Observer_CDC02.webm/20150101

Once I map the different API URLs to the framework itself then I will begin work on a lookup interface.

The API documentation is located at P4339 if you are interested in developing tools around these metrics. (Pinging @MusikAnimal.) Note that the dataset is incomplete; work to ingest all the data from 1 January 2015 to present. As of writing the ingest script is around midway through April 2016.

The web interface that lets you do lookups without bothering with API query URLs is forthcoming.

In T116363#2758657, @harej-NIOSH wrote:

The API documentation is located at P4339 if you are interested in developing tools around these metrics. (Pinging @MusikAnimal.) Note that the dataset is incomplete; work to ingest all the data from 1 January 2015 to present. As of writing the ingest script is around midway through April 2016.

The web interface that lets you do lookups without bothering with API query URLs is forthcoming.

Awesome! I will look into adding this to Pageviews Analysis :) So this goes off of the raw dumps at https://dumps.wikimedia.org/other/mediacounts/ ?

In T116363#2758663, @MusikAnimal wrote:

So this goes off of the raw dumps at https://dumps.wikimedia.org/other/mediacounts/ ?

Correct. More specifically, it goes off of columns 0, 3, 4, and 16, which (I believe) are the counts referring to actual plays of the media content, as opposed to previews.

MusikAnimal mentioned this in T149642: Add Mediaviews to Pageviews suite.Oct 31 2016, 11:12 PM

Going to close this task as complete since the metrics crunching is now underway; T149642 is the task for implementing the UI.

harej-NIOSH closed this task as Resolved.Nov 1 2016, 7:26 PM

All the past data has now been ingested, 1 January 2015 to 1 November 2016. There will be daily ingests that take place at around 20:00 UTC each day.

Fuzheado added a subscriber: Fuzheado.Nov 4 2016, 2:39 PM

For future reference, harej's tool's source code is at https://github.com/harej/mediaplaycounts and https://github.com/harej/mediaplaycounts-app.

I'm really glad this work is happening. As for which columns to include here is some feedback:

The file layout is as follows:

#1: Base name
#2: Total sum of response size for base name
#3: Total requests
#4: Requests for the raw - original - unmodified file
#5: Requests for a transcoded audio file
#6: Requests sound data1 (reserved)
#7: Requests sound data2 (reserved)
#8: Requests for a transcoded image file
#9: Requests for a transcoded image file ( 0 <= width <= 199)
#10: Requests for a transcoded image file ( 200 <= width <= 399)
#11: Requests for a transcoded image file ( 400 <= width <= 599)
#12: Requests for a transcoded image file ( 600 <= width <= 799)
#13: Requests for a transcoded image file ( 800 <= width <= 999)
#14: Requests for a transcoded image file (1000 <= width )
#15: Requests image data1 (reserved)
#16: Requests image data2 (reserved)
#17: Requests for a transcoded movie file
#18: Requests for a transcoded movie file ( 0 <= height <= 239)
#19: Requests for a transcoded movie file (240 <= height <= 479)
#20: Requests for a transcoded movie file (480 <= height )
#21: Requests movie data1 (reserved)
#22: Requests movie data2 (reserved)
#23: Requests with an internal refer (from WMF domains)
#24: Requests with an external refer (from non-WMF domains)
#25: Requests with unknown referer

If I'm correct columns 0, 3, 4, and 16, would map to this list (starting with 1) as
#1: Base name
#4: Requests for the raw - original - unmodified file
#5: Requests for a transcoded audio file
#17: Requests for a transcoded movie file

Is this correct?

That is correct @ezachte, or at least those are the columns I went with (so I hope they're correct!).

@harej-NIOSH thanks

These columns are a very minimal set. The idea behind counting image requests in bins per file size was that it allows making a distinction between thumbs (shown within a page, not necessarily related to what the reader was looking for, and maybe even outside the visible window) and images a reader specifically wanted to see in more detail. The latter are the ones I would report to museum directors (my credo: let's err on the side of modesty, even then our numbers are high beyond belief). Of course the supplied explanation/definition should mention this restriction.

Some images are even used as small navigation icons, so what is depicted is is hardly discernible. Adding view counts for those icons to a monthly total per collection in a report for this parties is in my opinion misinformation (and I have seen such reports more than once). So a decent threshold for image size would alleviate that problem. (it's not perfect, some small images would be discarded as false negatives). More subtle: allow images below the threshold if no larger version was ever requested.

Here is an example of an image that is mostly requested at a very minimal display size:
/wikipedia/commons/c/cd/Socrates.png
It's main usage is a icon within templates

The image is requested as follows: on Nov 20, 2016:
total 179076
larger than 200 pixels width: 1123 (or 0.006 or less than 1 percent)

Cheers,
Erik

TheDJ added a project: Wikimedia-Video.Nov 22 2016, 12:59 PM

@ezachte I would be interested in the data for full size images vs. thumbnail/icon, but this API focuses on the number of times a playable media file has been played. Though extending it to include this additional data should not be difficult.

create tool to crunch metrics for views (play started) of video and audio filesClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

create tool to crunch metrics for views (play started) of video and audio files
Closed, ResolvedPublic
Actions