I like writing about my work, and sharing some of the problems I’ve been able to solve. But unfortunately, at the end of the day, I write about less than 1% of the things I actually do – and so when I was writing about my Regional NSW Syndication project, I realised it’s been three years since I last wrote about OnAirHope. A lot has changed with Hope Media‘s on-air metadata system in that time.
With changes to the volume of media we publish to our website, and the introduction of the Regional NSW audio, we realised maintaining a dedicated file server to store all our audio is very ‘2005’. The idea of using AWS S3 for storage came up around the same time as a business requirement came through to gather some more in-depth and accurate stats about our downloadable audio.
As a result, OnAirHope is the database, connection broker, and statistical engine for all our downloadable audio content.
On a local server, we use Thimeo’s WatchCat to find new audio files in a watch-folder, process the audio, normalise the audio, convert the file to MP3, and then hand it off to a OnAirHope upload utility (built in Python).
This upload utility creates a database entry in OnAirHope, and then uploads the file to S3. OnAirHope provides a unique audio URL to our content producers, ready to embed in a WordPress Post or Podcast feed.
Whenever a file is accessed by an end-user, OnAirHope creates a one-time download token and redirects you to that file on S3. It also logs an entry in the database, used for statistical analysis later on. We also store a classification of where each download comes from, so we can see which channels are most popular.
This example file (screenshot below) shows you how many times the audio was loaded in our embedded web-player, how many people hit the ‘download’ button next to the player, how many people downloaded the file via a XML Podcast Feed, and any other downloads.
We can also see a chart showing listener retention throughout a specific file. This example chart shows 450 people loaded the file, but only 21 listened through to the 20 second point in the file. This chart only reflects data from our embedded web player, and is generated by a AJAX call that is hooked into the jQuery jPlayer ‘timeupdate’ function callback.
As James Cridland discovered, player stats across the industry are artificially high due to the way browsers pre-load audio files before the user actually hits ‘play’. This is why we show the decay chart – so we know who is actually ‘listening’, as opposed to a drive-by on the article page.
Icecast Streaming Statistics
Back in 2011, I built a system called ‘IcyStats’ – an Icecast logfile parser. While it never took off commercially, Hope had been using this for a number years due to the lack of a better solution (I’d had to hack at it a few times to make it scale, but even then it was a bit underwhelming in the performance and reliability arenas). As our streaming setup grew, and split across multiple servers, I had to find a better option.
The solution was to build some Streaming Statistics into OnAirHope.
Each of our Icecast Streaming Servers sends the latest access log file to S3 whenever the log is rotated (in our case, many times a day). OnAirHope then has a worker process which finds these logfiles, imports the data, and then creates a number of cached summary reports for the relevant stations.
Here’s an example of the stats for one station:
This station happens to be an inactive Icecast station I keep around for testing, and the stats shown here are meaningless. You may wonder why the stats are so high? One of the interesting problems faced in online statistics is the impact of ‘robot views’ or ‘phantom users’ – a problem plaguing online advertisers. There is a very high quantity of ‘stream rippers’, or servers which stay connected to a stream with no one really listening, generally for the purposes or relaying streams or doing analysis on the audio. You’ll note ‘DE’ is a popular country to listen from – because server colocation and bandwidth is very cheap in Germany.
The solution, which I’ve partially implemented, is to detect connections in server IP ranges and exclude them from the statistics. This is a fairly effective solution, but requires constant IP WHOIS and ASN lookups – as well as a lookup table of company names and a classification determining if they primarily act as server provider (bad) or a consumer ISP (good).
Unfortunately, I haven’t found any documented examples of anyone else doing this sort of filtering well, so any logfile reporting needs to be taken with a grain of salt. As a community station, we don’t sell anything based on this data, but instead use it as a guide to growth of our online listener base.
This feature provides a steady stream of data on what our listeners enjoy and don’t enjoy. While this data isn’t used as the sole metric for song selection and rotation, it does provide additional insight for our Music Directors.
There are basic protections in place to limit votes to one per-song, per-user (a mixture of Device ID, Cookies, and IP Address).