Monday, December 30, 2013

Which radio station really has the best variety of songs?


Listening to the radio


Years ago, I listened to the good ol' FM radio, and I pretty much had something playing 24x7 (as long as nobody complained about the music). Although, for some reason my "preferred" stations would always end up becoming acquired by another entity/competitor/whatever - like WLOL and The Edge.


Eventually, I grew tired of the seemingly "limited music variety" that was available by the usual radio stations, so I usually prefer to stream something (still filled with commercials) from Spotify, Pandora, and other sources, where I have a little more control over the variety (although even these providers seem limited in their breadth of variety)

Radio Tower
Since it's completely unfair to blatantly claim that the local terrestrial FM radio stations have a limited/repetitive song catalog, I figured I'd "be scientific" and actually collect data on what the stations are playing, and then visualize/graph the information to see what would show up. 

There are probably better ways to spot trends within data, but I prefer to view (and explain) things using visual means, so my approach is to simply present things, and "just see what happens".




Accumulating the data


I figured it'd be easy to gather the data, since most of the local station's web sites provide the "currently playing" song data, and if sites like tunein.com and radiosearchengine.com can provide the information, I should be able to figure it out too.

I started with 93X (KXXR), which seems to be the only "current rock-music station" in the area (yeah, pretty sad). I went to their website, and then fired up the developer tools to identify the means used to retrieve the "currently playing" song information. At the time (Dec 2013), it uses a jQuery JSONP call to retrieve the information, which is a decent workaround if the data provider doesn't support CORS

All I had to do was visit the station's website, clear the "Network" tab after everything had finished loading, and then wait for the next song to start. Eventually, entries would appear as the web page would poll the currently playing song, and then update things when needed.

I coded up a simple bit of JavaScript code using Node's http get routine and configured it to poll the website every so often. After a little while, my data collector stopped working. I pointed a real web browser at the station's website to see what the problem was, and I was presented with a "You've been blacklisted" type of message from the station's hosting provider!  I thought it was a tad ironic that a station that presents itself with a "bad-assed barely legal" veneer image had declared that I was too deviant to be allowed access to their website's information. 

My first thought was that they're trying to pretend that they can restrict access to that data in the same way that Major League Baseball tried to "copyright" facts and stats in order to deter the Fantasy Baseball leagues. Eventually, I figured out that it was a simple DDOS type of filter, and that all I had to do was make my data requests look "more like a real web browser".

One of the (many) great features in Chrome's Dev Tools is the option "Copy as cURL". After a few days, the blacklisting automatically cleared (or maybe my outgoing IP address changed), and I was able to access the station's site again. I opened Chrome's Dev Tools once again, and this time I copied everything that the original request used.

As I expected, the request was full of header settings, referer (yeah, it's misspelled), and cookies. I slam-dunked most of the pieces within that request into my "defined stations" database/collection, then hooked up the wonderful node-curl npm module, and unleashed my data-collection code upon the information. After a few hours, the data collection seemed stable, and hadn't been blacklisted. Mission accomplished!


Deploying the app


As difficult as all of that coding seems, deploying an app is actually the most difficult part (I keep planning on writing up my thoughts, explanations, and experiences about that topic, but it hasn't happened yet). 

When designing/developing an app, you need to have the end-goal in mind from the beginning, or it's going to be a complete train wreck. As a result, when I started imagining up this experiment, I specifically had NodeJS and MongoDB in mind so that I could deploy/use OpenShift (RedHat) and MongoLab.

The team over at OpenShift have a guide to help get started with a new NodeJS app, although I chose not to use their MongoDB cartridge, and preferred to manage the database myself. I eventually discovered that there's a quickstart guide for using OpenShift and MongoLab together, but I had already had things working - figures.

I personally love the PaaS, which is basically a "more evolved" form of a web-app container (I've been doing Java Apps since 1997). When I started deploying Java apps "to the cloud" years ago, I was disappointed in the amount of "wakeup time" that was required after hitting an app that had "gone to sleep"  (If you're wondering, Google's App Engine seems to have the fastest/best "wakeup time" for Java-based apps, but maybe that will be another "research project" someday).  Anyway, from my experience, NodeJS apps seem to have an extremely quick wakeup/response time (because humans should never have to wait for a computer).

Getting the app deployed was a breeze (due to the straightforward git-based interface used by OpenShift). I eventually coded up a simple REST-based interface so that I could peek on the collected data without having to dive into the MongoDB shell/console.

After a few days, I discovered that there were times where the data stream would contain entries like "Song information not available", which I obviously didn't want to include within my collected data (along with other "glitches" in the data stream). So, it took a few iterations of refinement and data scrubbing/cleansing in order to get a "clean" set of song information. 


Looking at the data from a different view


After a few days of collecting data, and improving the process (which is still ongoing), I wanted to get a better look at what I had accumulated.  I've been wanting to do "more sophisticated" graphs with the D3 library, because everything I've done with D3 thus far was akin to printing simple drinking straws with a Stratasys printer

I had an idea of how I wanted to "see the data", and after lots of searching, I discovered a graph called a "Cluster Dendrogram" with an example, which appears to use a data-protocol similar to Flare.  I updated my NodeJS code to provide a data-feed similar to what the graph expected/used, and had things working pretty easily (which is freakin' amazing, because that almost never happens in real life). 


After a few more iterations (and deployments) to my OpenShift instance, I was able to generate a graph of the songs played in the last 48 hours by the radio station - and as I had guessed, there is a small number of artists/bands that compose a "significant portion" of their song catalog.


The graph seemed interesting enough, but data isn't very meaningful unless you have a baseline or some other means of comparing things. So I used my existing JSON/cURL based collection engine on the websites for KQRS (which is a "classic rock" sister station of KXXR, and thus was pretty easy to set up), and the station KDWB (yeah, I know), which was a bit more challenging due to the way their JSON data feed/response is structured.

After a few more late-evening sessions (after the kids were in bed) of coding, tweaking, and deploying, I was able to obtain more graphs of the song selections for these stations.  

The station KQRS shows a "wide variety" of songs, because each song in the graph is pretty much played just once. 



The station KDWB is on the opposite end of the spectrum (being a more Top 40 type of format), and thus shows a very small number of artists/songs each with a numerous amount of broadcasts within the time period.



If you want to see the entire graphs in real-time, you can find them here:
http://noderadio-panurgy.rhcloud.com/graph.html

March 4, 2014 - deployment update - I've noticed that the RedHat/OpenShift instance tends to go to "sleep" after an unspecified period, which kills the radio station polling. I've also had ongoing issues with the radio station's ISP blocking my app due to suspected "DDoS Activity". So, I tweaked the code and made it work on OpenShift and CloudFoundry, and then deployed the app to a few other PaaS providers:

AppFog (CloudFoundry): AppFog (using AWS): http://noderadio.aws.af.cm
Updated July 1, 2014 - AppFog no longer supports instances on the HP Cloud, moved to AppFog on AWS/East.

IBM BlueMix (CloudFoundry @ SoftLayer): http://noderadio.mybluemix.net/
Updated July 1, 2014 - IBM BlueMix domain names have changed, now that they've officially launched the service.

Yet another update - Nov 27, 2014: Tried out Heroku's integration with DropBox and deployed an instance to their service, which increases the resiliency of the application now that it's running on yet another (free) PaaS:  http://noderadio.herokuapp.com

A nice bonus about this is that it's super easy to "scale wide" and provide capacity (and redundancy) by deploying the same app/code to multiple services.

Conclusion or consensus ?


After watching the graphs evolve, it definitely appears that stations whose format is biased towards the newer music tend to repeat songs more frequently, and thus "have less variety". On the other end of that spectrum are the stations whose format is biased towards classic (or older) songs, and thus can draw upon a wider variety of songs (and hence, less repetition).

So yeah, I basically spent two weeks of evenings and weekends to "prove" something that was pretty much "commonly known". Fortunately, my actual goal was to become more familiar with NodeJS, OpenShift, MongoDB (and its "schemaless nature" - which does not mean that the data isn't  organized), MongoLab, and D3 - and in that regard, the experiment was a huge success.

I guess I'll stick with Spotify, and songs like BT - Skylarking, which is great background music for coding (and I love night-time long-exposure photography). 

Add a comment below if you happen to have a favorite song/artist/station for coding music!

P.S. I finally committed the code out in a GitHub repo: panurgy/noderadio. Check back for updates, or follow me on Twitter