Sunday, February 21, 2021

Accumulating and graphing FM station song history (Noderadio v2.0)

tl;dr

This is the second generation of my (super minimalistic) graphing/visualization project, which visibly shows which FM radio stations have lots of variety (KQRS, KZJK, KQQL), and which don't (KDWB).

Check it out:  https://codepen.io/panurgy/full/YzpGpap?s=kqrs

Introduction

About eight years ago, I built a data-collection system that accumulated and graphed information about the song variety (or lack thereof) across several FM radio stations around Minneapolis MN.  A few years after setting that up, things changed at the various hosting providers, and the entire system fell apart (Heroku Cedar is deprecated, mLab was acquired and shutdown, etc).

A few months ago, my curiosity was rekindled about the variety of songs, so I revisited this project and updated it - using a newer generation of hosting solutions/options.  Some of the code remains the same (and still mentions "use strict" within the functions!), but other parts were rebuilt from scratch.

Getting the data

The first step was figuring out the data acquisition.  All of the radio stations have new websites, which changed (and broke) my song-info collection code (which essentially relied upon "screen scraping" the info from the station's website).  

Rather than writing and deploying server-side code, I used RunKit to create an assortment of individual endpoints, which obtains and returns the "now playing" song information for a specific station. Every station provides their info a bit differently, and fall into three general categories:

  • HTML data - the station's website sends a HTML string, which contains DOM elements, and the song info is buried within it. The npm package cheerio works great at parsing out the info.
  • JSON data - the station's website returns a JSON string, which is super easy to parse/use. Most of the stations use this format.
  • WebSocket - the station's site opens a WebSocket, and uses an "ask/reply" protocol that responds with a JSON string.
All of this information is discoverable using the browser's dev-tools, the biggest challenge is finding the "needle in the haystack" within all the other network requests (advertisements, metrics, trackers, and more trackers).

Here's the list of currently supported stations:


Storing the data


The next step was finding a place to store the data.  I needed something that could hold a "large quantity" of information, and possessed the ability to index/query upon the song's timestamp. Previously, I used a MongoDB instance (which worked great!), but this time I had issues getting the recent Node/MongoDB drivers working (some native-code dependency wasn't working), so I searched for something else that's "fully cloud" with a REST API.

I decided upon Google Cloud Firestore, because it's collection-based, capable of storing large quantities of data, and provides fast indexing/querying. I also knew that Zapier has an integration that easily connects with Cloud Firestore (disclaimer - I built that integration).



I decided to create a collection for each station, and used the "UNIX/Epoch timestamp" as the document's identifier.  I could have shortened the timestamps from milliseconds down to seconds (and reduced the length by three characters), but decided to leave things as milliseconds, since that works best with JavaScript's Date object. (btw - unix/epoch timestamps are "the best/only" way to store/preserve timestamps in a database/persistence - buy me a beer and I'll share my experiences).


Running the data collection


Once I had those two pieces worked out, the next step was setting up something that ran every few minutes, retrieved the "current song" from the station's website, and placed it into the database. The easiest option was a group of Zaps at Zapier - with a Zap for each station, which looks like this:


  • The trigger uses Zapier's Webhook integration, which calls one of the Runkit endpoints that I created
  • The filter step discards any data that's missing the song/artist info, which indicates the station is playing advertisements
  • The final step saves the data into the correct collection

Since the Zaps run/poll about every five minutes, it's possible that a station might play a short song in between two cycles, but this isn't a production-grade experiment.  Eventually, I might convert/replace the Zaps with Cloudflare's Cron triggers for Workers (assuming I have enough free-time and attention span, which is unlikely).  In the meantime, Zaps are a fantastic way to get things up and running quickly and easily (with observability into how well things are running).

I currently have these five Zaps running, each configured for a specific station:


Rebuilding the front-end


As the Zaps gradually accumulated the historical data, I started working on rebuilding the front-end.  Since the visualization/graphing code is very minimalistic (it's a single static page), and doesn't require any special server, I decided to use a CodePen to host this piece.  Most of the original (and obsolete) front-end code still worked, so I only had to rebuild the parts that interacted with the database - switching things from MongoDB over to Cloud Firestore.  The two biggest challenges were:
  • Converting the query - in MongoDB, querying is pretty easy, the database call passes over a "fairly simple" JSON object that contains the search's settings.  In Cloud Firestore, the JSON object is a "bit more complex" and requires a Structured Query.
  • Reading the data - in MongoDB, the query returns an array which contains the documents/objects from the database (thus the objects received match the objects in the database).  In Cloud Firestore, the documents aren't "simple JSON", but rather a more detailed format which contains lots of meta-information about each of the document's fields/data-types.  Fortunately, StackOverflow had the answer I needed, to convert those document objects into "plain objects".

Viewing the results


After all that work (and a few days to collect enough historical data), the results show which stations have "lots of variety", and which have "little variety" (and thus, overplay a small set of songs).

For example, KZJK is a Jack FM station, which plays a wide variety (and sometimes random) assortment of music.  During a two-day span, most songs were only played once or twice:





On the flipside, KDWB is a Top 40 station, which plays a small collection of songs more frequently.  During a two-day span, some songs were played 20 times:





Next Steps

There's a long list of enhancements/improvements that I'd like to perform, but it's unlikely that I'll have the time needed to work on things. Such as:
  • Switching from Zaps to Clouflare, for faster sampling intervals
  • Setup some metrics with Datadog to monitor the data-collection workers
  • Possibly setup Sentry.io error logging when things break/fail
  • Update and clean up the code, and rearrange it into something more polished

Conclusion

Whenever I'm listening to FM stations, I occasionally encounter a song that seems like it's played "all the time" - but now there's a convenient way to find out whether that's true, or just my perspective (or crappy luck, and listening to a station only when "that song" is playing).

If you'd like to replicate this experiment using your local/favorite FM radio stations, most of the pieces are free to use (Runkit, Cloud Firestore, CodePen).  Zapier's Webhook integration requires a paid plan (but may be available during their introductory trial period?), otherwise you can purchase and cancel anytime. Overall, the biggest challenge is figuring out how to collect/parse the song information from a station's website.