Big Data Project: Part 1 (Preparation)

April 20, 2019
Last updated
April 20, 2019 20:06

Never lose sight of your dreams. This project is that dream and we are making it our goal to make it a reality. It will be a long road ahead but it will be worthwhile wherever we may end up. This dream for us is to visualize and analyse data spanning the entire music industry. We want to take this subjective art and augment it with objective facts and numbers and learn about how music has changed over time and what we learn from it. We can only imagine what we might learn from this. This blog will become a series of posts where we will take you along our journey to data insight.

What are we looking for?

As much data as possible. We are considering the MusicBrainz database for all metadata on music. This as it is a well maintained wiki page for everything music. It is open source and free for anyone to access and download. This is an enormous relational database and we are currently considering what the best way would be for us to capture it and analyze it.

Also, we are looking to incorporate lyrics of songs as well. We are thinking about integrating the API of Genius or other services that offer lyrics online for free. In a perfect world we would also want to have streaming data. However, this is something we can only gather from our partners as streaming services to do not give access to their raw data to anyone (even though majors get access to their data API and labels to their label services. Artists only get Spotify for Artists).

Why are we doing this?

As music lovers and data enthusiasts we always wondered what data on the music industry could unveil. Could you see how genres are born and then become more niche as time goes on? Can you see topics in lyrics change over time? There are just so many questions you could potentially answer with data and at the same time it will allow for limitless exploration and the ability draw conclusions based on more than just industry expertise.

What is next?

We will make a selection of the data sources we want to use and how we are going to combine them and centralize them so we can run analyses on them. This is most likely going to be a server being run on a cloud so that we have expandability regarding memory, CPUs & availability. Generally it will be about how we can make it the most efficient and effective for us. We hope to have more news regarding this in the next few months.