Crossfit Open Data

Admittedly, I’m a CrossFitter. I even coach part-time at CrossFit Hierarchy Ivy City (stop by and check us out if you are in NE DC). Between my day job and my part-time job and my own workouts, I decided to take a break from the Android App I’ve been working on to play around with Python data analysis for something that my tie in to other areas of my life.

CrossFit + Data Analysis?

Every year, CrossFit HQ hosts the CrossFit Games. In order for athletes to qualify for the Games they have to complete the five-week CrossFit Open in the top 50 for their region then finish in the top five of their multi-day Regional event. Over 200K men and 140K women of all shapes and ages competed in the CrossFit Open in 2016. The beautiful thing for data nerds, is that all that data is freely available online. Additionally, every registered athlete has a profile page on Games.Crossfit.com where they can list their gym, height, weight, max lifts, and benchmark scores. Realizing this, and considering the number of open source data analysis libraries available for Python, I really want to see about playing with that data. In fact, several people have worked on CrossFit Open Data analysis over the years (http://cfganalysis.blogspot.com/, https://github.com/swiftsam/CrossfitRankings, http://onthesharpend.com/2014/03/09/crossfit-open-2014-stats-ranks-and-data-nerding/, etc).

I have several thoughts of the what type of information I want to know. My initial look is going to be at the performance of different gyms (or Affiliates or “boxes”). My theory is that by looking at the scores of athletes who compete from that gym over the years, we could start to determine which gyms (on average) show the most significant improvement year-by-year. From there, it may be worth looking at their programming to see how they have focused their efforts.

Gathering Score Data

To start, I decided to gather the data that I wanted to sift through. After noticing that Games.Crossfit.com had a filterable leaderboard, I figured I could build a Python web scrapper to pull down the data. Fortunately, I quickly discovered OpeNG, an open Angular databoard with an easily accessible API.

Source Code on GitHub: https://github.com/captamericadevs/CFOpenData

Initially, I turned to my sequential programming roots and decided to try scrapping the performance data using Python’s requests library. I started with a single HTTP GET request with a “number per page” parameter equal to the total number of athletes in the Male Rx division (over 140K). Unfortunately, the server rejected this (surprise surprise), so I decided to break down the number of athletes per page and make a separate GET request for each page. With 5950 pages at 30 athletes per page, sequential GET requests were quickly revealed to be a bad idea. After briefly considering employing Python’s asynchronous requests library grequests, I decided to use the new-to-3.4 library asyncio. Asyncio allows for single-thread concurrent coroutines that allows code to be written in a sequential fashion. The concurrent aspect of asyncio comes from the event loop that is run on a single thread. While an event is executing in the thread no other tasks are running on the thread, however, when the event is awaiting an action, it can suspend itself (i.e. yield from) in order to allow the next event to begin execution. As the HTTP GET requests will all need to execute then await a response from the server, this seems like a good solution.

In extractScores.py, I implemented the asyncio event loop to create events for all of the 5950 pages and quickly outran the number of sockets available. It appears some sort of control would need to be implemented so that there were enough available resources to listen for responses to the 5950 HTTP GET requests. The solution I decided on involved segmenting the total number of pages to smaller blocks of pages. Based on a previous IOError: cannot watch more than 1024 sockets I decided to keep the size of the blocks under 1024 pages. The block size also dictates the number of semaphores generated to control the number of concurrent events. Finally, after responses are received for all the requests in the block (averaging 13s on my system and connection) I extract the scores and store them in a CSV. After each block the CSV is appended so that even on a crash, the data from previous blocks remains available.

During the score extraction, I also extract the athleteID. As stated previously, each athlete has a unique ID which enables you to access their profile page at http://games.crossfit.com/athlete/{athleteID}. Every time the script extract scores, it appends to a list of athlete IDs for the next step.

Gathering Profile Data

Using the list of Ids, I followed nearly the exact process to extract profile data. The difference is that instead of 30 athlete scores per HTTP GET request, it is a single athlete’s profile for each HTTP GET request (so ~140K GET requests). This may take a few minutes. This class also employs the basic .find() function of BeautifulSoup4 to scrap the right data off the profile page.

Gathering Affiliate Data

Like the OpeNG API, athlete profile data on games.crossfit.com is static and easily scrappable using BeautifulSoup4. However, I ran into a small issue with the official CrossFit Affiliates page in that it uses an AJAX loader to load additional gyms once the browser scrolls to the bottom of the page. This means that the full data isn’t available in the loaded HTML file and machine readable by BeautifulSoup. But I know the data is stored somewhere nearby as at some point during loading, the page has to request it. So I opened up the handy-dandy Chrome Developer Tools under the Settings menu in my browser.

Under the Network tab you can load a page and view each of the GET calls made by the source. One of the calls for the affiliates list page MUST be to the data table we want access to. Due to the nature of the AJAX loader, all I had to do was scroll to the bottom of the page and see what loads on the Networks Name table.

….

cfaffiliatedata
Boom!

….

So I now know that the page is providing parameters to a PHP file (https://www.crossfit.com/cf/find-a-box.php?page=3&country=&state=&city=&type=Commercial) to load the data. By following the link, you’ll see that the PHP file is returning JSON formatted data. I already see a way that I could simply loop through multiple calls to this PHP file supplying increasing page numbers until the returned JSON object = {“affiliates”:[]}. How cool is that? Three years ago you could have made $150 doing this.

Now I’ve got tables of performance by athlete and week over each year of the CrossFit Open. I also have tables with details of individual athlete’s profile characteristics. And I have a table of all the Crossfit gyms, their websites, locations, and more. Tables on tables on tables. Now let me find something to do with it all.

Advertisements

One thought on “Crossfit Open Data

  1. Pingback: Open Data from 2012-2016 | Captain America Develops Code

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s