Open Data from 2012-2016

Registration from the CrossFit Open 2017 has begun. Continuing from my previous post, I have finished pulling data for the Open events from 2012 to 2016. I excluded 2011 (the first official year) due to the low participation and data availability. Additionally, as I am looking for trends and additional divisions have been added each year, I only downloaded the data for the core Men and Women’s Rx’d divisions each year. That is, the Master’s and Teen divisions didn’t exist in 2012, so I didn’t download those for 2016. That will eliminate some performance data from individuals who crossed into other divisions, but it should provide consistent enough data points for those in the “core” divisions for me to conduct valid analysis. 

In order to get the data, I first went in an scraped the leaderboard for each year, logging the scores of each athlete who registered. The Pandas DataFrame was built in extractScores.py using these headers

columns=('Id', 'Name', 'Division', 'OverallRank', 'Rank', 'Wk1_Score', 'Wk1_Rank', 'Wk1a_Score', 'Wk1a_Rank','Wk2_Score', 'Wk2_Rank', 'Wk3_Score', 'Wk3_Rank', 'Wk4_Score', 'Wk4_Rank', 'Wk5_Score', 'Wk5_Rank')

Note that only 2015 included a Wk1a event. All other years were Wk1-Wk5. In order to read the table section please see this page at games.crossfit.com.

Once I downloaded and saved a csv file for the Men and Women’s scores for each year, I then loaded each score spreadsheet using Dataprocess\CFOpenDataProcess.py. Here I took the Id and Rank column for each year and created a matrix indicating whether the Id registered for and completed the Open for each year. The result is a matrix as follows

Id 2012 Reg 2012 Finish 2013 Reg 2013 Finish 2014 Reg 2014 Finish 2015 Reg 2015 Finish 2016 Reg 2016 Finish
82 1 1 1 1 1 1 1 0 0 0
84 1 1 1 1 1 1 1 1 0 0
86 1 1 1 1 1 1 1 1 1 1
88 1 1 1 1 1 1 1 1 1 1
92 1 1 1 1 1 1 1 1 1 1
93 1 1 1 1 1 1 1 1 1 0

Using these participation matrices and matplotlib, I was able to plot the registration verse participation for five years of the CrossFit Open.

openparticipationmenIndividual Mens Rx Participation 2012-2016

openparticipationwomen
Individual Womens Rx Participation 2012-2016

From here, I’m interested in looking at trends in gym and athlete performance over the years. To start, I needed profile information for all of the athletes who participated in the Individual Rx divisions from 2012 to 2016. In DataExtract\gatherProfiles.py, I read in the participation matrix to build a comprehensive list of athlete Ids. Passing this list into getProfile.py, I the downloaded the available stats for all the Ids still up on games.crossfit.com. While working on this script, the games.crossfit.com webpage was redesigned so I had to rewrite the BeautifulSoup functions from the initial build.

Now that I have the athletes, their scores, their stats, and information about their gyms, I have some ideas on information I can pull regarding affiliate participation. I have also been working on some potential to build a prediction system for estimating performance in a workout based on athlete stats. Yes, I write equations on the white board.

predictor

We’ll see what I can come up with.

Original Post
Score Data Sheets 2012-2016
Profile Sheets
Participation Matrices

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s