Registration from the CrossFit Open 2017 has begun. Continuing from my previous post, I have finished pulling data for the Open events from 2012 to 2016. I excluded 2011 (the first official year) due to the low participation and data availability. Additionally, as I am looking for trends and additional divisions have been added each year, I only downloaded the data for the core Men and Women’s Rx’d divisions each year. That is, the Master’s and Teen divisions didn’t exist in 2012, so I didn’t download those for 2016. That will eliminate some performance data from individuals who crossed into other divisions, but it should provide consistent enough data points for those in the “core” divisions for me to conduct valid analysis.
In order to get the data, I first went in an scraped the leaderboard for each year, logging the scores of each athlete who registered. The Pandas DataFrame was built in extractScores.py using these headers
columns=('Id', 'Name', 'Division', 'OverallRank', 'Rank', 'Wk1_Score', 'Wk1_Rank', 'Wk1a_Score', 'Wk1a_Rank','Wk2_Score', 'Wk2_Rank', 'Wk3_Score', 'Wk3_Rank', 'Wk4_Score', 'Wk4_Rank', 'Wk5_Score', 'Wk5_Rank')
Note that only 2015 included a Wk1a event. All other years were Wk1-Wk5. In order to read the table section please see this page at games.crossfit.com.
Once I downloaded and saved a csv file for the Men and Women’s scores for each year, I then loaded each score spreadsheet using Dataprocess\CFOpenDataProcess.py. Here I took the Id and Rank column for each year and created a matrix indicating whether the Id registered for and completed the Open for each year. The result is a matrix as follows
|Id||2012 Reg||2012 Finish||2013 Reg||2013 Finish||2014 Reg||2014 Finish||2015 Reg||2015 Finish||2016 Reg||2016 Finish|
Using these participation matrices and matplotlib, I was able to plot the registration verse participation for five years of the CrossFit Open.
Individual Mens Rx Participation 2012-2016
Individual Womens Rx Participation 2012-2016
From here, I’m interested in looking at trends in gym and athlete performance over the years. To start, I needed profile information for all of the athletes who participated in the Individual Rx divisions from 2012 to 2016. In DataExtract\gatherProfiles.py, I read in the participation matrix to build a comprehensive list of athlete Ids. Passing this list into getProfile.py, I the downloaded the available stats for all the Ids still up on games.crossfit.com. While working on this script, the games.crossfit.com webpage was redesigned so I had to rewrite the BeautifulSoup functions from the initial build.
Now that I have the athletes, their scores, their stats, and information about their gyms, I have some ideas on information I can pull regarding affiliate participation. I have also been working on some potential to build a prediction system for estimating performance in a workout based on athlete stats. Yes, I write equations on the white board.
We’ll see what I can come up with.