Olympics Boxplot
[7/23/2012 Addition: I’ve updated these plots using ggplot2 to look nicer. They can be found here.]
Recently, I saw this pretty cool chart at the Washington Post (I originally saw the chart at this wonderful blog here) about the ages of olympians from the past three olympics. I commented to myself that I thought it would be more interesting with boxplots of the data, rather than simple ranges, and I also wondered what it would look like if we used data from all of the past olympics.
So, I wrote some R code and began scraping sports-reference.com/olympics to get a data set with all of the olympic athletes from all of the games. This took me quite some time (and work kept getting in the way), but I eventually got it right and collected the data.
Here are some of the resulting graphs:
Below is a graph of side-by-size boxplots of age for each sport by gender with blue for male, pink for female, and green for mixed competition. And no the 11 year old female swimmer is not a typo like I originally thought.
The previous graph was kind of messy, so I’ve sorted this one by median age. Not surprisingly female gymnastics and rhythmic gymnastics have the lowest median ages of competitors while equestrianism has the highest median age of competitor at over 35 years of age.
The previous two graphs were only for the years of 2000-2008, so I re-did the previous graph using data from all of the olympics. Since the obvious question arising from this graph is is “What is roque?”, I have saved you the trouble of googling it by providing a wikipedia link for roque.
This graph is boxplots of age by year with the color representing the host continent.
[7/15/2012 Correction: The original post had the 1956 box colored blue for Europe. However, commenter Mules points out that 1956 should actually be yellow for Australia. They are correct and the correction has been made. However, as I point out in response, I’m not totally wrong: The equestrian events had to be held in Stockholm, Sweden due to quarantine restrictions.]
[7/23/2012 Correction: The graph below had some mistakes in it, including an olympian who was over 90. This was pointed out by Kate, and has been corrected.]
And finally, we have overall age by gender.
Cheers.
Posted on July 9, 2012, in Math Pictures, Olympics, R, Sports. Bookmark the permalink. 27 Comments.
Very cool. Thanks for sharing. There are some old equestrians (septuagenarians)!
So this is very nit picky but I like my boxplots to have closed small circles
pch=19
with an alphacol=rgb(.1,.1,.1,.5)
so I can see the when the outliers are thick. Could you post your code to github so I can be a nitpicking “perfectionist” with the charts? 😉Here (should) be my Olmypics code. Feedback (politely) is welcome.
thank you!!!
I agree that it would be very fun to see the code that made these. A nice educational resource and a jumping off point for other ways to slice the same data. The CRAPL (http://matt.might.net/articles/crapl/) is an entertaining way to put this sort of work out in public.
I think I finally got github working. Here (should) be my Olmypics code. Feedback (politely) is welcome.
I really like the idea of CRAPL. And this quote from here sums up my code pretty well:
Just a small point but on the “age by year with the color representing the host continent” graph 1956 should be yellow
You are correct. Although, I’m not totally wrong: The equestrian events had to be held in Stockholm, Sweden due to quarantine restrictions. No, actually I am totally wrong.
Thank you for pointing this out to me. The correction has been made.
I’d like to see a graph of sport v age v sex v result (say for the top 5-8 in each event, but if it’s only the medalists [top 3], that’s ok). That would reveal what age range is needed to be competitive in which sports, and exclude the (many) ceremonial entries from countries that don’t really have a serious program in that particular sport (e.g. think Jamaica in the bobsled).
This would help illuminate which events truly require youth over experience/training (and perhaps point out which events don’t require much athleticism).
I’d also like to see a graph adding in muscle v fat measures (e.g. BMI). I think there might be some surprises there (for US folk, anyway ;).
…Dave…
I’ll post the boxplot with medal winners, as that’s easy to do.
As for the BMI suggestion, I’d like to see that plot, too. But I don’t have that data. If you’d like to send me that data, I’d be happy to make that graph.
Thanks – should be interesting.
I’ve never run across a reasonable body fat percentage data set for any sport, let alone all olympians. I just tried once again to track one down, and didn’t find anything other than the usual numbers for a few, selected athletes. My initial interest in that stat came from a report that the US did this during a combined training camp in Colo Springs many years ago (and there were some surprises), but I was never able to find that data set, nor even confirm the story.
Upon review, BMI would be a meaningless measure for this purpose (many olympic athletes rate as obese on that scale). Accurate body fat measurements are complex to get (e.g. an MRI), and I’m sure many countries are very protective of their athletes, which makes me fairly doubtful that anyone has this data in one place. Perhaps each country has this data though — however it would be measured by different methodologies, complicating the comparison.
If I find such a data set though, I’ll certainly let you know.
…Dave…
Here’s the boxplot for only medal winners from 2000-2008: http://bit.ly/O1mOtt
Very nice. I’m not sure of the sort order – avg of men’s and women’s ages?
Who are the 40+ swimmers? Seems odd.
How are the teams sports represented (again possible masking effects – men’s soccer/football seems a very narrow range to me)?
Women’s cycling age is a surprise to me.
Men’s soccer (or “football” as the rest of the world refers to it “incorrectly”.
According to Wikipedia:
Swimming:
Dara Torres was 41 in 2008 and she won three silver medals:
http://www.sports-reference.com/olympics/athletes/to/dara-torres-1.html
The graphs are sorted by median.
Cheers.
Note: as a little tweak, you can use FUN=paste0 instead of FUN=paste, sep=””.
I had read before that the oldest Olympian ever was a Swedish shooter named Oscar Swahn and that he was in his 70s. But in the 4th graph it looks like there is someone who is 92 years old from 1932! Is that right? Am I missing something?
ps I have really been enjoying these.
Thanks for pointing this out. The correction has been made here.
That doesn’t look right. Must be a mistake. I’ll take a look at it tomorrow.
HI, Great plots. Would it be possible to get a copy of the data (or your scraping scripts). I would love to use this data set for my class.
The code is here. If you use my code, please mention this blog.
Thanks for the code. Of course I provide a direct link to your website.
Pingback: Olympics Boxplot | Probabilidades, Estadística y... | Scoop.it
Pingback: Gender Wars • see things differently
Pingback: Olympics Box Plots: Part 2 / ggplot2 Shoutout « Stats in the Wild
Pingback: Outer Product of Character Vectors in R « Stats in the Wild
Pingback: Example: Boxplots of Olympic Athletes’ Age Distributions | Math 1272/5196 – Statistics (Fall 2013)
Pingback: World Cup ages | StatsbyLopez