Jukola is a 7-leg orienteering relay event held in Finland every June since 1949.
It is also the biggest relay event in the world. You can see Wikipedia article for more information.
This is how the start looks like:
Being an data geek and an orienteer (been running Jukola since 2003), I will be exploring data kindly provided by the organizers. This is the first post in the series.
First, let's do some imports.
from lxml import etree
import pandas as pd
import urllib2
import re
We do have 1949-2014 results available, but since 1992 they are provided in XML, so let's start from 1992 for now.
TODO: Got to hack a Scrapy parser for older results.
%%time
YEARS = range(1992, 2015)
PATH = "http://results.jukola.com/tulokset/results_j%d_ju.xml"
COLUMNS = ["name", "teamnro", "place", "teamid", "result", "tsecs", "members"]
valid_years = []
def iter_teams(root):
for i in root.findall(".//team"):
name = i.find("teamname").text
teamnro = int(i.find("teamnro").text)
teamid = int(i.find("teamid").text)
place_elem = i.find("placement")
place = int(place_elem.text) if place_elem is not None else ""
result_elem = i.find("result")
result = result_elem.text if result_elem is not None else ""
tsecs_elem = i.find("tsecs")
tsecs = float(tsecs_elem.text) if tsecs_elem is not None else ""
members_elem = i.findall(".//nm")
members = ",".join([m.text.strip() for m in members_elem])
yield (name, teamnro, place, teamid, result, tsecs, members)
def iter_years(years):
for y in years:
print(y)
try:
f = urllib2.urlopen(PATH % (y))
# fix broken XML
data = re.sub("&(?!amp)", "&", f.read())
root = etree.fromstring(data)
df = pd.DataFrame.from_records([i for i in iter_teams(root)], columns=COLUMNS)
df["year"] = y
valid_years.append(y)
yield df
except (urllib2.URLError, etree.XMLSyntaxError), e:
print(e)
finally:
if f: f.close()
bdf = pd.concat(list(iter_years(YEARS)))
So now we have everything that we need. Let's do a simple query. Kalevan Rasti road to success, is it?
bdf[(bdf.name == "Kalevan Rasti") & (bdf.teamnro == 1)]
Someday my club will get to the Top-100.
bdf[(bdf.name == "Northern Wind") & (bdf.teamnro == 1)]
Remember, that we use 1992-2014 data for now, so these are the newest Jukola heroes.
Extract runners from the teams first.
def iter_members(bdf):
for _, row in bdf.iterrows():
for m in row["members"].split(","):
yield(row["name"], row["teamnro"], row["place"], row["year"], m)
runners = pd.DataFrame.from_records([i for i in iter_members(bdf)], columns=["name", "teamnro", "place", "year", "runner"])
# remove empty names
# learning you some finnish: juoksijaa Ei means No runners
runners = runners[runners.runner != " "]
Extract those who get to the top-15.
top_runners = runners[runners.place < 15].groupby(["name", "teamnro", "runner"]).year.count().reset_index()
Groupby by year. Note there is the only one who did Top-15 13 times. Guess who.
top_runners.groupby("year").count()
Guessed right. Thierry Gueorgiou
top_runners.sort("year", ascending=False)[0:20]
Is there anyone who run all Jukolas since 1992 (that is 23 times)?
runners.groupby(["runner"]).year.count().reset_index().sort("year", ascending=False)[0:10]
Hmm, we obviously get missing names in the top as well as namesakes. Let's group for team name and team number.
masters = runners.groupby(["runner", "name", "teamnro"]).year.count().reset_index()
masters[masters.year == 23]
Let's find those who made it to the top-100.
masters_top100 = runners[runners.place <= 100].groupby(["runner", "name", "teamnro"]).year.count().reset_index()
masters_top100.sort("year", ascending=False)[0:10]
We all know the answer, but let's prove it with data.
Remember what we have.
bdf.head()
A bit of preprocessing.
# remove not finished teams, sorry guys.
bdf = bdf[bdf.tsecs != ""]
bdf["hour"] = bdf.tsecs.astype(int) / 60 / 60 - 1
What I want to plot is the distribution of finishing times per year, that is. The more it is skewed to the right, the longest times teams needed to complete the relay.
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="darkgrid")
sns.set_context(rc={"figure.figsize": (8, 4)})
g = sns.FacetGrid(bdf, row="year", size=1, aspect=2, xlim=(0, 25), row_order=valid_years[::-1])
bins = np.linspace(0, 25, 200)
g.map(plt.hist, "hour", color="steelblue", bins=bins, lw=0, alpha=0.8)
g.set_titles(template="{row_name}")
plt.xlabel("")
plt.yticks([])
plt.xticks([6, 12, 18, 24])
plt.plot()
We'll play with the plots. Adding probability density.
g = sns.FacetGrid(bdf, row="year", size=1.2, aspect=2, xlim=(0, 25), row_order=valid_years[::-1])
bins = np.linspace(0, 25, 200)
g.map(sns.distplot, "hour", color="steelblue", bins=bins)
g.set_titles(template="{row_name}")
plt.xlabel("")
plt.yticks([])
plt.xticks([6, 12, 18, 24])
plt.plot()
Perhaps, we better try boxplots.
g = sns.FacetGrid(bdf, row="year", size=1.2, aspect=2, xlim=(0, 25), row_order=valid_years[::-1])
bins = np.linspace(0, 25, 200)
g.map(sns.boxplot, "hour", color="steelblue", vert=False)
g.set_titles(template="{row_name}")
plt.xlabel("")
plt.yticks([])
plt.xticks([6, 12, 18, 24])
plt.plot()
Kytäjä-Jukola (2010), as expected, seems to be the hardest.
Speaking about the fastest Jukolas, I remember Jämi-Jukola in 2004 was held around the airfield area, and it was full-speed.
The plots show that the fastest Jukola (by median time to complete the race) seems to be 1994 Pyhä-Luosto Jukola whose location was 50 km north from the Arctic Circle.
Let's see what you need to get to the Top-100.
top100 = bdf[bdf.place <= 100]
top100["delta"] = top100.groupby("year").hour.transform(lambda x: x - x.min())
Here is the time 100th team lost to the winner.
sns.set_context(rc={"figure.figsize": (12, 4)})
sns.barplot("year", "delta", data=top100[top100.place==100])
plt.ylabel("delta (hr)")
plt.plot()
Wow, in 1992 Top-100 team were packed into 40 minutes!
sns.set_context(rc={"figure.figsize": (8, 4)})
g = sns.FacetGrid(top100, row="year", size=1.2, aspect=2.5, xlim=(0, 3), row_order=valid_years[::-1])
bins = np.linspace(0, 3, 200)
g.map(plt.hist, "delta", color="steelblue", bins=bins, lw=0, alpha=0.8)
g.set_titles(template="{row_name}")
plt.xlabel("")
plt.yticks([10])
plt.xticks([1, 2, 3])
plt.plot()
g = sns.FacetGrid(top100, row="year", size=1.2, aspect=2.5, xlim=(0, 3), row_order=valid_years[::-1])
g.map(sns.boxplot, "delta", color="steelblue", vert=False)
g.set_titles(template="{row_name}")
plt.xlabel("")
plt.xticks([0, 1, 2, 3])
plt.plot()