Mining Jukola relay results

Jukola is a 7-leg orienteering relay event held in Finland every June since 1949.

It is also the biggest relay event in the world. You can see Wikipedia article for more information.

This is how the start looks like:

Being an data geek and an orienteer (been running Jukola since 2003), I will be exploring data kindly provided by the organizers. This is the first post in the series.

Getting the data

First, let's do some imports.

In [1]:
from lxml import etree
import pandas as pd
import urllib2 
import re

We do have 1949-2014 results available, but since 1992 they are provided in XML, so let's start from 1992 for now.

TODO: Got to hack a Scrapy parser for older results.

In [2]:
%%time
YEARS = range(1992, 2015)
PATH = "http://results.jukola.com/tulokset/results_j%d_ju.xml"
COLUMNS = ["name", "teamnro", "place", "teamid", "result", "tsecs", "members"]
valid_years = []

def iter_teams(root):
    for i in root.findall(".//team"):
        name = i.find("teamname").text
        teamnro = int(i.find("teamnro").text)
        teamid =  int(i.find("teamid").text)
        place_elem = i.find("placement")
        place = int(place_elem.text) if place_elem is not None else ""
        result_elem = i.find("result")
        result = result_elem.text if result_elem is not None else ""
        tsecs_elem = i.find("tsecs")
        tsecs = float(tsecs_elem.text) if tsecs_elem is not None else ""
        members_elem = i.findall(".//nm")
        members = ",".join([m.text.strip() for m in members_elem])
        yield (name, teamnro, place, teamid, result, tsecs, members)

def iter_years(years):
    for y in years:
        print(y)
        try:
            f = urllib2.urlopen(PATH % (y))
            # fix broken XML
            data = re.sub("&(?!amp)", "&", f.read())
            root = etree.fromstring(data)
            df = pd.DataFrame.from_records([i for i in iter_teams(root)], columns=COLUMNS)
            df["year"] = y
            valid_years.append(y)
            yield df
        except (urllib2.URLError, etree.XMLSyntaxError), e:
            print(e)
        finally:
            if f: f.close()
        
bdf = pd.concat(list(iter_years(YEARS)))
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
CPU times: user 24.3 s, sys: 6.78 s, total: 31 s
Wall time: 3min 41s

Getting hands dirty

So now we have everything that we need. Let's do a simple query. Kalevan Rasti road to success, is it?

In [4]:
bdf[(bdf.name == "Kalevan Rasti") & (bdf.teamnro == 1)]
Out[4]:
name teamnro place teamid result tsecs members year
29 Kalevan Rasti 1 30 11 8:21:12 30072 Janne Märkälä,Jussi Silvennoinen,Arto Muhonen,... 1992
29 Kalevan Rasti 1 30 19 8:22:40 30160 Jari Heikkinen,Jussi Silvennoinen,Arto Muhonen... 1993
25 Kalevan Rasti 1 26 30 7:29:03 26943 Jari Heikkinen,Pekka Inkeri,Mika Elomäki,Antti... 1994
29 Kalevan Rasti 1 30 26 9:25:47 33947 Petri Vainio,Marko Pitkänen,Arto Muhonen,Antti... 1995
14 Kalevan Rasti 1 15 30 8:00:32 28832 Petri Vainio,Harri Romppanen,Arto Muhonen,Raun... 1996
14 Kalevan Rasti 1 15 15 8:17:49 29869 Ville Halonen,Jouni Mäkäräinen,Arto Muhonen,Ju... 1997
11 Kalevan Rasti 1 12 15 8:19:24 29964 Jari Heikkinen,Arto Muhonen,Harri Romppanen,Ra... 1998
4 Kalevan Rasti 1 5 12 7:41:17 27677 Petri Vainio,Jouni Mäkäräinen,Harri Romppanen,... 1999
3 Kalevan Rasti 1 4 61 7:30:37 27037 Petri Vainio,Antti Tolonen,Jari Heikkinen,Thie... 2001
1 Kalevan Rasti 1 2 4 7:47:38 28058 Antti Tolonen,Tommi Tölkkö,Mikael Boström,Petr... 2002
2 Kalevan Rasti 1 3 2 7:40:58 27658 Mika Venho,Petri Vainio,Tommi Tölkkö,Mikael Bo... 2003
0 Kalevan Rasti 1 1 3 6:51:51 24711 Mikael Boström,Samuli Launiainen,Tommi Tölkkö,... 2004
0 Kalevan Rasti 1 1 1 7:41:17 27677 Harri Romppanen,Mikael Boström,Tommi Tölkkö,Ha... 2005
1 Kalevan Rasti 1 2 1 8:18:50 29930 Harri Romppanen,Hannu Airila,Tommi Tölkkö,Samu... 2006
0 Kalevan Rasti 1 1 2 7:44:16 27856 Philippe Adamski,Samuli Launiainen,Tommi Tölkk... 2007
5 Kalevan Rasti 1 6 1 8:21:55 30115 Antti Nurmonen,Miika Hernelahti,Tommi Tölkkö,H... 2008
1 Kalevan Rasti 1 2 6 8:03:06 28986 Jan Prochazka,Tommi Tölkkö,Hannu Airila,Miika ... 2009
1 Kalevan Rasti 1 2 2 8:39:49 31189 Jan Prochazka,Tommi Tölkkö,Hannu Airila,Simo M... 2010
1 Kalevan Rasti 1 2 2 7:41:31 27691 Jan Prochazka,Hannu Airila,Thierry Gueorgiou,S... 2011
0 Kalevan Rasti 1 1 2 7:56:02 28562 Kiril Nikolov,Jarkko Huovila,Simo-Pekka Fincke... 2012
0 Kalevan Rasti 1 1 1 7:27:58 26878 Jere Pajunen,Simo-Pekka Fincke,Philippe Adamsk... 2013
0 Kalevan Rasti 1 1 1 7:59:01 28741 Jere Pajunen,Philippe Adamski,Aaro Asikainen,H... 2014

Someday my club will get to the Top-100.

In [5]:
bdf[(bdf.name == "Northern Wind") & (bdf.teamnro == 1)]
Out[5]:
name teamnro place teamid result tsecs members year
251 Northern Wind 1 252 493 10:07:31 36451 Alexander Khramov,Pavel Kalaidin,Vladimir Serg... 2003
195 Northern Wind 1 196 252 8:27:35 30455 Alexander Khramov,Pavel Kalaidin,Dmitriyl Somo... 2004
260 Northern Wind 1 261 196 10:13:43 36823 Artem Lebedev,Fedor Krasnienko,Viacheslav Krug... 2005
166 Northern Wind 1 167 261 10:44:26 38666 Artem Lebedev,Sergey Petrov,Viacheslav Kruglov... 2006
167 Northern Wind 1 168 167 10:13:37 36817 Dmitriy Sorokin,Pavel Petrov,Viacheslav Kruglo... 2007
183 Northern Wind 1 184 168 10:38:09 38289 Petr Zaslonkin,Pavel Petrov,Georgy Mavchun,Dmi... 2008
132 Northern Wind 1 133 184 9:52:38 35558 Dmitriy Sorokin,Pavel Petrov,Petr Zaslonkin,Al... 2009
169 Northern Wind 1 170 133 11:52:21 42741 Dmitriy Sorokin,Pavel Petrov,Viacheslav Kruglo... 2010
213 Northern Wind 1 214 170 10:18:08 37088 Georgiy Mavchun,Dmitriy Sorokin,Pavel Petrov,V... 2011
294 Northern Wind 1 295 214 11:32:34 41554 Dmitriy Sorokin,Pavel Petrov,Andrey Fershalov,... 2012
382 Northern Wind 1 383 295 10:56:27 39387 Dmitriy Sorokin,Pavel Petrov,Pavel Kalaidin,De... 2013
145 Northern Wind 1 146 383 10:25:03 37503 Dmitriy Sorokin,Georgiy Mavchun,Slava Kruglov,... 2014

Exploring top runners

Remember, that we use 1992-2014 data for now, so these are the newest Jukola heroes.

Extract runners from the teams first.

In [6]:
def iter_members(bdf):
    for _, row in bdf.iterrows():
        for m in row["members"].split(","):
            yield(row["name"], row["teamnro"], row["place"], row["year"], m)
In [7]:
runners = pd.DataFrame.from_records([i for i in iter_members(bdf)], columns=["name", "teamnro", "place", "year", "runner"])
In [8]:
# remove empty names
# learning you some finnish: juoksijaa Ei means No runners
runners = runners[runners.runner != " "]

Extract those who get to the top-15.

In [9]:
top_runners = runners[runners.place < 15].groupby(["name", "teamnro", "runner"]).year.count().reset_index()

Groupby by year. Note there is the only one who did Top-15 13 times. Guess who.

In [10]:
top_runners.groupby("year").count()
Out[10]:
name teamnro runner
year
1 720 720 720
2 228 228 228
3 103 103 103
4 51 51 51
5 28 28 28
6 20 20 20
7 14 14 14
8 6 6 6
9 8 8 8
10 4 4 4
11 2 2 2
12 1 1 1
13 1 1 1

Guessed right. Thierry Gueorgiou

In [11]:
top_runners.sort("year", ascending=False)[0:20]
Out[11]:
name teamnro runner year
510 Kalevan Rasti 1 Thierry Gueorgiou 13
508 Kalevan Rasti 1 Simo Martomaa 12
202 Halden SK 1 Mats Haldin 11
120 Delta 1 Valentin Novikov 11
492 Kalevan Rasti 1 Hannu Airila 10
215 Halden SK 1 Tore Sandvik 10
1158 Vaajakosken Terä 1 Jonne Lakanen 10
184 Halden SK 1 Emil Wingstedt 10
861 Paimion Rasti 1 Markus Lindeqvist 9
104 Delta 1 Leonid Novikov 9
108 Delta 1 Miika Hernelahti 9
916 Rajamäen Rykmentti 1 Marko Väisänen 9
512 Kalevan Rasti 1 Tommi Tölkkö 9
185 Halden SK 1 Erik Axelsson 9
1153 Vaajakosken Terä 1 Antti Anttonen 9
1156 Vaajakosken Terä 1 Jani Lakanen 9
1177 Vehkalahden Veikot 1 Tero Föhr 8
867 Paimion Rasti 1 Teemu Väre 8
1125 Turun Suunnistajat 1 Janne Salmi 8
1159 Vaajakosken Terä 1 Jouni Kahelin 8

Is there anyone who run all Jukolas since 1992 (that is 23 times)?

In [12]:
runners.groupby(["runner"]).year.count().reset_index().sort("year", ascending=False)[0:10]
Out[12]:
runner year
0 2939
43993 | 131
4 . 79
23708 Markku Virtanen 72
13222 Jari Järvinen 60
39141 Timo Hämäläinen 59
22803 Magnus Johansson 57
36722 Stefan Karlsson 56
39233 Timo Korhonen 53
30910 Pekka Valtonen 52

Hmm, we obviously get missing names in the top as well as namesakes. Let's group for team name and team number.

In [13]:
masters = runners.groupby(["runner", "name", "teamnro"]).year.count().reset_index()
In [14]:
masters[masters.year == 23]
Out[14]:
runner name teamnro year
5700 Antti Puro Turun NMKY 1 23
18376 Hannu Lavonen Miehikkälän Vilkas 1 23
20443 Heikki Hietalahti Halsuan Toivo 1 23
21335 Heikki Vallbacka Kortesjärven Järvi-Veikot 1 23
35870 Juha Lerssi Kaustisen Pohjan-Veikot 1 23
39725 Jussi Elo Laitilan Jyske 1 23
51420 Markku Nieminen Pälkäneen Lukko 1 23
56297 Matti Yliluikki Suomusjärven Sisu 1 23
71056 Petri Kauppinen Halsuan Toivo 1 23
74088 Reijo Kuoppala Nastolan Terä 1 23

Let's find those who made it to the top-100.

In [15]:
masters_top100 = runners[runners.place <= 100].groupby(["runner", "name", "teamnro"]).year.count().reset_index()
In [16]:
masters_top100.sort("year", ascending=False)[0:10]
Out[16]:
runner name teamnro year
5836 Petteri Laitinen Kangasala SK 1 20
2460 Janne Salmi Turun Suunnistajat 1 20
6593 Sören Nymalm Pargas IF 1 18
3873 Lasse Jansson Gävle OK 1 17
4269 Markus Lindeqvist Paimion Rasti 1 16
4242 Marko Väisänen Rajamäen Rykmentti 1 16
5737 Peter Öberg OK Hällen 1 15
2469 Janne Weckman Vehkalahden Veikot 1 15
3065 Jostein Andersen Kristiansand OK 1 15
3174 Juha Peltola MS Parma 1 15

Which Jukola was the hardest?

We all know the answer, but let's prove it with data.

Remember what we have.

In [17]:
bdf.head()
Out[17]:
name teamnro place teamid result tsecs members year
0 IK Hakarpspojkarna 1 1 16 8:04:21 29061 Andreas Rangert,Bo Granstedt,Fridolf Fskilsson... 1992
1 Rajamäen Rykmentti 1 2 15 8:04:48 29088 Marko Väisänen,Ari Enroth,Jouni Mähönen,Petri ... 1992
2 IFK Södertälje 1 3 1 8:05:00 29100 Lennart Olsson,Thomas Andersson,Kari Enckell,M... 1992
3 Sundsvalls OK 1 4 9 8:08:23 29303 Thomas Rex,Lars Nordin,Ulrik Olsson,Lars Pette... 1992
4 Angelniemen Ankkuri 1 5 2 8:08:27 29307 Petteri Kähäri,Tero Heikkilä,Ari Kattainen,Mik... 1992

A bit of preprocessing.

In [18]:
# remove not finished teams, sorry guys.
bdf = bdf[bdf.tsecs != ""]
bdf["hour"] = bdf.tsecs.astype(int) / 60 / 60 - 1

What I want to plot is the distribution of finishing times per year, that is. The more it is skewed to the right, the longest times teams needed to complete the relay.

In [19]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="darkgrid")
sns.set_context(rc={"figure.figsize": (8, 4)})
In [28]:
g = sns.FacetGrid(bdf, row="year", size=1, aspect=2, xlim=(0, 25), row_order=valid_years[::-1])
bins = np.linspace(0, 25, 200)
g.map(plt.hist, "hour", color="steelblue", bins=bins, lw=0, alpha=0.8)
g.set_titles(template="{row_name}")
plt.xlabel("")
plt.yticks([])
plt.xticks([6, 12, 18, 24])
plt.plot()
Out[28]:
[]

We'll play with the plots. Adding probability density.

In [21]:
g = sns.FacetGrid(bdf, row="year", size=1.2, aspect=2, xlim=(0, 25), row_order=valid_years[::-1])
bins = np.linspace(0, 25, 200)
g.map(sns.distplot, "hour", color="steelblue", bins=bins)
g.set_titles(template="{row_name}")
plt.xlabel("")
plt.yticks([])
plt.xticks([6, 12, 18, 24])
plt.plot()
Out[21]:
[]

Perhaps, we better try boxplots.

In [22]:
g = sns.FacetGrid(bdf, row="year", size=1.2, aspect=2, xlim=(0, 25), row_order=valid_years[::-1])
bins = np.linspace(0, 25, 200)
g.map(sns.boxplot, "hour", color="steelblue", vert=False)
g.set_titles(template="{row_name}")
plt.xlabel("")
plt.yticks([])
plt.xticks([6, 12, 18, 24])
plt.plot()
Out[22]:
[]

Kytäjä-Jukola (2010), as expected, seems to be the hardest.

Speaking about the fastest Jukolas, I remember Jämi-Jukola in 2004 was held around the airfield area, and it was full-speed.

The plots show that the fastest Jukola (by median time to complete the race) seems to be 1994 Pyhä-Luosto Jukola whose location was 50 km north from the Arctic Circle.

Time to get to Top-100

Let's see what you need to get to the Top-100.

In [24]:
top100 = bdf[bdf.place <= 100]
top100["delta"] = top100.groupby("year").hour.transform(lambda x: x - x.min())

Here is the time 100th team lost to the winner.

In [25]:
sns.set_context(rc={"figure.figsize": (12, 4)})
sns.barplot("year", "delta", data=top100[top100.place==100])
plt.ylabel("delta (hr)")
plt.plot()
Out[25]:
[]

Wow, in 1992 Top-100 team were packed into 40 minutes!

Backup plots

In [26]:
sns.set_context(rc={"figure.figsize": (8, 4)})
g = sns.FacetGrid(top100, row="year", size=1.2, aspect=2.5, xlim=(0, 3), row_order=valid_years[::-1])
bins = np.linspace(0, 3, 200)
g.map(plt.hist, "delta", color="steelblue", bins=bins, lw=0, alpha=0.8)
g.set_titles(template="{row_name}")
plt.xlabel("")
plt.yticks([10])
plt.xticks([1, 2, 3])
plt.plot()
Out[26]:
[]
In [27]:
g = sns.FacetGrid(top100, row="year", size=1.2, aspect=2.5, xlim=(0, 3), row_order=valid_years[::-1])
g.map(sns.boxplot, "delta", color="steelblue", vert=False)
g.set_titles(template="{row_name}")
plt.xlabel("")
plt.xticks([0, 1, 2, 3])
plt.plot()
Out[27]:
[]