Mining Jukola relay results

Jukola is a 7-leg orienteering relay event held in Finland every June since 1949.

It is also the biggest relay event in the world. You can see Wikipedia article for more information.

This is how the start looks like:

Being an data geek and an orienteer (been running Jukola since 2003), I will be exploring data kindly provided by the organizers. This is the first post in the series.

Getting the data

First, let's do some imports.

from lxml import etree
import pandas as pd
import urllib2 
import re

We do have 1949-2014 results available, but since 1992 they are provided in XML, so let's start from 1992 for now.

TODO: Got to hack a Scrapy parser for older results.

%%time
YEARS = range(1992, 2015)
PATH = "http://results.jukola.com/tulokset/results_j%d_ju.xml"
COLUMNS = ["name", "teamnro", "place", "teamid", "result", "tsecs", "members"]
valid_years = []

def iter_teams(root):
    for i in root.findall(".//team"):
        name = i.find("teamname").text
        teamnro = int(i.find("teamnro").text)
        teamid =  int(i.find("teamid").text)
        place_elem = i.find("placement")
        place = int(place_elem.text) if place_elem is not None else ""
        result_elem = i.find("result")
        result = result_elem.text if result_elem is not None else ""
        tsecs_elem = i.find("tsecs")
        tsecs = float(tsecs_elem.text) if tsecs_elem is not None else ""
        members_elem = i.findall(".//nm")
        members = ",".join([m.text.strip() for m in members_elem])
        yield (name, teamnro, place, teamid, result, tsecs, members)

def iter_years(years):
    for y in years:
        print(y)
        try:
            f = urllib2.urlopen(PATH % (y))
            # fix broken XML
            data = re.sub("&(?!amp)", "&amp;", f.read())
            root = etree.fromstring(data)
            df = pd.DataFrame.from_records([i for i in iter_teams(root)], columns=COLUMNS)
            df["year"] = y
            valid_years.append(y)
            yield df
        except (urllib2.URLError, etree.XMLSyntaxError), e:
            print(e)
        finally:
            if f: f.close()
        
bdf = pd.concat(list(iter_years(YEARS)))

1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
CPU times: user 24.3 s, sys: 6.78 s, total: 31 s
Wall time: 3min 41s

Getting hands dirty

So now we have everything that we need. Let's do a simple query. Kalevan Rasti road to success, is it?

bdf[(bdf.name == "Kalevan Rasti") & (bdf.teamnro == 1)]

Someday my club will get to the Top-100.

bdf[(bdf.name == "Northern Wind") & (bdf.teamnro == 1)]

Exploring top runners

Remember, that we use 1992-2014 data for now, so these are the newest Jukola heroes.

Extract runners from the teams first.

def iter_members(bdf):
    for _, row in bdf.iterrows():
        for m in row["members"].split(","):
            yield(row["name"], row["teamnro"], row["place"], row["year"], m)

runners = pd.DataFrame.from_records([i for i in iter_members(bdf)], columns=["name", "teamnro", "place", "year", "runner"])

# remove empty names
# learning you some finnish: juoksijaa Ei means No runners
runners = runners[runners.runner != " "]

Extract those who get to the top-15.

top_runners = runners[runners.place < 15].groupby(["name", "teamnro", "runner"]).year.count().reset_index()

Groupby by year. Note there is the only one who did Top-15 13 times. Guess who.

top_runners.groupby("year").count()

Guessed right. Thierry Gueorgiou

top_runners.sort("year", ascending=False)[0:20]

Is there anyone who run all Jukolas since 1992 (that is 23 times)?

runners.groupby(["runner"]).year.count().reset_index().sort("year", ascending=False)[0:10]

Hmm, we obviously get missing names in the top as well as namesakes. Let's group for team name and team number.

masters = runners.groupby(["runner", "name", "teamnro"]).year.count().reset_index()

masters[masters.year == 23]

Let's find those who made it to the top-100.

masters_top100 = runners[runners.place <= 100].groupby(["runner", "name", "teamnro"]).year.count().reset_index()

masters_top100.sort("year", ascending=False)[0:10]

Which Jukola was the hardest?

We all know the answer, but let's prove it with data.

Remember what we have.

bdf.head()

A bit of preprocessing.

# remove not finished teams, sorry guys.
bdf = bdf[bdf.tsecs != ""]
bdf["hour"] = bdf.tsecs.astype(int) / 60 / 60 - 1

What I want to plot is the distribution of finishing times per year, that is. The more it is skewed to the right, the longest times teams needed to complete the relay.

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="darkgrid")
sns.set_context(rc={"figure.figsize": (8, 4)})

g = sns.FacetGrid(bdf, row="year", size=1, aspect=2, xlim=(0, 25), row_order=valid_years[::-1])
bins = np.linspace(0, 25, 200)
g.map(plt.hist, "hour", color="steelblue", bins=bins, lw=0, alpha=0.8)
g.set_titles(template="{row_name}")
plt.xlabel("")
plt.yticks([])
plt.xticks([6, 12, 18, 24])
plt.plot()

[]

We'll play with the plots. Adding probability density.

g = sns.FacetGrid(bdf, row="year", size=1.2, aspect=2, xlim=(0, 25), row_order=valid_years[::-1])
bins = np.linspace(0, 25, 200)
g.map(sns.distplot, "hour", color="steelblue", bins=bins)
g.set_titles(template="{row_name}")
plt.xlabel("")
plt.yticks([])
plt.xticks([6, 12, 18, 24])
plt.plot()

[]

Perhaps, we better try boxplots.

g = sns.FacetGrid(bdf, row="year", size=1.2, aspect=2, xlim=(0, 25), row_order=valid_years[::-1])
bins = np.linspace(0, 25, 200)
g.map(sns.boxplot, "hour", color="steelblue", vert=False)
g.set_titles(template="{row_name}")
plt.xlabel("")
plt.yticks([])
plt.xticks([6, 12, 18, 24])
plt.plot()

[]

Kytäjä-Jukola (2010), as expected, seems to be the hardest.

Speaking about the fastest Jukolas, I remember Jämi-Jukola in 2004 was held around the airfield area, and it was full-speed.

The plots show that the fastest Jukola (by median time to complete the race) seems to be 1994 Pyhä-Luosto Jukola whose location was 50 km north from the Arctic Circle.

Time to get to Top-100

Let's see what you need to get to the Top-100.

top100 = bdf[bdf.place <= 100]
top100["delta"] = top100.groupby("year").hour.transform(lambda x: x - x.min())

Here is the time 100th team lost to the winner.

sns.set_context(rc={"figure.figsize": (12, 4)})
sns.barplot("year", "delta", data=top100[top100.place==100])
plt.ylabel("delta (hr)")
plt.plot()

[]

Wow, in 1992 Top-100 team were packed into 40 minutes!

Backup plots

sns.set_context(rc={"figure.figsize": (8, 4)})
g = sns.FacetGrid(top100, row="year", size=1.2, aspect=2.5, xlim=(0, 3), row_order=valid_years[::-1])
bins = np.linspace(0, 3, 200)
g.map(plt.hist, "delta", color="steelblue", bins=bins, lw=0, alpha=0.8)
g.set_titles(template="{row_name}")
plt.xlabel("")
plt.yticks([10])
plt.xticks([1, 2, 3])
plt.plot()

[]

g = sns.FacetGrid(top100, row="year", size=1.2, aspect=2.5, xlim=(0, 3), row_order=valid_years[::-1])
g.map(sns.boxplot, "delta", color="steelblue", vert=False)
g.set_titles(template="{row_name}")
plt.xlabel("")
plt.xticks([0, 1, 2, 3])
plt.plot()

[]

	name	teamnro	place	teamid	result	tsecs	members	year
29	Kalevan Rasti	1	30	11	8:21:12	30072	Janne Märkälä,Jussi Silvennoinen,Arto Muhonen,...	1992
29	Kalevan Rasti	1	30	19	8:22:40	30160	Jari Heikkinen,Jussi Silvennoinen,Arto Muhonen...	1993
25	Kalevan Rasti	1	26	30	7:29:03	26943	Jari Heikkinen,Pekka Inkeri,Mika Elomäki,Antti...	1994
29	Kalevan Rasti	1	30	26	9:25:47	33947	Petri Vainio,Marko Pitkänen,Arto Muhonen,Antti...	1995
14	Kalevan Rasti	1	15	30	8:00:32	28832	Petri Vainio,Harri Romppanen,Arto Muhonen,Raun...	1996
14	Kalevan Rasti	1	15	15	8:17:49	29869	Ville Halonen,Jouni Mäkäräinen,Arto Muhonen,Ju...	1997
11	Kalevan Rasti	1	12	15	8:19:24	29964	Jari Heikkinen,Arto Muhonen,Harri Romppanen,Ra...	1998
4	Kalevan Rasti	1	5	12	7:41:17	27677	Petri Vainio,Jouni Mäkäräinen,Harri Romppanen,...	1999
3	Kalevan Rasti	1	4	61	7:30:37	27037	Petri Vainio,Antti Tolonen,Jari Heikkinen,Thie...	2001
1	Kalevan Rasti	1	2	4	7:47:38	28058	Antti Tolonen,Tommi Tölkkö,Mikael Boström,Petr...	2002
2	Kalevan Rasti	1	3	2	7:40:58	27658	Mika Venho,Petri Vainio,Tommi Tölkkö,Mikael Bo...	2003
0	Kalevan Rasti	1	1	3	6:51:51	24711	Mikael Boström,Samuli Launiainen,Tommi Tölkkö,...	2004
0	Kalevan Rasti	1	1	1	7:41:17	27677	Harri Romppanen,Mikael Boström,Tommi Tölkkö,Ha...	2005
1	Kalevan Rasti	1	2	1	8:18:50	29930	Harri Romppanen,Hannu Airila,Tommi Tölkkö,Samu...	2006
0	Kalevan Rasti	1	1	2	7:44:16	27856	Philippe Adamski,Samuli Launiainen,Tommi Tölkk...	2007
5	Kalevan Rasti	1	6	1	8:21:55	30115	Antti Nurmonen,Miika Hernelahti,Tommi Tölkkö,H...	2008
1	Kalevan Rasti	1	2	6	8:03:06	28986	Jan Prochazka,Tommi Tölkkö,Hannu Airila,Miika ...	2009
1	Kalevan Rasti	1	2	2	8:39:49	31189	Jan Prochazka,Tommi Tölkkö,Hannu Airila,Simo M...	2010
1	Kalevan Rasti	1	2	2	7:41:31	27691	Jan Prochazka,Hannu Airila,Thierry Gueorgiou,S...	2011
0	Kalevan Rasti	1	1	2	7:56:02	28562	Kiril Nikolov,Jarkko Huovila,Simo-Pekka Fincke...	2012
0	Kalevan Rasti	1	1	1	7:27:58	26878	Jere Pajunen,Simo-Pekka Fincke,Philippe Adamsk...	2013
0	Kalevan Rasti	1	1	1	7:59:01	28741	Jere Pajunen,Philippe Adamski,Aaro Asikainen,H...	2014

	name	teamnro	place	teamid	result	tsecs	members	year
251	Northern Wind	1	252	493	10:07:31	36451	Alexander Khramov,Pavel Kalaidin,Vladimir Serg...	2003
195	Northern Wind	1	196	252	8:27:35	30455	Alexander Khramov,Pavel Kalaidin,Dmitriyl Somo...	2004
260	Northern Wind	1	261	196	10:13:43	36823	Artem Lebedev,Fedor Krasnienko,Viacheslav Krug...	2005
166	Northern Wind	1	167	261	10:44:26	38666	Artem Lebedev,Sergey Petrov,Viacheslav Kruglov...	2006
167	Northern Wind	1	168	167	10:13:37	36817	Dmitriy Sorokin,Pavel Petrov,Viacheslav Kruglo...	2007
183	Northern Wind	1	184	168	10:38:09	38289	Petr Zaslonkin,Pavel Petrov,Georgy Mavchun,Dmi...	2008
132	Northern Wind	1	133	184	9:52:38	35558	Dmitriy Sorokin,Pavel Petrov,Petr Zaslonkin,Al...	2009
169	Northern Wind	1	170	133	11:52:21	42741	Dmitriy Sorokin,Pavel Petrov,Viacheslav Kruglo...	2010
213	Northern Wind	1	214	170	10:18:08	37088	Georgiy Mavchun,Dmitriy Sorokin,Pavel Petrov,V...	2011
294	Northern Wind	1	295	214	11:32:34	41554	Dmitriy Sorokin,Pavel Petrov,Andrey Fershalov,...	2012
382	Northern Wind	1	383	295	10:56:27	39387	Dmitriy Sorokin,Pavel Petrov,Pavel Kalaidin,De...	2013
145	Northern Wind	1	146	383	10:25:03	37503	Dmitriy Sorokin,Georgiy Mavchun,Slava Kruglov,...	2014

	name	teamnro	runner
year
1	720	720	720
2	228	228	228
3	103	103	103
4	51	51	51
5	28	28	28
6	20	20	20
7	14	14	14
8	6	6	6
9	8	8	8
10	4	4	4
11	2	2	2
12	1	1	1
13	1	1	1

	name	teamnro	runner	year
510	Kalevan Rasti	1	Thierry Gueorgiou	13
508	Kalevan Rasti	1	Simo Martomaa	12
202	Halden SK	1	Mats Haldin	11
120	Delta	1	Valentin Novikov	11
492	Kalevan Rasti	1	Hannu Airila	10
215	Halden SK	1	Tore Sandvik	10
1158	Vaajakosken Terä	1	Jonne Lakanen	10
184	Halden SK	1	Emil Wingstedt	10
861	Paimion Rasti	1	Markus Lindeqvist	9
104	Delta	1	Leonid Novikov	9
108	Delta	1	Miika Hernelahti	9
916	Rajamäen Rykmentti	1	Marko Väisänen	9
512	Kalevan Rasti	1	Tommi Tölkkö	9
185	Halden SK	1	Erik Axelsson	9
1153	Vaajakosken Terä	1	Antti Anttonen	9
1156	Vaajakosken Terä	1	Jani Lakanen	9
1177	Vehkalahden Veikot	1	Tero Föhr	8
867	Paimion Rasti	1	Teemu Väre	8
1125	Turun Suunnistajat	1	Janne Salmi	8
1159	Vaajakosken Terä	1	Jouni Kahelin	8

	runner	year
0		2939
43993	\|	131
4	.	79
23708	Markku Virtanen	72
13222	Jari Järvinen	60
39141	Timo Hämäläinen	59
22803	Magnus Johansson	57
36722	Stefan Karlsson	56
39233	Timo Korhonen	53
30910	Pekka Valtonen	52

	runner	name	teamnro	year
5700	Antti Puro	Turun NMKY	1	23
18376	Hannu Lavonen	Miehikkälän Vilkas	1	23
20443	Heikki Hietalahti	Halsuan Toivo	1	23
21335	Heikki Vallbacka	Kortesjärven Järvi-Veikot	1	23
35870	Juha Lerssi	Kaustisen Pohjan-Veikot	1	23
39725	Jussi Elo	Laitilan Jyske	1	23
51420	Markku Nieminen	Pälkäneen Lukko	1	23
56297	Matti Yliluikki	Suomusjärven Sisu	1	23
71056	Petri Kauppinen	Halsuan Toivo	1	23
74088	Reijo Kuoppala	Nastolan Terä	1	23

	runner	name	teamnro	year
5836	Petteri Laitinen	Kangasala SK	1	20
2460	Janne Salmi	Turun Suunnistajat	1	20
6593	Sören Nymalm	Pargas IF	1	18
3873	Lasse Jansson	Gävle OK	1	17
4269	Markus Lindeqvist	Paimion Rasti	1	16
4242	Marko Väisänen	Rajamäen Rykmentti	1	16
5737	Peter Öberg	OK Hällen	1	15
2469	Janne Weckman	Vehkalahden Veikot	1	15
3065	Jostein Andersen	Kristiansand OK	1	15
3174	Juha Peltola	MS Parma	1	15

	name	teamnro	place	teamid	result	tsecs	members	year
0	IK Hakarpspojkarna	1	1	16	8:04:21	29061	Andreas Rangert,Bo Granstedt,Fridolf Fskilsson...	1992
1	Rajamäen Rykmentti	1	2	15	8:04:48	29088	Marko Väisänen,Ari Enroth,Jouni Mähönen,Petri ...	1992
2	IFK Södertälje	1	3	1	8:05:00	29100	Lennart Olsson,Thomas Andersson,Kari Enckell,M...	1992
3	Sundsvalls OK	1	4	9	8:08:23	29303	Thomas Rex,Lars Nordin,Ulrik Olsson,Lars Pette...	1992
4	Angelniemen Ankkuri	1	5	2	8:08:27	29307	Petteri Kähäri,Tero Heikkilä,Ari Kattainen,Mik...	1992