sun, 09-sep-2018, 10:54

# Introduction

In previous posts (Fairbanks Race Predictor, Equinox from Santa Claus, Equinox from Gold Discovery) I’ve looked at predicting Equinox Marathon results based on results from earlier races. In all those cases I’ve looked at single race comparisons: how results from Gold Discovery can predict Marathon times, for example. In this post I’ll look at all the Usibelli Series races I completed this year to see how they can inform my expectations for next Saturday’s Equinox Marathon.

# Methods

I’ve been collecting the results from all Usibelli Series races since 2010. Using that data, grouped by the name of the person racing and year, find all runners that completed the same set of Usibelli Series races that I finished in 2018, as well as their Equinox Marathon finish pace. Between 2010 and 2017 there are 160 records that match.

The data looks like this. crr is that person’s Chena River Run pace in minutes, msr is Midnight Sun Run pace for the same person and year, rotv is the pace from Run of the Valkyries, gdr is the Gold Discovery Run, and em is Equniox Marathon pace for that same person and year.

crr msr rotv gdr em
8.1559 8.8817 8.1833 10.2848 11.8683
8.7210 9.1387 9.2120 11.0152 13.6796
8.7946 9.0640 9.0077 11.3565 13.1755
9.4409 10.6091 9.6250 11.2080 13.1719
7.3581 7.1836 7.1310 8.0001 9.6565
7.4731 7.5349 7.4700 8.2465 9.8359
... ... ... ... ...

I will use two methods for using these records to predict Equinox Marathon times, multivariate linear regression and Random Forest.

The R code for the analysis appears at the end of this post.

# Results

## Linear regression

We start with linear regression, which isn’t entirely appropriate for this analysis because the independent variables (pre-Equinox race pace times) aren’t really independent of one another. A person who runs a 6 minute pace in the Chena River Run is likely to also be someone who runs Gold Discovery faster than the average runner. This relationship, in fact, is the basis for this analysis.

I started with a model that includes all the races I completed in 2018, but pace time for the Midnight Sun Run wasn’t statistically significant so I removed it from the final model, which included Chena River Run, Run of the Valkyries, and Gold Discovery.

This model is significant, as are all the coefficients except the intercept, and the model explains nearly 80% of the variation in the data:

```##
## Call:
## lm(formula = em ~ crr + gdr + rotv, data = input_pivot)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -3.8837 -0.6534 -0.2265  0.3549  5.8273
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   0.6217     0.5692   1.092 0.276420
## crr          -0.3723     0.1346  -2.765 0.006380 **
## gdr           0.8422     0.1169   7.206 2.32e-11 ***
## rotv          0.7607     0.2119   3.591 0.000442 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.278 on 156 degrees of freedom
## Multiple R-squared:  0.786,  Adjusted R-squared:  0.7819
## F-statistic:   191 on 3 and 156 DF,  p-value: < 2.2e-16
```

Using this model and my 2018 results, my overall pace and finish times for Equinox are predicted to be 10:45 and 4:41:50. The 95% confidence intervals for these predictions are 10:30–11:01 and 4:35:11–4:48:28.

## Random Forest

Random Forest is another regression method but it doesn’t require independent variables be independent of one another. Here are the results of building 5,000 random trees from the data:

```##
## Call:
##  randomForest(formula = em ~ ., data = input_pivot, ntree = 5000)
##                Type of random forest: regression
##                      Number of trees: 5000
## No. of variables tried at each split: 1
##
##           Mean of squared residuals: 1.87325
##                     % Var explained: 74.82

##      IncNodePurity
## crr       260.8279
## gdr       321.3691
## msr       268.0936
## rotv      295.4250
```

This model, which includes all race results explains just under 74% of the variation in the data. And you can see from the importance result that Gold Discovery results factor more heavily in the result than earlier races in the season like Chena River Run and the Midnight Sun Run.

Using this model, my predicted pace is 10:13 and my finish time is 4:27:46. The 95% confidence intervals are 9:23–11:40 and 4:05:58–5:05:34. You’ll notice that the confidence intervals are wider than with linear regression, probably because there are fewer assumptions with Random Forest and less power.

# Conclusion

My number one goal for this year’s Equinox Marathon is simply to finish without injuring myself, something I wasn’t able to do the last time I ran the whole race in 2013. I finished in 4:49:28 with an overall pace of 11:02, but the race or my training for it resulted in a torn hip labrum.

If I’m able to finish uninjured, I’d like to beat my time from 2013. These results suggest I should have no problem acheiving my second goal and perhaps knowing how much faster these predictions are from my 2013 times, I can race conservatively and still get a personal best time.

# Appendix - R code

```library(tidyverse)
library(RPostgres)
library(lubridate)
library(glue)
library(randomForest)
library(knitr)

races <- dbConnect(Postgres(),
host = "localhost",
dbname = "races")

all_races <- races %>%
tbl("all_races")

usibelli_races <- tibble(race = c("Chena River Run",
"Midnight Sun Run",
"Jim Loftus Mile",
"Run of the Valkyries",
"Gold Discovery Run",
"Santa Claus Half Marathon",
"Golden Heart Trail Run",
"Equinox Marathon"))

css_2018 <- all_races %>%
inner_join(usibelli_races, copy = TRUE) %>%
filter(year == 2018,
name == "Christopher Swingley") %>%
collect()

candidate_races <- css_2018 %>%
select(race) %>%
bind_rows(tibble(race = c("Equinox Marathon")))

input_data <- all_races %>%
inner_join(candidate_races, copy = TRUE) %>%
filter(!is.na(gender), !is.na(birth_year)) %>%
collect()

input_pivot <- input_data %>%
group_by(race, name, year) %>%
mutate(n = n()) %>%
filter(n == 1) %>%
ungroup() %>%
select(name, year, race, pace_min) %>%
rename(crr = `Chena River Run`,
msr = `Midnight Sun Run`,
rotv = `Run of the Valkyries`,
gdr = `Gold Discovery Run`,
em = `Equinox Marathon`) %>%
filter(!is.na(crr), !is.na(msr), !is.na(rotv),
!is.na(gdr), !is.na(em)) %>%
select(-c(name, year))

css_2018_pivot <- css_2018 %>%
select(name, year, race, pace_min) %>%
rename(crr = `Chena River Run`,
msr = `Midnight Sun Run`,
rotv = `Run of the Valkyries`,
gdr = `Gold Discovery Run`) %>%
select(-c(name, year))

pace <- function(minutes) {
mm = floor(minutes)
seconds = (minutes - mm) * 60

glue('{mm}:{sprintf("%02.0f", seconds)}')
}

finish_time <- function(minutes) {
hh = floor(minutes / 60.0)
min = minutes - (hh * 60)
mm = floor(min)
seconds = (min - mm) * 60

glue('{hh}:{sprintf("%02d", mm)}:{sprintf("%02.0f", seconds)}')
}

lm_model <- lm(em ~ crr + gdr + rotv,
data = input_pivot)

summary(lm_model)

prediction <- predict(lm_model, css_2018_pivot,
interval = "confidence", level = 0.95)

prediction

rf <- randomForest(em ~ .,
data = input_pivot,
ntree = 5000)
rf
importance(rf)

rfp_all <- predict(rf, css_2018_pivot, predict.all = TRUE)

rfp_all\$aggregate

rf_ci <- quantile(rfp_all\$individual, c(0.025, 0.975))

rf_ci
```
sat, 29-oct-2016, 21:14

Equinox Marathon Relay leg 2, 2016

# Introduction

A couple years ago I compared racing data between two races (Gold Discovery and Equinox, Santa Claus and Equinox) in the same season for all runners that ran in both events. The result was an estimate of how fast I might run the Equinox Marathon based on my times for Gold Discovery and the Santa Claus Half Marathon.

Several years have passed and I've run more races and collected more racing data for all the major Fairbanks races and wanted to run the same analysis for all combinations of races.

# Data

The data comes from a database I’ve built of race times for all competitors, mostly coming from the results available from Chronotrack, but including some race results from SportAlaska.

We started by loading the required R packages and reading in all the racing data, a small subset of which looks like this.

race year name finish_time birth_year sex
Beat Beethoven 2015 thomas mcclelland 00:21:49 1995 M
Equinox Marathon 2015 jennifer paniati 06:24:14 1989 F
Equinox Marathon 2014 kris starkey 06:35:55 1972 F
Midnight Sun Run 2014 kathy toohey 01:10:42 1960 F
Midnight Sun Run 2016 steven rast 01:59:41 1960 M
Equinox Marathon 2013 elizabeth smith 09:18:53 1987 F
... ... ... ... ... ...

Next we loaded in the names and distances of the races and combined this with the individual racing data. The data from Chronotrack doesn’t include the mileage and we will need that to calculate pace (minutes per mile).

My database doesn’t have complete information about all the racers that competed, and in some cases the information for a runner in one race conflicts with the information for the same runner in a different race. In order to resolve this, we generated a list of runners, grouped by their name, and threw out racers where their name matches but their gender was reported differently from one race to the next. Please understand we’re not doing this to exclude those who have changed their gender identity along the way, but to eliminate possible bias from data entry mistakes.

Finally, we combined the racers with the individual racing data, substituting our corrected runner information for what appeared in the individual race’s data. We also calculated minutes per mile (pace) and the age of the runner during the year of the race (age). Because we’re assigning a birth year to the minimum reported year from all races, our age variable won’t change during the running season, which is closer to the way age categories are calculated in Europe. Finally, we removed results where pace was greater than 20 minutes per mile for races longer than ten miles, and greater than 16 minute miles for races less than ten miles. These are likely to be outliers, or competitors not running the race.

name birth_year gender race_str year miles minutes pace age
aaron austin 1983 M midnight_sun_run 2014 6.2 50.60 8.16 31
aaron bravo 1999 M midnight_sun_run 2013 6.2 45.26 7.30 14
aaron bravo 1999 M midnight_sun_run 2014 6.2 40.08 6.46 15
aaron bravo 1999 M midnight_sun_run 2015 6.2 36.65 5.91 16
aaron bravo 1999 M midnight_sun_run 2016 6.2 36.31 5.85 17
aaron bravo 1999 M spruce_tree_classic 2014 6.0 42.17 7.03 15
... ... ... ... ... ... ... ... ...

We combined all available results for each runner in all years they participated such that the resulting rows are grouped by runner and year and columns are the races themselves. The values in each cell represent the pace for the runner × year × race combination.

For example, here’s the first six rows for runners that completed Beat Beethoven and the Chena River Run in the years I have data. I also included the column for the Midnight Sun Run in the table, but the actual data has a column for all the major Fairbanks races. You’ll see that two of the six runners listed ran BB and CRR but didn’t run MSR in that year.

name gender age year beat_beethoven chena_river_run midnight_sun_run
aaron schooley M 36 2016 8.19 8.15 8.88
abby fett F 33 2014 10.68 10.34 11.59
abby fett F 35 2016 11.97 12.58 NA
abigail haas F 11 2015 9.34 8.29 NA
abigail haas F 12 2016 8.48 7.90 11.40
aimee hughes F 43 2015 11.32 9.50 10.69
... ... ... ... ... ... ...

With this data, we build a whole series of linear models, one for each race combination. We created a series of formula strings and objects for all the combinations, then executed them using map(). We combined the start and predicted race names with the linear models, and used glance() and tidy() from the broom package to turn the models into statistics and coefficients.

All of the models between races were highly significant, but many of them contain coefficients that aren’t significantly different than zero. That means that including that term (age, gender or first race pace) isn’t adding anything useful to the model. We used the significance of each term to reduce our models so they only contained coefficients that were significant and regenerated the statistics and coefficients for these reduced models.

The full R code appears at the bottom of this post.

# Results

Here’s the statistics from the ten best performing models (based on ).

start_race predicted_race n p-value
run_of_the_valkyries golden_heart_trail_run 40 0.956 0
golden_heart_trail_run equinox_marathon 36 0.908 0
santa_claus_half_marathon golden_heart_trail_run 34 0.896 0
midnight_sun_run gold_discovery_run 139 0.887 0
beat_beethoven golden_heart_trail_run 32 0.886 0
run_of_the_valkyries gold_discovery_run 44 0.877 0
midnight_sun_run golden_heart_trail_run 52 0.877 0
gold_discovery_run santa_claus_half_marathon 111 0.876 0
chena_river_run golden_heart_trail_run 44 0.873 0
run_of_the_valkyries santa_claus_half_marathon 91 0.851 0

It’s interesting how many times the Golden Heart Trail Run appears on this list since that run is something of an outlier in the Usibelli running series because it’s the only race entirely on trails. Maybe it’s because it’s distance (5K) is comparable with a lot of the earlier races in the season, but because it’s on trails it matches well with the later races that are at least partially on trails like Gold Discovery or Equinox.

Here are the ten worst models.

start_race predicted_race n p-value
midnight_sun_run equinox_marathon 431 0.525 0
beat_beethoven hoodoo_half_marathon 87 0.533 0
beat_beethoven midnight_sun_run 818 0.570 0
chena_river_run equinox_marathon 196 0.572 0
equinox_marathon hoodoo_half_marathon 90 0.584 0
beat_beethoven equinox_marathon 265 0.585 0
gold_discovery_run hoodoo_half_marathon 41 0.599 0
beat_beethoven santa_claus_half_marathon 163 0.612 0
run_of_the_valkyries equinox_marathon 125 0.642 0
midnight_sun_run hoodoo_half_marathon 118 0.657 0

Most of these models are shorter races like Beat Beethoven or the Chena River Run predicting longer races like Equinox or one of the half marathons. Even so, each model explains more than half the variation in the data, which isn’t terrible.

# Application

Now that we have all our models and their coefficients, we used these models to make predictions of future performance. I’ve written an online calculator based on the reduced models that let you predict your race results as you go through the running season. The calculator is here: Fairbanks Running Race Converter.

For example, I ran a 7:41 pace for Run of the Valkyries this year. Entering that, plus my age and gender into the converter predicts an 8:57 pace for the first running of the HooDoo Half Marathon. The for this model was a respectable 0.71 even though only 23 runners ran both races this year (including me). My actual pace for HooDoo was 8:18, so I came in quite a bit faster than this. No wonder my knee and hip hurt after the race! Using my time from the Golden Heart Trail Run, the converter predicts a HooDoo Half pace of 8:16.2, less than a minute off my 1:48:11 finish.

# Appendix: R code

```library(tidyverse)
library(lubridate)
library(broom)

races_db <- src_postgres(host="localhost", dbname="races")

combined_races <- tbl(races_db, build_sql(
"SELECT race, year, lower(name) AS name, finish_time,
year - age AS birth_year, sex
FROM chronotrack
UNION
SELECT race, year, lower(name) AS name, finish_time,
birth_year,
CASE WHEN age_class ~ 'M' THEN 'M' ELSE 'F' END AS sex
UNION
SELECT race, year, lower(name) AS name, finish_time,
NULL AS birth_year, NULL AS sex
FROM other"))

races <- tbl(races_db, build_sql(
"SELECT race,
lower(regexp_replace(race, '[ ’]', '_', 'g')) AS race_str,
date_part('year', date) AS year,
miles
FROM races"))

racing_data <- combined_races %>%
inner_join(races) %>%
filter(!is.na(finish_time))

racers <- racing_data %>%
group_by(name) %>%
summarize(races=n(),
birth_year=min(birth_year),
gender_filter=ifelse(sum(ifelse(sex=='M',1,0))==
sum(ifelse(sex=='F',1,0)),
FALSE, TRUE),
gender=ifelse(sum(ifelse(sex=='M',1,0))>
sum(ifelse(sex=='F',1,0)),
'M', 'F')) %>%
ungroup() %>%
filter(gender_filter) %>%
select(-gender_filter)

racing_data_filled <- racing_data %>%
inner_join(racers, by="name") %>%
mutate(birth_year=birth_year.y) %>%
select(name, birth_year, gender, race_str, year, miles, finish_time) %>%
group_by(name, race_str, year) %>%
mutate(n=n()) %>%
filter(!is.na(birth_year), n==1) %>%
ungroup() %>%
collect() %>%
mutate(fixed=ifelse(grepl('[0-9]+:[0-9]+:[0-9.]+', finish_time),
finish_time,
paste0('00:', finish_time)),
minutes=as.numeric(seconds(hms(fixed)))/60.0,
pace=minutes/miles,
age=year-birth_year,
age_class=as.integer(age/10)*10,
group=paste0(gender, age_class),
gender=as.factor(gender)) %>%
filter((miles<10 & pace<16) | (miles>=10 & pace<20)) %>%
select(-fixed, -finish_time, -n)

speeds_combined <- racing_data_filled %>%
select(name, gender, age, age_class, group, race_str, year, pace) %>%

main_races <- c('beat_beethoven', 'chena_river_run', 'midnight_sun_run',
'run_of_the_valkyries', 'gold_discovery_run',
'santa_claus_half_marathon', 'golden_heart_trail_run',
'equinox_marathon', 'hoodoo_half_marathon')

race_formula_str <-
lapply(seq(1, length(main_races)-1),
function(i)
lapply(seq(i+1, length(main_races)),
function(j) paste(main_races[[j]], '~',
main_races[[i]],
'+ gender', '+ age'))) %>%
unlist()

race_formulas <- lapply(race_formula_str, function(i) as.formula(i)) %>%
unlist()

lm_models <- map(race_formulas, ~ lm(.x, data=speeds_combined))

models <- tibble(start_race=factor(gsub('.* ~ ([^ ]+).*',
'\\1',
race_formula_str),
levels=main_races),
predicted_race=factor(gsub('([^ ]+).*',
'\\1',
race_formula_str),
levels=main_races),
lm_models=lm_models) %>%
arrange(start_race, predicted_race)

model_stats <- glance(models %>% rowwise(), lm_models)
model_coefficients <- tidy(models %>% rowwise(), lm_models)

reduced_formula_str <- model_coefficients %>%
ungroup() %>%
filter(p.value<0.05, term!='(Intercept)') %>%
mutate(term=gsub('genderM', 'gender', term)) %>%
group_by(predicted_race, start_race) %>%
summarize(independent_vars=paste(term, collapse=" + ")) %>%
ungroup() %>%
transmute(reduced_formulas=paste(predicted_race, independent_vars, sep=' ~ '))

reduced_formula_str <- reduced_formula_str\$reduced_formulas

reduced_race_formulas <- lapply(reduced_formula_str,
function(i) as.formula(i)) %>% unlist()

reduced_lm_models <- map(reduced_race_formulas, ~ lm(.x, data=speeds_combined))

n_from_lm <- function(model) {
summary_object <- summary(model)

summary_object\$df[1] + summary_object\$df[2]
}

reduced_models <- tibble(start_race=factor(gsub('.* ~ ([^ ]+).*', '\\1', reduced_formula_str),
levels=main_races),
predicted_race=factor(gsub('([^ ]+).*', '\\1', reduced_formula_str),
levels=main_races),
lm_models=reduced_lm_models) %>%
arrange(start_race, predicted_race) %>%
rowwise() %>%
mutate(n=n_from_lm(lm_models))

reduced_model_stats <- glance(reduced_models %>% rowwise(), lm_models)
reduced_model_coefficients <- tidy(reduced_models %>% rowwise(), lm_models) %>%
ungroup()

coefficients_and_stats <- reduced_model_stats %>%
inner_join(reduced_model_coefficients,
by=c("start_race", "predicted_race", "n")) %>%
select(start_race, predicted_race, n, r.squared, term, estimate)

write_csv(coefficients_and_stats,
"coefficients.csv")

make_scatterplot <- function(start_race, predicted_race) {
age_limits <- speeds_combined %>%
filter_(paste("!is.na(", start_race, ")"),
paste("!is.na(", predicted_race, ")")) %>%
summarize(min=min(age), max=max(age)) %>%
unlist()

q <- ggplot(data=speeds_combined,
aes_string(x=start_race, y=predicted_race)) +
# plasma works better with a grey background
# theme_bw() +
geom_abline(slope=1, color="darkred", alpha=0.5) +
geom_smooth(method="lm", se=FALSE) +
geom_point(aes(shape=gender, color=age)) +
scale_color_viridis(option="plasma",
limits=age_limits) +
scale_x_continuous(breaks=pretty_breaks(n=10)) +
scale_y_continuous(breaks=pretty_breaks(n=6))

svg_filename <- paste0(paste(start_race, predicted_race, sep="-"), ".svg")

height <- 9
width <- 16
resize <- 0.75

svg(svg_filename, height=height*resize, width=width*resize)
print(q)
dev.off()
}

lapply(seq(1, length(main_races)-1),
function(i)
lapply(seq(i+1, length(main_races)),
function(j)
make_scatterplot(main_races[[i]], main_races[[j]])
)
```
sun, 04-aug-2013, 09:35

How will I do?

My last blog post compared the time for the men who ran both the 2012 Gold Discovery Run and the Equinox Marathon in order to give me an idea of what sort of Equinox finish time I can expect. Here, I’ll do the same thing for the 2012 Santa Claus Half Marathon.

Yesterday I ran the half marathon, finishing in 1:53:08, which is an average pace of 8.63 / 8:38 minutes per mile. I’m recovering from a mild calf strain, so I ran the race very conservatively until I felt like I could trust my legs.

I converted the SportAlaska PDF files the same way as before, and read the data in from the CSV files. Looking at the data, there are a few outliers in this comparison as well. In addition to being ouside of most of the points, they are also times that aren’t close to my expected pace, so are less relevant for predicting my own Equinox finish. Here’s the code to remove them, and perform the linear regression:

```combined <- combined[!(combined\$sc_pace > 11.0 | combined\$eq_pace > 14.5),]
model <- lm(eq_pace ~ sc_pace, data=combined)
summary(model)

Call:
lm(formula = eq_pace ~ sc_pace, data = combined)

Residuals:
Min       1Q   Median       3Q      Max
-1.08263 -0.39018  0.02476  0.30194  1.27824

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.11209    0.61948  -1.795   0.0793 .
sc_pace      1.44310    0.07174  20.115   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5692 on 45 degrees of freedom
Multiple R-squared: 0.8999,     Adjusted R-squared: 0.8977
F-statistic: 404.6 on 1 and 45 DF,  p-value: < 2.2e-16
```

There were fewer male runners in 2012 that ran both Santa Claus and Equinox, but we get similar regression statistics. The model and coefficient are significant, and the variation in Santa Claus pace times explains just under 90% of the variation in Equinox times. That’s pretty good.

Here’s a plot of the results:

As before, the blue line shows the model relationship, and the grey area surrounding it shows the 95% confidence interval around that line. This interval represents the range over which 95% of the expected values should appear. The red line is the 1:1 line. As you’d expect for a race twice as long, all the Equinox pace times are significantly slower than for Santa Claus.

There were fewer similar runners in this data set:

2012 Race Results
Runner DOB Santa Claus Equinox Time Equinox Pace
John Scherzer 1972 8:17 4:49 11:01
Greg Newby 1965 8:30 5:03 11:33
Trent Hubbard 1972 8:31 4:48 11:00

This analysis predicts that I should be able to finish Equinox in just under five hours, which is pretty close to what I found when using Gold Discovery times in my last post. The model predicts a pace of 11:20 and an Equinox finish time of four hours and 57 minutes, and these results are within the range of the three similar runners listed above. Since I was running conservatively in the half marathon, and will probably try to do the same for Equinox, five hours seems like a good goal to shoot for.

sat, 27-jul-2013, 08:03

Gold Discovery Run, 2013

This spring I ran the Beat Beethoven 5K and had such a good time that I decided to give running another try. I’d tried adding running to my usual exercise routines in the past, but knee problems always sidelined me after a couple months. It’s been three months of slow increases in mileage using a marathon training plan by Hal Higdon, and so far so good.

My goal for this year, beyond staying healthy, is to participate in the 51st running of the Equinox Marathon here in Fairbanks.

One of the challenges for a beginning runner is how pace yourself during a race and how to know what your body can handle. Since Beat Beethoven I've run in the Lulu’s 10K, the Midnight Sun Run (another 10K), and last weekend I ran the 16.5 mile Gold Discovery Run from Cleary Summit down to Silver Gulch Brewery. I completed the race in two hours and twenty-nine minutes, at a pace of 9:02 minutes per mile. Based on this performance, I should be able to estimate my finish time and pace for Equinox by comparing the times for runners that participated in the 2012 Gold Discovery and Equinox.

The first challenge is extracting the data from the PDF files SportAlaska publishes after the race. I found that opening the PDF result files, selecting all the text on each page, and pasting it into a text file is the best way to preserve the formatting of each line. Then I process it through a Python function that extracts the bits I want:

```import re
""" lines appear to contain:
place, bib, name, town (sometimes missing), state (sometimes missing),
birth_year, age_class, class_place, finish_time, off_win, pace,
points (often missing) """
fields = line.split()
place = int(fields.pop(0))
bib = int(fields.pop(0))
name = fields.pop(0)
while True:
n = fields.pop(0)
name = '{} {}'.format(name, n)
if re.search('^[A-Z.-]+\$', n):
break
pre_birth_year = []
pre_birth_year.append(fields.pop(0))
while True:
try:
f = fields.pop(0)
except:
print("Warning: couldn't parse: '{0}'".format(line.strip()))
break
else:
if re.search('^[0-9]{4}\$', f):
birth_year = int(f)
break
else:
pre_birth_year.append(f)
if re.search('^[A-Z]{2}\$', pre_birth_year[-1]):
state = pre_birth_year[-1]
town = ' '.join(pre_birth_year[:-1])
else:
state = None
town = None
try:
(age_class, class_place, finish_time, off_win, pace) = fields[:5]
class_place = int(class_place[1:-1])
finish_minutes = time_to_min(finish_time)
fpace = strpace_to_fpace(pace)
except:
print("Warning: couldn't parse: '{0}', skipping".format(
line.strip()))
return None
else:
return (place, bib, name, town, state, birth_year, age_class,
class_place, finish_time, finish_minutes, off_win,
pace, fpace)
```

The function uses a a couple helper functions that convert pace and time strings into floating point numbers, which are easier to analyze.

```def strpace_to_fpace(p):
""" Converts a MM:SS" pace to a float (minutes) """
(mm, ss) = p.split(':')
(mm, ss) = [int(x) for x in (mm, ss)]
fpace = mm + (float(ss) / 60.0)

return fpace

def time_to_min(t):
""" Converts an HH:MM:SS time to a float (minutes) """
(hh, mm, ss) = t.split(':')
(hh, mm) = [int(x) for x in (hh, mm)]
ss = float(ss)
minutes = (hh * 60) + mm + (ss / 60.0)

return minutes
```

Once I process the Gold Discovery and Equnox result files through this routine, I dump the results in a properly formatted comma-delimited file, read the data into R and combine the two race results files by matching the runner’s name. Note that these results only include the men competing in the race.

```gd <- read.csv('gd_2012_men.csv', header=TRUE)
gd <- gd[,c('name', 'birth_year', 'finish_minutes', 'fpace')]
eq <- eq[,c('name', 'birth_year', 'finish_minutes', 'fpace')]
combined <- merge(gd, eq, by='name')
names(combined) <- c('name', 'birth_year', 'gd_finish', 'gd_pace',
'year', 'eq_finish', 'eq_pace')
```

When I look at a plot of the data I can see four outliers; two where the runners ran Equinox much faster based on their Gold Discovery pace, and two where the opposite was the case. The two races are two months apart, so I think it’s reasonable to exclude these four rows from the data since all manner of things could happen to a runner in two months of hard training (or on race day!).

```attach(combined)
combined <- combined[!((gd_pace > 10 & gd_pace < 11 & eq_pace > 15)
| (gd_pace > 15)),]
```

Let’s test the hypothesis that we can predict Equinox pace from Gold Discovery Pace:

```model <- lm(eq_pace ~ birth_year, data=combined)
summary(model)

Call:
lm(formula = eq_pace ~ gd_pace, data = combined)

Residuals:
Min       1Q   Median       3Q      Max
-1.47121 -0.36833 -0.04207  0.51361  1.42971

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.77392    0.52233   1.482    0.145
gd_pace      1.08880    0.05433  20.042   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6503 on 48 degrees of freedom
Multiple R-squared:  0.8933,    Adjusted R-squared:  0.891
F-statistic: 401.7 on 1 and 48 DF,  p-value: < 2.2e-16
```

Indeed, we can explain 65% of the variation in Equinox Marathon pace times using Gold Discovery pace times, and both the model and the model coefficient are significant.

Here’s what the results look like:

The red line shows a relationship where the Gold Discovery pace is identical to the Equinox pace for each running. Because the actual data (and the prediced results based on the regression model) are above this line, that means that all the runners were slower in the longer (and harder) Equinox Marathon.

As for me, my 9:02 Gold Discovery pace should translate into an Equinox pace around 10:30. Here are the 2012 runners who were born within ten years of me, and who finished within ten minutes of my 2013 Gold Discovery time:

2012 Race Results
Runner DOB Gold Discovery Equinox Time Equinox Pace
Dan Bross 1964 2:24 4:20 9:55
Chris Hartman 1969 2:25 4:45 10:53
Mike Hayes 1972 2:27 4:58 11:22
Ben Roth 1968 2:28 4:47 10:57
Jim Brader 1965 2:31 4:09 9:30
Erik Anderson 1971 2:32 5:03 11:34
John Scherzer 1972 2:33 4:49 11:01
Trent Hubbard 1972 2:33 4:48 11:00

Based on this, and the regression results, I expect to finish the Equinox Marathon in just under five hours if my training over the next two months goes well.

sat, 01-dec-2012, 07:41

It’s now December 1st and the last time we got new snow was on November 11th. In my last post I looked at the lengths of snow-free periods in the available weather data for Fairbanks, now at 20 days. That’s a long time, but what I’m interested in looking at today is whether the monthly pattern of snowfall in Fairbanks is changing.

The Alaska Dog Musher’s Association holds a series of weekly sprint races starting at the beginning of December. For the past several years—and this year—there hasn’t been enough snow to hold the earliest of the races because it takes a certain depth of snowpack to allow a snow hook to hold a team back should the driver need to stop. I’m curious to know if scheduling a bunch of races in December and early January is wishful thinking, or if we used to get a lot of snow earlier in the season than we do now. In other words, has the pattern of snowfall in Fairbanks changed?

One way to get at this is to look at the earliest data in the “winter year” (which I’m defining as starting on September 1st, since we do sometimes get significant snowfall in September) when 12 inches of snow has fallen. Here’s what that relationship looks like:

And the results from a linear regression:

```Call:
lm(formula = winter_doy ~ winter_year, data = first_foot)

Residuals:
Min      1Q  Median      3Q     Max
-60.676 -25.149  -0.596  20.984  77.152

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -498.5005   462.7571  -1.077    0.286
winter_year    0.3067     0.2336   1.313    0.194

Residual standard error: 33.81 on 60 degrees of freedom
Multiple R-squared: 0.02793,    Adjusted R-squared: 0.01173
F-statistic: 1.724 on 1 and 60 DF,  p-value: 0.1942
```

According to these results the date of the first foot of snow is getting later in the year, but it’s not significant, so we can’t say with any authority that the pattern we see isn’t just random. Worse, this analysis could be confounded by what appears to be a decline in the total yearly snowfall in Fairbanks:

This relationship (less snow every year) has even less statistical significance. If we combine the two analyses, however, there is a significant relationship:

```Call:
lm(formula = winter_year ~ winter_doy * snow, data = yearly_data)

Residuals:
Min     1Q Median     3Q    Max
-35.15 -11.78   0.49  14.15  32.13

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)      1.947e+03  2.082e+01  93.520   <2e-16 ***
winter_doy       4.297e-01  1.869e-01   2.299   0.0251 *
snow             5.248e-01  2.877e-01   1.824   0.0733 .
winter_doy:snow -7.022e-03  3.184e-03  -2.206   0.0314 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 17.95 on 58 degrees of freedom
Multiple R-squared: 0.1078,     Adjusted R-squared: 0.06163
F-statistic: 2.336 on 3 and 58 DF,  p-value: 0.08317
```

Here we’re “predicting” winter year based on the yearly snowfall, the first date where a foot of snow had fallen, and the interaction between the two. Despite the near-significance of the model and the parameters, it doesn’t do a very good job of explaining the data (almost 90% of the variation is unexplained by this model).

One problem with boiling the data down into a single (or two) values for each year is that we’re reducing the amount of data being analyzed, lowering our power to detect a significant relationship between the pattern of snowfall and year. Here’s what the overall pattern for all years looks like:

And the individual plots for each year in the record:

Because “winter month” isn’t a continuous variable, we can’t use normal linear regression to evaluate the relationship between year and monthly snowfall. Instead we’ll use multinominal logistic regression to investigate the relationship between which month is the snowiest, and year:

```library(nnet)
model <- multinom(data = snowiest_month, winter_month ~ winter_year)
summary(model)

Call:
multinom(formula = winter_month ~ winter_year, data = snowiest_month)

Coefficients:
(Intercept)  winter_year
3    30.66572 -0.015149192
4    62.88013 -0.031771508
5    38.97096 -0.019623059
6    13.66039 -0.006941225
7   -68.88398  0.034023510
8   -79.64274  0.039217108

Std. Errors:
(Intercept)  winter_year
3 9.992962e-08 0.0001979617
4 1.158940e-07 0.0002289479
5 1.120780e-07 0.0002218092
6 1.170249e-07 0.0002320081
7 1.668613e-07 0.0003326432
8 1.955969e-07 0.0003901701

Residual Deviance: 221.5413
AIC: 245.5413
```

I’m not exactly sure how to interpret the results, but typically you’re looking to see if the intercepts and coefficients are significantly different from zero. If you look at the difference in magnitude between the coefficients and the standard errors, it appears they are significantly different from zero, which would imply they are statistically significant.

In order to examine what they have to say, we’ll calculate the probability curves for whether each month will wind up as the snowiest month, and plot the results by year.

```fit_snowiest <- data.frame(winter_year = 1949:2012)
probs <- cbind(fit_snowiest, predict(model, newdata = fit_snowiest, "probs"))
probs.melted <- melt(probs, id.vars = 'winter_year')
names(probs.melted) <- c('winter_year', 'winter_month', 'probability')
probs.melted\$month <- factor(probs.melted\$winter_month)
levels(probs.melted\$month) <- \
list('oct' = 2, 'nov' = 3, 'dec' = 4, 'jan' = 5, 'feb' = 6, 'mar' = 7, 'apr' = 8)
q <- ggplot(data = probs.melted, aes(x = winter_year, y = probability, colour = month))
q + theme_bw() + geom_line(size = 1) + scale_y_continuous(name = "Model probability") \
+ scale_x_continuous(name = 'Winter year', breaks = seq(1945, 2015, 5)) \
+ ggtitle('Snowiest month probabilities by year from logistic regression model,\n
Fairbanks Airport station') \
+ scale_colour_manual(values = \
c("violet", "blue", "cyan", "green", "#FFCC00", "orange", "red"))
```

The result:

Here’s how you interpret this graph. Each line shows how likely it is that a month will be the snowiest month (November is always the snowiest month because it always has the highest probabilities). The order of the lines for any year indicates the monthly order of snowiness (in 1950, November, December and January were predicted to be the snowiest months, in that order), and months with a negative slope are getting less snowy overall (November, December, January).

November is the snowiest month for all years, but it’s declining, as is snow in December and January. October, February, March and April are increasing. From these results, it appears that we’re getting more snow at the very beginning (October) and at the end of the winter, and less in the middle of the winter.

Meta Photolog Archives