sun, 31-jan-2010, 13:40

I recently saw a pair of blog posts showing how to make heatmaps with straight R and with ggplot2. Basketball doesn’t really interest me, so I figured I’d attempt to do the same thing for the 2010 Oakland Athletics 40-man roster. Results are at the bottom of the post.

First, I needed to get the 40-man roster:

$ w3m -dump "http://oakland.athletics.mlb.com/team/roster_40man.jsp?c_id=oak" > 40man

Then trim it down so it’s just a listing of the player’s names.

Next, get the baseball data bank (BDB) database from http://baseball-databank.org/, convert and insert it into a PostgreSQL database using mysql2pgsql.perl.

A Python script reads the names from the roster, and dumps a CSV file of the batting and pitching data for the past two seasons for the players passed in.

$ cat 40man_names | ./get_two-year_batter_stats.py

The batting data looks like this:

            name  , age,   g,    ba,   obp,   slg,   ops,  rc,   hrr,    kr,   bbr
Daric Barton (1B) ,  25, 194, 0.238, 0.342, 0.365, 0.707,  73, 0.017, 0.173, 0.134
Travis Buck (RF)  ,  27,  74, 0.223, 0.289, 0.392, 0.682,  28, 0.035, 0.202, 0.073
Chris Carter (LF) ,  28,  13, 0.261, 0.320, 0.261, 0.581,   1, 0.000, 0.360, 0.080
...

I’ve used the counting stats in the BDB to calculate batting average (ba), on-base percentage (obp), slugging percentage (slg), OPS (on-base percentage + slugging percentage), runs created (rc), home run rate (hrr), strikeout rate (kr) and walks rate (bbr).

And the pitching data:

            name   , age,  g,      ip,  w, l,    sv,    wp,    lp,    wf,   era,    k9,   bb9,   hr9
Brett Anderson (P) ,  22,  30, 175.33, 11,  11,   0,  0.37,  0.37,  0.00,  4.06,  7.70,  2.36,  1.03
Andrew Bailey (P)  ,  26,  68,  83.33,  6,   3,  26,  0.09,  0.04,  0.04,  1.84,  9.83,  2.92,  0.54
Jerry Blevins (P)  ,  27,  56,  60.00,  1,   3,   0,  0.02,  0.05, -0.04,  3.75,  8.70,  3.30,  0.60
...

Here I’ve calculated innings pitched (ip), winning percentage (wp), losing percentage (lp), win frequency (wf), earned run average (era), strikeouts per nine innings (k9), walks per nine (bb9), and home runs given up per nine innings (hr9). All these stats are for the last two Major League seasons.

Finally, generate the heat maps in R. For batting statistics:

library(ggplot2)
mlb <- read.csv('batting.csv')
mlb$name <- with(mlb, reorder(name, ops))
mlb.m <- melt(mlb)
mlb.m <- ddply(mlb.m, .(variable), transform, rescale = rescale(value))
(p <- ggplot(mlb.m, aes(variable, name)) +
+   geom_tile(aes(fill = rescale), colour = "white") +
+   scale_fill_gradient(low = "gold", high = "darkgreen"))
base_size <- 14
p + theme_grey(base_size = base_size) + labs(x = "", y = "") +
+   scale_x_discrete(expand = c(0, 0)) + scale_y_discrete(expand = c(0, 0)) +
+   opts(legend.position = "none", axis.ticks = theme_blank(),
+   axis.text.x = theme_text(size = base_size * 0.8, angle = 0, hjust = 0.5, colour = "black"),
+   axis.text.y = theme_text(size = base_size * 0.8, lineheight = 0.9, colour="black", hjust = 1))
    

Pitching statistics are the same, except the third line (where I order the data frame) is:

mlb$name <- with(mlb, reorder(name, 1/(era+0.1)))
    

The results:

A’s batting heatmap, ordered by OPS

A’s pitching heatmap, ordered by ERA

You have to keep the number of games (or innings pitched for pitchers) in mind when you look at these charts. I don’t even know who some of those guys are, probably because they’ve only barely played in the majors. It might make some sense to split the pitching plot into plots for starters and relievers, but I’d need a good way to determine a pitcher’s status (innings pitched divided by games beyond some threshold, perhaps?).

As for the A’s, I like their pitching, but have serious doubts about their offense. I sure hope some of the younger guys on this chart start reaching their power potential because having Jack Cust as your only offensive weapon doesn’t bode well for the team scoring runs.

fri, 23-oct-2009, 17:22

DNR pond

frozen DNR pond

It’s been almost a month since I last discussed the first true snowfall date (when the snow that falls stays on the ground for the entire winter) in Fairbanks, and we’re still without snow on the ground. It hasn’t been that cold yet, but the average temperature is enough below freezing that the local ponds have started freezing. Without snow, there’s a lot of ice skating going on around town. I’m hoping to head out this weekend and do some skating on the pond in the photo above. Still, most folks in Fairbanks are hoping for snow.

Since my last post, I’ve gotten access to data from the National Climate Data Center, and have been working on getting it all processed into a database. I’ve worked out a procedure for processing the daily COOP data, which means I can repeat my earlier snow depth analysis with a longer (and more consistent) data set. The following figure shows the same basic analysis as in my previous post, but now I’ve got data from 1948 to 2008.

Snow depth histogram

The latest date for the first true snowfall was November 11th, 1962, and we’re almost three weeks away from that date. But we’re also on the right side of the distribution—the mean (and median) date is October 14th, and we’re 9 days past that with no significant snow in the forecast. I’ve also marked the earliest (September 13th, 1992) and latest (November 1st, 1997) first snowfall dates in recent history. 1992 was the year the snow fell while the leaves were still on the trees, causing major power outages and a lot of damage. I think 1997 was the year that we didn’t get much snow at all, which caused a lot of problems for water and septic lines buried in the ground. A deep snowpack provides a good insulating layer that keeps buried water lines from freezing and in 1997 a lot of things froze.


Great Horned Owl

Great Horned Owl, digi-scoped with my iPhone

This is also the time of the year when some of the winter birds start making themselves less scarce. We saw our first Pine Grosbeaks of the year, three days later than last year’s first observation, a Northern Goshawk flew over a couple weeks ago, and we got some great views of this Great Horned Owl on Saturday. Andrea took some spectacular photos with her digital camera, and I experimented with my iPhone and the scope we bought in Homer this year. It’s quite a challenge to get the tiny iPhone lens properly oriented with the eyepiece image in the scope, but the photos are pretty impressive when you get it all set up. Even a pretty wimpy camera becomes powerful when looking through a nice scope.

Winter is on it’s way, just a bit late this year. I’ve been taking advantage by riding my bike to work fairly often. Earlier in the week I replaced my normal tires with carbide-studded tires, so I’ll be ready when the ice and snow finally comes.

tags: DNR pond  GHOW  owl  R  snowfall  weather 
fri, 25-sep-2009, 18:21

Piper and Nika on the Creek

Piper and Nika on the Creek, Feb 2009

On Wednesday I reported the results of my analysis examining the average date of first snow recorded at the Fairbanks Airport weather station. It was based on the snow_flag boolean field in the ISD database. In that post I mentioned that examining snow depth data might show the date on which permanent snow (snow that lasts all winter) first falls in Fairbanks. I’m calling this the first “true” snowfall of the season.

For this analysis I looked at the snow depth field in the ISD database for the Fairbanks station. The data was present for the years between 1973 and 1999, but isn’t in the database before that date. I’m not sure why it’s not in there after 1999, but luckily I’ve been collecting and archiving the data in the Fairbanks Daily Climate Summary (which includes a snow depth measurement) since late 2000. Combining those two data sets, I’ve got data for 27 years.

The SQL query I came up with to get the data from the data sets is a good estimate of what we’re interested in, but isn’t perfect because it only finds the date of first snow that lasts at least a week. In a place like Fairbanks where the turn to winter is so rapid and so dependent on the high albedo of snow cover, I think it’s close enough to the truth. Unfortunately, the query is brutally slow because it involves six (!) inner self-joins. The idea is to join the table containing snow depth data against itself, incrementing the date by one day at each join. The result set before the WHERE statement is the data for each date, plus the data for the six days following that date. The WHERE clause requires that snow depth on all those seven dates is above zero. This large query is a subquery of the main query which selects the earliest date found in each year.

There must be a better way to deal with conditions like this where we’re interested in the consecutive nature of the phenomenon, but I couldn’t figure out any other way to handle it in SQL, so here it is:

SELECT year, min(date) FROM
    (
        SELECT extract(year from a.dt) AS year,
            to_char(extract(month from a.dt), '00') ||
                '-' ||
                ltrim(to_char(extract(day from a.dt), '00')) AS date
        FROM isd_daily AS a
            INNER JOIN isd_daily AS b
                ON a.isd_id=b.isd_id AND
                    a.dt=b.dt - interval '1 day'
            INNER JOIN isd_daily AS c
                ON a.isd_id=c.isd_id AND
                    a.dt=c.dt - interval '2 days'
            INNER JOIN isd_daily AS d
                ON a.isd_id=d.isd_id AND
                    a.dt=d.dt - interval '3 day'
            INNER JOIN isd_daily AS e
                ON a.isd_id=e.isd_id AND
                    a.dt=e.dt - interval '4 day'
            INNER JOIN isd_daily AS f
                ON a.isd_id=f.isd_id AND
                    a.dt=f.dt - interval '5 day'
            INNER JOIN isd_daily AS g
                ON a.isd_id=g.isd_id AND
                    a.dt=g.dt - interval '6 day'
        WHERE a.isd_id = '702610-26411' AND
            a.snow_depth > 0 AND
            b.snow_depth > 0 AND
            c.snow_depth > 0 AND
            d.snow_depth > 0 AND
            e.snow_depth > 0 AND
            f.snow_depth > 0 AND
            g.snow_depth > 0 AND
            extract(month from a.dt) > 7
    ) AS snow_depth_conseq
GROUP BY year
ORDER BY year;

See what I mean? It’s pretty ugly. Running the result through the same R script as in my previous snowfall post yields this plot:

First true snowfall histogram

Between 1973 and 2008 we’ve gotten snow lasting the whole winter starting as early as September 12th (that was the infamous 1992), and as late as the first of November (1976). The median date is October 13th, which matches my impression. Now that the leaves have largely fallen off the trees, I’m hoping we get our first true snowfall on the early end of the distribution. We’ve still got a few things to take care of (a couple new dog houses, insulating the repaired septic line, etc.), but once those are done, I’m ready for the Creek to freeze and snow to blanket the trails.

tags: Nika  Piper  R  snowfall  weather 
wed, 23-sep-2009, 17:32

Snow on moss

Snow on moss

We got our first dusting of snow last night. It stuck around until after noon, allowing me to take the photo on the right when I went for a walk with Nika around the peat bog. You can really tell where the permafrost is by the thick layer of insulating moss that keeps the ground frozen, and is keeping the snow from melting in the photo.

Every year when the first snow falls it seems like it’s earlier than the last, and there’s usually some discussion at the office about how short the summer turned out to be. The early snows of 1992 that knocked out power for days all over town are also normally mentioned. I decided to look and see if I had some data that could place this year’s first snowfall in a historical context.

One of the few free long-term weather datasets that’s available from the National Climate Data Center is the Integrated Surface Dataset (ISD), which contains daily weather observations for more than 20,000 stations. The Fairbanks Airport station has been in operation for more than 100 years, but it moved in 1946, so I only used data from 1946–2008. In addition to a series of numerical observations (minimum and maximum temperature, pressure, wind speed, etc.), the dataset contains several fields used to indicate whether a particular phenomenon was observed during that day. One of them, snow_flag, is defined as: “True indicates there was a report of snow or ice pellets during the day.”

That’s perfect. Snow depth is another parameter I considered, but this data wasn’t collected until the mid-70s, and it doesn’t really help us answer the question because most of the time the first snowfall of the year doesn’t last long enough to be recorded as snow on the ground.

Here’s the SQL query to find the earliest snowfall date for each year for the Fairbanks Airport station:

SELECT year, min(date) FROM (
        SELECT extract(year from dt) AS year,
            to_char(extract(month from dt), '00') ||
                '-' ||
                ltrim(to_char(extract(day from dt), '00')) AS date,
            snow_flag
        FROM isd_daily
        WHERE isd_id = '702610-26411'
            AND extract(month from dt) > 7
            AND snow_flag = 't'
    ) AS snow_flag_sub
GROUP BY year
ORDER BY year;

Mix in a little R:

fs <- read.table("first_snow_mm-dd", header=TRUE, row.names=1)
fs$date<-as.Date(fs$date, "%m-%d")
png("first_snow_mm-dd.png", height=500, width=500, units="px", pointsize=12)
hist(fs$date, breaks="weeks", labels=FALSE,
    xlab="Date of first snowfall",
    main="First snowfall reported, Fairbanks Airport (PAFA) station",
    plot=TRUE, freq=TRUE,
    ylim=c(0, 20), col="gray60")
text(as.Date("2009-09-23"), 19, "⇦ 2009", srt=90, col="darkred")
dev.off()

And you get this plot:

First snowfall histogram

You can see from the plot that the first snowfall comes somewhere between August 3rd and October 26th, with the week of September 21st being the most common. So we’re right on schedule this year.

Another analysis that I’ve been meaning to do is to find the average date when the snow that falls lasts the entire winter. Since I’ve been in Fairbanks, my estimate of this date is the second week of October, but I’ve never actually looked it up to see if that’s true or not. Unfortunately, this requires good snow depth data, and the ISD dataset doesn’t have snow depth for Fairbanks prior to 1975. It’s also a bit more complicated than looking for the earliest snow_flag = 't' because you need to examine future rows to know if the snow depth observation you’re examining lasted more than a few days.


Why isn’t all the data collected by the Weather Service freely available? Public money was used to collect, analyze, and archive it, so I think it should be made available to the public that paid for it.

tags: R  snowfall  weather 
tue, 13-jan-2009, 17:40

bookshelf

Bookshelf

With the exception of some newer fiction and our cookbooks we haven't organized our books since we moved a year and a half ago. But the majority of our books are in Bookpedia, so we should be able to have Bookpedia help us organize them. Step one is finding out how much space a book takes up. So I went around measuring the number of books contained in a foot of space on each shelf of several of our bookshelves. A quick foray into R:
> books_per_foot = c(14,11,17,11,10,18,12,12,
          13,16,15,14,13,14,15,8,10,10,11,11)
> summary(books_per_foot)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   8.00   11.00   12.50   12.75   14.25   18.00
> mean(books_per_foot)
[1] 12.75
> sd(books_per_foot)
[1] 2.613225
With that information (12.75 books per foot of shelf space with a standard deviation of 2.61 books) and the total length of each bookshelf, it ought to be relatively easy to extract a listing of what books to put on each bookshelf. Actually moving them will be more of a challenge!
tags: Bookpedia  books  bookshelves  R 

<< 0 1 2 3 4 5 6 7 8
Meta Photolog Archives