There's probably no point in producing yet another replication of Tamino's work. The theoretical problem itself did peak my interest in an academic sense, nevertheless. In particular, Tamino had mentioned in his preliminary results post that he was making all grid rows 10° tall, except for the northernmost row, which was 20° tall. However, he was counting its surface area as if it were 10° tall. Evidently, this has to do with the fact that there are relatively few temperature stations near the Arctic. That row is not equivalent, in a statistical sense, to other rows.
It occurred to me that an alternative way to partition a grid would be to create grid cells (or grid boxes, if you prefer) that have the same number of stations in them. I call them statistically equivalent grid cells. So I started to write some code.
I ended up writing not just a script, but a full-fledged command-line tool that can produce a temperature record for a pretty-much arbitrary set of stations. For example, it can produce a temperature series for urban areas of Brazil, or hilly regions of the Northern Hemisphere. All you do is pass a simple expression to the tool. Additionally, it is written in a modular manner, so it's relatively easy for others to contribute new grid partitioning methods, new adjustments, new station combination methods, station filters, and output formats. At least a couple alternatives are already supported in each case.
I have set up a SourceForge project page for this project, and other projects and code that I might release through my blogs in the future.
GHCN Processor is open source software. It is released under the terms of the Apache License 2.0.
You need to have a Java Runtime Environment 1.5+ installed and in your PATH. It's not uncommon for PCs to already have it. You can check with:
GHCN Processor can be downloaded from the Files Section of the project page.
You can unzip the ZIP file in a directory of your choice. Then you need to add the
bindirectory of the distribution to your PATH environment variable. In Windows, you need to right-click My Computer, then select Properties / Advanced / Environment Variables.
Documentation on tool options can be found in the
readme.txtfile shipped with the GHCN Processor distribution. Any other posts I write about GHCN Processor should be available via the ghcn processor tag.
A default global temperature record can be created with:
ghcnp -o /tmp/ghcn-global.csv
The first time you run the tool, it will download inventory and data files from the GHCN v2 FTP site. Note that if you happen to interrupt the download, you could end up with incomplete files. In this case, run the tool again with the
The output is in CSV format, readable by spreadsheet software. By default it will contain one year per row, with 14 columns each. You can also produce a "monthly format" file with one month per row, as follows.
ghcnp -of monthly -o /tmp/ghcn-global-monthly.csv
By default, the tool uses the adjusted data set from GHCN v2. To use the raw, unadjusted data, try:
ghcnp -dt mean -o /tmp/ghcn-global-raw.csv
To use what I call "statistically equivalent grid cells", try:
ghcnp -gt seg -o /tmp/ghcn-global-seg.csv
The default combination method for stations that are in the same grid cell is a simple "temperature anomaly" method, with a base period if 1950 to 1981 by default. I've implemented a experimental "optimal" method that you can use as follows.
ghcnp -scm olem -o /tmp/ghcn-global-olem.csv
I'll discuss this in more detail on another occasion.
To produce a regional temperature record, you should always pass a
-regoption. The grid partitioning algorithm works differently when it's not attempting to produce a global grid. The following command produces a temperature series for airports in the United Kingdom.
ghcnp -reg -include " country=='UNITED KINGDOM' && is_airport " -o /tmp/uk-airports.csv
This alternative syntax also works:
ghcnp -reg -include " country eq 'UNITED KINGDOM' && is_airport " -o /tmp/uk-airports.csv
The expression after the
-includeoption should be enclosed in double-quotes, and its syntax is that of EL (basically the same as Java or C expression syntax.) String literals in the expression must be enclosed in single quotes. The names of countries must be written exactly as they appear in the GHCN v2 country codes file.
As another example, let's get a temperature series for stations in the southern hemisphere that are in forested areas.
ghcnp -reg -include "latitude < 0 && vegetation eq 'FO' " -o /tmp/sh-forested.csv
GHCN Processor in this case tells us that there are only 7 such stations with temperature data in the adjusted data set.
In addition to station properties you find in the inventory file of GHCN, there are a number of properties added by GHCN Processor:
mean_temperature. The properties
last_yeartell you the years when the station started and finished reporting, respectively.
To reproduce Tamino's work, we can simply run the following commands:
ghcnp -include "last_year <= 1991" -o /tmp/wattergate-dropped.csv
ghcnp -include "last_year > 1991" -o /tmp/wattergate-kept.csv
GHCN Processor tells us that 2899 stations reported only until 1991 or earlier, whereas 1872 stations were reporting after 1991. That's in the adjusted data set. Notice that they are completely disjoint sets of stations. I can assure you; any half-decent grid partitioning method and station combination method will produce essentially the same results as Tamino's.
GHCN Processor relies on the following third-party libraries and environments:
- Apache Commons Math 2.0 (Apache License 2.0)
- JUEL - Java Unified Expression Language (Apache License 2.0)
Of course, this tool wouldn't be possible without the GHCN v2 database.