Saturday, March 20, 2010

GHCN Processor 1.0

As widely reported, Tamino debunked a key "sceptic" claim to the effect that the drop-out of stations in GHCN around 1990 introduced a false warming trend. Tamino's result has been replicated multiple times.

There's probably no point in producing yet another replication of Tamino's work. The theoretical problem itself did peak my interest in an academic sense, nevertheless. In particular, Tamino had mentioned in his preliminary results post that he was making all grid rows 10° tall, except for the northernmost row, which was 20° tall. However, he was counting its surface area as if it were 10° tall. Evidently, this has to do with the fact that there are relatively few temperature stations near the Arctic. That row is not equivalent, in a statistical sense, to other rows.

It occurred to me that an alternative way to partition a grid would be to create grid cells (or grid boxes, if you prefer) that have the same number of stations in them. I call them statistically equivalent grid cells. So I started to write some code.

I ended up writing not just a script, but a full-fledged command-line tool that can produce a temperature record for a pretty-much arbitrary set of stations. For example, it can produce a temperature series for urban areas of Brazil, or hilly regions of the Northern Hemisphere. All you do is pass a simple expression to the tool. Additionally, it is written in a modular manner, so it's relatively easy for others to contribute new grid partitioning methods, new adjustments, new station combination methods, station filters, and output formats. At least a couple alternatives are already supported in each case.

I have set up a SourceForge project page for this project, and other projects and code that I might release through my blogs in the future.

License

GHCN Processor is open source software. It is released under the terms of the Apache License 2.0.

Requirements

You need to have a Java Runtime Environment 1.5+ installed and in your PATH. It's not uncommon for PCs to already have it. You can check with:

java -version

Download

GHCN Processor can be downloaded from the Files Section of the project page.

Installation

You can unzip the ZIP file in a directory of your choice. Then you need to add the bin directory of the distribution to your PATH environment variable. In Windows, you need to right-click My Computer, then select Properties / Advanced / Environment Variables.

Documentation

Documentation on tool options can be found in the readme.txt file shipped with the GHCN Processor distribution. Any other posts I write about GHCN Processor should be available via the ghcn processor tag.

Global Examples

A default global temperature record can be created with:

ghcnp -o /tmp/ghcn-global.csv

The first time you run the tool, it will download inventory and data files from the GHCN v2 FTP site. Note that if you happen to interrupt the download, you could end up with incomplete files. In this case, run the tool again with the -f option.

The output is in CSV format, readable by spreadsheet software. By default it will contain one year per row, with 14 columns each. You can also produce a "monthly format" file with one month per row, as follows.

ghcnp -of monthly -o /tmp/ghcn-global-monthly.csv

By default, the tool uses the adjusted data set from GHCN v2. To use the raw, unadjusted data, try:

ghcnp -dt mean -o /tmp/ghcn-global-raw.csv

To use what I call "statistically equivalent grid cells", try:

ghcnp -gt seg -o /tmp/ghcn-global-seg.csv

The default combination method for stations that are in the same grid cell is a simple "temperature anomaly" method, with a base period if 1950 to 1981 by default. I've implemented a experimental "optimal" method that you can use as follows.

ghcnp -scm olem -o /tmp/ghcn-global-olem.csv

I'll discuss this in more detail on another occasion.

Regional Examples

To produce a regional temperature record, you should always pass a -reg option. The grid partitioning algorithm works differently when it's not attempting to produce a global grid. The following command produces a temperature series for airports in the United Kingdom.

ghcnp -reg -include " country=='UNITED KINGDOM' && is_airport " -o /tmp/uk-airports.csv

This alternative syntax also works:

ghcnp -reg -include " country eq 'UNITED KINGDOM' && is_airport " -o /tmp/uk-airports.csv

The expression after the -include option should be enclosed in double-quotes, and its syntax is that of EL (basically the same as Java or C expression syntax.) String literals in the expression must be enclosed in single quotes. The names of countries must be written exactly as they appear in the GHCN v2 country codes file.

As another example, let's get a temperature series for stations in the southern hemisphere that are in forested areas.

ghcnp -reg -include "latitude < 0 && vegetation eq 'FO' " -o /tmp/sh-forested.csv

GHCN Processor in this case tells us that there are only 7 such stations with temperature data in the adjusted data set.

Tamino Reproduction

In addition to station properties you find in the inventory file of GHCN, there are a number of properties added by GHCN Processor: slope, variance, first_year, last_year, and mean_temperature. The properties first_year and last_year tell you the years when the station started and finished reporting, respectively.

To reproduce Tamino's work, we can simply run the following commands:

ghcnp -include "last_year <= 1991" -o /tmp/wattergate-dropped.csv

ghcnp -include "last_year > 1991" -o /tmp/wattergate-kept.csv

GHCN Processor tells us that 2899 stations reported only until 1991 or earlier, whereas 1872 stations were reporting after 1991. That's in the adjusted data set. Notice that they are completely disjoint sets of stations. I can assure you; any half-decent grid partitioning method and station combination method will produce essentially the same results as Tamino's.

Acknowledgments

GHCN Processor relies on the following third-party libraries and environments:

Of course, this tool wouldn't be possible without the GHCN v2 database.

11 comments:

Pekko said...

Many thanks. This is much appreciated. It's nice to see a good OO design applied to this particular problem.

tomcpp said...

Just because I'd like to actually know the answer to this question :

How do we tell if this database is trustworthy ? And I've actually taught at a university, so please don't tell me "scientists aren't biased", that just won't do.

Joseph said...

How do we tell if this database is trustworthy ? And I've actually taught at a university, so please don't tell me "scientists aren't biased", that just won't do.

@tomcpp: Any reason to think it wouldn't be? It's hard to imagine the raw measurements would be biased. Note that there's an adjusted data set, but the raw unadjusted measurements are also available.

PhilScadden said...

tomcpp: well look at the origins of the data - numerous different organizations everywhere contributing. That would require quite a conspiracy. A trick by the database collators? Well if the reported temperature for your part of the world differed from the country's metoffice records don't you think they would notice? And of course there is the supporting evidence. The records shows the same characteristics (trends, maxima and minima) as record from satellite MSU, from satellite SST, and from temperature proxies, eg global glacial volume and sea level in particular. I guess you could go to each country and pay the local agency if required for their data for comparison. I think I would have to have some strong reason to suspect GHCN of fraud (eg a variance with other records) before I'd bother.

Joseph said...

Plus no one has convincingly demonstrated that there is a systematic warming bias in the adjustments, when all stations are looked at. Quite the contrary.

Zeke said...

Well done. Its nice to see more folks creating spatial weighting and gridding programs.

That said, I'm unsure of how effective your method will be in accounting for biases in the spatial distribution of stations. If grid cells have a size that is defined by the number of stations available, you will end up interpolating temperatures over quite long distances in regions with few stations (I presume each resulting grid is still weighted by its size relative to all the grids available when calculating global anomalies).

Ron Broberg over at the Whiteboard has been doing some interesting work in creating independent metadata characteristics for GHCN and USHCN stations (e.g. via spatial pop density data, urban boundaries, satellite nightlights, etc.) that you might find useful. I've been playing around with them at Lucia's place to look into UHI issues of late.

Joseph said...

@Zeke: Sure. The fact that some regions have a scarcity of stations is a problem all in itself, and it's not really solvable, except by adding stations at those locations. GHCN has few stations over sea. It has no stations in Antarctica below -80S (except 2 in the unadjusted data.)

When you have grid boxes that are 5x5 (or 5x10 - the GHCN Processor default), that's also problematic. What do you do with boxes that have no stations?

In the Southern Hemisphere, you'll have lots of boxes with no stations. You'll have lots of boxes with 1 station. Basically, you end up having to produce an average for a (latitude) row, and then you average the rows, weighted for surface area of the row.

With the "statistically equivalent grid cells" method, you don't have that issue. You can even count Antarctica as much as that's possible.

There are other differences, as you can imagine.

Zeke said...

Joseph,

Fair enough. It just might be safer to put an upper bound on the distance from a station that is allowed to avoid weirdness if you start looking at smaller subsets of stations (e.g. only urban ones).

I wonder if you could produce a map of the resulting gridcells? It would give a better sense of how the dynamic sizes work in practice.

Joseph said...

@Zeke: Yes, that's a good idea for a feature. I have looked at them while debugging. The southern-most row is the tallest one for 5 stations per cell - about 23 degrees (as I recall) for the adjusted database.

Of course, you have control over the number of stations per cell, and the number of cells per non-polar row. Polar rows have exactly 1 cell in them.

steven said...

Nice work,

At some point I think it would make sense for somebody to create a set of synthetic data sets so that we can test all the systems with the same data, most notably GISS use different data than others ( see my details over at Lucia ) anyways, the idea would be to create a series of files that had the same format as a GHCN v2mean. The test data would consist of a variety of data and data problems. But for starters a dataset of ghcnv2mean that is complete from 1850 to present with data for every station and every month ( like all zeros) would be a first good test to through at all the various approaches.

Then test data sets can be created with a variety of "problems" missing months, missing stations, various trends.

Finally, Have a look at Zeke's suggestion and visit Ron's page where we look at metadata. It needs a good looking at. WRT raw data.. it isnt raw. but that should stop folks from comparing methods

John Baez said...

The two links to posts by Tamino at the top of this blog entry no longer work.