Tuesday, March 30, 2010

US Rural vs. Urban Temperature Stations

There's an article by Edward R. Long titled "Contiguous U.S. Temperature Trends Using NCDC Raw and Adjusted Data For One-Per-State Rural and Urban Station Sets." It claims to show that in the raw/unadjusted NCDC data, urban U.S. stations have a warming trend that diverges from that of rural stations, whereas in the adjusted data, the rural trend has been adjusted to "take on the time-line characteristics of urban data." In not so many words, it claims that NCDC data has been fudged. Not surprisingly, Long's article appears to be quite popular in "sceptic" circles.

The methodology of the article is peculiar. First, why only analyze the U.S.? Even though the U.S. has more stations than any other single country, its surface area is only 2% that of Earth.

More importantly, why pick only one rural and one urban station from each state? What was the criteria used to pick each state's stations? Was it random? The article does not clarify, so it lends itself to accusations of cherry-picking.

It's fairly easy to verify Long's claims with GHCN Processor. A quick verification takes perhaps 10 minutes if you're familiar with the tool's options.

First, we can get rural and urban temperature anomaly series for the U.S. from the adjusted data, with the following commands, respectively:

ghcnp -include "country eq 'UNITED STATES OF AMERICA' && population_type eq 'R'" -reg -o /tmp/us-rural.csv

ghcnp -include "country eq 'UNITED STATES OF AMERICA' && population_type eq 'U'" -reg -o /tmp/us-urban.csv

GHCN Processor informs us that it processed 867 rural stations, and 117 urban stations. That's plenty more than the article analyzes. Let's take a look at a graph of both series.

This is still consistent with Long's claims. What we want to confirm is whether the rural trend is significantly less steep in the raw unadjusted data. To get rural and urban series from the unadjusted data file, I used the following commands, respectively:

ghcnp -dt mean -include "country eq 'UNITED STATES OF AMERICA' && population_type eq 'R'" -reg -o /tmp/us-rural-raw.csv

ghcnp -dt mean -include "country eq 'UNITED STATES OF AMERICA' && population_type eq 'U'" -reg -o /tmp/us-urban-raw.csv

In this case, the tool informs us that 1046 rural stations have been analyzed, compared to 392 urban stations. A graph of the series follows.

To be thorough, let's also get 12-year running averages of the unadjusted series. That's what the article does.

This graph is very different to Figure 6 from Long's article, and it doesn't support Long's conclusions by any stretch of the imagination. It's also clear that while Long's urban trend is roughly correct, the 48 rural stations he picked are not representative of the 1046 stations GHCN Processor retrieves out of the raw data file. Why they are not representative can only be speculated upon, but I have some ideas.

The divergence between urban and rural stations that exists prior to the reference period (1950 to 1981) might be something that needs to be looked into further, but it's not too surprising. The farther back you go, the fewer the number of stations that report. There's simply more error in older data.

Addendum (3/31/2010)

In comments, steven suggests that GHCN v2 population assessments are old and may no longer be applicable. For example, a station might be near a town that used to have less than 10,000 people, and classified as 'R', but then the town grew.

Intuitively, it doesn't seem likely that this would be sufficient to explain away the findings, and it certainly doesn't address Dr. Long's choice of only 48 rural stations in the U.S. But I try to keep an open mind about these types of arguments, within reason.

Fully addressing steven's objection would take substantial work, but we can do the next best thing. Steven indicates that population density is what really matters. Let's take a look at a population density map of the United States (from Wikimedia):

Here's another such map. A portion of the U.S. (basically, the mid-west) has considerably lower population density than the rest of the country. Let's define this low-density region as that bounded by longitudes -95° and -115°, which excludes California. A longitude map of the US can be found here.

A typical rural station in the low-density region should not be as likely to be near an urban area as a rural station in high-density areas of the U.S. Additionally, the population density around the station should be lower for rural stations in the low-density region, in average. Does that make sense?

With GHCN Processor we can easily obtain an unadjusted temperature anomaly series only for rural stations in the low-density region, as follows.

ghcnp -dt mean -include "country eq 'UNITED STATES OF AMERICA' && population_type eq 'R' && longitude < -95 && longitude > -115" -reg -o /tmp/us-really-rural-raw.csv

I've calculated 12-year running averages of the rural low-density series, and plotted it along with the full-rural series.

Things haven't really changed, have they?

There seems to be somewhat of an offset between both station sets, which is interesting to some extent. Apparently, "really rural" stations were warm relative to all rural stations during the baseline period (1950 to 1981.)

BTW, there are 432 "really rural" stations in the unadjusted data file.

Sunday, March 28, 2010

The Average Temperature of Earth

If you Google the average temperature of Earth, you'll find a couple of frequent estimates: 13°C and 15°C. GISTemp data files carry a note that says:
Best estimate for absolute global mean for 1951-1980 is 14.0 deg-C or 57.2 deg-F...

GHCN Processor 1.1 has a -abs option that causes the tool to write out "absolute measurement" averages, as opposed to temperature anomalies. Additionally, simulations I've run indicate that the tool's default cell and station combination method (the linear-equations-based method) is adequate for this sort of application.

You can get a global absolute measurement series with the following command.

ghchp -gt seg -abs -o /tmp/ghcn-global-abs.csv

In this case the tool says 4771 stations have been analyzed. GHCN v2 has considerably more stations, however. A lot of them are dropped in the adjusted data set because they don't have enough data points. To get a series based on the raw data, you can run:

ghcnp -dt mean -gt seg -abs -o /tmp/ghcn-global-abs-noadj.csv

The tool now analyzes 7067 out of 7280 stations. (GHCN Processor 1.1 has an implicit filter that drops any station without at least 10 data points in any given month.)

It probably shouldn't come as a surprise that there are differences in the results we obtain with each data set. With the adjusted data, the average for the baseline period 1950 to 1981 is 14.7°C. With the raw data, the average is 14.1°C. The reason for the discrepancy probably has to do with the sorts of stations that don't make it into the adjusted data set, typically because they haven't reported long enough. They might be cold stations, like stations near Antarctica, a region with a substantial scarcity of stations in the adjusted data set.

Let me post a graph of both series, just so there's no confusion.

So it's basically an offset difference between the two. If I'm correct about it being due to stations near Antarctica, one caveat would be elevation. It appears that stations in Antarctica are high up, and this is a problem. We could filter stations by elevation, but then we basically drop Antarctica, like the adjusted data set does.

I believe 14.1°C is probably a low estimate. It's also pretty clear that GHCN v2 is land-biased.

There does seem to be a slight slope difference (0.15°C/century) between the adjusted and raw series. This can't be anywhere near significance, and again, it probably has to do with the sorts of stations that don't make it into the adjusted data set. I wouldn't be surprised if "sceptics" manage to make a big deal of it, though.

GHCN Processor 1.1

Version 1.1 of GHCN Processor is now available for download. Relative to Version 1.0, the highlights are:

  • Implemented the first difference station combination method (Peterson et al. 1998.)
  • Added -abs option, which causes the tool to produce an "absolute measurement" series, as opposed to an "anomaly" series.
  • Added a -ccm option (similar to -scm) that sets the grid cell/box combination method.
  • The default station and grid cell combination method is the linear-equation-based method (olem.) This decision was based on simulations I ran comparing all 3 supported methods, involving series with equal and different underlying slopes.
  • Added -og option, which lets you write grid partitioning information to a CSV-format file. This feature is informative if you use the -gt seg option. (Zeke suggested producing a map. This is the next best thing.)

See the release-notes.txt and readme.txt files for additional details.

Monday, March 22, 2010

How GHCN Processor Combines Stations

To combine temperature series from stations that occupy the same grid cell/box, it's normally not a good idea to simply calculate an average of the measurements. Actually, this would work just fine if all stations had complete data for the entire period of interest (e.g. 1880 to 2009.) They typically do not, however. Moreover, the absolute temperature measurements at one station can be systematically higher or lower than those of another station, even if the stations are close to one another. Because of this, if you just average stations, you will probably end up with "step" changes that do not accurately reflect reality.

Tamino proposed an optimal solution to the problem not long ago. Then RomanM wrote about a "more optimal" method, which he later revised in his Plan C post.

The problem both Tamino and RomanM took on involves finding an offset for each station-month (μi for station i.) Once you find the right offsets, you can add each temperature series to their corresponding offsets in order to "normalize" the stations, so to speak. Then you can average out the "normalized" series.

I believe that's the right problem to solve. My approach to solving it was different, nonetheless, and rather simple, in my view. The approach is already implemented and shipped with GHCN Processor 1.0. You can browse the source code of the implementation in CVS. (You need to pass a -scm olem option to the tool in order to use the method, otherwise it just uses a simple "temperature anomaly" method.)

Let's take as a given that you can always find a difference or delta between two temperature stations. Between stations i and j, let's call the difference ΔTi,j.

Note that the difference between offsets μi and μj is equal to -ΔTi,j. That's obvious enough, isn't it? The solution is easier to explain with an example. Let's take a grid cell with only 3 stations. The following two equations hold for such a set of stations:

(1) μ0 - μ1 = -ΔT0,1
(2) μ0 - μ2 = -ΔT0,2

Additionally, the sum of all offsets should be zero. This way, if you have data from all stations in a given year, the sum of the "normalized" measurements will be the same as the sum of the original absolute measurements. Hence, we have a third equation:

(3) μ0 + μ1 + μ2 = 0

Equations 1, 2 and 3 are nothing more than a system of linear equations: 3 equations with 3 variables. It's actually simpler that a typical system of linear equations. Notice that if you simply add the equations, you can easily solve for μ0:

μ0 = -(ΔT0,1 + ΔT0,2) / 3

Once you have μ0, you can easily solve for μ1 (with equation 1) and μ2 (with equation 2).


You might have noticed that there's one more equation we could have used: one involving μ1 and μ2. In fact, a valid alternative system of 3 equations is the following:

(1) μ1 - μ2 = -ΔT1,2
(2) μ1 - μ0 = -ΔT1,0
(3) μ0 + μ1 + μ2 = 0

While both equation systems are valid, the first equation system was biased toward station 0, whereas the second equation system is biased toward station 1. If you only solve one system of n equations biased toward station i, the solution is similar to the reference station method (Hansen & Lebedeff 1987.) The main limitation of this method is that it's dependent on the reliability of a single station: the reference station.

The full algorithm solves n systems of n equations each, with station i (from 0 to n-1) being the reference station in each case. Then it averages out the offsets that result from solving each of the equation systems. A straight average would be completely unbiased. The implementation uses a weighted average, where the weight of a reference station is the number of years with available data. The solution is thus biased, but only to the extent that some stations have more complete data than other stations.


I thought that was a sound plan, but I wondered if it worked in practice. I wrote a simulation to find out.

The simulation consists of 15 stations with a common non-linear trend, and some common noise. On top of that, each station has noise and seasonality unique to it. Stations start out with complete data, and we calculate a "expected" temperature series with these data. Then we remove data points at random from each station – anywhere from 0% to 80% of them. We then apply a station combination algorithm and come up with a "estimated" temperature series. I have uploaded the source code of the simulation into CVS, where you will find all other details.

Let's start by evaluating how a simple "temperature anomaly" method does with this simulation. I'm interested in how well it predicts the absolute temperature measurements, not just the anomalies. It's not too bad. A linear regression of "expected" vs. "estimated" temperatures has a correlation coefficient R2 of 0.987, with a slope of 0.99. Ideally, the slope should be 1. There's also an intercept of -0.251. This should ideally be zero.

Now let's see how the linear equation-based method does. The correlation coefficient R2 is 0.997 in this case. The slope is basically 1, and the intercept is basically zero. It's nearly an ideal solution.

Saturday, March 20, 2010

GHCN Processor 1.0

As widely reported, Tamino debunked a key "sceptic" claim to the effect that the drop-out of stations in GHCN around 1990 introduced a false warming trend. Tamino's result has been replicated multiple times.

There's probably no point in producing yet another replication of Tamino's work. The theoretical problem itself did peak my interest in an academic sense, nevertheless. In particular, Tamino had mentioned in his preliminary results post that he was making all grid rows 10° tall, except for the northernmost row, which was 20° tall. However, he was counting its surface area as if it were 10° tall. Evidently, this has to do with the fact that there are relatively few temperature stations near the Arctic. That row is not equivalent, in a statistical sense, to other rows.

It occurred to me that an alternative way to partition a grid would be to create grid cells (or grid boxes, if you prefer) that have the same number of stations in them. I call them statistically equivalent grid cells. So I started to write some code.

I ended up writing not just a script, but a full-fledged command-line tool that can produce a temperature record for a pretty-much arbitrary set of stations. For example, it can produce a temperature series for urban areas of Brazil, or hilly regions of the Northern Hemisphere. All you do is pass a simple expression to the tool. Additionally, it is written in a modular manner, so it's relatively easy for others to contribute new grid partitioning methods, new adjustments, new station combination methods, station filters, and output formats. At least a couple alternatives are already supported in each case.

I have set up a SourceForge project page for this project, and other projects and code that I might release through my blogs in the future.


GHCN Processor is open source software. It is released under the terms of the Apache License 2.0.


You need to have a Java Runtime Environment 1.5+ installed and in your PATH. It's not uncommon for PCs to already have it. You can check with:

java -version


GHCN Processor can be downloaded from the Files Section of the project page.


You can unzip the ZIP file in a directory of your choice. Then you need to add the bin directory of the distribution to your PATH environment variable. In Windows, you need to right-click My Computer, then select Properties / Advanced / Environment Variables.


Documentation on tool options can be found in the readme.txt file shipped with the GHCN Processor distribution. Any other posts I write about GHCN Processor should be available via the ghcn processor tag.

Global Examples

A default global temperature record can be created with:

ghcnp -o /tmp/ghcn-global.csv

The first time you run the tool, it will download inventory and data files from the GHCN v2 FTP site. Note that if you happen to interrupt the download, you could end up with incomplete files. In this case, run the tool again with the -f option.

The output is in CSV format, readable by spreadsheet software. By default it will contain one year per row, with 14 columns each. You can also produce a "monthly format" file with one month per row, as follows.

ghcnp -of monthly -o /tmp/ghcn-global-monthly.csv

By default, the tool uses the adjusted data set from GHCN v2. To use the raw, unadjusted data, try:

ghcnp -dt mean -o /tmp/ghcn-global-raw.csv

To use what I call "statistically equivalent grid cells", try:

ghcnp -gt seg -o /tmp/ghcn-global-seg.csv

The default combination method for stations that are in the same grid cell is a simple "temperature anomaly" method, with a base period if 1950 to 1981 by default. I've implemented a experimental "optimal" method that you can use as follows.

ghcnp -scm olem -o /tmp/ghcn-global-olem.csv

I'll discuss this in more detail on another occasion.

Regional Examples

To produce a regional temperature record, you should always pass a -reg option. The grid partitioning algorithm works differently when it's not attempting to produce a global grid. The following command produces a temperature series for airports in the United Kingdom.

ghcnp -reg -include " country=='UNITED KINGDOM' && is_airport " -o /tmp/uk-airports.csv

This alternative syntax also works:

ghcnp -reg -include " country eq 'UNITED KINGDOM' && is_airport " -o /tmp/uk-airports.csv

The expression after the -include option should be enclosed in double-quotes, and its syntax is that of EL (basically the same as Java or C expression syntax.) String literals in the expression must be enclosed in single quotes. The names of countries must be written exactly as they appear in the GHCN v2 country codes file.

As another example, let's get a temperature series for stations in the southern hemisphere that are in forested areas.

ghcnp -reg -include "latitude < 0 && vegetation eq 'FO' " -o /tmp/sh-forested.csv

GHCN Processor in this case tells us that there are only 7 such stations with temperature data in the adjusted data set.

Tamino Reproduction

In addition to station properties you find in the inventory file of GHCN, there are a number of properties added by GHCN Processor: slope, variance, first_year, last_year, and mean_temperature. The properties first_year and last_year tell you the years when the station started and finished reporting, respectively.

To reproduce Tamino's work, we can simply run the following commands:

ghcnp -include "last_year <= 1991" -o /tmp/wattergate-dropped.csv

ghcnp -include "last_year > 1991" -o /tmp/wattergate-kept.csv

GHCN Processor tells us that 2899 stations reported only until 1991 or earlier, whereas 1872 stations were reporting after 1991. That's in the adjusted data set. Notice that they are completely disjoint sets of stations. I can assure you; any half-decent grid partitioning method and station combination method will produce essentially the same results as Tamino's.


GHCN Processor relies on the following third-party libraries and environments:

Of course, this tool wouldn't be possible without the GHCN v2 database.

Friday, March 5, 2010

Essentially The Same Graph, Different Smoothing

This is probably the last post I will write on the topic of tropical cyclones in a long time. I don't intend to make this an all-tropical-cyclones-all-the-time blog. I promise.

I thought I should mention a paper I found – Mann & Emanuel (2006) – that contains a graph of storm counts and sea surface temperatures, shown below.

The smoothing is somewhat different to that of the graphs I've produced. It's decadal smoothing. They apparently use SST data for the Atlantic averaged over August-September-October. (I did something like this in my first analysis, but I do not believe it makes sense anymore.)

Mann & Emanuel then compare the decadally smoothed series, and find a correlation coefficient R of 0.73 (p < 0.001 one-tailed.) Based on these results, the authors concluded that:
There is a strong historical relationship between tropical Atlantic SST and tropical cyclone activity extending back through the late nineteenth century.

The way I see correlation coefficients, 0.5 is "not bad," 0.75 is "pretty good," 0.9 is "very good," and 0.99+ is "law of physics."

The confidence level is what really matters when it comes to causality, in my view. In my statistical analysis, where data had all the original noise, I found a detrended association with 99.993% confidence (equivalent to p < 0.00007) – allowing for a lag of 1 year.

I wondered, nevertheless: What would the correlation coefficient be like if we compare the 15-year central moving averages of the data I've been using? Let's do this only considering data since 1870, like Mann & Emanuel do. Let's also allow for a lag, given that such is evident in the graph.

I found something surprising. The optimal lag is 7 years, not a few years like I had presumed. The correlation coefficient R at this lag is 0.924 – well within "very good" territory.

You know what's interesting about the 7-year lag? 2005 minus 1998 is 7. I might come back to that some other time.

I conclude that Mann & Emanuel were on the right track, but they didn't make a strong enough case. There's a presentation by Tom Knutson of NOAA where he mentions the graph from Mann & Emanuel (2006), apparently thinking it's interesting, but wonders if the storm record is reliable enough to produce a graph like that. Then he goes on to discuss the long-term trend, which may or may not be statistically significant (as I already explained, causality and trend significance are different things), and the projections of computer simulations.

Thursday, March 4, 2010

Literature on the Number of Tropical Cyclones

It's not exactly true that my claims about the number of tropical cyclones tracking SSTs are at odds with observational science. They might be at odds with some models, as discussed in 4AR WGI

See, for example, Webster et al. (2005). This paper has some interesting graphs too, but the time span is short.
We conclude that global data indicate a 30-year trend toward more frequent and intense hurricanes, corroborated by the results of the recent regional assessment.

I should also note that determining causality and determining whether there's a statistically significant upward trend are technically different endeavors. You can have one without the other. This applies to other effects of AGW as well, like sea level rise. Does that make sense?

Wednesday, March 3, 2010

Recent Storm Seasons Have Been Mild, Right?

OK. The graph with the 15-year (or 17-year, or 21-year) smooth NH SST and Atlantic storm series looks good, almost too good. I was reminded of that fact recently. But then, you might ask, how do we reconcile this with the fact that Atlantic storm seasons after 2005 have been mild? Should I believe my lying eyes, or... my lying eyes?

Mild compared to what? Anything is mild when you compare it to the Katrina season.

You can search Wikipedia for each year's count, e.g. google "2006 Atlantic Hurricane Season." These are the figures:

2006: 9
2007: 15
2008: 16
2009: 9

The average is 12.25. The average between 1944 and 1990 was 9.8. I'm not saying recent seasons are significantly more busy, but you can't claim they are milder than the 1944-1990 average either.

Clearly, 2007 and 2008 were both considerably more busy than normal. They probably went unnoticed because, again, what could possibly compare to the 2005 season?

Note that I never claimed storms resulting from global warming were going to be the end of us all, or anything of the sort. I haven't even attempted to model it yet. Any projection I made carried a caveat: that the association would have to be linear. But it can't really be linear, can it? I can't say, for example, if the number of storms saturates after a certain temperature is reached.

I never even claimed that there has been a statistically significant rise in the number of storms. It seems reasonable this is the case, but I just haven't done this analysis. What I did is determine whether storms track sea-surface temperatures, and they do.

I will say one thing, though. Do not be surprised if the 2011 or 2012 seasons are quite busy. I'm saying that because 2010 is an El Niño year. It's not the case that busy seasons always follow El Niño years, or that all El Niño years end up producing busy seasons. But based on a quick comparison, it often does happen that way.

Too Easy To Be True?

Back in June of 2008, I wrote a statistical analysis of the association between northern hemisphere sea surface temperatures and the number of named storms in the Atlantic basin. I determined, with 99.993% confidence, that indeed there was such an association. I had controlled for coincidental trends (otherwise known as a spurious correlation) by means of detrending.

A commenter over at Climate Audit tacitly admitted reproducing my analysis, but pointed out that if he detrended the series using 6th-order polynomial trendlines, the association no longer held. I noted that if you allow for a lag of 1 year between the series, even the 6th-order detrending resulted in a statistically significant association, despite the loss of information that necessarily results from such a detrending. The existence of the lag between temperatures and named storms would soon become crystal clear to me and my (not very many) readers.

During the back and forth with the Climate Audit commenter, I realized that if you simply smooth out the noise from both series, the association becomes graphically evident, and a lot more convincing – I thought – than a statistical analysis. The first graph I posted used nothing more than a 21-year central moving average for smoothing. The results were remarkable and the graph was remarkably easy to produce.

As it turns out, it was also remarkably difficult to believe. A few months went by, and a reader (paulm) posted a link to my analysis over at the Accuweather.com GW blog. The graph was met with a got-to-be-fake type of response.

When I found out, this obviously made me upset. You can tell I was upset as I was explaining my very lenient comment policy (see sidebar) in my response to the Accuweather.com incident. I even posted the spreadsheet for verification. There were no further falsification accusations after I did this, and I thought that was the end of it.

Fast forward a year and a half. Deltoid has a recent post on the topic, quoting various IPCC statements, and I basically commented that the IPCC was wrong, in my view, in regards to the number of tropical cyclones not changing in response to global warming. The data told me otherwise.

I guess questioning an IPCC claim was a mistake, wasn't it? I might have also broken some social norm I'm not aware of or something. Some of the regulars started to talk to me as if I were one of the resident trolls, like El Gordo or Spangled Drongo. They basically accused me of fraud and trying to deceive, in a way that is not dissimilar to how the CRU team are accused of fraud and so forth.

Things calmed down after I, once again, posted the spreadsheet. I appreciate Bernard J's semi-apology.

For reference, below are the links to the data and the spreadsheet. I posted the spreadsheet at a more permanent location.

What else can I do? You've seen my comment policy. Should I post screenshots too?

I'd hate to think that my most interesting graphs are assumed to be fake a priori. Is my graph of Red Sea sea level and Vostok temperatures a fake? What about my graph of the natural spline interpolation of the Law Dome CO2 data? What about the one with the Mann et al. (2008) reconstruction and CO2 at the time of the industrial revolution? What about the graph with the Greenland temperature reconstruction?

I realize I often post claims that you can't just Google to confirm, and I realize that people are sometimes paranoid. That's why I have the comment policy I have.