Monday, March 22, 2010

How GHCN Processor Combines Stations

To combine temperature series from stations that occupy the same grid cell/box, it's normally not a good idea to simply calculate an average of the measurements. Actually, this would work just fine if all stations had complete data for the entire period of interest (e.g. 1880 to 2009.) They typically do not, however. Moreover, the absolute temperature measurements at one station can be systematically higher or lower than those of another station, even if the stations are close to one another. Because of this, if you just average stations, you will probably end up with "step" changes that do not accurately reflect reality.

Tamino proposed an optimal solution to the problem not long ago. Then RomanM wrote about a "more optimal" method, which he later revised in his Plan C post.

The problem both Tamino and RomanM took on involves finding an offset for each station-month (μi for station i.) Once you find the right offsets, you can add each temperature series to their corresponding offsets in order to "normalize" the stations, so to speak. Then you can average out the "normalized" series.

I believe that's the right problem to solve. My approach to solving it was different, nonetheless, and rather simple, in my view. The approach is already implemented and shipped with GHCN Processor 1.0. You can browse the source code of the implementation in CVS. (You need to pass a -scm olem option to the tool in order to use the method, otherwise it just uses a simple "temperature anomaly" method.)

Let's take as a given that you can always find a difference or delta between two temperature stations. Between stations i and j, let's call the difference ΔTi,j.

Note that the difference between offsets μi and μj is equal to -ΔTi,j. That's obvious enough, isn't it? The solution is easier to explain with an example. Let's take a grid cell with only 3 stations. The following two equations hold for such a set of stations:

(1) μ0 - μ1 = -ΔT0,1
(2) μ0 - μ2 = -ΔT0,2

Additionally, the sum of all offsets should be zero. This way, if you have data from all stations in a given year, the sum of the "normalized" measurements will be the same as the sum of the original absolute measurements. Hence, we have a third equation:

(3) μ0 + μ1 + μ2 = 0

Equations 1, 2 and 3 are nothing more than a system of linear equations: 3 equations with 3 variables. It's actually simpler that a typical system of linear equations. Notice that if you simply add the equations, you can easily solve for μ0:

μ0 = -(ΔT0,1 + ΔT0,2) / 3

Once you have μ0, you can easily solve for μ1 (with equation 1) and μ2 (with equation 2).


You might have noticed that there's one more equation we could have used: one involving μ1 and μ2. In fact, a valid alternative system of 3 equations is the following:

(1) μ1 - μ2 = -ΔT1,2
(2) μ1 - μ0 = -ΔT1,0
(3) μ0 + μ1 + μ2 = 0

While both equation systems are valid, the first equation system was biased toward station 0, whereas the second equation system is biased toward station 1. If you only solve one system of n equations biased toward station i, the solution is similar to the reference station method (Hansen & Lebedeff 1987.) The main limitation of this method is that it's dependent on the reliability of a single station: the reference station.

The full algorithm solves n systems of n equations each, with station i (from 0 to n-1) being the reference station in each case. Then it averages out the offsets that result from solving each of the equation systems. A straight average would be completely unbiased. The implementation uses a weighted average, where the weight of a reference station is the number of years with available data. The solution is thus biased, but only to the extent that some stations have more complete data than other stations.


I thought that was a sound plan, but I wondered if it worked in practice. I wrote a simulation to find out.

The simulation consists of 15 stations with a common non-linear trend, and some common noise. On top of that, each station has noise and seasonality unique to it. Stations start out with complete data, and we calculate a "expected" temperature series with these data. Then we remove data points at random from each station – anywhere from 0% to 80% of them. We then apply a station combination algorithm and come up with a "estimated" temperature series. I have uploaded the source code of the simulation into CVS, where you will find all other details.

Let's start by evaluating how a simple "temperature anomaly" method does with this simulation. I'm interested in how well it predicts the absolute temperature measurements, not just the anomalies. It's not too bad. A linear regression of "expected" vs. "estimated" temperatures has a correlation coefficient R2 of 0.987, with a slope of 0.99. Ideally, the slope should be 1. There's also an intercept of -0.251. This should ideally be zero.

Now let's see how the linear equation-based method does. The correlation coefficient R2 is 0.997 in this case. The slope is basically 1, and the intercept is basically zero. It's nearly an ideal solution.

No comments: