Thursday, December 3, 2009

Statistical Proof of Anthropogenic Global Warming v2.0

Given that I'm again looking at climate data, I decided to write a follow-up of the post that launched this blog (originally here.) That was an analysis intended for skeptical laypersons, illustrated graphically at each step. Essentially, I demonstrated that you can associate CO2 emissions with temperatures in a way that eliminates the possibility of coincidence.

You see, it's very easy to associate CO2 with temperatures in a "naive" way. It's also very easy to associate the number of pirates with temperatures this way. In fact, you can associate any two trends in this manner.

I suggested removing the trends from the data, and then comparing the resulting detrended series. I explained it a little differently back then, but that's what it amounts to. Since that time I have learned this is called detrended cross-correlation analysis.

There were a number of criticisms of my analysis posted in comments. I didn't find any of them very convincing, to be honest, but they can be addressed as follows.
  1. Instead of using northern-hemisphere land temperatures, I will use global sea-surface temperatures (HadSST2 data set.) The criticism was that fluctuations of CO2 emissions could be associated with (i.e. confounded by) fluctuations of the urban heat island effect. I think that's a bit of a stretch, and this sort of objection has been addressed elsewhere, but I thought I would simply use sea-surface temperatures to eliminate the potential confound altogether.
  2. Instead of using cumulative CO2 emissions, I will use an ice core-based CO2 reconstruction from Etheridge et al. (1998), combined with Mauna Loa CO2 data. I will use Mauna Loa data only for 1979 onwards, and I will adjust it by subtracting 0.996 ppmv from each data point. There's an excellent match between the Etheridge et al. data and Mauna Loa data for the years 1958 to 1978, but there's a tiny offset of 0.996 ppmv between the two.
  3. I will use the logarithm of the CO2 concentration. This is theoretically more accurate. The equilibrium temperature of the planet depends on the concentration of green house gases, logarithmically. That's why climate scientists talk about sensitivity to CO2 doubling.

I will start by posting a graph of the original sea surface temperature (SST) and (logarithm of) CO2 series. This is Figure 1.

I said we would remove the trend from the series. What exactly is the trend, you might ask. There are many ways to model a trend. We could use a straight line, but then you could say that the series are not linear. The wobbles around each of the linear trends might also coincide. A polynomial trendline can model trends that are not linear. A 2nd-order polynomial trendline might be enough. I went straight for the cubic or 3rd-order polynomial trendline, which gives a slightly better fit. Those are the yellow lines that you see in Figure 1.

To detrend a series, you simply subtract the trendline from the series. The result of the operation can be seen in Figure 2 below.



The scales are a little different, but you should be able to tell, visually, how Figure 2 is obtained from Figure 1.

I added a vertical dashed brown line around the year 1890. I suspect data is simply wrong prior to 1890, for no other reason than the fact that it doesn't look good visually. I will analyze both the entire range (1850-2008) and the shorter one (1890-2008). Curiously, I had previously found something similar in a comparison of SSTs and named storms, but I assumed the storm data was the one in error.

Obviously, changes in CO2 won't be reflected in temperature fluctuations instantaneously. It takes time for heat to be trapped. Or to put in more technical terms, CO2 is logarithmically proportional to the equilibrium temperature. So there should be a lag.

Looking at the whole series, a statistically significant association between the detrended series starts to become significant with a lag of 6 years. The best lag (based on correlation coefficient) is 15 years. At a lag of 15 years, the association is significant with 99.997% confidence.

In my previous analysis I had found a lag of 10 years was the best lag, not 15. Interestingly, a lag of 10 years is exactly the best lag in the current analysis if I only consider data starting at 1890. In my estimation, 10 years is the true best lag, and this is completely justifiable theoretically by other means.

When you look at data from 1890 onwards, even a lag of zero will result in a statistically significant association. With a lag of 10 years, the association is significant with 99.99996% confidence. I think it's worth looking at a graphical representation of this association: the following scatter chart (Figure 3.)



In Figure 3, each dot represents a year. The graph tells us that the higher the detrended CO2 concentration in a given year, the higher the detrended sea-surface temperature, 10 years later. We can also calculate the 95% confidence interval of the slope of the trend, which is 13.7 to 29.7 in this case.

Given the methodology used, and the direction of the lag, this result can't be anything but indicative of a causal association between CO2 and sea-surface temperatures.

The association can't be explained by:
  • Coincidence.- Because we removed series trends, and because the associations are highly significant, mere-chance coincidences are exceedingly improbable.
  • Correlation is not causation.- For the same reasons, because of the lag, and because there appear to be no confounds, the only plausible explanation for the correlation is causation in this case.
  • Urban heat island.- The urban heat island effect should not be relevant to sea-surface temperatures.
  • Error and bias.- Errors would tend to hinder the analysis rather than help, just like we see with the data prior to 1890. Any systematic bias should be taken care of by the detrending step.
  • Conspiracy.- It would be preposterous to suppose that someone doctored the data sets (in just the right manner) anticipating this type of analysis several years in advance.
  • Assorted Non-sequiturs.- Arguments such as "Al Gore sucks" can be dismissed off-hand.

20 comments:

Nik said...

Likely to be moderated out of http://greenfyre.wordpress.com/2009/12/03/fox-news-praises-jon-stewart-for-climategate-coverage/#comments

"How could Stewart have handled it? Well, take the “hide the decline” piece. How hard would it have been to follow the talking head with this graph"

Well you *see* those four "knuckles" at the right of your graph? They are indeed the real deal, as you say, except maybe just maybe we all might stop cutting the older values off and show the whole thing like the NOAA web site does:

http://i46.tinypic.com/2akb9j.jpg

Joseph said...

@Nik: That's entirely off-topic, and it's not clear what you're trying to say. Your picture suggests that if you detrend a series, you end up with a series that has a flat trend. No kidding.

Wrinkled Retainer said...

I think Nik should go to Denial Depot. He might learn something.

Ian Enting said...

Nice, but no way is this a proof -- not enough data. The Etheridge numbers that you link to are, as the file says, a smoothed curve (probably a smoothing spline, maybe using code that I wrote (using routines from Carl de Boor's book)), so the numbers are not independent.

Joseph said...

@Ian: Yes, it's a spline. There's quite a bit of auto-correlation, no doubt, but as I noted, there's an association there with 99.99996% confidence. In my previous analysis (with land temperatures) I got an association with 99.99999999% confidence.

I doubt any corrections of any kind would make the associations non-significant.

Pekko said...

I want to see if I can replicate your results. So far I've been able to replicate the data sets (I get plots exactly like yours). Now I'm trying to figure out how to do the actual correlation analysis.

Why not just use the xcorr function to find the lag at which the greatest correlation occurrs?

I get this result. The global minimum occurs at lag +20, the global maximum at lag +51, however, there's also a local maximum at lag -15.

Don't know what to make of this since I've never really used the xcorr function.

Pekko said...

Well, perhaps the negative correlation at lag +20 can be discarded as physically unreasonable since it would imply negative correlation between CO2 and temperature (a rise in temperatures causing a decline in the atmospheric CO2 content). Maybe the maximum positive correlation at lag +51 is too late to be reasonable (maybe it's not statistically significant; there should be some indication of statistical significance here; also, what physical process would it indicate? surely not the Milankovic cycles). This would leave the local maximum at lag -15 as the most plausible peak. (It's getting late here, so perhaps I shouldn't post this, but here goes anyways. I hope I wasn't reading the cross correlation backwards! :-)

Joseph said...

Why not just use the xcorr function to find the lag at which the greatest correlation occurrs?

@Pekko: Well, I don't use Python. What I did is simply check the correlation coefficient at different lags. Then you can plot the lag vs. correlation coefficient series and figure out the maximum. You can do this even in Excel. I have a routine in Java too, which can produce a non-integer lag by means of spline interpolation.

I should note that when I add all the available noise to the CO2 data (as described in my latest post) I'm getting a lag of 14 years with HadCRUT3.

I'm not sure why you get different results for lag, without understanding the function you're using.

Actually, I bet your -15 lag means that temperature lags CO2 by 15 years, does it not? That's exactly the lag I get for the series from the post, when I analyze since 1850.

Can you see graphically that it shouldn't be more than 10 to 20 years?

Regarding statistical significance, once you find the right lag, you can test for significance. For example, you can calculate the confidence interval of the slope of a linear regression of the series. Since you do multiple comparisons to get to this point, the pertinent corrections should be applied, but this is not going to be a problem.

Joseph said...

BTW, Pekko, you might want to try it with HadCRUT3. That's less noisy than HadSST2, and, I suspect, more accurate.

I used HadSST2 in the post simply to address a potential objection.

Pekko said...

Thanks for the input Joseph. I realize that not everyone uses python.

I did initially think about estimating the lagged correlations like you did. (However, I didn't realize you can do non-integer lags with splines. What a brilliant idea. :-)

I think the results should, nevetheless, be more or less the same. However, I'll dig deeper into this when I have more time.

Yes, I think the lag of -15 means that CO2 leads temperatures 15 years. I input CO2 as the first parameter to the function, temperatures as the second.

By the way, I like the way you approach these issues, quantitatively and analytically. (You have lots of interesting stuff even Tamino doesn't have. :-)

Joseph said...

Thanks :)

I'm a big fan of Tamino's blog. I learn a lot there.

Thomas Palm said...

Uptake of CO2 in the oceans should vary a bit with temperature, and you'd expect an effect from variations in growth of vegetation on land too. So temperature should affect detrended CO2 levels. This explanation doesn't fit with the lag you observe, but I suspect it may play havoc with the statistical significance of your result.

Joseph said...

@Thomas: That's a pertinent observation, and with the analysis from this post alone I'd have a hard time addressing it. But it just so happens that the original analysis addressed that.

Basically, back then I didn't use CO2 concentrations. I calculated cumulative human-produced CO2 emissions based on data from CDIAC.

Note that the slope of the linear regression is perhaps more steep than one would expect for CO2. I'm sure there's some GHG confounding there, like from Methane. Maybe there are feedbacks, like you note. Or perhaps sharp CO2 fluctuations trigger stronger El Nino events - just speculating there.

Anonymous said...

Unfortunately, any time you get a regression with 99.99999% that probably means you have regressed a tautology somehow.

We know that CO2 is not the only climate forcer in existence (solar changes, volcanoes, methane, N2O, black carbon, CFCs, sulfate aerosols, and internal variability all play roles). Therefore, even in a perfectly clean system, you'd only expect CO2 to have a 50-60% correlation.

(I will note that I do not doubt that CO2 and the other GHGs are the cause for most of the warming in the past 50 years, but I am more convinced by the "fingerprints" - strat cooling vs. surface warming, northern hemisphere patterns following aerosols while global patterns follow GHGs, etc - and the basic theory, than by any random correlation)

-Marcus

Anonymous said...

Oh. Ignore my previous comment - I read "99.9999x% significant" as "99.999x% correlation".

I still hold by thinking that correlations are not as robust as other methodologies, and that especially for CO2 fluctuations of the magnitude of the 1900s to 1940s that I would be highly surprised by a clear signal, but... it isn't totally out of the question.

Joseph said...

Right, that's the confidence level, not to be confused with a correlation coefficient. (A correlation coefficient that high would only be possible if I'm modeling a simple law of physics or something like that.)

In addition to historical data analyses like this, there's plausibility and a theoretical foundation, of course, which is important to have in science. It is well known that CO2 absorbs radiation of certain wavelengths.

Ian Jordan said...

The statement of a 99.997% significance in correlation between polynomial-detrended log(atmospheric CO2) and SST over a range of lags is quite a remarkable claim. The visual appearance of Figure 3 does not suggest the correlation is so strong in spite of the stated 95% confidence bounds on the slope.

As you do not state the magnitude of the correlation coefficients, could you please post a plot of the correlation coefficients vs. lag? Such coefficients yield some information about the amount of variation attributable to the correlated quantities, and the complement with '1' yields information about the relative magnitude of the external noise and other sources of variation.

Thanks.

ian jordan

Elizabeth J. Neal said...

That was an analysis intended for skeptical laypersons, illustrated graphically at each step. Essentially, I demonstrated that you can associate CO2 emissions with temperatures in a way that eliminates the possibility of coincidence.
feng shui singapore

Fiverr Work said...

There were a number of criticisms of my analysis posted in comments. I didn't find any of them very convincing, to be honest, but they can be addressed as follows. feng shui singapore

Frank Walters said...

I see how you have done part of the 20th century and how you have done the correlation.

How do you know that you have not got spurious correlation?

How about doing the same for 1850 to 1950?

The variations in temperature in that period do not seem to correspond with the logarithm of CO2 in the atmosphere.

Have you taken into account the effects of the AMO and the PDO?

Your blog should also satisfy people who have some statistical background and some knowledge of oceanography.