Thursday, November 26, 2009

VERY ARTIFICIAL quote-mining

The CRU stolen emails incident is a big mess, isn't it? What I mean is that it could easily hinder real work that needs to get done, not just in climate science.

I thought I'd lend a hand as a computer scientist who is entirely independent of the politics of AGW. As you can see, I have taken an interest in the topic in the past, but most of the time I'm involved in science debates of a completely different nature.

I've read through some of the quote-mined emails, and I don't really see anything that looks like conspiracy talk of any sort, attempts to cover up data and such. In my view, it mostly amounts to innocuous chatter among scientists, discussion of statistical techniques, speculation, and some honest skepticism. For example, that 2008 was a "cold-ish" year is no secret, and I myself had said that if 2009 is also a cold year, it's possible IPCC predictions might be falsified. (Incidentally, it's not.)

When it comes to accusations of "data cooking," there is one post that caught my attention: Hiding the Decline: Part 1 – The Adventure Begins by Eric S. Raymond. Admittedly, it initially caught my attention because I've thought highly of Mr. Raymond ever since I read The Cathedral and the Bazaar, about 10 years ago. Mr. Raymond opens the post with a snippet of IDL code:

;
; Apply a VERY ARTIFICAL correction for decline!!
;
yrloc=[1400,findgen(19)*5.+1904]
valadj=[0.,0.,0.,0.,0.,-0.1,-0.25,-0.3,0.,- 0.1,0.3,0.8,1.2,1.7,2.5,2.6,2.6,$
2.6,2.6,2.6]*0.75 ; fudge factor
if n_elements(yrloc) ne n_elements(valadj) then message,’Oooops!’
;
yearlyadj=interpol(valadj,yrloc,timey)

This certainly looked dodgy to me at first glance. Basically, it looks like a correction that arbitrarily reduces temperatures in the 1930s, and increases them starting in the 1970s. Mr. Raymond posts a graph of the valadj array (which he calls "coefficients") and proclaims that "this isn’t just a smoking gun, it’s a siege cannon with the barrel still hot."

I decided to take a closer look. I found a copy of the FOI2009.zip file, still available at RapidShare. I'm not providing a link, because I'm not completely sure that's legal.

I will post the entire code from the file in question below. I'll highlight the snippet quoted by Mr. Raymond in yellow, and I'll highlight in green other parts of the file that I'd like to discuss.

;
; Now prepare for plotting
;

loadct,39
multi_plot,nrow=3,layout='caption'
if !d.name eq 'X' then begin
window,ysize=800
!p.font=-1
endif else begin
!p.font=0
device,/helvetica,/bold,font_size=18
endelse
def_1color,20,color='red'
def_1color,21,color='blue'
def_1color,22,color='black'
;
restore,'compbest_fixed1950.idlsave'
;
plot,timey,comptemp(*,3),/nodata,$
/xstyle,xrange=[1881,1994],xtitle='Year',$
/ystyle,yrange=[-3,3],ytitle='Normalised anomalies',$
; title='Northern Hemisphere temperatures, MXD and corrected MXD'
title='Northern Hemisphere temperatures and MXD reconstruction'
;
yyy=reform(comptemp(*,2))
;mknormal,yyy,timey,refperiod=[1881,1940]
filter_cru,5.,/nan,tsin=yyy,tslow=tslow
oplot,timey,tslow,thick=5,color=22
yyy=reform(compmxd(*,2,1))
;mknormal,yyy,timey,refperiod=[1881,1940]
;
; Apply a VERY ARTIFICAL correction for decline!!
;
yrloc=[1400,findgen(19)*5.+1904]
valadj=[0.,0.,0.,0.,0.,-0.1,-0.25,-0.3,0.,-0.1,0.3,0.8,1.2,1.7,2.5,2.6,2.6,$
2.6,2.6,2.6]*0.75 ; fudge factor
if n_elements(yrloc) ne n_elements(valadj) then message,'Oooops!'
;
yearlyadj=interpol(valadj,yrloc,timey)

;
;filter_cru,5.,/nan,tsin=yyy+yearlyadj,tslow=tslow
;oplot,timey,tslow,thick=5,color=20
;
filter_cru,5.,/nan,tsin=yyy,tslow=tslow
oplot,timey,tslow,thick=5,color=21
;

oplot,!x.crange,[0.,0.],linestyle=1
;
plot,[0,1],/nodata,xstyle=4,ystyle=4
;legend,['Northern Hemisphere April-September instrumental temperature',$
; 'Northern Hemisphere MXD',$
; 'Northern Hemisphere MXD corrected for decline'],$
; colors=[22,21,20],thick=[3,3,3],margin=0.6,spacing=1.5
legend,['Northern Hemisphere April-September instrumental temperature',$
'Northern Hemisphere MXD'],$
colors=[22,21],thick=[3,3],margin=0.6,spacing=1.5
;
end

Let's talk about the most important finding first. Notice the second section highlighted in green, right below the snippet quoted by Mr. Raymond. There are 4 statements there. The first two start with ";" which means they are commented out. Why is that important? The adjusted yearly data is assigned to variable yearlyadj. The only reference in the file to variable yearlyadj is in the first line that is commented out, where it says yyy+yearlyadj. Notice that the corresponding line that is not commented out only uses yyy in place of that. In other words, as this code stands, the adjusted yearly data is not used at all.

You might ask, why is this "VERY ARTIFICIAL" correction there at all then? I can only guess and speculate. When you're writing software, and you find bugs, a non-brute-force way to debug code is to propose hypotheses as to what is causing the bugs. Then you test these hypotheses. One way to test hypotheses might involve fudging code; trying out different ideas. When I do this I might add strings like "%%%%" to the code, so I know I need to remove that code later. I suppose adding something like "VERY ARTIFICIAL" would work as well.

My guess is that at some point the scientists wanted to see what the plot would look like with this correction, but this correction was obviously not part of the final version of the code.

Note also that this is code for plotting data. It's not code for producing a data set. Claims to the effect that the "data was cooked," seem spurious and overly dramatic. At worst, a graph might not have exactly reflected the raw data, but I wouldn't worry about raw data sets being compromised by the code above. (When it comes to the "hockey stick," what matters is what the raw data tells us.)

To borrow the words of blogger Skeptico regarding a similar incident 4 years ago, I think AGW deniers need to grow up a little.

21 comments:

dhogaza said...

This discussion of the divergence problem may help ...

"The divergence problem has important consequences for the utilization of tree ring records from temperature-limited boreal sites in hemispheric-scale proxy temperature reconstructions. The principal difficulty is that the divergence disallows the direc calibration of tree growth indices with instrumental temperature data over recent decades (the period of greatest warmth over the last 150 years), impeding the use of such data in climatic reconstructions. Consequently, when such data are included, a bias is imparted during the calibration period in the generation of the regression coefficients. Residuals from such regression analyses should thus be assessed for biases related to divergence, as this bias can result in an overestimation of past temperatures and an underestimation of the relative magnitude of recent warming (Briffa et al. 1998a and b)."

So about the time this code was written, Briffa was exploring the divergence problem (for some datasets, recent tree ring data doesn't track the instrumental record well, for reasons which are not well understood).

Look at the graph title for the adjusted data (commented out):

"Northern Hemisphere temperatures, MXD and corrected MXD"

MXD is "maximum latewood density", thought to be a temperature marker as confers close to their altitudinal or latitudinal range limits do most of their growth late in the growing season when temps finally reach a point favorable for growth.

So they were looking at (before it was commented out) the instrumental record, raw MXD data, and I assume then MXD data adjusted as though it were accurately tracking the instrumental record in later decades.

This looks to me like a poke at the data looking at what it takes to reconcile the divergence between recent MXD data in the tree ring chronology and the instrumental record.

The graph actually plotted:

"Northern Hemisphere temperatures and MXD reconstruction"

would demonstrate the magnitude of the divergence problem for this tree ring chronology over recent decades.

I'm just shocked, I say, shocked, that one of the world's leading researchers into the magnitude, causes, and significance of the divergence problem would actually write one or more programs probing the temperature vs. MXD data which is the source of that problem.

Here's a link to a good, relatively recent (accepted for publication 2007) review paper on the divergence problem. Briffa's name comes up a lot ...

http://www.ldeo.columbia.edu/~liepert/pdf/DArrigo_GPC2007.pdf

Joseph said...

@dhogaza: Thanks for the additional information. That certainly makes sense.

If you really wanted to cover up data, would you leave comments in code using words like "fudge" and "VERY ARTIFICIAL?" No. You would try to obfuscate it. Clearly that's there to remind the programmers that this is not final code, but only a temporary "test" of some sort.

dhogaza said...

If you really wanted to cover up data, would you leave comments in code using words like "fudge" and "VERY ARTIFICIAL?"

Ah, but as Eric Raymond pointed out, the comment said "VERY ARTIFICAL".

That hides it from all but the most brilliant right-wing whack-job software engineers like Eric Raymond ...

Joseph said...

I think the word "fudge" might be what allowed this code to be located. That word is often used by coders and I don't doubt Eric Raymond is aware of that.

Sean said...

This is a good analysis and it does change my mind as to what these files might mean.

Any chance you could use what has been released to see what the code actually does?

The basic problem is that the code that creates the temperature records (proxy and instrumental) has never been released, at least prior to this episode. So all we can do is look at snippets and speculate as to what the scientists are doing.

It would be great if they just released all the code so you we could all at least agree what it does.

Hank Roberts said...

Ever wonder if strong AI exists?

Google turned this up just now:

http://geekz.co.uk/shop/store/show/eler-tshirt

And the verification word for this posting is:

"lartl"

http://en.wiktionary.org/wiki/LART

Joseph said...

@Sean: I've been trying to locate the original data to try to piece things together. The hacked archive doesn't appear to have all the files, though. Specifically, I'd like to know what this does:


;
; Get regional tree lists and rbar
;
restore,filename='reglists.idlsave'



I think "tree" above refers to actual trees, not the data structure "tree". So I think dhogaza's explanation (which is Mann's explanation as well) about the divergence problem having to do with tree ring records is the most parsimonious, less paranoid, and most likely explanation.

I definitely agree that having everything be open source and having a way to clearly and fairly easily follow the provenance of data sets would be good for everyone, albeit difficult perhaps. It could help reassure fence-sitters.

Techskeptic said...

Wow. I have heard of quote mining the bible. I've heard of quote mining a research paper, but I have never heard of quote mining computer code!

It sort of reminds me that sometimes we have two choices, for example, we could take a red pill and learn about evidence that Merck may have known something about Vioxx badness before they took it off the shelves or we could take a blue pill and see how the Huffington Post is providing some dangerous flu misinformation

Anonymous said...

The original post is a lie. I checked the actual code lifted from the site (i.e. briffa_sep98_e) and it clearly uses the fudged data in its calculations.

Who is worse ... scientists who lie, or the bozo who lies on their behalf?

Pathetic.

Joseph said...

@Anon: This post (like Mr. Raymond's) is about briffa_sep98_d, not briffa_sep98_e. Before making accusations, you'd do well to check what people are talking about.

Yes, there's a file in a separate directory structure that doesn't comment out the code (briffa_sep98_e).

That still misses the point, though. The file is clearly marked "**************** VERY ARTIFICIAL ***********" and so forth. What is the purpose of these comments in your view?

There's no evidence that the MXD plots (they are not temperatures) produced by these scripts were ever used in the published literature. This is clearly also code from 1998, and you can check Briffa et al. (1998) to see if in fact the artificially corrected plots were used. Hint: They were not.

Anonymous said...

I'm well aware that the file I cited is not the same one, as well as the fact that you are not seeing the big picture..

If I can draw an analogy ... you find a gun in a house that doesn't look like it was fired, but has a note next to it that says "for killing people". I find a gun in the same house with the same note that WAS clearly fired ... and you don't have the least bit of concern.

Then, I point you to email conversations where the owner of the house says he will destroy the guns and notes rather than turn them over, as well as many other notes that show he is financially and personally motivated to kill people.

Your only position is that there is "no evidence" that shows if and when the gun was fired, therefore he should keep his guns and be left alone.

Oh please. How naive can you be?

Joseph said...

@Anon: What's amazing is that you can't see where your analogy is wrong. The gun does not actually say "for killing people." It says "****** CAUTION: DO NOT USE ******"

Anonymous said...

It says "APPLY ARTIFICIAL CORRECTION". What ganga are you smoking? If it were not to be used, why put it there? What possible purpose could the code have had in the first place, and why keep it there after it was no longer useful? Hogwash!

Anyone who says that climate science (in fact ALL of science) has not been damaged by these research 'hooligans' is not credible. They delberately hid their data and defeated the peer-review process designed to validate their work. This was not science by any definition. They cannot be trusted, and frankly neither can anyone who can't read the writing on the wall in this case.

Joseph said...

I'm not the one who suggested it was for killing people. Someone might be smoking something, but it ain't me.

Clearly, the purpose of the code is to produce some plots that someone (at CRU) would look at, probably to test some ideas or hypotheses, or a proposed correction model.

In the "d" file, you'll note there are other parts commented out that go with the artificial correction. Specifically, notice the legend "Northern Hemisphere MXD corrected for decline." Now, explain to me, Anon, why someone would put that legend right on the plot.

The corrected plot also has a different color than the normal plot. Why?

They delberately hid their data and defeated the peer-review process designed to validate their work.

This is rhetoric with absolutely no merit.

Josh said...

Gavin from RC already stated that none of the uncommented code was used in any published data (and it would be easy to see if it was because the resulting graphs would be nonsensical). See post #99 here: http://www.realclimate.org/index.php/archives/2009/11/something-is-x-in-the-state-of-denmark/comment-page-2/#comments

Joseph said...

Thanks for that info, Josh.

Anonymous said...

Oh ... Gavin from RC said it?! Well why didn't you say so?!?! That changes everythng.

You people are so trusting. This whole issue is all about how the CRU refused FOI requests to open the kimono and show how they derived their data (code and all). This is anti-science.

The 'rhetoric' that Joseph observed is a reference to leaked emails from CRU which he clearly has not read before forming his opinion. Of note is an email from Phil Jones to Michael Mann stating that he would rather destroy the code and data than hand it over.

http://www.eastangliaemails.com/emails.php?eid=490&filename=1107454306.txt

Hmmm ... I wonder why he would say that. Gavin, please say it ain't so!

Have a look at the other goodies that caused "Dr." Jones to step down:

http://sweetness-light.com/archive/some-summaries-of-the-cru-emails

The intent is all very clear, and nothing produced by this group is to be trusted until verified.

Joseph said...

You people are so trusting. This whole issue is all about how the CRU refused FOI requests to open the kimono and show how they derived their data (code and all). This is anti-science.

That's not the issue we were discussing. You're changing the subject, Anon. Refusing FOI requests has nothing to do with being anti-science, BTW. They probably have their reasons.

Granted, that's one thing they might get in trouble for. We'll see what the investigations turn up.

Anonymous said...

It *is* the issue since you are giving them the benefit of the doubt wheen they clearly deserve none.

And I'm sorry, but there is no excuse for refusing to share data used in published papers. Not doing so is indeed anti-science.
Now they say they lost the raw data ... which was not even theirs to begin with.

Does anyone believe them?

Joseph said...

@Anon: You're spreading misinformation. There's no reason to think any raw data has been lost permanently.

Anonymous said...

What rock have you been hiding under? On top of the emails from Phil Jones where he asks everyone to destroy AR4 data, the CRU admitted to "losing" the data:

http://www.timesonline.co.uk/tol/news/environment/article6936328.ece#cid=OTC-RSS&attr=797084

I have some real estate in Florida I'd like to sell you...