Monday, December 7, 2009

A trick to "hide the decline" in autism data

The SwiftHack controversy reminded me of something that is done routinely in epidemiology when considering prevalence by birth year data.

I was thinking of what is perhaps the most cited email of SwiftHack; this one by Phil Jones:

From: Phil Jones
To: ray bradley ,mann@xxxxx.xxx, mhughes@xxxx.xxx
Subject: Diagram for WMO Statement
Date: Tue, 16 Nov 1999 13:31:15 +0000
Cc: k.briffa@xxx.xx.xx,t.osborn@xxxx.xxx

Dear Ray, Mike and Malcolm,
Once Tim’s got a diagram here we’ll send that either later today or first thing tomorrow. I’ve just completed Mike’s Nature trick of adding in the real temps to each series for the last 20 years (ie from 1981 onwards) amd from 1961 for Keith’s to hide the decline. Mike’s series got the annual land and marine values while the other two got April-Sept for NH land N of 20N. The latter two are real for 1999, while the estimate for 1999 for NH combined is +0.44C wrt 61-90. The Global estimate for 1999 with data through Oct is +0.35C cf. 0.57 for 1998.

Thanks for the comments, Ray.

Cheers
Phil


Some people seem to be under the impression that the "decline" refers to the decline in temperatures shown in some databases when you cherry-pick 1998 as the starting year in temperature slope calculations. The email was written in 1999, so that can't be it, first of all.

Second, how exactly do you fudge data by adding in "real" data? That doesn't make any sense on the surface.

Once you parse the statement further, you realize that "Mike's Nature trick" is a reference to Figure 5b of Mann, Bradley & Hughes (1998), a paper published in Nature, and whose first author is Mike Mann. The figure follows.



The figure is not very clear, unfortunately, because it's in black and white, but you can see it's simply a graph of multiple time series: (1) A historical temperature reconstruction up to 1980, (2) The instrumental record (referred to as "actual data" in the figure, and "real temps" by Phil Jones), and (3) a 50-year low-pass filter of the reconstructed series. Really, there's nothing underhanded or nefarious about plotting multiple series in one graph. It's not much of a "trick" at all.

In addition to this, Phil Jones says he added the real temps to Keith Briffa's series, starting in 1961, in order to "hide the decline." That's the interesting part, isn't it? After some digging (hat tip Deltoid) it's clear the "decline" after 1961 refers to something related to tree-ring data that was documented in Briffa et al. (1998). Figure 6 of this paper follows.



It's pretty clear why you'd want to add the real temps starting at about 1961. Again, I don't think there's anything wrong with this.

Some people might insist, however, that throwing away the last part of a series because it doesn't show what you think it should show is just wrong. It's not valid scientific methodology, and so on and so forth, ad nauseum.

Is it wrong? Are you sure? Let me show you something that I bet most readers of this blog have not seen.



This is a graph of the administrative prevalence of autism by birth year. The X axis is the year of birth. For example, the administrative prevalence of autism for children born in 1992, according to special education counts, was 34.5 in 10,000 as reported in 2003. The prevalence is much lower for children born in the year 2000: 12 in 10,000.

Is it actually true that the prevalence of autism is that low for children born in 2000? Absolutely not. That's what the 2003 report said. The 2009 report will tell you something very different. You see, it takes time for children to get diagnosed with autism. If they are too young, they might not have any label at all, or they might be classified under a different label.

Autism prevalence by birth year series will always decline in recent years. They also change as the series are revised year after year. (That's why I suggest using prevalence at age by report year instead, but I digress.)

If you must use a prevalence by birth year series, there's a "trick" you can use to "hide the decline," and this "trick" has a name: You left censor the series. In an administrative autism series, you probably should not consider children below the age of 8.

I have used this "trick" myself in a blog post titled Is Precipitation Associated with Autism? Now I'm Quite Sure It's Not. I said: "I'm left-censoring autism caseload starting at 2000." So there you go. Joseph, from the Residual Analysis blog, has admitted to using a "trick" to "hide the decline." Feel free to quote me on that.

Once again, I think AGW deniers need to grow up a little.

8 comments:

Anonymous said...

Help me out here Joseph. Are you saying if we just wait long enough, the fact that the tree ring data no longer reflect "real" temperatures will just go away? That seems to be what you autism analogy is telling us, but I may be wrong (always a problem with analogies, that's why I usually like to stick to the "real" argument).

And I'm not sure how your autism analogy applies to the real reason we use tree ring data: We use it to try to figure out what temperatures were like long-long ago. If tree ring data since 1960 don't reflect "real" temperatures, why should we think they do since 1000?

In addition to Michael Mann's rant about the "specious" accusation that he did what you now say is perfectly normal, and which I think you have to agree Phil Jones actually did do in 1999, the other key issue is, if proxy data need to be "hidden" because they no longer reflect "real" temperature, what confidence do we have that they reflect "real" temperature from 1000 - 1850?

John M

Joseph said...

Help me out here Joseph. Are you saying if we just wait long enough, the fact that the tree ring data no longer reflect "real" temperatures will just go away? That seems to be what you autism analogy is telling us, but I may be wrong (always a problem with analogies, that's why I usually like to stick to the "real" argument).

It depends on the cause of the tree-ring divergence. I understand the cause is not known. If the cause is that tree rings are too young, then yes, it would be just like the autism problem.

If the cause is something else, like acid rain, then it's a little different, but you'd still be justified in left-censoring the data.

If tree ring data since 1960 don't reflect "real" temperatures, why should we think they do since 1000?

That's a good question, and I'd like to study the subject in more detail. I have some reasons to suspect the tree-ring-based reconstructions are too warm in the 17th century.

Of course, there are many other types of reconstructions, so you can compare and contrast, and cross-validate them.

BTW, when Mike Mann complained that they have never "grafted" two series to his knowledge, it's quite possible he was being honest. I don't think that's a very important topic of discussion either way.

Anonymous said...

Joseph,

"I understand the cause is not known. If the cause is that tree rings are too young, then yes, it would be just like the autism problem.

If the cause is something else, like acid rain, then it's a little different, but you'd still be justified in left-censoring the data."

If the cause is not known, there is no justification in ignoring the current data and assuming the "old" data is accurate.

And even if it's an "age" issue, your autism analogy is not really on point. I assume that all of the autism data come from the same type of source---presumably public health records with an "official" diagnosis. My guess is that they're not put together by using simplistic proxies (e.g # of head injuries reported to hospitals in 1935), so your "censoring" technique in autism statistics is not really relevant to the truncating of proxy data sets and blending in instrumental temperature data.

Your censoring in autism statistics is done because of a very well undstood lag in diagnosis.

It doesn't throw out apples and replace them with oranges.

And I can understand you wanting to drop the discussion about the Michael Mann quote. Especially since he was copied on the email where Jones described how he was going to put together the WMO figure.

John M

Joseph said...

If the cause is not known, there is no justification in ignoring the current data and assuming the "old" data is accurate.

They clearly keep using it because it appears that the divergence problem is the result of something recent. It matches pretty well pre-1960, for over 100 years.

If it's an issue of low threshold, then it's problematic, and there are papers on this.

Do proxies have limitations? Sure.

And even if it's an "age" issue, your autism analogy is not really on point. I assume that all of the autism data come from the same type of source---presumably public health records with an "official" diagnosis. My guess is that they're not put together by using simplistic proxies

You'd be surprised. Those counts are quite inaccurate, especially in the past. That's why I call them 'administrative.' They have to do with primary disability classifications in the special education system. Lots of kids not in SpEd are autistic, and lots more are probably in the mental retardation or specific learning disability categories.

Ian said...

Really nice analogy, and a very good data censoring example.

Whether the reason for censoring autism reports matches the reason for censoring tree ring data is immaterial to your purpose. As you wrote: "Some people might insist, however, that throwing away the last part of a series because it doesn't show what you think it should show is just wrong. It's not valid scientific methodology, and so on and so forth, ad nauseum."

Thanks for writing this post.

Anonymous said...

Hi Joseph,

Please critique what I have to say about your analogy.

Looking to find temperature, we study tree ring data as proxy, with modern intrumental measurements to match tree ring year by year. and essentially "forecast" backwards from there.

So if we were to take modern reporting of autism cases and compare to actual measured presence in sample populations , we could obtain a proxy, in the reported cases, if asking questions about actual presence in the population.

What if starting at 1984, or 1992onwards, the relation between reported cases and actual presence in the population completely went haywire ?



Would you censor, and continue along, estimating autism for 1850 AD ?

I'd think that you darn well desperately need to know what happened at approx 1984, or 1992, don't you ?

I mean, before disregarding the most modern times, you need a firm reason for the change in the relation between sample measured presence of autism in the population and the reporting of it?

Joseph said...

@Anon: I think I get your analogy. But see, in this case, the birth-year counts from the special education system are a proxy already.

It's a proxy that not only has problems with recent counts. It can be shown that it has problems with old counts as well.

I believe you're asking: How do you distinguish if the proxy is wrong or the instrumental data is wrong?

In the case of birth-year prevalence, the issue is obvious. (The average age at diagnosis is 3 or so, and late diagnoses are common.)

In the case of temperatures, if you have to choose between thermometers and tree rings, thermometers look like the obvious choice.

Anonymous said...

Thanks,
Yes, I do realize reported cases are already a proxy, and that obvious problems exist , which then can be explained as above, or because of more awareness stc etc.
And so you come to a reasonable understanding.

You have an explanation for the counterfactual behaviour of your proxy.
Or...in climate science, you declare that maybe it's the fertilization and censor the whole thing.

:)