cc:  John.Lanzante@noaa.gov, Melissa Free <Melissa.Free@noaa.gov>,  Peter Thorne <peter.thorne@metoffice.gov.uk>, Dian Seidel <dian.seidel@noaa.gov>, Tom Wigley <wigley@cgd.ucar.edu>,  Karl Taylor <taylor13@llnl.gov>, Thomas R Karl <Thomas.R.Karl@noaa.gov>, Carl Mears <mears@remss.com>,  "David C. Bader" <bader2@llnl.gov>, "'Francis W. Zwiers'" <francis.zwiers@ec.gc.ca>,  Frank Wentz <frank.wentz@remss.com>, Leopold Haimberger <leopold.haimberger@univie.ac.at>,  "Michael C. MacCracken" <mmaccrac@comcast.net>, Phil Jones <p.jones@uea.ac.uk>,  Steve Sherwood <Steven.Sherwood@yale.edu>, Steve Klein <klein21@mail.llnl.gov>, Tim Osborn <t.osborn@uea.ac.uk>,  Gavin Schmidt <gschmidt@giss.nasa.gov>, "Hack, James J." <jhack@ornl.gov>
date: Tue, 15 Jan 2008 20:04:09 -0800
from: Ben Santer <santer1@llnl.gov>
subject: Re: Updated Figures
to: Susan Solomon <Susan.Solomon@noaa.gov>

<x-flowed>
Dear folks,

Once again, thanks for all your help with the Figures for our paper. Let 
me briefly summarize the current "state of play" with regard to the Figures.

Figure 1: One candidate for Figure 1 is the Figure I distributed last 
night, which tried to illustrate how differences in the autocorrelation 
structure of the regression residuals (basically red noise and white 
noise) affect the estimate of the standard error of a linear trend. This 
Figure did not meet with unanimous approval. Some of you liked it, some 
of you didn't. I am currently trying to improve it. I hope to finalize 
the Figure tomorrow.

Figure 2: Susan suggested that the Figure showing tropical T2LT changes 
in five realizations of the MRI model's 20c3m experiment (as well as in 
the MRI ensemble mean) would serve well as Figure 2. She also proposed 
some modifications to this Figure, which I have not yet implemented. I 
will finish this Figure tomorrow.

Figure 3: This is the two-panel Figure (appended) which attempts to 
visually illustrate differences between the "Santer et al. trend 
comparison" (panel A) and the "Douglass et al. trend comparison" (panel 
B). Steve Sherwood's keen eyes spotted an error in the first version of 
this Figure. I had incorrectly plotted trend results for T2 (not T2LT) 
in panel A, and T2LT in panel B. This error has been corrected. I've 
also incorporated a number of comments from Tom Wigley regarding the 
legend.

Figure 4: This is now a three-panel Figure (appended), which shows the 
normalized trend difference for our "paired trends" test. Panel A gives 
results (in the form of a histogram) for tests of the T2LT trend in UAH 
against T2LT trends in 49 individual 20c3m realizations. Panel B 
provides similar results for paired trend tests involving RSS T2LT data. 
In the previous version of this Figure, the UAH- and RSS_based 
histograms were plotted together in the same panel, which made things a 
little messy. At Karl's suggestion, the numerator of the normalized 
trend difference statistic (d) is now "model trend minus observed trend" 
(not "observed minus model"), which is why the UAH-based distribution of 
d is positively skewed. Panel C shows the results of "Intra-ensemble" 
tests. At John Lanzante's suggestion, these are now shown as a histogram 
rather than in the form of colored symbols for individual models. Note 
that the histogram in panel C is now symmetrical about zero. I've 
followed the advice that many of you gave me, and performed the 
"intra-ensemble" tests using all (non-identical) trend pairs in any 
individual model's ensemble of 20c3m realizations - i.e., b(1) vs. b(2), 
b(1) vs. b(3), b(2) vs. b(3), b(2) vs. b(1), etc. This yields a total of 
124 tests.

Figure 4 clearly shows that the values of the normalized trend 
difference obtained through internal variability alone are comparable in 
size to the actual values of d obtained in "model-vs-RSS" tests. Even 
the "model-vs-UAH" d results show substantial overlap with the 
"intra-ensemble" values of d.

Figure 5: John Lanzante and Tom Wigley made a number of suggestions 
about "decluttering" this Figure (appended), which shows the results of 
trend tests performed with synthetically-generated data. I have tried to 
implement all of their suggestions. I think the Figure is now much clearer.

With regard to Figure 5, John, Tom, Karl, and Steve had a number of 
suggestions about the "positive bias" issue. Today, with help from Karl, 
I performed a number of sensitivity tests. All of my comments now 
pertain to rejection rate results obtained using the version of our 
"paired trends" test that incorporates adjustment of the standard error 
for temporal autocorrelation effects.

John had suggested using a square root transformation to address the 
inherently skewed nature of the rejection rate distributions. This 
transformation does indeed shift the mean of the empirically-determined 
rejection rate distributions closer to their theoretical expectation 
values. Given this result, Karl suggested that we should also look at 
cube root (and higher-order root) transformations. As expected, the 
higher the order of the root transformation, the greater the shift 
towards the theoretical values. This is encouraging, although I still 
think we have the non-trivial problem of explaining and justifying any 
distribution transformation we decide to use (I can already hear Fred 
Singer saying "There goes Santer, manipulating data yet again!"). The 
current version of Figure 5 does NOT show "transformed" results.

A number of other factors also influence the rejection rates obtained 
with synthetic data. These include:

a) The selected value of the AR-1 coefficient in the autoregressive model;

b) The stipulated amplitude of the noise in the AR-1 model;

c) Whether the actual value of r{1} (the lag-1 autocorrelation) that is 
determined from each synthetic time series is estimated from the raw 
anomalies of that time series or from the regression residuals.

Until now, I've been estimating r{1} from the regression residuals, not 
from the raw anomaly data. In the results described in our 2000 JGR 
paper, Doug Nychka's experiments with synthetic data provided some 
justification for this choice. What I learned today, however, is that 
estimation of r{1} from the raw anomaly data yields rejection rates that 
are very close to the theoretical expectation values, even without any 
transformation of the rejection rate distributions. This makes sense, 
since any trend in the synthetic time series will inflate the value of 
r{1}, decrease the effective sample size n{e}, and thereby inflate the 
standard error of the trend, making it more difficult to reject the null 
hypothesis, and lowering rejection rates. A large randomly-generated 
trend will also make it easier to reject the null hypothesis, so there 
must be some competitive effects here.

Does this mean that we should repeat all trend tests (both 
model-vs-observed and model-vs-model) with r{1} estimated from raw data? 
I don't know. I'm hoping that Tom will be able to clarify this point 
with Doug Nychka.

The bottom line here is that - for our "paired trends" test with 
adjustment of the S.E. for lag-1 autocorrelation effects - the 
sensitivity of rejection rate results to different choices in a), b), 
and c) appears to be fairly small (of order 1-2%). In other words, 
depending on the choices in a), b), and c), we obtain "average" 
rejection rates for 5% tests that vary between 4.7% and 7%. [As used 
here, "average" means "averaged over tests performed using 1000 
different realizations of 100 different synthetic time series".] The key 
point here is that this 1-2% uncertainty in the empirically-determined 
rejection rate is very small relative to:

i) The difference in rejection rates between our test and the Douglass 
et al. test in Figure 5 (this difference is typically 60% or larger);

ii) The difference in rejection rates in "adjusted" and "unadjusted" 
forms of the "paired trends" test (which is of order 55% in Figure 5).

Clearly, there is scope for more work in this area. But at this stage, I 
think we've almost reached the point of diminishing returns. Unless 
someone has a better idea, I suggest that we go with the current version 
of Figure 5, and briefly mention the sensitivity of the 
empirically-determined rejection rates to some of the choices noted above.

Figure 6: This is the "vertical trend profile" Figure. Peter (with 
assistance from Leo) have been working hard to finalize this. I'm hoping 
that Peter will be able to distribute a version of this Figure tomorrow.

That's all for tonight, folks!

With best regards,

Ben

Susan Solomon wrote:
> Dear Ben et al
> 
> These diagrams seem to me to do a superb job of explaining things.  They 
> will greatly help this paper to reach the interested non-specialist. 
> Unfortunately, a lot of people may now be being swayed by the incorrect 
> arguments being advanced recently by Douglass and their supporters; this 
> paper will be critical to many of those, who really wish to understand 
> but need the simple and honest helping hand that these figures provide. 
> Bravo.
> 
> I want to make a few small suggestions that may help to make these very 
> excellent illustrations even more effective.
> 
> I think Figure 2 should be figure 1 - i.e., show the reason why 
> auto-correlation is so important here right up front, and very 
> explicitly.   I love this figure.  Showing it before the MRI examples 
> helps the reader understand how to look at the MRI model results so show 
> it first.
> 
> Then the old figure 1 becomes the new figure 2. It may be best to put 
> the ensemble mean on the top instead of the bottom, then follow that 
> with each of the realizations.   If it doesn't make things look too 
> busy, it may also be useful to include red and blue trends and adjusted 
> and unadjusted errors for the ensemble, and for each realization in the 
> panels below the ensemble at top.   This would allow the figure itself 
> to sequentially make the important point that Ben is making in words:  
> that the MRI model should not have been able to generate realization #1 
> if the Douglass et al. approach to computing errors were right -- and 
> that establishes beyond doubt that their approach is not viable.
> 
> Ben, many thanks -- this work is a great service to our science.
> best
> Susan
> 
> 
> 
> 
> 
> At 10:28 PM -0800 1/14/08, Ben Santer wrote:
>> Dear folks,
>>
>> Thanks for all your of comments. I am truly astonished and gratified. 
>> You have read a large number of lengthy Santer emails; examined 
>> different iterations of proposed Figures in microscopic detail; 
>> thought long and hard about the scientific issues raised by the 
>> Douglass et al. IJC paper; provided valuable advice on tricky 
>> statistical points, forcing uncertainties, and "response strategies"; 
>> supplied results from state-of-the-art radiosonde datasets; and have 
>> helped to draft the "vertical trend profile" Figure (Peter and Leo). 
>> All of you have given generously of your time, despite the many other 
>> commitments you must have. I'm very fortunate that I've had the 
>> benefit of your collective expertise.
>>
>> It's rather late at night, so this will be a short email. I'm 
>> appending two new Figures. I suggest that these should be the first 
>> two Figures of  our paper.
>>
>> I decided to produce these Figures based on comments that I received 
>> from Susan. I've reproduced below an except from Susan's email.
>>
>> ========================================================================== 
>>
>> Excerpt from Susan Solomon's email of January 10th, 2008:
>>
>> "I do want to urge you to think about the need for your paper to 
>> communicate clearly not just to specialists but to any educated 
>> person. While I understand what you are doing, I think your work is 
>> not going to penetrate very well with the type of person we most need 
>> to reach: interested and smart lay person who is not a statistician."
>>
>> "In the hope of helping you bridge that gap even better, I pose the 
>> following questions:"
>>
>> 1) "If you had to explain to a nonspecialist what temporal 
>> auto-correlation is, and why it matters in this case, what would you 
>> say? Imagine you are talking to a very smart person who never took 
>> statistics."
>>
>> 2) "Next, imagine that you had to show such a person a graph or a 
>> table to illustrate the problem.  What would you show them?"
>>
>> 3) "If you had to explain to the interested but uneducated lay person 
>> why your error bars on the data are larger than those of Douglass at 
>> all, what would you tell them?"
>> ========================================================================== 
>>
>>
>> The first of the two new Figures is targeted at Susan's "interested 
>> and smart lay person who is not a statistician." It shows the tropical 
>> T2LT changes in the five realizations of the 20c3m experiment 
>> performed with the MRI-CGCM2.3.2 model. This Figure makes a number of 
>> useful points:
>>
>> a) It shows the general problem of trying to fit linear trends to 
>> inherently noisy time series;
>>
>> b) It illustrates that for our short (21-year) analysis period, the 
>> overall linear trend is a relatively small component of the total 
>> variability of the tropical T2LT data. The total variability is 
>> dominated by temperature fluctuations on ENSO timescales;
>>
>> c) It shows that the five different 20c3m realizations have five 
>> different sequences of internally-generated variability, and visually 
>> illustrates that the temperature response to the forcings changed in 
>> the 20c3m experiment is signal + noise, not pure signal;
>>
>> d) It highlights deficiencies in the Douglass et al. "consistency 
>> test". Although these five MRI 20c3m realizations were generated with 
>> the same model, using exactly the same external forcings, realization 
>> 1 yields a tropical trend (over 1979 to 1999) that is close to zero, 
>> while realizations 2 through 5 yield positive trends, ranging from 
>> 0.28 to 0.37 degrees C/decade. If one applied the Douglass et al. test 
>> to the MRI data, one would erroneously conclude that realization 1 
>> could not have been generated by the MRI model!
>>
>> e) It shows that, in the five-member MRI ensemble, averaging over 
>> different realizations of the 20c3m experiment reduces the amplitude 
>> of internally-generated variability, but does not totally remove this 
>> variability. Even in the ensemble-mean time series in panel F, there 
>> is still inherent statistical uncertainty in fitting a linear trend to 
>> the data. Although we've all noted that Douglass et al. do not account 
>> for statistical uncertainty in fitting a linear trend to the 
>> observations, it's also worth pointing out that they ignore the 
>> statistical uncertainty in the estimate of any individual model's 
>> ensemble-mean trend.
>>
>> After giving the reader some sense of the character of the T2LT 
>> anomaly time series, we move on to the temporal autocorrelation issue, 
>> and thus get to heart of Susan's three questions. Enter the new Figure 2.
>>
>> The new Figure 2 has two columns. The wide column on the left hand 
>> side shows two different sets of time series. The narrow column on the 
>> right hand side shows the estimated least-squares trends for these 
>> time series, together with their unadjusted and adjusted 95% 
>> confidence intervals.
>>
>> Consider the RSS results first (panel A). The linear trend in the 
>> tropical T2LT data, estimated from a time series of 252 months in 
>> length, is +0.166 degrees C/decade. Visually, it's immediately obvious 
>> that the departures of the original data from this least-squares fit 
>> (the residuals) are not randomly distributed in time - there is some 
>> persistence (temporal autocorrelation) of temperature information from 
>> one month to the next. Because of this, the number of independent time 
>> samples in the time series is much less than the actual number of time 
>> samples, n{t} (252 months). The lag-1 autocorrelation of the residuals 
>> is a simple measure of persistence, and can be used to obtain an 
>> estimate of the number of independent (or "effective") time samples, 
>> n{e}. In the case of the RSS data in panel A, r{1} is 0.88, and n{e} 
>> is roughly 16.
>>
>> The number of independent time samples influences the estimated value 
>> of s{b}, the standard error of the fitted linear trend. If we make the 
>> naive (and incorrect) assumption that the n{e} = n{t} = 252 (i.e., 
>> that the 252 residuals constitute 252 independent samples), then the 
>> estimated 1-sigma standard error is very small (roughly +/- 0.03 
>> degrees C/decade). If we use the effective sample size n{e} to 
>> estimate the 1-sigma standard error, it is over four times larger (+/- 
>> 0.13 degrees c/decade).
>>
>> In our method for testing the significance of differences between the 
>> trends in two different time series, we employ the effective sample 
>> size to calculate standard errors that have been "adjusted" for 
>> temporal autocorrelation of the residuals.
>>
>> The bottom panel of Figure 2 shows a synthetic time series, in which 
>> Gaussian noise was added to the RSS T2LT trend of +0.166 degrees 
>> C/decade. The amplitude of the noise was chosen so that the temporal 
>> standard deviation of the synthetic data (0.31 degrees C) was 
>> identical to the standard deviation of the original RSS data. By 
>> chance, the addition of this particular realization of the noise 
>> slightly increases the original trend in the RSS data, so that the 
>> synthetic data has an overall trend of +0.174 degrees C/decade.
>>
>> Panel 2B illustrates the case of virtually no temporal autocorrelation 
>> of the residuals. The lag-1 autocorrelation of the residuals in panel 
>> B is very close to zero [r{1} = 0.034], which yields an effective 
>> sample size of 236. In other words, n{e} is close to n{t} - the number 
>> of effectively independent time samples is almost as large as the 
>> actual number of time samples. Because of this, the adjusted and 
>> unadjusted standard errors of the trend are both small and virtually 
>> identical.
>>
>> I hope these two Figures go some way towards addressing Susan's 
>> concerns. It took me a bit of time to generate them, which is why I've 
>> not yet completed the revisions to all of the other Figures. However, 
>> I did have some time to perform experiments to investigate the small 
>> "positive bias" in the rejection rates obtained with synthetic data. 
>> At the suggestion of Tom Wigley, I looked at whether this bias might 
>> be related to my use of a Normal distribution (rather than a 
>> t-distribution) to estimate p-values. For the t-distribution case, the 
>> sample size enters the picture not only in the calculation of the 
>> standard error of the linear trend, s{b}, but also in the degrees of 
>> freedom (DOF) used in the t-test. For the "unadjusted" case, the 
>> sample size size is 480 [i.e., n1{t} + n2{t} - 24, where n1{t} and 
>> n2{t} are the actual number of time samples in the first and second 
>> time series. The "minus 24" arises because in each time series, we've 
>> defined anomalies relative to climatological monthly means. For the 
>> "adjusted" case, I believe that the sample size is n1{e} + n2{e}. I 
>> don't think the "minus 24" term comes into it (although I may be 
>> wrong), since our estimate of the lag-1 autocorrelation of the 
>> regression residuals, and hence of n{e}, already depends on how we've 
>> defined anomalies (e.g., r{1} would be different if we define 
>> monthly-mean anomalies relative to climatological annual means rather 
>> than climatological monthly means).
>>
>> This is more detail than you probably ever want to know. The bottom 
>> line is that, even if I use the Student's t-distribution for 
>> calculating rejection rates (and irrespective of whether I use n{t} or 
>> n{e} for estimating the DOF for the t-test), I still get a small 
>> positive bias in rejection rates. Rejection rates for the Normal 
>> distribution case and t-distribution case are very similar.
>>
>> Tomorrow I'll experiment with John's suggested transformation of the 
>> rejection rate distribution.
>>
>> So much for my "short email"! It's now late at night, so hopefully 
>> this isn't too incoherent.
>>
>> With best regards,
>>
>> Ben
>>
>> John Lanzante wrote:
>>> Dear Ben and All,
>>>
>>> After returning to the office earlier in the week after a couple of 
>>> weeks
>>> off during the holidays, I had the best of intentions of responding to
>>> some of the earlier emails. Unfortunately it has taken the better 
>>> part of
>>> the week for me to shovel out my avalanche of email. [This has a lot to
>>> do with the remarkable progress that has been made -- kudos to Ben 
>>> and others
>>> who have made this possible]. At this point I'd like to add my 2 
>>> cents worth
>>> (although with the declining dollar I'm not sure it's worth that much 
>>> any more)
>>> on several issues, some from earlier email and some from the last day 
>>> or two.
>>>
>>> I had given some thought as to where this article might be submitted.
>>> Although that issue has been settled (IJC) I'd like to add a few related
>>> thoughts regarding the focus of the paper. I think Ben has brokered the
>>> best possible deal, an expedited paper in IJC, that is not treated as a
>>> comment. But I'm a little confused as to whether our paper will be 
>>> titled
>>> "Comments on ... by Douglass et al." or whether we have a bit more 
>>> latitude.
>>>
>>> While I'm not suggesting anything beyond a short paper, it might be 
>>> possible
>>> to "spin" this in more general terms as a brief update, while at the 
>>> same
>>> time addressing Douglass et al. as part of this. We could begin in the
>>> introduction by saying that this general topic has been much studied and
>>> debated in the recent past [e.g. NRC (2000), the Science (2005) 
>>> papers, and
>>> CCSP (2006)] but that new developments since these works warrant 
>>> revisiting
>>> the issue. We could consider Douglass et al. as one of several new
>>> developments. We could perhaps title the paper something like 
>>> "Revisiting
>>> temperature trends in the atmosphere". The main conclusion will be 
>>> that, in
>>> stark contrast to Douglass et al., the new evidence from the last 
>>> couple of
>>> years has strengthened the conclusion of CCSP (2006) that there is no
>>> meaningful discrepancy between models and observations.
>>>
>>> In an earlier email Ben suggested an outline for the paper:
>>>
>>>   1) Point out flaws in the statistical approach used by Douglass et al.
>>>
>>>   2) Show results from significance testing done properly.
>>>
>>>   3) Show a figure with different estimates of radiosonde temperature 
>>> trends
>>>      illustrating the structural uncertainty.
>>>
>>>   4) Discuss complementary evidence supporting the finding that the 
>>> tropical
>>>      lower troposphere has warmed over the satellite era.
>>>
>>> I think this is fine but I'd like to suggest a couple of other items. 
>>> First,
>>> some mention could be made regarding the structural uncertainty in 
>>> satellite
>>> datasets. We could have 3a) for sondes and 3b) for satellite data. The
>>> satellite issue could be handled in as briefly as a paragraph, or with a
>>> bit more work and discussion a figure or table (with some trends). 
>>> The main
>>> point to get across is that it's not just UAH vs. RSS (with an 
>>> implied edge
>>> to UAH because its trends agree better with sondes) it's actually UAH vs
>>> all others (RSS, UMD and Zou et al.). There are complications in 
>>> adding UMD
>>> and Zou et al. to the discussion, but these can be handled either
>>> qualitatively or quantitatively. The complication with UMD is that it 
>>> only
>>> exists for T2, which has stratospheric influences (and UMD does not 
>>> have a
>>> corresponding  measure for T4 which could be used to remove the 
>>> stratospheric
>>> effects). The complication with Zou et al. is that the data begin in 
>>> 1987,
>>> rather than 1979 (as for the other satellite products).
>>>
>>> It would be possible to use the Fu method to remove the stratospheric
>>> influences from UMD using T4 measures from either or both UAH and 
>>> RSS. It
>>> would be possible to directly compare trends from Zou et al. with 
>>> UAH, RSS
>>> & UMD for a time period starting in 1987. So, in theory we could include
>>> some trend estimates from all 4 satellite datasets in apples vs. apples
>>> comparisons. But perhaps this is more work than is warranted for this 
>>> project.
>>> Then at very least we can mention that in apples vs. apples 
>>> comparisons made
>>> in CCSP (2006) UMD showed more tropospheric warming than both UAH and 
>>> RSS,
>>> and in comparisons made by Zou et al. their dataset showed more 
>>> warming than
>>> both UAH and RSS. Taken together this evidence leaves UAH as the 
>>> "outlier"
>>> compared to the other 3 datasets. Furthermore, better trend agreement 
>>> between
>>> UAH and some sonde data is not necessarily "good" since the sonde 
>>> data in
>>> question are likely to be afflicted with considerable spurious 
>>> cooling biases.
>>>
>>> The second item that I'd suggest be added to Ben's earlier outline 
>>> (perhaps
>>> as item 5) is a discussion of the issues that Susan raised in earlier 
>>> emails.
>>> The main point is that there is now some evidence that inadequacies 
>>> in the
>>> AR4 model formulations pertaining to the treatment of stratospheric 
>>> ozone may
>>> contribute to spurious cooling trends in the troposphere.
>>>
>>> Regarding Ben's Fig. 1 -- this is a very nice graphical presentation 
>>> of the
>>> differences in methodology between the current work and Douglass et 
>>> al. However, I would suggest a cautionary statement to the effect 
>>> that while error
>>> bars are useful for illustrative purposes, the use of overlapping 
>>> error bars
>>> is not advocated for testing statistical significance between two 
>>> variables
>>> following Lanzante (2005).
>>>    Lanzante, J. R., 2005: A cautionary note on the use of error bars.
>>>    Journal of Climate,  18(17), 3699-3703.
>>> This is also motivation for application of the two-sample test that 
>>> Ben has
>>> implemented.
>>>
>>> Ben wrote:
>>>> So why is there a small positive bias in the empirically-determined 
>>>> rejection rates? Karl believes that the answer may be partly linked to
>>>> the skewness of the empirically-determined rejection rate 
>>>> distributions.
>>> [NB: this is in regard to Ben's Fig. 3 which shows that the rejection 
>>> rate
>>> in simulations using synthetic data appears to be slightly positively 
>>> biased
>>> compared to the nominal (expected) rate].
>>>
>>> I would note that the distribution of rejection rates is like the 
>>> distribution
>>> of precipitation in that it is bounded by zero. A quick-and-dirty way to
>>> explore this possibility using a "trick" used with precipitation data 
>>> is to
>>> apply a square root transformation to the rejection rates, average 
>>> these, then reverse transform the average. The square root 
>>> transformation should yield
>>> data that is more nearly Gaussian than the untransformed data.
>>>
>>> Ben wrote:
>>>> Figure 3: As Mike suggested, I've removed the legend from the 
>>>> interior of the Figure (it's now below the Figure), and have added 
>>>> arrows to indicate the theoretically-expected rejection rates for 
>>>> 5%, 10%, and 20% tests. As Dian suggested, I've changed the colors 
>>>> and thicknesses of the lines indicating results for the "paired 
>>>> trends". Visually, attention is now drawn to the results we think 
>>>> are most reasonable - the results for the paired trend tests with 
>>>> standard errors adjusted for temporal autocorrelation effects.
>>>
>>> I actually liked the earlier version of Fig. 3 better in some regards.
>>> The labeling is now rather busy. How about going back to dotted, thin
>>> and thick curves to designate 5%, 10%, and 20%, and also placing labels
>>> (5%/10%/20%) on or near each curve? Then using just three colors to
>>> differentiate between Douglass, paired/no_SE_adj, and paired/with_SE_adj
>>> it will only be necessary to have 3 legends: one for each of the 
>>> three colors.
>>> This would eliminate most of the legends.
>>>
>>> Another topic of recent discussion is what radiosonde datasets to 
>>> include
>>> in the trend figure. My own personal preference would be to have all 
>>> available
>>> datasets shown in the figure. However, I would defer to the individual
>>> dataset creators if they feel uncomfortable about including sets that 
>>> are
>>> not yet published.
>>>
>>> Peter also raised the point about trends being derived differently for
>>> different datasets. To the extent possible it would be desirable to
>>> have things done the same for all datasets. This is especially true for
>>> using the same time period and the same method to perform the 
>>> regression.
>>> Another issue is the conversion of station data to area-averaged 
>>> data. It's
>>> usually easier to insure consistency if one person computes the trends
>>> from the raw data using the same procedures rather than having several
>>> people provide the trend estimates.
>>>
>>> Karl Taylor wrote:
>>>> The lower panel <of Figure 2> ...
>>>> ... By chance the mean of the results is displaced negatively ...
>>>> ... I contend that the likelihood of getting a difference of x is equal
>>>> to the likelihood of getting a difference of -x ...
>>>> ... I would like to see each difference plotted twice, once with a 
>>>> positive
>>>> sign and again with a negative sign ...
>>>> ... One of the unfortunate problems with the asymmetry of the 
>>>> current figure is that to a casual reader it might suggest a 
>>>> consistency between the intra-ensemble distributions and the 
>>>> model-obs distributions that is not real
>>>> Ben and I have already discussed this point, and I think we're both 
>>>> still a bit unsure on what's the best thing to do here.  Perhaps 
>>>> others can provide convincing arguments for keeping the figure as is 
>>>> or making it symmetric as I suggest.
>>>
>>> I agree with Karl in regard to both his concern for misinterpretation as
>>> well as his suggested solution. In the limit as N goes to infinity we
>>> expect the distribution to be symmetric since we're comparing the 
>>> model data
>>> with itself. The problem we are encountering is due to finite sample 
>>> effects.
>>> For simplicity Ben used a limited number of unique combinations -- using
>>> full bootstrapping the problem should go away. Karl's suggestion 
>>> seems like
>>> a simple and effective way around the problem.
>>>
>>> Karl Taylor wrote:
>>>> It would appear that if we believe FGOALS or MIROC, then the 
>>>> differences between many of the model runs and obs are not likely to 
>>>> be due to chance alone, but indicate a real discrepancy ... This 
>>>> would seem
>>>> to indicate that our conclusion depends on which model ensembles we 
>>>> have
>>>> most confidence in.
>>>
>>> Given the tiny sample sizes, I'm not sure one can make any meaningful
>>> statements regarding differences between models, particularly with 
>>> regard to
>>> some measure of variability such as is implied by the width of a 
>>> distribution.
>>> This raises another issue regarding Fig. 2 -- why show the results 
>>> separately
>>> for each model? This does not seem to be relevant to this project. Our
>>> objective is to show that the models as a collection are not 
>>> inconsistent
>>> with the observations -- not that any particular model is more or less
>>> consistent with the observations. Furthermore showing results  for 
>>> different
>>> models tempts the reader to make such comparisons. Why not just 
>>> aggregate the
>>> results over all models and produce a histogram? This would also 
>>> simplify
>>> the figure.
>>>
>>> Best regards,
>>>
>>> _____John
>>>
>>
>>
>> -- 
>> ---------------------------------------------------------------------------- 
>>
>> Benjamin D. Santer
>> Program for Climate Model Diagnosis and Intercomparison
>> Lawrence Livermore National Laboratory
>> P.O. Box 808, Mail Stop L-103
>> Livermore, CA 94550, U.S.A.
>> Tel:   (925) 422-2486
>> FAX:   (925) 422-7675
>> email: santer1@llnl.gov
>> ---------------------------------------------------------------------------- 
>>
>>
>>
>> Attachment converted: Junior:time_series4.pdf (PDF /�IC�) (00233261)
>> Attachment converted: Junior:error_compare3.pdf (PDF /�IC�) (00233262)
> 


-- 
----------------------------------------------------------------------------
Benjamin D. Santer
Program for Climate Model Diagnosis and Intercomparison
Lawrence Livermore National Laboratory
P.O. Box 808, Mail Stop L-103
Livermore, CA 94550, U.S.A.
Tel:   (925) 422-2486
FAX:   (925) 422-7675
email: santer1@llnl.gov
---------------------------------------------------------------------------- 


</x-flowed>

Attachment Converted: "c:\eudora\attach\NEW_IJC_figure03.pdf"

Attachment Converted: "c:\eudora\attach\NEW_IJC_figure04.pdf"

Attachment Converted: "c:\eudora\attach\NEW_IJC_figure05.pdf"