Hydrologic Model Error and Determining Correlation With Observed Biological Conditions

Introduction: Modeling a Hydrologic Foundation
This series of blogs is an attempt to outline the basic premise and assumptions behind using a hydrologic model to estimate the baseline flow regime at an un-gaged location for the purpose of establishing flow-ecology relationships.  This baseline flow regime, when combined with known magnitudes of water withdrawal, discharge, and impoundment releases, will permit us to estimate the extent and nature of changes to the flow regime as expressed by a variety of hydrologic indicators.  The specific values calculated for a suitable suite of hydrologic indicators (IHA and HIT for example), are believed to be able to describe a streams suitability for supporting specific types of aquatic life.  The use of models for this purpose relies upon an underlying assumption: that hydrologic model flow regimes with some reasonable resemblance to the actual flow regimes will be able to represent trends in hydrologic alteration (in so far as sites will be identified as having generally greater or lesser alteration) even if the exact magnitude of alteration at individual sites is subject to substantial error.  While these errors will prevent us from determining exact thresholds of ecological tolerance to alteration, it will allow us to identify the types of alteration that have the greatest impact on the health of individual species, life-stages, guilds or general indicators of biological health.

This series of blogs will go through the assumptions that underly this approach, and explore the effects of modeled error on the statistics that describe relationships between causes and correlatives.  The blogs in this series will progress from the simple and hypothetical to the analysis of actual complex flow-ecology relationships:

  1. The examination of simulated error on a hypothetical flow ecology relationship to examine the ability of simple linear regression to sift through noise to identify the signal.
  2. The examination of model error for a known relationship between impervious area, flow regimes, and biological health (from the Potomac River ELOHA study).
  3. The exploration of simulated flows and measured ecological indicators in the Virginia HWI project.

Episode 1: Random Errors on a Mathematical Function and It’s Effect on Linear Regression
Our first hypothesis is that linear regression can sift through the noise to find the signal in some causal relationship.  Before trying this hypothesis out on a complex biological system, we will first explore it in the context of a simple, linear algebraic function with error (noise) deliberately introduced.

Figure 1 shows a hypothetical flow-ecology relationship for some stream.  Imagine that for this stream, there is a linear relationship between decreases in the annual 30-day minimum and biological health, described by the equation y = 0.5x + 100, which literally says that for every 1 percent decrease in our x value (30-day low flow), we will see a 0.5% decrease in our y-value (% health metric).  Figure 1 shows this linear relationship, where  where our ecological health indicator is displayed in the y-axis (from 100% to 0% health), decreasing from 100% to 50% over the range of x values.  The x-axis is our flow alteration metric, in this case we are calling it “Percent Reduction in 30-day minimum”.

Hypothetical Flow Alteration - Ecological Response Equation

The values that result from this equation are shown in Table 1, with column “Bio-Health Indicator” showing the y-values (biological response to our hypothetical flow alteration) and the column labelled “% Decrease in 30-day Min” showing the x-values (the percent alteration of our metric).  Suerimposed on our graph in Figure 1 is an equation that was generated by the Simple Least Squares Regression package in MSExcel.  If you look at the equation on the graph in Figure 1, you will notice that the linear regression routine describes the equation exactly as it was set it up – y = 0.5x + 100.  In other words, our data which was error free was exactly described by our regression.

Linear Function with 25 Percentage Point Random Error Induced

Figure 2: Linear Function with 25 Percentage Point Random Error Induced.  R = 0.7769, p = 2.68e-10.

In order to explore the effects of random error, a function was created which introduced an error into our x-values, which means that a given site, whose health indicator corresponds to its actual flow alteration level, has some hydrologic model error which causes the modeled alteration level to be erroneous, therefore clouding our ability to perceive the relationship.  Five (5) different levels of modeled error were induced, ranging from up to +-25 percentage points (e25(x)) to +-300 percentage points (e300(x)).  For example, for e25(x), each x value was increased or decreased by a randomly generated number between -25 and +25, for e300(x), the x’s were offset by a randomly generated number between -300% and +300%.  Figure 1 shows the results of this analysis with the e25 function applied.  You can see that while a fair amount of scatter has been introduced into our graph due to the error in x-values, the regression line still mimics that of our base equation, and our R2 value remains relatively high (0.7769), and our p-value is essentially zero (see table 1 – p=2.68E-10), that is, the regression technique predicts that there is a 0% chance of this relationship being an illusion.  Figures 3-6 show the results of this exercise for induced error ranges from 50-300% and their impact upon regression parameters.

Figure 3: Linear Function with 50 Percentage Point Random Error Induced

Figure 3: Linear Function with 50 Percentage Point Random Error Induced. R2 = 0.40, p = 0.0035

It can be seen that as the magnitude of the induced errors grows, our R2 value decreases, reflecting the fact that less of our variation in y is explained by x.  Similarly, the value for p increases, reflecting a lower level of confidence that the observed phenomenon is in fact real.  Something that should be noted however, by looking at Figures 3&4 we can see that the p-value remains very low (99% confidence level), while the R2 value changes substantially.  Once a certain level of error is reached, however, our p-value begins to increase rapidly.    Another important factor in evaluating p-values is the sample size.  Larger sample sizes will result in smaller p-values (greater significance) even in the presence of very low R2 values, because the larger sample size increases the confidence that a genuine relationship exists, even if its strength is very small.  Also interesting to note is that the slope of our line becomes less and less similar to that in our actual relationship, which indicates the effect of model error on our ability to predict actual responses, as opposed to simply verifying that some relationship exists and is of a certain general direction (i.e., lowering 30-day low will decrease health).

Figure 4: Linear Function with 75 Percentage Point Random Error Induced

Figure 4: Linear Function with 75 Percentage Point Random Error Induced. R2 = 0.5537, p = 0.0035.

This hypothetical example showed us some of what to expect from the use of hydrologic models to estimate flow alteration.  We should expect first, that R^2 values will suffer as model error increases.    We should also expect that p-values will increase, though we might hope that they will increase at a slower rate than R2 values decrease, provided that sample sizes are large enough.  Generally speaking, sample size is a huge limiter in establishing flow-ecology relationships: we have relatively sparse biological sampling networks, and even sparser networks of continuously monitored flow gages.  One great advantage of this approach, is that we are using hydrologic models to expand the flow time series record to encompass all of our biological monitoring points.

Linear Function with 200 Percentage Point Random Error Induced

Figure 5: Linear Function with 200 Percentage Point Random Error Induced.  R2 = -0.3675, p = 0.0013.
Linear Function with 300 Percentage Point Random Error Induced

Figure 5: Linear Function with 300 Percentage Point Random Error Induced.  R2 = 0.05, p = 0.64.

Table 1:

% Decrease in 30-day Min Bio-Health Indicator R2 =0.78 , p=2.68E-10 R2 =0.40, p=0.0035 R2 =0.55, p=0.0013 R2 = 0.37, p=0.40 R2 = 0.05 , p=0.64
x y e25(x) e50(x) e75(x) e200(x) e300(x)
4 98 -15 -10 -37 129 -125
8 96 -9 10 -44 -108 304
12 94 15 -5 -20 -80 73
16 92 19 17 49 155 -184
20 90 31 53 95 -19 -55
24 88 37 46 62 28 178
28 86 11 26 103 -162 -227
32 84 44 48 65 50 -217
36 82 11 61 90 142 76
40 80 63 36 20 -78 276
44 78 62 85 -10 -64 -120
48 76 62 78 19 103 231
52 74 50 15 -1 158 37
56 72 78 37 110 3 -160
60 70 49 11 96 104 180
64 68 66 49 78 -43 212
68 66 84 110 136 22 216
72 64 48 104 124 50 134
76 62 91 82 41 82 -129
80 60 63 104 82 167 313
84 58 65 112 107 225 322
88 56 79 121 75 165 -86
92 54 84 95 102 -47 78
96 52 108 130 49 229 -26
100 50 82 63 144 160 -166

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s