Introduction: Modeling a Hydrologic Foundation
This series of blogs is an attempt to outline the basic premise and assumptions behind using a hydrologic model to estimate the baseline flow regime at an un-gaged location for the purpose of establishing flow-ecology relationships. This baseline flow regime, when combined with known magnitudes of water withdrawal, discharge, and impoundment releases, will permit us to estimate the extent and nature of changes to the flow regime as expressed by a variety of hydrologic indicators. The specific values calculated for a suitable suite of hydrologic indicators (IHA and HIT for example), are believed to be able to describe a streams suitability for supporting specific types of aquatic life. The use of models for this purpose relies upon an underlying assumption: that hydrologic model flow regimes with some reasonable resemblance to the actual flow regimes will be able to represent trends in hydrologic alteration (in so far as sites will be identified as having generally greater or lesser alteration) even if the exact magnitude of alteration at individual sites is subject to substantial error. While these errors will prevent us from determining exact thresholds of ecological tolerance to alteration, it will allow us to identify the types of alteration that have the greatest impact on the health of individual species, life-stages, guilds or general indicators of biological health.
This series of blogs will go through the assumptions that underly this approach, and explore the effects of modeled error on the statistics that describe relationships between causes and correlatives. The blogs in this series will progress from the simple and hypothetical to the analysis of actual complex flow-ecology relationships:
- The examination of simulated error on a hypothetical flow ecology relationship to examine the ability of simple linear regression to sift through noise to identify the signal.
- The examination of model error for a known relationship between impervious area, flow regimes, and biological health (from the Potomac River ELOHA study).
- The exploration of simulated flows and measured ecological indicators in the Virginia HWI project.
Episode 1: Random Errors on a Mathematical Function and It’s Effect on Linear Regression
Our first hypothesis is that linear regression can sift through the noise to find the signal in some causal relationship. Before trying this hypothesis out on a complex biological system, we will first explore it in the context of a simple, linear algebraic function with error (noise) deliberately introduced.
Figure 1 shows a hypothetical flow-ecology relationship for some stream. Imagine that for this stream, there is a linear relationship between decreases in the annual 30-day minimum and biological health, described by the equation y = 0.5x + 100, which literally says that for every 1 percent decrease in our x value (30-day low flow), we will see a 0.5% decrease in our y-value (% health metric). Figure 1 shows this linear relationship, where where our ecological health indicator is displayed in the y-axis (from 100% to 0% health), decreasing from 100% to 50% over the range of x values. The x-axis is our flow alteration metric, in this case we are calling it “Percent Reduction in 30-day minimum”.
The values that result from this equation are shown in Table 1, with column “Bio-Health Indicator” showing the y-values (biological response to our hypothetical flow alteration) and the column labelled “% Decrease in 30-day Min” showing the x-values (the percent alteration of our metric). Suerimposed on our graph in Figure 1 is an equation that was generated by the Simple Least Squares Regression package in MSExcel. If you look at the equation on the graph in Figure 1, you will notice that the linear regression routine describes the equation exactly as it was set it up – y = 0.5x + 100. In other words, our data which was error free was exactly described by our regression.
In order to explore the effects of random error, a function was created which introduced an error into our x-values, which means that a given site, whose health indicator corresponds to its actual flow alteration level, has some hydrologic model error which causes the modeled alteration level to be erroneous, therefore clouding our ability to perceive the relationship. Five (5) different levels of modeled error were induced, ranging from up to +-25 percentage points (e25(x)) to +-300 percentage points (e300(x)). For example, for e25(x), each x value was increased or decreased by a randomly generated number between -25 and +25, for e300(x), the x’s were offset by a randomly generated number between -300% and +300%. Figure 1 shows the results of this analysis with the e25 function applied. You can see that while a fair amount of scatter has been introduced into our graph due to the error in x-values, the regression line still mimics that of our base equation, and our R2 value remains relatively high (0.7769), and our p-value is essentially zero (see table 1 – p=2.68E-10), that is, the regression technique predicts that there is a 0% chance of this relationship being an illusion. Figures 3-6 show the results of this exercise for induced error ranges from 50-300% and their impact upon regression parameters.
It can be seen that as the magnitude of the induced errors grows, our R2 value decreases, reflecting the fact that less of our variation in y is explained by x. Similarly, the value for p increases, reflecting a lower level of confidence that the observed phenomenon is in fact real. Something that should be noted however, by looking at Figures 3&4 we can see that the p-value remains very low (99% confidence level), while the R2 value changes substantially. Once a certain level of error is reached, however, our p-value begins to increase rapidly. Another important factor in evaluating p-values is the sample size. Larger sample sizes will result in smaller p-values (greater significance) even in the presence of very low R2 values, because the larger sample size increases the confidence that a genuine relationship exists, even if its strength is very small. Also interesting to note is that the slope of our line becomes less and less similar to that in our actual relationship, which indicates the effect of model error on our ability to predict actual responses, as opposed to simply verifying that some relationship exists and is of a certain general direction (i.e., lowering 30-day low will decrease health).
This hypothetical example showed us some of what to expect from the use of hydrologic models to estimate flow alteration. We should expect first, that R^2 values will suffer as model error increases. We should also expect that p-values will increase, though we might hope that they will increase at a slower rate than R2 values decrease, provided that sample sizes are large enough. Generally speaking, sample size is a huge limiter in establishing flow-ecology relationships: we have relatively sparse biological sampling networks, and even sparser networks of continuously monitored flow gages. One great advantage of this approach, is that we are using hydrologic models to expand the flow time series record to encompass all of our biological monitoring points.
|% Decrease in 30-day Min||Bio-Health Indicator||R2 =0.78 , p=2.68E-10||R2 =0.40, p=0.0035||R2 =0.55, p=0.0013||R2 = 0.37, p=0.40||R2 = 0.05 , p=0.64|