At some point after setting out to utilize the output from the CBP5 Watershed model for inclusion in my meta-modeling system (WOOOMM), it became readily apparent that the model has far more data in its output files than is needed by a lot of applications, particularly those that are interested only in hydrology. Moreover, it is often only the need to be “loosely coupled” with the model, that is, I really only want to take the models output and do something with it, rather than feedback to the model, and try to change its inputs. Thus, it became relevant to put together a system for caching and retrieving the models output data for use by widgets in the meta-modeling system. During this process, I learned the pros and cons of database indices, specifically the following yin-yang: lots of indices can help speed up data READS from a table, but in order to achieve high speed WRITES to a table, the number of indices must be minimized. The short of it is that a table with several million records in it and 4 indices describing some important geographic and scenario specific details will take 10+ minutes to insert another 120,000 records (because it is recalculating indices), however, by choosing the most important index and dropping the others, I could reduce that insert time to under 10 seconds. The long of it (including how I got here) follows:
The first crack at doing this was not particularly scalable, I wrote a routine to convert the output of the modeling database files known as WDM’s, into a text file, then to parse that text file and harvest the data that I was interested in. This prevented me from crushing memory by taking in a lot of extraneous data (I could take only that which I wanted and discard the rest), but it was time consuming, and had to be repeated each time I ran the model simulation. Additionally, it was very non-scalable, in other words, I had to link the output WDM to every object that I wanted to pull data from, and I also had to have a copy of each WDM accessible for using multiple watershed model output areas. Not only was this a bit of a drag from a model setup standpoint, if the CBP5 model output changed (say they issue a new calibration, which happens every 6-12 months), or perhaps I do some experiements in another part of the model which requires re-running the CBP5 output — then the files have to be swapped out, etc.
Second iteration was the use of a postgreSQL database to house the model output. This proved to be a nice way to store the data, and by giving each “model scenario” a specific ID, I could store multiple iterations of the models outputs (say, a historical run, a projected future run, a future run with global warming, etc.) all tagged with a “scenario ID” to make them distinct from one another. When the model widget that uses this data needs to change scenarioid, all it does it swap its ID, all other data can stay the same, and it gets the new info. This was a good solution, but it soon became apparent that I needed to learn a thing or two about high volume databases (with frequent WRITES) and indices.