What’s the cost of NULL data in PostgreSQL?

What’s the cost of a column with a null value?  In other words, if you have a table with a handful of columns, but a large number of rows have NULL values in one or more columns, when are you better off to just add a second table, linked to the first with a foreign key, to hold the data for those columns that are more “ephemeral”?

Drupal Nodes, Drupal Custom Entities, Geospatial Raster, Geospatial Vector
I ask this question for two seemingly unrelated pursuits: the issue of nodes vs. custom entity tables in Drupal and the issue of raster vs. vector storage in GIS. Both of these areas have similar aspects to the question.  In Drupal, when creating a complex data entity, you are faced with the question of using Drupal’s own internal Node+Fields plumbing to represent your multi-attribute data, or of creating a custom Entity with it’s own storage table where each of your Entities attributes can reside in the same table.  The Node+Fields approach is really easy to get going with, all accomplishable with ease from the Drupal AdminUI in a browser, but which suffers from really complex joins when you have more than 2 or 3 attribute fields attached to your node.  In other words, for every additional attribute (field) you want, Drupal will add another table to store it.  The custom Entity approach on the other hand eliminate much of the join complexity with since you just create your attributes as columns in a single table, but involves quite a bit of coding (although some use ECK – though I’ve struggled with some buggy performance there).  In GIS, raster data is a very efficient storage technique, but for every attribute you want at a place, you essentially need to add an additional raster, whereas, a vector can have additional attributes by adding more columns to it’s single attribute table (not to mention that vectors can have a variable resolution).  In Drupal, the Node+Fields approach is simpler to create, but less efficient for storage and data handling, and the Custom Entity approach is more complex to create, but more efficient for storage.  The Raster in GIS is more efficient for storage, but more complex to query, and the vector is less efficient for storage, but easier to create, and easier to query.

A Tale of Two Tables
We’ll create 2 tables with 2 float8 columns on them.  We will also find out where postgresql stores them in the file system so that we can check them out.

create table nvtest (var1 float8, var2 float8);
create table nvtest_null (var1 float8, var2 float8);
ptest=# select pg_relation_filepath(‘nvtest’);
pg_relation_filepath
———————-
base/922881/922882
(1 row)
ptest=# select pg_relation_filepath(‘nvtest_null’);

pg_relation_filepath
———————-
base/922881/922885
(1 row)

ptest=# insert into nvtest (var1, var2) select v1,v1 from (select generate_series(1,101000) as v1) as foo;
INSERT 0 100000
ptest=# insert into nvtest_null(var1) select v1 from (select generate_series(1,101000) as v1) as foo;
INSERT 0 100000

What’s the storage totals for each of our tables?  The table with non-NULL in both columns was 4.3M, while the table with NULL in the second column filled 3.5M, about 18% less data (– ( note in postgres 8.x storage can be estimated in Gig by SELECT relfilenode, relpages * 8.0 / 1000000.0 FROM pg_class WHERE relname = ‘nvtest_null’;  ).  So, having a column with a NULL value in it is quite economical.

[postgres@dbase2 ~]$ ls -lhrt /data/postgres/data/base/922881/922882
-rw——-. 1 postgres postgres 4.3M Mar 20 12:36 /data/postgres/data/base/922881/922882
[postgres@dbase2 ~]$ ls -lhrt /data/postgres/data/base/922881/922885
-rw——-. 1 postgres postgres 3.5M Mar 20 12:36 /data/postgres/data/base/922881/922885

What if we create a table with only 1 column?  This table has identical storage requirement of the table with 2 columns but NULL values in the second column.

ptest=# create table nvtest_none (var1 float8);
CREATE TABLE
ptest=# insert into nvtest_none(var1) select v1 from (select generate_series(1,1000) as v1) as foo;
INSERT 0 1000
ptest=# insert into nvtest_none(var1) select v1 from (select generate_series(1,100000) as v1) as foo;
INSERT 0 100000
ptest=# select pg_relation_filepath(‘nvtest_none’);
pg_relation_filepath
———————-
base/922881/923548
(1 row)

ptest=# \q
[postgres@dbase2 ~]$ ls -lhrt /data/postgres/data/base/922881/923548
-rw——-. 1 postgres postgres 3.5M Mar 20 12:42 /data/postgres/data/base/922881/923548

What if we create fill half of the second variable column with null, the other half with data?  Storage requirement only increases modestly.  Suggesting that the large part of storage is actually in row overhead, adding additional columns only increases requirements a smig

ptest=# create table nvtest_some (var1 float8, var2 float8);
CREATE TABLE
ptest=# insert into nvtest_some(var1,var2) select v1,v1 from (select generate_series(1, 50500) as v1) as foo;
INSERT 0 50500
ptest=# insert into nvtest_some(var1) select v1 from (select generate_series(1, 50500) as v1) as foo;
INSERT 0 50500
ptest=# select pg_relation_filepath(‘nvtest_some’);
pg_relation_filepath
———————-
base/922881/923551
(1 row)

ptest=# \q
[postgres@dbase2 ~]$ ls -lhrt /data/postgres/data/base/922881/923551
-rw——-. 1 postgres postgres 3.9M Mar 20 12:45 /data/postgres/data/base/922881/923551

To test the hypothesis that row overhead is a large share of the storage requirement, we will create another two column table and insert 100K records with null in both columns.  The resulting storage requirement (2.8M) is over half (65%) of the requirement of the table with non-null values in both columns (4.3M).

ptest=# create table nvtest_nullnull (var1 float8, var2 float8);
CREATE TABLE
ptest=# insert into nvtest_nullnull(var1,var2) select null,null from (select generate_series(1, 101000) as v1) as foo;
INSERT 0 101000
ptest=# select pg_relation_filepath(‘nvtest_nullnull’);
pg_relation_filepath
———————-
base/922881/923557
(1 row)

ptest=# \q
[postgres@dbase2 ~]$ ls -lhrt /data/postgres/data/base/922881/923557
-rw——-. 1 postgres postgres 2.8M Mar 20 12:58 /data/postgres/data/base/922881/923557

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s