Entity Properties or Fields: Storage & Query Speed Costs in Drupal

There are two options for associating attribute data to entiteis in Drupal: properties & fields.  Properties (often referred to as “meta-data”), are simply columns on the entities base table.  Fields are data that is stored in in a separate table, with one table for every additional attribute that the user wishes to associate with their entity.  While the addition of fields is highly desirable from the standpoint of Admin UI, and lower the cost of entry to data modeling (they don’t  require you to program custom entity modules) , their use results in very real performance costs in virtually all cases, and result in greatly increased disk storage.  Thus, in data intensive operations, one must quantify the cost of using fields in termsof performance and memory.  Below is a case study comparing

Adding a varchar(32) with NULL data.

In hits example I have a time series table (named ‘dh_timeseries’)  with ~1.8 million records.  The table stores a large amount of metadata, but its primary purpose is to store a float value (tsvalue) of a given data type (varid) for some other entity (featureid), at a given point in time (tstime).  The table itself takes up 208MB.  I want to add a 32 character text fragment (we’ll call tsfrag) to the table, and will explore the cost of doing it with a property or a field.  First, we’ll look at the cost of having this data on the timeseries table itself, as a column (property) with NULL values.  In fact, this example we create a copy of our table, but with an adsditional column.  Table 1 shows that simply having a column that is NULL costs virtually nothing to the database — the storage requirement is identical.

Query 1: Storing null tsfrag data in the Entity table.

drop table dh_timeseries_tmp;
create table dh_timeseries_tmp (tid integer, tstime bigint, utcoffset bigint, tsvalue real, featureid bigint, varid bigint, status smallint , module varchar(255), tsendtime bigint, bundle varchar(64), timeline smallint, entity_type varchar(64), tsfrag varchar(32));
insert into dh_timeseries_tmp (tid, tstime, utcoffset, tsvalue, featureid,
  varid, status, module, tsendtime, bundle, timeline, entity_type) 
select tid, tstime, utcoffset, tsvalue, featureid, varid, status, module,
  tsendtime, bundle, timeline, entity_type 
from dh_timeseries;
vacuum analyze dh_timeseries_tmp ;
select relname, relpages, round(sum(8.0 * relpages / 1024)) as size_mb FROM pg_class where relname in ('dh_timeseries', 'dh_timeseries_tmp') group by relname,relpages order by relname;

Table 1: Two tables with identical table, with the seconds table having a single varchar(32) column added, but with NULL data in the column.

      relname       relpages  size_mb
 dh_timeseries 26644 208
 dh_timeseries_tmp 28688 208
(2 rows)

Adding a varchar(32) entity property with 8 char data.

If we recreate our time series table with tsfrag with non-null data for each and every row (8 characters in length), our associated storage increases by 16MB (see Table 2).

Query 2: Storing not-null tsfrag data in the Entity table.

drop table dh_timeseries_tmp;
create table dh_timeseries_tmp (tid integer, tstime bigint, utcoffset bigint, tsvalue real, featureid bigint, varid bigint, status smallint , module varchar(255), tsendtime bigint, bundle varchar(64), timeline smallint, entity_type varchar(64), tsfrag varchar(32));
insert into dh_timeseries_tmp (tid, tstime, utcoffset, tsvalue, featureid,
  varid, status, module, tsendtime, bundle, timeline, entity_type, tsfrag) 
select tid, tstime, utcoffset, tsvalue, featureid, varid, status, module,
  tsendtime, bundle, timeline, entity_type, '12345678' 
from dh_timeseries;
vacuum analyze dh_timeseries_tmp ;
select relname, relpages, round(sum(8.0 * relpages / 1024)) as size_mb FROM pg_class where relname in ('dh_timeseries', 'dh_timeseries_tmp') group by relname,relpages order by relname;

Table 2: Two tables with identical structure, with the seconds table having a single varchar(32) column added, but with 8 characters of data in each record

      relname       relpages  size_mb
 dh_timeseries 26644 208
 dh_timeseries_tmp 28688 224
(2 rows)

Adding a varchar(32) field with 8 char data.

Now we add a table, dh_timeseries_field_tmp, that is roughly equivalent to a Drupal field in structure, linked to the dh_timeseries table with a foreign key (tid).  Storage cost for the field table is 93MB, as compared to 16MB of extra storage needed to add the tsfrag as a column on the base table.

Query 3: Storing tsfrag data as a Drupal Field.

drop table dh_timeseries_field_tmp;
create table dh_timeseries_field_tmp(tid bigint, tsfrag varchar(32));
insert into dh_timeseries_field_tmp(tid,tsfrag) 
select tid, '12345678' from dh_timeseries;
vacuum analyze dh_timeseries_field_tmp;
select relname, relpages, round(sum(8.0 * relpages / 1024)) as size_mb FROM pg_class where relname in ('dh_timeseries', 'dh_timeseries_tmp', 'dh_timeseries_field_tmp') group by relname,relpages order by relname;

Table 3: Storage requirements of storing a varchar(32) as a Drupal field.

        relname          relpages  size_mb
 dh_timeseries 26644 208
 dh_timeseries_field_tmp 11871 93
 dh_timeseries_tmp 28688 224
(3 rows)

Verifying Storage Amounts

Just to be certain that the technique above is not producing inflated estimates, you can go to the file system and verify.

Query 4: Verifying storage amounts in the file system.

SELECT pg_relation_filepath(oid), relpages FROM pg_class WHERE relname in ('dh_timeseries', 'dh_timeseries_tmp', 'dh_timeseries_field_tmp');
ls -lh base/17721/898373 base/17721/898376 base/17721/882655
-rw------- 1 postgres postgres 209M Dec 7 16:12 base/17721/882655
-rw------- 1 postgres postgres 225M Dec 12 11:55 base/17721/898373
-rw------- 1 postgres postgres 93M Dec 12 11:59 base/17721/898376

Timing & Joins

The time cost of joining this field when queries are retrieved.  First, we need an index on the text field (which will also cost us!).  Then, we have the added performance hit of the join.  In this example, a query that grabs 237K records from both the first timeseries table takes around 600ms.  Similarly, the new timeeseries table with tsfrag text column takes the same (actually, in this case it was a tad faster, but that’s due to slight variations due to whatever the system is doing at the time).   Joining the text frag “field” to the original timeseries table takes a whopping 1600ms, ~2.5 times as long as the query on the table with tsfrag included as a field.

Query 5: Verifying storage amounts in the file system.

create index tsf_tid_tix on dh_timeseries_field_tmp(tid);
vacuum ANALYZE dh_timeseries_field_tmp;

\timing
copy (select * from dh_timeseries where varid = 8) 
  to '/tmp/dump.txt';
COPY 237627
Time: 653.347 ms

copy (select * from dh_timeseries_tmp where varid = 8) 
  to '/tmp/dump.txt';
COPY 237627
Time: 630.405 ms

copy (select a.*,b.tsfrag from dh_timeseries as a, dh_timeseries_field_tmp as b where a.varid = 8 and b.tid = a.tid)
  to '/tmp/dump.txt';
COPY 237627
Time: 1595.167 ms

Table 4: Storage requirements of the FK index on a Drupal field table with 1.8 million records.

         relname          relpages  size_mb
 dh_timeseries 26644 208
 dh_timeseries_field_tmp 11871 93
 dh_timeseries_tmp 28688 224
 tsf_tid_tix 5112 40
(4 rows)

Total Cost of Fielding

In sum, adding a field increased the storage requirement for the tsfrag data by over 700% (16 MB as column on dh_timeseries_tmp, and 123MB as field with index).  It also increased query time for a simple filtered select by 250%.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s