Decreasing the index size on wide columns

If you have indexes on columns that are quite large but are not used for range scans (greater than, less than, between) then it is a wise choice to compact the index to cut back on the disk usage overhead and free up some memory that you desperately need.

For example if you have a column username that is used to keep the usernames as text then an average row width could be something like 24B eg. ‘’. If you have 1 million users this means you will be using 24MB just for keeping the values, and when the index is accessed regularly the database does it’s best to keep it all in the memory.

The solution itself is simple. You should create a functional index on the column that is calculated using a hash algorithm When using the PostgreSQL built-in hash function called hashtext(text) we are able to decrease the space needed to store the value from 24B to 4B (Actual index tuples have a lot more information that is used internally)

CREATE INDEX my_ix ON users (hashtext(username));

This statement will generate an index that is much more compact as it only has to store 4B per user. Raw data itself is not anymore kept inside the index when you create the functional index, instead the result given back from the hash function is stored there. For ‘’ it would be hashtext(‘’) = 1408893908 that is a 4B integer. The select statement that is used to look up the user is as follows:

SELECT * FROM users WHERE hashtext(username) = hashtext($1) AND username = $1;

Hashing can create hash collisions (different usernames will have the same hash) so that is why we need to also add the exact match criteria (username = $1). This select will use the created hashtext index to find 1..N rows that match the calculated hash and then filter out the ones whos username is an exact match.

Just a few numbers from a quick test:
100K rows 24B text each
Size of: btree(text) = 516096
Size of: btree(hashtext(text)) = 245760
This is a 52% gain.


About this entry