PostgreSQL substring search

As most of you have probably discovered there is no nice way to do substring search in PostgreSQL however where’s a will there’s a way. The following post is about overcoming this limit, as with all such cases it is somewhat specific to the problem that we are solving and has it’s drawbacks. However i hope this might be of some help for the really desperate folks.This somewhat hackish approach relies on the full text search contrib module called Tsearch2 by Oleg Bartunov and Teodor Sigaev. I tried to use their trigram module first but it didn’t really help me out so here is the solution: Let’s create a really simple substring search on email addresses, for input we have the following table:


    Table "public.tmp_email"
  Column  |   Type   | Modifiers
----------+----------+-----------
 email    | text     |
 trigrams | tsvector |

trigrams is a column added solely for the purpose of being able to help us to search emails by substrings, it contains all the unique sequential 3 letter combinations used in the email field. For example for an email ‘kristo@eesti.ee’ it would consist of

o@e est sti .ee @ee sto ris ees ist ti. kri i.e to@

We will do this with a simple plpython function (can be done in plpgsql also but it looks nicer in python)


CREATE FUNCTION text_to_trigrams(i_txt text) RETURNS text AS $$
    trigrams = {}
    for i in range(len(i_txt)-2):
        trigrams[i_txt[i:i+3]] = 1
    return ' '.join(trigrams.keys())
$$ LANGUAGE plpythonu SECURITY DEFINER;

In a production system you would probably be calculating the trigrams in an update/insert trigger on the table doing the same operation provided in the following query:


update tmp_email set trigrams = to_tsvector('simple',text_to_trigrams(email));

The principle itself is really simple, after splitting up the string in email column into these unique 3 letter combinations we will create a full text index on that column (trigram column) which will enable us to search for a predefined set of trigrams inside a string.


create index idx_email_substring on tmp_email using gin (trigrams);

Be warned that the index is quite hefty. In my testcase with live e-mail addresses the index was just a bit bigger than the table itself, remember i warned you about the drawbacks ;) However the search itself has a quite acceptable performance, but before doing the search we will have to create the search condition. This is achived by creating a query that consists of the trigrams that we are looking for, eg. we want to find users with e-mail address that includes the substring ‘%to@skype%’


test=# select replace(text_to_trigrams('to@skype'),' ',' & ');
              replace
-----------------------------------
 o@s & sky & @sk & ype & to@ & kyp

This query will find all the trigrams that are contained within the string and replace the spaces we needed for tsvector creation with & operations needed for the tsquery. So using this generated string we can do the search in the following way:


test=# explain analyze select * from tmp_email where trigrams @@ to_tsquery('simple','o@s & sky & @sk & ype & to@ & kyp');
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_email_substring on tmp_email  (cost=0.00..15.50 rows=14 width=64) (actual time=0.086..3.349 rows=2 loops=1)
   Index Cond: (trigrams @@ '''o'' & ''s'' & ''sky'' & ''sk'' & ''ype'' & ''to'' & ''kyp'''::tsquery)
 Total runtime: 3.379 ms

test=# explain analyze select count(*) from tmp_email ;
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=676.60..676.61 rows=1 width=0) (actual time=45.656..45.657 rows=1 loops=1)
   ->  Seq Scan on tmp_email  (cost=0.00..642.28 rows=13728 width=0) (actual time=0.009..27.038 rows=13728 loops=1)
 Total runtiows=13728 loops=1) Total runtime: 45.701 ms(3 rows)test=# select count(*) from tmp_email ; count ------- 13728

test=# select count(*) from tmp_email ;
 count
-------
 13728

I know that this is not the nicest way to do this but until we have some generic substring search implemented into PostgreSQL this is at least one way to do it for certain cases. If you want to know in more detail how the previously described solution works i suggest you take a loot at the GIN spec. Be aware that this currently only works if the substring is at least 3 characters long.

About these ads

About this entry