While experimenting with building a simple Twitter clone in Postgres, I found that I needed a way to parse hashtags and mentions from tweets like:

#example tweet - #testing with @postgresql

Imagine that we have a table called tweets defined with the following structure:

CREATE TABLE tweets (  
  id        uuid PRIMARY KEY NOT NULL DEFAULT uuid_generate_v4(),
  post      text NOT NULL,
  hashtags  text[] NOT NULL DEFAULT '{}',
  mentions  text[] NOT NULL DEFAULT '{}'
);

Wouldn't it be nice if the hashtag and mention tokens were automatically parsed when we INSERT posts into the tweets table like the following:

INSERT INTO tweets (post)  
VALUES ('#example tweet - #testing with @postgresql');

SELECT * FROM tweets;

                   id                  |                     post                   |      hashtags     |   mentions
---------------------------------------+--------------------------------------------+-------------------+--------------
 e133820e-7329-4852-b40b-6e9b7e2fa69d  | #example tweet - #testing with @postgresql | {example,testing} | {postgresql}
(1 row)

It turns out that this is pretty easy to achieve with Postgres! First we need to define a function to parse tokens from content and return an array of text.

CREATE FUNCTION parse_tokens(content text, prefix text)  
  RETURNS text[] AS $$
    DECLARE
      regex text;
      matches text;
      subquery text;
      captures text;
      tokens text[];
    BEGIN
      regex := prefix || '(\S+)';
      matches := 'regexp_matches($1, $2, $3) as captures';
      subquery := '(SELECT ' || matches || ' ORDER BY captures) as matches';
      captures := 'array_agg(matches.captures[1])';

      EXECUTE 'SELECT ' || captures || ' FROM ' || subquery
      INTO tokens
      USING LOWER(content), regex, 'g';

      IF tokens IS NULL THEN
        tokens = '{}';
      END IF;

      RETURN tokens;
    END;
  $$ LANGUAGE plpgsql STABLE;

Let's test it out by parsing hashtags from a tweet:

SELECT parse_tokens('#example tweet - #testing with @postgresql', '#');

       tokens
-------------------
 {example,testing}
 (1 row)

Parsing mentions from a tweet is just as simple:

SELECT parse_tokens('#example tweet - #testing with @postgresql', '@');

    tokens
--------------
 {postgresql}
 (1 row)

Now that our parse_tokens function is working, we need to define some triggers to parse hashtags and mentions when a tweet record is inserted or updated.

CREATE TRIGGER parse_hashtags  
  BEFORE INSERT OR UPDATE ON tweets
  FOR EACH ROW EXECUTE PROCEDURE parse_hashtags_from_post();

CREATE FUNCTION parse_hashtags_from_post()  
  RETURNS trigger AS $$
    BEGIN
      NEW.hashtags = parse_tokens(NEW.post, '#');
      RETURN NEW;
    END;
  $$ LANGUAGE plpgsql;

CREATE TRIGGER parse_mentions  
  BEFORE INSERT OR UPDATE ON tweets
  FOR EACH ROW EXECUTE PROCEDURE parse_mentions_from_post();

CREATE FUNCTION parse_mentions_from_post()  
  RETURNS trigger AS $$
    BEGIN
      NEW.mentions = parse_tokens(NEW.post, '@');
      RETURN NEW;
    END;
  $$ LANGUAGE plpgsql;

Now when we create or update tweets the hashtags and mentions fields are automatically updated! This whole process was pretty fun and interesting to get working. I look forward to attempting to push even more logic down into Postgres!