Profile faxon

I build websites.

PostgreSQL indexing OVERLAPS query with tsrange for scheduling

Recently I worked on a project with a complex scheduling component. Matt Smith and Ray Lucke wrote the scheduling query which handled the business needs of the application, however once it hit production at scale we found that one part -- the calculation of the number of concurrent reservations -- was too time consuming. I took it upon myself to find a solution to better index the query for performance.

The scenario was something like a 24 hour restaurant or gym that takes reservations with limited availability depending on staffing. The request was that the application be able to throttle reservations based on starting time and concurrent customers during any given hour. In the application a user selects a day and is given a list of available reservation times. When customers schedule a reservation they receive different pricing based on availability. Prices increase as the requested slot fills. This encourages customers to spread out their reservations so everyone receives the best service possible.

PostgreSQL provides a lot of helpful ways to address this problem. In this post I want to lay out the specifics of the problem we encountered, explain my first approach and where it ran into trouble, and provide a solution that improves on the initial work.

Setup

The abbreviated table structure looks like this:

CREATE TABLE reservations (
  id SERIAL PRIMARY KEY,
  starts_at timestamp without time zone,
  ends_at timestamp without time zone
);

CREATE TABLE throttles (
  id SERIAL PRIMARY KEY,
  hour timestamp without time zone,
  starting integer,
  concurrent integer
);

CREATE UNIQUE INDEX index_throttles_on_hour ON throttles USING btree (hour);

In the actual application throttles are created on the fly from a template that administrative users can define. If you're following along you can generate some sample data.

INSERT INTO throttles (hour, starting, concurrent) ( 
  SELECT hour,
  CASE WHEN hour::time < '07:00:00' THEN 5
       WHEN hour::time < '17:00:00' THEN 20
       WHEN hour::time < '21:00:00' THEN 40
       ELSE 5
  END AS starting,
  CASE WHEN hour::time < '07:00:00' THEN 10
       WHEN hour::time < '17:00:00' THEN 40
       WHEN hour::time < '21:00:00' THEN 80
       ELSE 10
  END AS concurrent
  FROM generate_series(current_date::TIMESTAMP WITHOUT time ZONE,
                       '2020-01-01 00:00:00'::TIMESTAMP WITHOUT time ZONE,
                       '1:00:0'::interval) AS hour
);

The resulting table should look like this:

SELECT * FROM throttles;
id hour starting concurrent
1 2014-11-20 00:00:00 5 10
2 2014-11-20 01:00:00 5 10
3 2014-11-20 02:00:00 5 10
4 2014-11-20 03:00:00 5 10
5 2014-11-20 04:00:00 5 10
...
44929 rows in set (8.90 sec)

Reservation start times can be scheduled every 15 minutes. Throughout this post queries will be for the current day. In the actual application a start and end time are passed to generate_series to create slots for the day. Using PostgreSQL's generate_series function we can query to find the number of starts and concurrent reservations allowed for a given reservation slot.

SELECT generated_series.slot,
  sum(throttles.starting) AS throttle_starting,
  sum(throttles.concurrent) AS throttle_concurrent
FROM generate_series(current_date::TIMESTAMP WITHOUT time ZONE,
                     (current_date + interval '23 hours 59 minutes')::TIMESTAMP WITHOUT time ZONE,
                     '0:15:0'::interval)
     generated_series(slot)
INNER JOIN throttles
  ON (throttles.hour = date_trunc('hour'::text, generated_series.slot))
GROUP BY generated_series.slot
ORDER by slot;
slot throttle_starting throttle_concurrent
2014-11-20 00:00:00 5 10
2014-11-20 00:15:00 5 10
...
2014-11-20 16:45:00 20 40
2014-11-20 17:00:00 40 80
...
2014-11-20 23:30:00 5 10
2014-11-20 23:45:00 5 10
96 rows in set (0.02 sec)

On its own this isn't all that helpful, but it shows how to use the generate_series function to create the slot column for the reservable times on any given day.

We can populate a few reservations for today to get started actually determining the slot availability.

INSERT INTO reservations (starts_at, ends_at)
  VALUES (current_date::TIMESTAMP WITHOUT time ZONE,
          (current_date + interval '1 hour')::TIMESTAMP WITHOUT time ZONE);
INSERT INTO reservations (starts_at, ends_at)
  VALUES ((current_date + interval '30 minutes')::TIMESTAMP WITHOUT time ZONE,
          (current_date + interval '1 hour')::TIMESTAMP WITHOUT time ZONE);
INSERT INTO reservations (starts_at, ends_at)
  VALUES ((current_date + interval '1 hour')::TIMESTAMP WITHOUT time ZONE,
          (current_date + interval '2 hours')::TIMESTAMP WITHOUT time ZONE);
INSERT INTO reservations (starts_at, ends_at)
  VALUES (current_date::TIMESTAMP WITHOUT time ZONE,
          (current_date + interval '2 hours')::TIMESTAMP WITHOUT time ZONE);

Visually the reservations look like this:

0:00 0:15 0:30 0:45 1:00 1:15 1:30 1:45 2:00
           
               
           
   

Now we can query the hourly_starting and hourly_concurrent reservations by using a subselect for the new reservation data. To keep things brief I'll use the same FROM statement to generate the series of slots but replace the hourly_starting and hourly_concurrent columns. Using the OVERLAPS operator we can determine the concurrent availability for a given slot.

SELECT generated_series.slot,
  sum((SELECT count(*)
       FROM reservations
       WHERE reservations.starts_at
         BETWEEN date_trunc('hour'::text, generated_series.slot)
           AND (date_trunc('hour'::text, generated_series.slot) + '00:59:59'::interval)
     )) AS hourly_starting,
  sum((SELECT COUNT(*)
       FROM reservations
       WHERE (reservations.starts_at,
              reservations.ends_at)
             OVERLAPS
             (date_trunc('hour'::text, generated_series.slot),
              date_trunc('hour'::text, generated_series.slot) + '01:00:00'::interval)
     )) AS hourly_concurrent
FROM generate_series(current_date::TIMESTAMP WITHOUT time ZONE,
                    (current_date + interval '23 hours 59 minutes')::TIMESTAMP WITHOUT time ZONE,
                    '0:15:0'::interval)
     generated_series(slot)
INNER JOIN throttles
  ON (throttles.hour = date_trunc('hour'::text, generated_series.slot))
GROUP BY generated_series.slot
ORDER by slot;
slot hourly_starting hourly_concurrent
2014-11-20 00:00:00 3 3
2014-11-20 00:15:00 3 3
2014-11-20 00:30:00 3 3
2014-11-20 00:45:00 3 3
2014-11-20 01:00:00 1 2
2014-11-20 01:15:00 1 2
2014-11-20 01:30:00 1 2
2014-11-20 01:45:00 1 2
2014-11-20 02:00:00 0 0
...
96 rows in set (0.03 sec)

This lines up with what we would expect given the inserted reservations records.

Replacing the subselects we can see the slot_starting and slot_concurrent counts:

SELECT generated_series.slot,
  sum((SELECT COUNT(*)
       FROM reservations
       WHERE (reservations.starts_at = generated_series.slot))) AS slot_starting,
  sum((SELECT COUNT(*)
       FROM reservations
       WHERE (reservations.starts_at,
              reservations.ends_at)
             OVERLAPS
             (generated_series.slot,
              (generated_series.slot + '0:15:0'::interval))
     )) AS slot_concurrent
FROM generate_series(current_date::TIMESTAMP WITHOUT time ZONE,
                    (current_date + interval '23 hours 59 minutes')::TIMESTAMP WITHOUT time ZONE,
                    '0:15:0'::interval)
     generated_series(slot)
INNER JOIN throttles
  ON (throttles.hour = date_trunc('hour'::text, generated_series.slot))
GROUP BY generated_series.slot
ORDER by slot;
slot slot_starting slot_concurrent
2014-11-20 00:00:00 2 2
2014-11-20 00:15:00 0 2
2014-11-20 00:30:00 1 3
2014-11-20 00:45:00 0 3
2014-11-20 01:00:00 1 2
2014-11-20 01:15:00 0 2
2014-11-20 01:30:00 0 2
2014-11-20 01:45:00 0 2
2014-11-20 02:00:00 0 0
...
96 rows in set (0.03 sec)

Combining the queries for throttle_starting, throttle_concurrent, hourly_starting, hourly_concurrent, slot_starting and slot_concurrent gives us the data we need to provide users with available times.

at scale

We happily used this in development and continued on with the application. But, when we went to production things slowed down drastically. To illustrate the problem let's populate the reservations table with one hour reservations starting every half hour.

INSERT INTO reservations (starts_at, ends_at) (
  SELECT starts_at, (starts_at + interval '1 hour')::TIMESTAMP WITHOUT time ZONE AS ends_at
  FROM generate_series(current_date::TIMESTAMP WITHOUT time ZONE,
                       '2020-01-01 00:00:00'::TIMESTAMP WITHOUT time ZONE,
                       '00:30:0'::interval) AS starts_at);

The combined query to get all of the starting and concurrent data takes around three seconds on my machine. Separating out each of the columns inside of the generated series, we find two that are very slow. The hourly_starting query takes around 2.62 seconds and the hourly_concurrent section takes 2.92 seconds. That's an eternity in database time. Under production loads, with multiple requests to the query, the database process spikes and things grind to a halt.

Isolating hourly_starting

Let's look at the hourly_concurrent portion of the query first. Running an EXPLAIN shows two areas of interest:

EXPLAIN SELECT generated_series.slot,
  sum((SELECT count(*)
       FROM reservations
       WHERE reservations.starts_at
         BETWEEN date_trunc('hour'::text, generated_series.slot)
           AND (date_trunc('hour'::text, generated_series.slot) + '00:59:59'::interval)
     )) AS hourly_starting
FROM generate_series(current_date::TIMESTAMP WITHOUT time ZONE,
                    (current_date + interval '23 hours 59 minutes')::TIMESTAMP WITHOUT time ZONE,
                    '0:15:0'::interval)
     generated_series(slot)
INNER JOIN throttles
  ON (throttles.hour = date_trunc('hour'::text, generated_series.slot))
GROUP BY generated_series.slot
ORDER by slot;
HashAggregate  (cost=2597702.92..2597704.92 rows=200 width=8)
-- snip
SubPlan 1
->  Aggregate  (cost=3435.99..3436.01 rows=1 width=0)
  ->  Seq Scan on reservations  (cost=0.00..3434.87 rows=449 width=0)
    Filter: ((starts_at >= date_trunc('hour'::text, generated_series.slot)) AND (starts_at <= (date_trunc('hour'::text, generated_series.slot) + '00:59:59'::interval)))

Even though the HashAggregate has a high cost, it is a red herring for our purposes. We know in production we'll only be looking for times within a single day so the generated_series.slot sort will only be happening across 96 rows. The solution for this is fairly straightforward. The query compares the starts_at time to the generated series at the beginning and end of the hour but must do a row scan for all reservation records. Adding an index on the starts_at column resolves the problem.

CREATE INDEX index_reservations_on_starts_at ON reservations USING btree (starts_at);

isolating hourly_concurrent

Addressing the query behind the hourly_concurrent column of the final result is trickier. Let's look at the EXPLAIN:

EXPLAIN SELECT generated_series.slot,
  sum((SELECT COUNT(*)
       FROM reservations
       WHERE (reservations.starts_at,
              reservations.ends_at)
             OVERLAPS
             (date_trunc('hour'::text, generated_series.slot),
              date_trunc('hour'::text, generated_series.slot) + '01:00:00'::interval)
     )) AS hourly_concurrent
FROM generate_series(current_date::TIMESTAMP WITHOUT time ZONE,
                    (current_date + interval '23 hours 59 minutes')::TIMESTAMP WITHOUT time ZONE,
                    '0:15:0'::interval)
     generated_series(slot)
INNER JOIN throttles
  ON (throttles.hour = date_trunc('hour'::text, generated_series.slot))
GROUP BY generated_series.slot;
Sort Key: generated_series.slot
HashAggregate  (cost=2446812.92..2446814.92 rows=200 width=8)
-- snip 
SubPlan 1
->  Aggregate  (cost=2445.11..2445.12 rows=1 width=0)
  ->  Seq Scan on reservations  (cost=0.00..2370.22 rows=29954 width=0)
    Filter: "overlaps"(starts_at, ends_at, date_trunc('hour'::text, generated_series.slot), (date_trunc('hour'::text, generated_series.slot) + '01:00:00'::interval))

The sequential scan in the OVERLAPS query shows that it, again, has to scan every reservation record without any indexing. Adding indexes to the starts_at and ends_at column does not help here because OVERLAPS can not take advantage of it.

In a rush to find a solution I added a where clause on the reservations query for hourly_concurrent that would run before the OVERLAPS query to reduce the number of times OVERLAPS was used:

SELECT generated_series.slot,
  sum((SELECT COUNT(*)
       FROM reservations
       WHERE (reservations.starts_at >= current_date::TIMESTAMP WITHOUT time ZONE)
             AND (reservations.ends_at < (current_date + interval '1 day')::TIMESTAMP WITHOUT time ZONE)
             AND (reservations.starts_at,
                  reservations.ends_at)
                 OVERLAPS
                 (date_trunc('hour'::text, generated_series.slot),
                  date_trunc('hour'::text, generated_series.slot) + '01:00:00'::interval)
     )) AS hourly_concurrent
FROM generate_series(current_date::TIMESTAMP WITHOUT time ZONE,
                    (current_date + interval '23 hours 59 minutes')::TIMESTAMP WITHOUT time ZONE,
                    '0:15:0'::interval)
     generated_series(slot)
INNER JOIN throttles
  ON (throttles.hour = date_trunc('hour'::text, generated_series.slot))
GROUP BY generated_series.slot;

This got things running again, but had a logic error. If a reservation was started on the previous day and carried over to the day we're requesting, the schedule for it was not considered by the OVERLAPS query. More importantly it didn't feel right, but at the time (with production running poorly) I could not find a way to index the starts_at and ends_at column in a way that OVERLAPS could take advantage of. After rolling with this quick fix I started looking for a proper solution.

Indexing using tsrange

The solution is a feature added to PostgreSQL 9.2, the tsrange type. tsrange fields can be indexed with GiST or SP-GiST. Let's add a tsrange field called period on the reservations table to store the starts_at and ends_at time range, add a GiST index and load up the existing data.

-- add the interval column
ALTER TABLE reservations
ADD COLUMN period tsrange;

-- add the tsrange index
CREATE INDEX reservation_period_gist_idx on reservations USING GiST (period);

-- add existing starts_at, ends_at data to the period column
UPDATE reservations
SET period = ('[' || starts_at || ',' || ends_at || ')')::tsrange;

You may notice that the range records start with a [ and are closed by a ). This is how range lower and upper bounds are defined. To compare the new period column we use the && range overlap operator.

We can now run our concurrent query using the indexed period column with much faster results:

SELECT generated_series.slot,
  sum((SELECT COUNT(*)
       FROM reservations
       WHERE period && tsrange(date_trunc('hour'::text, generated_series.slot),
                               (date_trunc('hour'::text, generated_series.slot) + '01:00:00'::interval))
     )) AS hourly_concurrent
FROM generate_series(current_date::TIMESTAMP WITHOUT time ZONE,
                    (current_date + interval '23 hours 59 minutes')::TIMESTAMP WITHOUT time ZONE,
                    '0:15:0'::interval)
     generated_series(slot)
  INNER JOIN throttles ON (throttles.hour = date_trunc('hour'::text, generated_series.slot))
GROUP BY generated_series.slot
ORDER BY slot;
slot hourly_starting
2014-11-20 00:00:00 2
...
2014-11-20 23:45:00 2
96 rows in set (0.02 sec)

Bringing the query into the application

This project happens to be a Rails app, so managing tsrange fields in the application is awkward. In order to make things simpler I prefer setting the reservation starts_at and ends_at fields to automatically update the period column so ActiveRecord never has to be aware of it. Usually in a Rails app I would avoid triggers because ActiveRecord can keep all of that logic in the application. In this case I'm willing to make the exception for two reasons. First, the scheduling query is managed by a service object that already sits outside of ActiveRecord. The period column is only used by the scheduling query, making it a helper for the scheduler service object. Second, I don't want to fiddle with getting ActiveRecord to set a tsrange field.

The following trigger will update the index field every time a row on the reservations table is inserted or updated:

CREATE OR REPLACE FUNCTION update_reservation_period()
RETURNS trigger AS $$
BEGIN
NEW.period := ('[' || NEW.starts_at || ',' || NEW.ends_at || ')')::tsrange;
RETURN NEW;
END
$$ language plpgsql;

DROP TRIGGER IF EXISTS set_reservation_period ON reservations;

CREATE TRIGGER set_reservation_period
BEFORE INSERT OR UPDATE ON reservations
FOR EACH ROW
WHEN (NEW.starts_at IS NOT NULL AND NEW.ends_at IS NOT NULL AND NEW.starts_at > NEW.ends_at)
EXECUTE PROCEDURE update_reservation_period();

With the query performing well we can add some sugar columns that will enable service object to show the fill percentages. This will allow the application to determine which items to promote. PostgreSQL provides the WITH query that we can wrap our existing query in to provide the extra columns in the result. Bringing it all together looks like:

WITH all_slots AS (
  SELECT generated_series.slot,
    sum(throttles.starting) AS throttle_starting,
    sum(throttles.concurrent) AS throttle_concurrent,
    sum((SELECT count(*)
         FROM reservations
         WHERE reservations.starts_at
           BETWEEN date_trunc('hour'::text, generated_series.slot)
             AND (date_trunc('hour'::text, generated_series.slot) + '00:59:59'::interval)
       )) AS hourly_starting,
    sum((SELECT COUNT(*)
         FROM reservations
         WHERE period && tsrange(date_trunc('hour'::text, generated_series.slot),
                                 (date_trunc('hour'::text, generated_series.slot) + '01:00:00'::interval))
        )) AS hourly_concurrent,
    sum((SELECT COUNT(*)
         FROM reservations
         WHERE (reservations.starts_at = generated_series.slot))) AS slot_starting,
    sum((SELECT COUNT(*)
         FROM reservations
         WHERE period && tsrange(date_trunc('hour'::text, generated_series.slot),
                                 (date_trunc('hour'::text, generated_series.slot) + '00:15:00'::interval))
       )) AS slot_concurrent
  FROM generate_series(current_date::TIMESTAMP WITHOUT time ZONE,
                        (current_date + interval '23 hours 59 minutes')::TIMESTAMP WITHOUT time ZONE,
                        '0:15:0'::interval)
       generated_series(slot)
    INNER JOIN throttles ON (throttles.hour = date_trunc('hour'::text, generated_series.slot))
  GROUP BY generated_series.slot
)
SELECT all_slots.*,
  (all_slots.slot_starting / all_slots.throttle_starting)::float AS starting_fill,
  (all_slots.slot_concurrent / all_slots.throttle_concurrent)::float AS concurrent_fill,
  (all_slots.slot_starting < all_slots.throttle_starting) AS starting_available,
  (all_slots.slot_concurrent < all_slots.throttle_concurrent) AS concurrent_available
FROM all_slots
ORDER BY slot;

The result of the query has too many columns to show here but returns in 0.05 seconds, well within reason for our needs. In our implementation the Rails application is still responsible to throw out slot results that would overlap when concurrent_availbale is false. There are also scenarios where administrators can override the set throttles, so it's convenient to have the data on hand. I think it would be possible to use the previous SQL statement in a subselect and remove starting times that would overlap where concurrent_available is false, depending on the need.

Trate-offs

Adding the period column to reservations means that the column must be indexed on insert. This does incur an overhead. The insert statement for reservations starting every half hour created 89,857 records. Without indexes it ran on my machine in 0.39 seconds. With the starts_at index and the trigger creating periods with indexes it ran in 0.93 seconds. These are within reason for our current needs.

In summary, while incredibly useful the PostgreSQL OVERLAPS query can have significant impact on performance when run over a large range. Creating a tsrange field with the starting and ending times can be indexed using GiST to provide huge performance gains.

References

Matt Smith and Ray Lucke introduced me to a lot of the neat things PostgreSQL can do with generate_series, times, and WITH queries. They get the credit for building this query and for setting up a great framework that made the final tuning easy.

Erwin Brandstetter has an incredibly helpful response to a Stack Exchange Database question Optimizing queries on a range of timestamps (two columns) that pointed me in the direction of using tsrange to index the starts_at and ends_at columns.

Summary: PostgreSQL offers some great ways to work with time based queries. The most performant way to use them is not obvious given the plethora of options available in PostgreSQL.