Haseeb Annadamban

How to Add PostgreSQL Index based on Query Plan as a Rails Developer

ยท 2691 words ยท 13 minutes to read

As Rails developers, we often focus on writing clean, efficient Ruby code. However, as the application grows optimizing database performance is equally crucial for creating responsive applications. One of the most powerful tools in our arsenal for database optimization is indexing. Here I will talk about PostgreSQL indexing from a Rails developer’s perspective. The query plan generated by EXPLAIN will help us to make informed decisions about index creation, ensuring our database queries run as efficiently as possible.

This blog post is part 2 of my previous blog post Understanding The Output of EXPLAIN Command As a Rails Developer. So, please ensure you understand the concepts explained there so the topic in this post makes sense.

Not All Types of Queries Can Benefit From Indexes ๐Ÿ”—

It’s up to PostgreSQL to decide if it wants to use an index. In fact, using indexes adds an overhead. So in some cases, adding an index might not improve performance. Basically, due to the overhead, it became faster for PostgreSQL to do the sequential scan so it chooses the faster path and decides to do seq scan. Unless it can do an index-only scan. An index-only scan is always fast.

Data also plays a huge part in this. Suppose you have a users table with a large number of records and 90% of the users have the status “active”. Then PostgreSQL will prefer to do a sequential scan when doing User.where(status: "active"). Also, suppose 2% of the users have the status “banned”. Then Postgresql will use the index for the query User.where(status: "banned").

Here are the situations where sequential scans are preferred,

1. Queries Returning a Large Portion of the Table ๐Ÿ”—

If your query is going to return a significant percentage of the rows in a table, PostgreSQL might choose a sequential scan instead of using an index. The exact threshold depends on various factors, including table size and system configuration including hardware.

2. Small Tables ๐Ÿ”—

For tables with only a few hundred or thousand rows, a sequential scan can be faster than using an index. The overhead of reading the index and then fetching the data might exceed the time it takes to simply read through all the rows. But there are exceptions to this. When sorting is involved, When there is a table join, and when there is a uniqueness constraint in a column used then PostgreSQL can prefer using indexes.

3. High Selectivity ๐Ÿ”—

If your WHERE clause matches a large portion of the rows, a sequential scan might be more efficient.

Selectivity ๐Ÿ”—

Selectivity is a crucial concept when it comes to index effectiveness. It refers to the proportion of rows that match a given condition to the total rows.

Selectivity is low when only a few rows are selected from a table in the result. Index access is more effective the lower this value is. That is how much closer it is to 0. The higher it becomes the index scan will become less effective. At a certain point of selectivity a seq scan will become more effective than the index scan. But if we can get an index only scan an index is always faster regardless of the selectivity.

Short queries and long queries ๐Ÿ”—

Indexes are beneficial only on short queries.

A short query is not defined based on the length of the query or the size of the data returned. A query can be called short if the rows needed to get the result are small. In the special case when it involves multiple tables, Consider all the rows from all the tables and if it requires only a small portion from this. then it is a short query. A long query means it requires a large portion of the table rows to get to the results. For example analytical queries. It makes sense for large tables.

Selectivity determines if a query is short or long. Only short queries will benefit from being indexed. Most of the queries we use in a web application are usually short queries.

Indexing ๐Ÿ”—

Now you know about the selectivity and short queries. Let’s start with indexing.

Where to Add Index ๐Ÿ”—

Consider this code

query = Order.joins(:customer, order_items: :product)
             .select('orders.order_id, customers.customer_name, products.product_name')
             .where('orders.order_date >= ?', '2023-01-01')
             .where(order_status: 'shipped')
             .where(customers: { country: 'USA' })
             .where(products: { category: 'Electronics' })

The equivalent SQL query will be

SELECT o.order_id, c.customer_name, p.product_name
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
JOIN products p ON oi.product_id = p.product_id
WHERE o.order_date >= '2023-01-01'
  AND o.order_status = 'shipped'
  AND c.country = 'USA'
  AND p.category = 'Electronics';

This is a complex query. And it looks like we can benefit from indexing. So, let’s start by generating the query plan and we can decide where to add index based on the query plan. To get it’s query plan we can make use of ActiveRecord #explain method

puts query.explain

And it will generate an output like this. It might look a lot. But don’t worry we only need to check Seq Scan right now. And It will become easier to look into query plans by the end of this post.

                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=5262.57..59286.27 rows=5085 width=59)
   Hash Cond: (order_items.product_id = products.id)
   ->  Nested Loop  (cost=5155.70..59111.67 rows=25781 width=45)
         ->  Hash Join  (cost=5155.27..31476.62 rows=8647 width=45)
               Hash Cond: (orders.customer_id = customers.id)
               ->  Seq Scan on orders  (cost=0.00..26205.50 rows=44131 width=38)
                     Filter: ((order_date >= '2023-08-20 00:00:00'::timestamp without time zone) AND ((order_status)::text = 'shipped'::text))
               ->  Hash  (cost=4638.50..4638.50 rows=41342 width=23)
                     ->  Seq Scan on customers  (cost=0.00..4638.50 rows=41342 width=23)
                           Filter: ((country)::text = 'USA'::text)
         ->  Index Scan using index_order_items_on_order_id on order_items  (cost=0.43..3.16 rows=4 width=16)
               Index Cond: (order_id = orders.id)
   ->  Hash  (cost=97.00..97.00 rows=789 width=30)
         ->  Seq Scan on products  (cost=0.00..97.00 rows=789 width=30)
               Filter: ((category)::text = 'Electronics'::text)

The most restrictive index has the most impact. Through this query, let’s look at which filter has the most selectivity. Let’s assume There are close to 10000 products, 100,000 customers, and more than one million orders.

Now we will check selectivity.

In this:

->  Seq Scan on customers  (cost=0.00..4638.50 rows=41342 width=23)
                           Filter: ((country)::text = 'USA'::text)

Right now we only need to look at rows value (i.e, rows=41342).

Let’s check each seq scan and check it’s selectivity.

  1. Seq scan on orders table - As we assumed earlier, There are 1 million records in the database. For the filter condition order_date >= '2023-08-20 00:00:00' AND order_status = 'shipped' there is an estimated row count is 18658. So, selectivity is 18658/1000000 = 0.0186 = 1.86% It means it will filter out 98% of the rows
  2. Seq scan on the customers table - We have 100000 customers. The estimated row count is 41342. So, selectivity is 0.41 or 41%
  3. Seq scan on the products table - We have 10000 products. The estimated row count is 789. So, selectivity is 0.0789 or 7.89%

The lower the selectivity is higher the impact of the index. So, The index on the orders table has the most impact for this particular query, and then the index on the customers table might also have some impact. However, an index scan on the customers table on the country field might not be much useful. Another thing to note here is, that since we have two filters order_date >= '2023-08-20 00:00:00' AND order_status = 'shipped' in the orders table, using a compound index will be faster than using individual indexes. Individual indexes might even slow down the query when multiple conditions exist in a few cases.

In a Rails migration, We can do this:

 add_index :orders, [ :order_date,  :order_status ]

Now let’s take a look into the types of index

Use B-TREE Indexes for Greater than, Less Than, and Between Operations ๐Ÿ”—

Basic indexes use a B-TREE structure to store data. Hence they are called B-TREE indexes. They support equality, greater than, less than, and range operations. Strings are encoded as UTF-8 etc. So, they also support these operations.

Here is an example index:

CREATE INDEX users_last_name, ON users(last_name)

The Rails equivalent will be,

add_index :users, :last_name

Add the default B-TREE indexes in regular cases involving equal, greater than, less than, and range operations.

Multi-Column Indexes ๐Ÿ”—

An index is called a compound index when we use multiple columns in the index. When we add individual indexes It will have to index scan multiple times and also the selectivity is higher compared to compound indexes. When using compound indexes the selectivity will become a function of all the filters. So, the index will become more effective.

let’s add one compound index. In a rails migration:

add_index :orders [:order_status, :order_date, :order_id ]

It will add a compound index to support filtering involving [order_status, order_date and order_id]

This compound index will be applicable for filters involving order_status alone too. Also [order_status, order_date] and [order_status, order_id]. So you don’t have to separately add these indexes.

But it won’t work for order_date and order_id alone. Also won’t work for [order_date, order_id]. It’s all about which column is chosen as the firm column So, keep this in mind. But there are other criteria when choosing which column should go first. If you use the most restrictive condition at the beginning postgreSQL can more efficiently scan the next columns in the index and improve performance further. So, may be just add more index as needed instead of choosing the most applicable column as the first column. For compound indexing, prefer to choose column ordering based on selectivity.

Covering indexes ๐Ÿ”—

Remember an index only scan is always faster than a seq scan regardless of the selectivity. Covering indexes are used to include columns in the select part of the index that are not in the filter conditions.

Consider this query:

Order.where(order_date: "2023-09-09", order_status: "shipped").select(:order_id)

To get an index only scan we could add this Rails migration

add_index :orders, [:order_date, :order_status, :order_id]

But since the filter by order_id is not present in the query, This will be an extra waste of space.

Now, Instead, order_id can be ‘included’ using a covering index

add_index :orders, [ :order_date, :order_status ], include: :order_id

This way, we can get the benefits of index only scans without having to create all the index structures. That means better storage utilization. This will produce the following query plan.

Index Only Scan using index_orders_on_order_date_and_order_status on orders  (cost=0.42..222.90 rows=324 width=22)
   Index Cond: ((order_date = '2023-09-09 00:00:00'::timestamp without time zone) AND (order_status = 'shipped'::text))

Indexes When SQL Functions Are Involved. ๐Ÿ”—

Consider this SQL

SELECT * FROM users WHERE LOWER(name) = 'john doe';

The index added in the previous section add_index :users, :last_name will not work here since every name needs to be passed through the SQL function. So, we can create a migration to add an index along with this function

class AddLowerIndexToUserName < ActiveRecord::Migration[7.2]
  def change
    add_index :users, "LOWER(name)"
  end
end

Now it does have compute the function again when using the index. So, PostgreSQL will be able to use this index.

Unique Index ๐Ÿ”—

A unique index can be added when we need a uniqueness validation in our model. One usual example is the devise gem generated email uniqueness index and validation. A unique index will prevent from addition of duplicate records to the column at the database level and also provides a fast indexing. Unique indexes has the perfect selectivity. This is how you add a unique index in rails:

add_index :users, :email, unique: true

Another thing is, Adding a unique index will make it easier for the uniqueness validation to do its job.

Partial Index ๐Ÿ”—

Partial indexing is a feature that allows you to create an index on a subset of rows in a table, rather than indexing all rows. This can be very useful for reducing index size in certain scenarios.

For example, When you are soft-deleting your records instead of deleting it is a common practice to use deleted_at timestamp. If this value is NULL it means it is not deleted. If it has a value it means the column is deleted. In this case, we only need to consider WHERE deleted_at IS NULL for many of the indexes.

Rails supports PostgreSQL partial indexes in migration. We can make use of that. Add the following in a migration:

add_index :products, :name, where: "deleted_at IS NULL",

Partial indexes can be very query specific too. It is perfectly fine to create a partial index for one specific query.

Imagine we have a products table and we observed we need an index for this particular base query where we will add more filters.

@products = Product.where(category: 'Electronics', in_stock: true)
                   .where('rating > ?', 4.0)

We can add a migration for this specific index

 add_index :products, [:category, :rating],
              name: 'index_electronics_high_rated_in_stock',
              where: "category = 'Electronics' AND rating > 4.0 AND in_stock = true"

This example is very unrealistic. but there can be many scenarios where this will help.

Index for LIKE Query ๐Ÿ”—

As I mentioned earlier B-TREE indexes do not support LIKE operation. So, here are the alternatives. For prefix search we can use B-TREE index with a specific opclass. For other cases we can add a GIN index.

A prefix search is equivalent to a Ruby filter with start_with?.

SELECT * FROM customers WHERE customer_name LIKE 'John%'
Customer.where("customer_name LIKE ?", "John%")

A sample query plan will be,

->  Seq Scan on customers  (cost=0.00..3552.47 rows=12 width=44)
      Filter: ((customer_name)::text ~~ 'John%'::text)

In this case, we can add the operator varchar_pattern_ops. There is also one text_pattern_ops for text fields. Both are similar

Add this in a Rails migration:

add_index :customers, :customer_name, opclass: :varchar_pattern_ops

Now the query plan will become

Index Scan using index_customers_on_customer_name on customers  (cost=0.42..8.44 rows=21 width=44)
   Index Cond: (((customer_name)::text ~>=~ 'John'::text) AND ((customer_name)::text ~<~ 'Joho'::text))
   Filter: ((customer_name)::text ~~ 'John%'::text)

See the clever trick used by Postgresql ๐Ÿ˜ƒ. Genius! Right?. Let’s carry on.

Even though this index is a B-TREE index, it will not be useful for equality and other regular operations. For that add a separate index.

But what if my search is case insensitive like this:

Customer.where("lower(customer_name) LIKE ?", "john%")

Then create the index like this:

add_index :customers, "LOWER(customer_name) varchar_pattern_ops"

For non-prefix search including LIKE and full-text search, the above index will not work.

Suppose we want to search orderid based on a string value

Order.where("order_id LIKE ?", "%EEC5%")

The query plan without indexes is,

 Gather  (cost=1000.00..17831.75 rows=96 width=71)
   Workers Planned: 2
   ->  Parallel Seq Scan on orders  (cost=0.00..16822.15 rows=40 width=71)
         Filter: ((order_id)::text ~~ '%EEC5%'::text)

In this, Parallel and workers planned means PostgreSQL will use multiple workers to speed up the scan. It is a good thing but the output of explain analyze can become a little unreliable. But this is out of scope here. Just note that it is a seq scan.

The pg_trgm extension in PostgreSQL implements trigram indexes. trigram means it will split each word into 3 character chunks as a sliding window. For example, the word “haseeb annadmban” will split into

{"  a","  h"," an"," ha",ada,amb,"an ",ann,ase,ban,dam,"eb ",eeb,has,mba,nad,nna,see}

To enable this extension add this in a Rails migration:

def change
 enable_extension "pg_trgm"
end

We will use GIN indexes from pg_trgm to speed up our search process. This can be used for speeding up full text searches too.

class AddIndexOrderIdForSearch < ActiveRecord::Migration[7.2]
 disable_ddl_transaction!

  def change
   add_index :orders, :order_id, using: :gin, opclass: :gin_trgm_ops, algorithm: :concurrently
  end
end

The algorithm is concurrently because the index creation will take time and I have a huge amount of data. disable_ddl_transaction! is because as per PostgreSQL “CREATE INDEX CONCURRENTLY cannot run inside a transaction block”.

Now you can see this index is used for the above query.

 Bitmap Heap Scan on orders  (cost=34.64..393.90 rows=96 width=71)
   Recheck Cond: ((order_id)::text ~~ '%EEC5%'::text)
   ->  Bitmap Index Scan on index_orders_on_order_id  (cost=0.00..34.61 rows=96 width=0)
         Index Cond: ((order_id)::text ~~ '%EEC5%'::text)

But note that if we only need to search for orders by a single customer PostgreSQL can use the index customer_id in the orders table and the above index might not be necessary.

Index Usage stats ๐Ÿ”—

Not all indexes are useful. Sometimes you will have to check pg_stat_all_indexes to see which indexes are unused (But don’t touch primary key indexes no matter what).

Wrapping up ๐Ÿ”—

There is a lot to learn. But all this information is critical for adding indexes confidently based on informed decisions. I hope this is helpful.

Your Feedback Matters!

I'd love to hear your thoughts on this. Connect with me on LinkedIn Or Twitter