While working on one of those projects wherein our team of java developers was serving the role of DBA too, we fell into serious database performance crisis. Though our application worked just fine in development environment; it was quite slow for a few functionalities in the production like environment. Turned out that due to the bulk of data in production database, some queries were slow. As the release date was approaching, we did every effort to rewrite our queries in different ways to achieve no significant performance gain. This is when the indexes came to rescue. Database indexes are an efficient tool to enhance database performance. In this blog, I discuss how we can use database indexes to improve performance of our application's database. Though the discussed concepts apply to most of the relational databases, I have used MySQL for examples. At many places I have used the MySQL world database, available here.
Database indexes are the data structures which facilitate fast lookup of rows from a table by maintaining a copy of indexed columns in a way that can be searched efficiently. Index entries also contain links to the corresponding rows in the table for actual row retrieval. A simple representation of the index is as shown below, though there would be some complex data structure (e.g., B-Tree, Hash, RTree etc.) actually implementing the index.
In the absence of indexes, a table would be searched from beginning to the end to find the matching rows which has a time complexity of O(n). If the table contains a large number of rows this would lead to slow query execution. However, indexes on the search columns can make the query execution efficient e.g., a B-Tree based index would have a complexity of O(log(n)). You can compare database indexes with the real world book indexes or telephone directories. Imagine how difficult it would be to search a topic in a book without proper indexes. The same way all but very small databases would benefit with indexes. Like a normal telephone directory would only help search on the basis of name and not using a phone number; the same way an index would only speed up the query for indexed column and would not help for other columns. If you need to search on multiple columns you either need multiple indexes or composite indexes.
Indexes definitely improve the search on the indexed column however it does come with a cost. All the insert, delete and updates on the indexed column would now take a little more time as this would involve inserting, removing and updating entries in index too. Also, index would itself require some storage space increasing overall storage requirements of the database. So, one should use indexes wisely.
Below are a few examples of various possible ways to create indexes in MySQL:
1)Specify index in CREATE TABLE itself
CREATE TABLE customer ( id INT PRIMARY KEY, name VARCHAR(20) NOT NULL, address VARCHAR(50), INDEX name_index (name(5) DESC))
2)Using ALTER TABLE ADD INDEX to add index if the table already exists.
ALTER TABLE customer ADD INDEX name_index (name(5) DESC)
3)Using CREATE INDEX ON
, when table already exists.
CREATE INDEX name_index ON customer (name(5) DESC)
There are various data structures that can be used internally to maintain indexes, each one serving a different purpose. B-Tree, Hash, R-Tree are a few popular types. Various storage engines allow users to specify different types of data structure. It is important to understand how they work in order to use them properly.
It maintains a BTree of indexed values so sorting would not involve extra expense. Efficient for exact-match as well as range-based comparisons e.g., =, <, <=, >, >=, <>, BETWEEN etc. It can also be used for LIKE expressions if the LIKE pattern string does not start with a wildcard character.
A hash function is applied to each value in the indexed column, which is then used to perform lookup instead of using the actual value. The benefit is that the lookup using hash values is much faster than the actual value. The downside is that they are efficient for exact-match comparisons however would not work in range lookup (e.g., <=, >, <, >=, BETWEEN etc.). It cannot be used to avoid sorting operation as the hash index does not maintain any specific ordering.
A tree data structure used to handle spatial data. Generally used for geographic information e.g., find a fuel pump within 5 km of a specific location.
This type of index works well for regular text search. They generally use inverted indexes which store a list of words and the corresponding documents in which those words appear. LIKE would not involve full text indexes; we need to use MATCH...AGAINST in MySQL (CONTAINS in MS SQL) in order to use full text index. So if we have LIKE with pattern starting wildcard e.g., '%xyz', it cannot be served by any type of index and can be extremely slow.
Clustered index determines the order in which the records are stored physically in a table, so we can have only single clustered index per table. MySQL InnoDB uses primary key for making clustered index. It provides very fast lookup and retrieval of data does not require additional lookups as both the index and data resides at the same place. Clustered indexes can be costly for the columns which require frequent changes as this would involve relocating the row to new physical location.
Any other indexes are called non-clustered or secondary indexes. There can be multiple secondary indexes in a table. Each MySQL secondary index is a covering index (more on this in later section) which also contains the clustering index key for fast retrieval of data. So, we should avoid using primary key or clustering index of size much larger than requirement as that would unnecessarily waste space in secondary indexes too. Secondary indexes require separate lookup for fetching the row as data is not stored in leaf node.
Composite indexes are the indexes on multiple columns. These can be helpful in certain circumstances e.g., multiple conditions in WHERE clause with AND, WHERE and ORDER BY combination on different fields.
Example: An index on (CountryCode, District, Name) can be used in any of the below scenario's:
SELECT * FROM city WHERE CountryCode = 'IND' AND District='Haryana' AND Name = 'Rohtak'; SELECT * FROM city WHERE CountryCode = 'IND' AND District = 'Haryana' SELECT * FROM city WHERE CountryCode = 'IND'
We can remove the later AND conditions as long as we follow the order mentioned in the index; however, this index would not be of any help in below queries:
WHERE District = 'Haryana' AND Name = 'Bhiwani' -- no Country specified WHERE District = 'Haryana' -- no Country specified WHERE District = 'Haryana' OR Name = 'Noida' -- OR used instead of AND
Example: Similarly, an index on (District, Name) can be useful for this WHERE-ORDER BY combination:
WHERE District = 'Haryana' ORDER BY Name
Covering indexes are used to reduce the disk reads by including other columns used in query as well (even if not used in WHERE/GROUP By/ORDER BY). The whole query can then be served using the index itself without doing any extra read from table for additional fields. This would improve the performance greatly; however too many covering indexes have an overhead in terms of disk space and insert/update/delete operations. So, covering indexes can be used for very slow or frequently used queries. By default, MySQL InnoDB uses covering index for all the secondary keys by also including primary key in the leaf node.
Example: Observe how the below query benefits significantly by using the covering index.
SELECT CountryCode, SUM(Population) FROM City GROUP BY CountryCode; ALTER TABLE City ADD INDEX indexCntry (CountryCode); -- normal index on CountryCode, reduces query execution time ALTER TABLE City ADD INDEX indexCntryPop (CountryCode, Population); -- covering index, reduces query execution time significantly
If multiple indexes are present for some query, MySQL Query Optimizer chooses the one which would perform best in this scenario e.g., in above query if both indexCntry and indexCntryPop are available, MySQL uses indexCntryPop however once we drop indexCntryPop, it would use indexCntry.
We can use EXPLAIN to understand the role of index used in performance of a query. It displays the indexes available for a query and what has been selected to execute it. For example,
EXPLAIN SELECT Name, Population FROM City ORDER BY Name
Below are the results before and after creating the following index:
ALTER TABLE City ADD INDEX indexNamePop (Name, Population);
|Result Column||Content (Without Index||Content( With Index|
|extra||Using filesort||Using index|
Without the index, filesort is performed to perform the ordering however, as soon as the index is created the "filesort" is no longer required (result is directly served using index indexNamePop) so the query would execute faster.
Indexes are particularly useful in the databases where read/select operation is performed more frequently than write.
Deciding on columns to index is not evident by looking at the database schema alone. It actually depends on how your application uses that database. The queries which are slow OR the one's which are frequently running on a larger data-set are good candidates to consider for indexing. You can identify them using your application's log or MySQL slow query log.
The columns to be indexed should have high cardinality. Cardinality of a column refers to the number of unique values in that column. A primary key or any other UNIQUE key has cardinality which equals number of rows in the table. Some columns like Gender, Color, Status etc. would have low cardinality. Creating indexes on low cardinality columns can adversely impact the performance of the query rather than improving it. An index is useful in the cases where you have to select/search a small set of data from a large number of values. However in case of low cardinality e.g., gender = 'Male', even if you are using index you end up fetching around half of the records (assuming 1:1 male female ratio) which is same as liner search. Moreover the index would require storage and update overhead without serving much purpose in improving the performance.
Size of the column to be indexed is also important as small size would mean less space to store the index and less searching time. So, if VARCHAR(50) is sufficient for the Name field, do not use VARCHAR(80). Apart from this, as already discussed, MySQL secondary indexes also maintain primary key in the node along with index key so the primary key should also be kept small wherever possible.
Look at the WHERE, JOIN, ORDER BY and GROUP BY clause of the query.
The columns in WHERE clause can be used for creating index. Salary can be indexed for below WHERE clause.
WHERE salary > 10000
If there are multiple columns in WHERE clause with AND connector, then we can use composite index too.
WHERE country='India'AND state ='Haryana'AND city = 'Bhiwani'
Here we can make a composite index on country, state and city. However, if we have OR in WHERE clause then we cannot use composite indexes.
When there is a JOIN between two or more tables then the JOIN column used for lookup in the JOINED tables can be indexed e.g.,
SELECT * FROM Country c LEFT OUTER JOIN City ct ON ct.countryCode=c.Code
An index on CountryCode column of table City would help in this.
If WHERE and ORDER BY are on different fields then we can use composite index by putting where column first and then order by column e.g.,
EXPLAIN SELECT * FROM City WHERE CountryCode = 'XYZ' ORDER BY name
You can observe that "Using filesort" is removed from column Extra of result when you create an index on (CountryCode, Name).
Similarly, if there is WHERE and GROUP BY in a query then an index on (WHERE column, GROUP BY column) would help. e.g.,
EXPLAIN SELECT Continent, count(Code) FROM country WHERE Population > 1000 GROUP BY Continent ALTER TABLE country ADD INDEX cp (Continent, Population)
|Result Column||Content (Without Index)||Content (With Index|
|extra||Using where Using temporary Using filesort||Using where Using index|
As I already discussed that usefulness of indexes depend upon how we query the database; it is essential to perform an analysis from time to time to add appropriate indexes or remove unused as and when the application code changes, table size increases or some queries are invoked more frequently.
Now next time you encounter an extremely slow query dealing with a large data set and have already tried different ways to rewrite it to make it fast; try indexing on appropriate columns.