Wednesday, March 25, 2015

Index Scan Optimization for ">" condition

In PostgreSQL 9.5, we can see improved performance  for Index Scan on ">" condition.

In order to explain this optimization, consider the below schema:
create table tbl2(id1 int, id2 varchar(10), id3 int);
create index idx2 on tbl2(id2, id3);

Query as:
                select count(*) from tbl2 where id2>'a' and id3>990000;

As per design prior to this patch, Above query used following steps to retrieve index tuples:

  • Find the scan start position by searching first position in BTree as per the first key condition i.e. as per id2>'a'
  • Then it fetches each tuples from position found in step-1.
  • For each tuple, it matches all scan key condition, in our example it matches both scan key condition.
  • If condition match, it returns the tuple otherwise scan stops.


Now problem is here that already first scan key condition is matched to find the scan start position (Step-1), so it is obvious that any further tuple also will match the first scan key condition (as records are sorted).

So comparison on first scan key condition again in step-3 seems to be redundant.

So we have made the changes in BTree scan algorithm to avoid the redundant check i.e. remove the first key comparison for each tuple as it is guaranteed to be always true.

Performance result summary:



I would like to thanks Simon Riggs for verifying and committing this patch. Simon Riggs also confirmed improvement of 5% in both short and long index, on the least beneficial data-type and considered to be very positive win overall. 

Wednesday, March 18, 2015

Overview of PostgreSQL Engine Internals

POSTGRESQL is an open-source, full-featured relational database. This blog gives an overview of how POSTGRESQL engine processes queries received from the user.
Typical simplified flow of PostgreSQL engine is:

SQL Engine Flow

As part of this blog, I am going to cover all modules marked in yellow colour.

Parser:
Parser module is responsible for syntactical analysis of the query. It constitute two sub-modules:
1. Lexical scanner
2. Bison rules/actions

Lexical Scanner:
Lexical scanner reads each character from the given query and return the appropriate token based on the matching rules. E.g. rules can be as follows:

   

    Name given in the <> is the state name, in the above example <xc> is the state name for the comment start. So once it sees the comment starting character, comment body token will be read in the <xc> state only.

Bison:
Bison reads token returned from scanner and matches the same  against the given rule for a particular query and performs the associated actions. E.g. the bison rule for SELECT statement is:

       

So each returned token is matched with the rule mentioned above in left-right order, if at any time it does not find matching rule, then either it goes to next possible matching rule or throws an error.

Analyzer:
Analyzer module is responsible for doing semantic analysis of the given query.  Each raw information about the query received from the Parser module is transformed to database internal object form to get the corresponding object id. E.g. relation name "tbl" get replaces with its object id.
Output of analyzer module is Query tree, structure of same can be seen in the structure "Query" of file src/include/nodes/parsenodes.h

Optimizer:
Optimizer module also consider to be brain of SQL engine is responsible for choosing the best path for execution of the query. Best path for a query is selected based on the cost of the path. The path with least cost is considered to be a winner path.
Based on the winner path, plan is created which is used by executor to execute the query.
Some of the important decision points are taken in terms of below methods:

  1. Scan Method
    • Sequential scan: Simply read the heap file start to end so it is considered to be very slow if many records to be fetched.
    • Index scan: Use a secondary data structure to quickly find the records that satisfy a certain predicate and then corresponding to that it looks for other part of the record in Heap. So it involves extra cost of random page access.
  2. Join Method
    • Nested Loop Join: In this join approach, each record of outer table is matched with each record of inner table. The simple algorithm for the same is:
               For a NL join between Outer and Inner on Outer:k = Inner:k:

                                     for each tuple r in outer:
                                             for each tuple s in Inner with s.k = r.k:
                                                        emit output tuple (r,s)

                            Equivalently: Outer is left, Inner is right.

                                        
                                  


    • Merge Join: This join is suitable only sorted record of each participating table and only for "=" join clause. The simple algorithm for this join is:
                     For both r in Outer, s in Inner:
                                          if r.k = s.k:
                                                emit output tuple (r,s)
                                                Advance Outer & Inner
                                           if r.k < s.k
                                                Advance Outer
                                           else
                                                Advance Inner
                         
                              
                                   

    • Hash Join: This join does not require records to be sorted but this is also used only for "=" join clause.
                       For a HJ between Inner and Outer on Inner:k = Outer:k:
                               -- build phase
                              for each tuple r in Inner:
                                   insert r into hash table T with key r.k
                              -- probe phase
                             for each tuple s in Outer:
                                   for each tuple r in bucket T[s.k]:
                                   if s.k = r.k:
                                        emit output tuple (T[s.k], s)

                              
                                  
3. Join Order: It is mechanism to decide the order in which table has to be joined.

 Typical output of the plan is:
  postgres=# explain select firstname from friend where age=33      order by firstname;
                          QUERY PLAN
--------------------------------------------------------------
 Sort  (cost=1.06..1.06 rows=1 width=101)
   Sort Key: firstname
   ->  Seq Scan on friend  (cost=0.00..1.05 rows=1 width=101)
         Filter: (age = 33)
(4 rows)
Executor:
Executor is the module, which takes output of planner as input and transform each node of plan to state tree node. Then in turn each node gets executed to perform the corresponding operation. 
The state tree nodes execution starts from root and to get the input, it keep going to child node till it reaches the leaf node. So finally leaf node executed to pass the input to upper node. Out of two leaf nodes, first outer node (i.e. left node) gets evaluated.
         At this point it uses the interface from storage module to retrieve the actual data.

Typically execution process can be divided as:
  • Executor Start: Prepares the plan for execution. It process each node of plan recursively and generate corresponding state tree node. Also it initializes memory to hold projection list, qualification expression and slot for holding the resultant tuple. 
  • Executor Run: Recursively process each state tree node and each resultant tuple is send to front-end using the register destination function.
  • Executor Finish: Free all the allocated resources.
One of the typical flow of execution is as below:
So in above flow, Execution starts from Merge Join but it needs input to process, so it flow towards first left child node, take one tuple using index scan and then it request for input tuple from right child node. Right child node is Sort node, so it request for tuple from its child, which in turn does the sequence scan. So once all tuple is received at Sort node and tuples are shorted, then it passes the first tuple to its parent node.

Reference:
Older papers from PostgreSQL.