Friday, June 26, 2015

Presented paper in PGCon 2015 on Native Compilation Technology

I am back from the PGCon 2015 and was fortunate enough to present my first paper on "Native Compilation" technology. Also got opportunity to meet most creamy  folks of PostgreSQL community, found everyone to be very polite and easy to go.

As part of this Native Compilation technology, I mostly focused on the Native Compilation of Relation, which we call it as Schema Binding.

Some of the details from presentation are as below (For complete presentation please visit Go Faster With Native Compilation):

Native Compilation:
Native Compilation is a methodology to reduce CPU instructions by executing only instruction specific to given query/objects unlike interpreted execution. Steps are:

  • Generate C-code specific to objects/query.
  • Compile C-code to generate DLL and load with server executable.
  • Call specialized function instead of generalized function.
Schema Binding:
Native Compilation of relation is called the Schema Binding. Since most of the properties of a particular remains same once it is created, so its data gets stored and accessed in the similar patter irrespective of any data. So instead of accessing tuples for same relation in generic way, we create a specialized access function for the relation during its creation and the same gets used for further query containing that table.


It gives performance improvement of upto 30% on standard TPC-H benchmark.


Tuesday, April 14, 2015

Optimizer Hint @ Indian PGDay, Bangalore (11th April 2015)

I recently just got back from Indian PGDay conference 2015. It was an interesting, motivating and lot of knowledge sharing in terms of both attending and speaking at the conference.

I spoke about the various kind of "Optimizer Hint" provided by many database engines and also a new idea of "Hint", which can be provided to Optimizer. Some of the speakers shared their work on PostgreSQL as User.
Also it was interesting to know that many companies are evaluating migration or are in process of migrating from other DBs to PostgreSQL. This is really encouraging for all PostgreSQL experts.



Some of the details from presentation are as below (For complete presentation please visit Full Optimizer Hint)

Statistics Hint:

Statistics Hint is used to provide any kind of possible statistics related to query, which can be used by optimizer to yield the even better plan compare to what it would have done otherwise.
Since most of the databases stores statistics for a particular column or relation but doesn't store statistics related to join of column or relation. Rather these databases just multiply the statistics of individual column/relation to get the statistics of join, which may not be always correct.

Example:
Lets say there is query as
SELECT * FROM EMPLOYEE WHERE GRADE>5 AND SALARY > 10000;

If we calculate independent stats for a and b.
suppose sel(GRADE) = .2 and sel(SALARY) = .2;

then sel (GRADE and SALARY) =
sel(GRADE) * sel (SALARY) = .04.
 
In all practical cases if we see, these two components will be highly dependent i.e. if first column satisfy,second column will also satisfy. Then in that case sel (GRADE and SALARY) should be .2 not .04. But current optimizer will be incorrect in this case and may give wrong plan.

Data Hint:

This kind of hints provides the information about the relationship/ dependency among relations or column to influence the plan instead of directly hinting to provide desired plan or direct selectivity value. Optimizer can consider dependency information to derive the actual selectivity.

Example:
Lets say there is a query as
SELECT * FROM TBL  WHERE ID1 = 5 AND ID2=NULL;
SELECT * FROM TBL  WHERE ID1 = 5 AND ID2!=NULL;

Now here if we specify that the dependency as
“If TBL.ID1 = 5 then TBL.ID2 is NULL”
then the optimizer will always consider this dependency pattern and accordingly combined statistics for these two columns can be choosen.

Note: This feature is not yet available in PG.

Conclusion:
Unlike other DB, we can provide some actual statistics information to optimizer to come out with the most optimal plan instead of directly telling planner to choose one specific plan.

Wednesday, March 25, 2015

Index Scan Optimization for ">" condition

In PostgreSQL 9.5, we can see improved performance  for Index Scan on ">" condition.

In order to explain this optimization, consider the below schema:
create table tbl2(id1 int, id2 varchar(10), id3 int);
create index idx2 on tbl2(id2, id3);

Query as:
                select count(*) from tbl2 where id2>'a' and id3>990000;

As per design prior to this patch, Above query used following steps to retrieve index tuples:

  • Find the scan start position by searching first position in BTree as per the first key condition i.e. as per id2>'a'
  • Then it fetches each tuples from position found in step-1.
  • For each tuple, it matches all scan key condition, in our example it matches both scan key condition.
  • If condition match, it returns the tuple otherwise scan stops.


Now problem is here that already first scan key condition is matched to find the scan start position (Step-1), so it is obvious that any further tuple also will match the first scan key condition (as records are sorted).

So comparison on first scan key condition again in step-3 seems to be redundant.

So we have made the changes in BTree scan algorithm to avoid the redundant check i.e. remove the first key comparison for each tuple as it is guaranteed to be always true.

Performance result summary:



I would like to thanks Simon Riggs for verifying and committing this patch. Simon Riggs also confirmed improvement of 5% in both short and long index, on the least beneficial data-type and considered to be very positive win overall. 

Wednesday, March 18, 2015

Overview of PostgreSQL Engine Internals

POSTGRESQL is an open-source, full-featured relational database. This blog gives an overview of how POSTGRESQL engine processes queries received from the user.
Typical simplified flow of PostgreSQL engine is:

SQL Engine Flow

As part of this blog, I am going to cover all modules marked in yellow colour.

Parser:
Parser module is responsible for syntactical analysis of the query. It constitute two sub-modules:
1. Lexical scanner
2. Bison rules/actions

Lexical Scanner:
Lexical scanner reads each character from the given query and return the appropriate token based on the matching rules. E.g. rules can be as follows:

   

    Name given in the <> is the state name, in the above example <xc> is the state name for the comment start. So once it sees the comment starting character, comment body token will be read in the <xc> state only.

Bison:
Bison reads token returned from scanner and matches the same  against the given rule for a particular query and performs the associated actions. E.g. the bison rule for SELECT statement is:

       

So each returned token is matched with the rule mentioned above in left-right order, if at any time it does not find matching rule, then either it goes to next possible matching rule or throws an error.

Analyzer:
Analyzer module is responsible for doing semantic analysis of the given query.  Each raw information about the query received from the Parser module is transformed to database internal object form to get the corresponding object id. E.g. relation name "tbl" get replaces with its object id.
Output of analyzer module is Query tree, structure of same can be seen in the structure "Query" of file src/include/nodes/parsenodes.h

Optimizer:
Optimizer module also consider to be brain of SQL engine is responsible for choosing the best path for execution of the query. Best path for a query is selected based on the cost of the path. The path with least cost is considered to be a winner path.
Based on the winner path, plan is created which is used by executor to execute the query.
Some of the important decision points are taken in terms of below methods:

  1. Scan Method
    • Sequential scan: Simply read the heap file start to end so it is considered to be very slow if many records to be fetched.
    • Index scan: Use a secondary data structure to quickly find the records that satisfy a certain predicate and then corresponding to that it looks for other part of the record in Heap. So it involves extra cost of random page access.
  2. Join Method
    • Nested Loop Join: In this join approach, each record of outer table is matched with each record of inner table. The simple algorithm for the same is:
               For a NL join between Outer and Inner on Outer:k = Inner:k:

                                     for each tuple r in outer:
                                             for each tuple s in Inner with s.k = r.k:
                                                        emit output tuple (r,s)

                            Equivalently: Outer is left, Inner is right.

                                        
                                  


    • Merge Join: This join is suitable only sorted record of each participating table and only for "=" join clause. The simple algorithm for this join is:
                     For both r in Outer, s in Inner:
                                          if r.k = s.k:
                                                emit output tuple (r,s)
                                                Advance Outer & Inner
                                           if r.k < s.k
                                                Advance Outer
                                           else
                                                Advance Inner
                         
                              
                                   

    • Hash Join: This join does not require records to be sorted but this is also used only for "=" join clause.
                       For a HJ between Inner and Outer on Inner:k = Outer:k:
                               -- build phase
                              for each tuple r in Inner:
                                   insert r into hash table T with key r.k
                              -- probe phase
                             for each tuple s in Outer:
                                   for each tuple r in bucket T[s.k]:
                                   if s.k = r.k:
                                        emit output tuple (T[s.k], s)

                              
                                  
3. Join Order: It is mechanism to decide the order in which table has to be joined.

 Typical output of the plan is:
  postgres=# explain select firstname from friend where age=33      order by firstname;
                          QUERY PLAN
--------------------------------------------------------------
 Sort  (cost=1.06..1.06 rows=1 width=101)
   Sort Key: firstname
   ->  Seq Scan on friend  (cost=0.00..1.05 rows=1 width=101)
         Filter: (age = 33)
(4 rows)
Executor:
Executor is the module, which takes output of planner as input and transform each node of plan to state tree node. Then in turn each node gets executed to perform the corresponding operation. 
The state tree nodes execution starts from root and to get the input, it keep going to child node till it reaches the leaf node. So finally leaf node executed to pass the input to upper node. Out of two leaf nodes, first outer node (i.e. left node) gets evaluated.
         At this point it uses the interface from storage module to retrieve the actual data.

Typically execution process can be divided as:
  • Executor Start: Prepares the plan for execution. It process each node of plan recursively and generate corresponding state tree node. Also it initializes memory to hold projection list, qualification expression and slot for holding the resultant tuple. 
  • Executor Run: Recursively process each state tree node and each resultant tuple is send to front-end using the register destination function.
  • Executor Finish: Free all the allocated resources.
One of the typical flow of execution is as below:
So in above flow, Execution starts from Merge Join but it needs input to process, so it flow towards first left child node, take one tuple using index scan and then it request for input tuple from right child node. Right child node is Sort node, so it request for tuple from its child, which in turn does the sequence scan. So once all tuple is received at Sort node and tuples are shorted, then it passes the first tuple to its parent node.

Reference:
Older papers from PostgreSQL.