Tuesday, April 14, 2015

Optimizer Hint @ Indian PGDay, Bangalore (11th April 2015)

I recently just got back from Indian PGDay conference 2015. It was an interesting, motivating and lot of knowledge sharing in terms of both attending and speaking at the conference.

I spoke about the various kind of "Optimizer Hint" provided by many database engines and also a new idea of "Hint", which can be provided to Optimizer. Some of the speakers shared their work on PostgreSQL as User.
Also it was interesting to know that many companies are evaluating migration or are in process of migrating from other DBs to PostgreSQL. This is really encouraging for all PostgreSQL experts.



Some of the details from presentation are as below (For complete presentation please visit Full Optimizer Hint)

Statistics Hint:

Statistics Hint is used to provide any kind of possible statistics related to query, which can be used by optimizer to yield the even better plan compare to what it would have done otherwise.
Since most of the databases stores statistics for a particular column or relation but doesn't store statistics related to join of column or relation. Rather these databases just multiply the statistics of individual column/relation to get the statistics of join, which may not be always correct.

Example:
Lets say there is query as
SELECT * FROM EMPLOYEE WHERE GRADE>5 AND SALARY > 10000;

If we calculate independent stats for a and b.
suppose sel(GRADE) = .2 and sel(SALARY) = .2;

then sel (GRADE and SALARY) =
sel(GRADE) * sel (SALARY) = .04.
 
In all practical cases if we see, these two components will be highly dependent i.e. if first column satisfy,second column will also satisfy. Then in that case sel (GRADE and SALARY) should be .2 not .04. But current optimizer will be incorrect in this case and may give wrong plan.

Data Hint:

This kind of hints provides the information about the relationship/ dependency among relations or column to influence the plan instead of directly hinting to provide desired plan or direct selectivity value. Optimizer can consider dependency information to derive the actual selectivity.

Example:
Lets say there is a query as
SELECT * FROM TBL  WHERE ID1 = 5 AND ID2=NULL;
SELECT * FROM TBL  WHERE ID1 = 5 AND ID2!=NULL;

Now here if we specify that the dependency as
“If TBL.ID1 = 5 then TBL.ID2 is NULL”
then the optimizer will always consider this dependency pattern and accordingly combined statistics for these two columns can be choosen.

Note: This feature is not yet available in PG.

Conclusion:
Unlike other DB, we can provide some actual statistics information to optimizer to come out with the most optimal plan instead of directly telling planner to choose one specific plan.