Hey there, I guess you have been having a lot fun with your Vertica databases. I thought I’d share some of the things that I did to improve my query performance.
Tables and Projections in Vertica
In Vertica even though we access data through tables everything is stored in projections which are stored among nodes in various ways depending on the K Safety and the Segmentation options that we choose when creating the table.
We can make modifications to these projections manually or run the Database Designer to generate projections for us.
Sometimes to get the best performance we have to write query specific projections.
How do we write Query Specific projections?
First thing is to check out the query plan of our problematic query. Use EXPLAIN PLAN to check this out. What you spot there is what you have to work on.
That sounds simple.
But how do I deal with what I spot in explain plan?
More about Explain PLAN Output in Vertica can be found at https://my.vertica.com/docs/6.1.x/HTML/index.htm#16384.htm As you take a closer look you come across two prime suspects:
- Hash Joins
- Resegmentation
Hash Joins - What are they?
Hash join is an algorithm used by Vertica for joining two tables when it sees that the columns on which the join are happening aren’t sorted the same way or if there is an inequality condition in the join. It is expensive because a hash is created in the memory using the join keys of the smaller table to the larger table. When the size of the hash gets more than the size of the memory available, disk paging starts and your queries are slowed down considerably. Not just the query that you are currently executing, but the whole cluster could be slowed down by a massive query that does a lot of HASH joins on several tables with lots of data in them. How to stop this? Sort the columns in the JOIN condition. Presorted join columns use a faster join technique called Merge Joins which uses very little memory and speeds up your query several times, especially when your tables are huge. More about the different join algorithms used in Vertica can be found here: https://my.vertica.com/docs/6.1.x/HTML/index.htm#20001.htm
Sorting also helps to speed up GROUP BY, if your columns are sorted in the same way as your GROUP BY condition, the query should perform well. More about optimizing GROUP BY can be found here: https://my.vertica.com/docs/6.1.x/HTML/index.htm#12550.htm
Resegmentation - What is that?
As discussed earlier, Projections may be stored among nodes of a cluster in a segmented or unsegmented way. Vertica uses a particular hash segmentation method to segment data of tables among nodes. The segmentation is generally done on a primary key column. When you have composite keys, this could be tricky. You could segment on the composite key, but every time you access something from your tables you pay the cost for the hash function. Thus as a simple tip from Alan and Matt, keep the segmentation clause simple, probably hash it on one field.
When tables are segmented using different combinations of fields they are distributed in different ways. Therefore creating the need for re-segmenting the data when doing joins on tables. But consider this, if tables are segmented by one field that is generally available in all tables, then that would allow a more even distribution of data among nodes in a cluster. There by reducing the re-segmentation needed during query time.
Identically segmented projections perform best when joined together.
What are identically segmented projections?
Projections that are segmented using the same field.
More about this over here: https://my.vertica.com/docs/6.1.x/HTML/index.htm#10248.htm
Other factors
Encoding types of columns Some encoding type chosen by default by the DBD may not be the most appropriate. DBD may also sort columns according to what it thinks but that may not be the most appropriate for your needs. So always choose encoding wisely depending on the cost it involves in using a certain encoding type and also depending on what sort of data type that column is. More about it here: https://my.vertica.com/docs/6.1.x/HTML/index.htm#9273.htm
- Partitioning tables for query pruning could also improve speed significantly. Partitioning a table is dividing the table data into different parts based on values of a certain column. It is very similar to FDB sharding. But partitioning is not segmentation.
- Pre-join Projections are also a Vertica feature that allows you to create projections that combine columns from more than one table using the foreign key relationship. It could be useful to speed up queries that are often written with the same sort of join condition and are pretty much meaningless without a join. More about this can be found over here: https://my.vertica.com/docs/6.1.x/HTML/index.htm#10391.htmThis would only work as long as you have the primary key and foreign key relationships defined, else Vertica would give you the following error:
ERROR 5600: Invalid predicate in projection-select. Only PK=FK equijoins are allowed
Optimizing JOIN Queries
Refer to this page on Vertica Documentation: https://my.vertica.com/docs/6.1.x/HTML/index.htm#12549.htm. This twiki page highlights the most important factors that slows down your query.
A note about projections
Query specific projections might sound like an awesome idea. The more optmized projections you create might speed up your queries. But creating too many projections for the same table or group of tables would also mean that when you update or insert values into that table, all the underlying projections have to be update. So having too many projections might have a harmful effect on your updates, it could slow them down.