Vectorization: The secret to blazing fast performance

The special sauce

Filed under: Kinetica

Last updated on: July 21, 2021

Length:: 3 minute read, 579 words

Everything in Kinetica is geared towards speed and performance. Let’s start with its biggest edge - vectorized parallelism.

Almost all distributed analytic databases offer some level of task parallelization by farming out queries and data to multiple nodes. For instance, here the data containing information about daily sales for three stores A, B, C is split into three. Each partition containing information for just one store is kept on separate nodes in the cluster. Now, if we wanted to calculate total sales for each store. We would map the query to each node and execute it in parallel. The results from these queries would then be combined to arrive at the final table showing the total sales for each store.

This is task level parallelization. Like a lot of other databases Kinetica does this really well, however, it does not stop there. Parallelization in Kinetica happens at a much deeper level inside each node rather than simply at the level of tasks that are mapped to nodes. We do this using vectorization. Vectorization operates at the level of individual instructions sent to a processor within each node. For instance, in the illustration shown here, the instruction is to add 5 to a column of numbers and copy the results to a new column B. With vectorization, all the data elements in that column are transformed simultaneously, i.e. the instruction to add 5 is applied to multiple pieces of data at the same time. This paradigm is sometimes referred to as Single Instruction Multiple Data (or SIMD).

We can think of vectorization as subdividing the work into smaller chunks that can be handled independently by different computational units at the same time.

This is orders of magnitude faster than the conventional sequential model where each piece of data is handled one after the other in sequence.

With vectorization, performing the same operation on a modern intel CPU is 16 times faster than the sequential mode. The performance gains on GPUs with thousands of computational cores is even greater. However, despite these remarkable performance benefits, most analytical code out there is written in the slower sequential mode. This is not a surprise, since until about a decade ago, CPU and GPU hardware could not really support vectorization for data analysis. So most implementations had to be sequential.

The last 10 years, however, have seen the rise of new technologies like CUDA from NVidia and advanced vector extensions from Intel that have dramatically shifted our ability to apply vectorization. Because of the power of vectorization, some traditional vendors now make claims about including vectorization in their offerings. But shifting to this new vectorized paradigm is not easy, since all of your code needs to be written from scratch to utilize these capabilities. Unlike Kinetica, these traditional databases incorporate only partial vectorization, limited to specific operations such as filters or some aggregations and continue to use outdated and less performant modes of computation for other analytical tasks. These solutions also only leverage vectorization on the CPUs.

Kinetica on the other hand provides native support for vectorization on both CPUs and GPUs. Analytical functions in Kinetica have been written from scratch to take advantage of vectorization. And the great thing is that you don’t have to write a single piece of vectorized code yourself.

You can simply write your queries in data science languages that you are familiar with, and under the hood, Kinetica will leverage vectorization to deliver your results at mind-boggling speeds.