Improve performance by reducing data movement over the network
Filed under: Kinetica
Last updated on: July 21, 2021
- Length:
- 4 minute read, 645 words
All the features that we have discussed in the previous posts would already make Kinetica standout in terms of performance. But why stop here! 😄
There are two further optimizations that are worth noting.
Data distribution features
Sharding
The first is movement of data across different nodes within the cluster. Unorganized data is a key performance bottleneck that can lead to unnecessary shuffling of data shuffling over the network.
For instance, consider the earlier example of task parallelization. In this example, the data is split based on the store. So all of the data for store A is in the first node, store B in the second and store C in the third node. To find the total sales for each store, each node has all the data it needs to calculate the total.
Now, if this data had not been split on the store, it is likely that each node would not have had all the data it needed. This would have required copies of the data to be sent over the network so that each had enough information to complete the query.
Sending data over the network is the slowest and least performant part of any distributed system.
So, a key consideration is how to split the data between the different nodes so that queries from users don’t require too much shuffling of data between different nodes.
This is where sharding comes into play.
Sharding helps to intelligently distribute data across different nodes so that data that are likely to be used together are co-located, reducing the need to shuffle parts of it over the network while executing queries.
Chunking and partitioning
On top of this, Kinetica provides the capability to partition the data within each node such that it is organized in a way that reduces the number of rows that need to be processed to address a query.
For instance, if a common query is to filter for all the red values, then organizing and grouping the data by colour would allow us to completely skip all the rows with only blue values while searching for rows that match red.
Integrated analytics
The final and perhaps most important piece in Kinetica’s performance puzzle is its integrated analytics suite.
While a database can have all the bells and whistles in terms of a high performance architecture, all of this goes out of the window the second you need to move your data to an external location to complete your analysis.
As we discussed earlier, moving data over the network is the biggest performance bottleneck.
Complex analytical tasks often require specialized tools that most traditional solutions do not offer. These solutions often have to rely on a patchwork of external tools to complete these tasks.
This requires data to be moved outside the cluster to these external systems to execute these specialized analytical tasks and then bringing these intermediate results back into the cluster to complete the analysis. This has a tremendous impact on performance.
Kinetica, however, is a SQL - 92 compliant database with an extensive analytical suite that includes, graph, geospatial, machine learning and time series capabilities. This allows you to combine and integrate different analytical tools to execute really complex and specialized analytical tasks within the database without having to rely on any external tools.
This means that we can harness all of Kinetica’s performance without losing any to network I/O.
So to summarize, Kinetica at its core is a memory-first platform with tiered storage that leverages vectorization to parallelize tasks at the level of each instruction. It’s integrated suite of tools have been built from the ground up to leverage these performance features to deliver high speed streaming and real time analytics. This is why enterprises often look to Kinetica when conventional solutions reach their breaking point trying to solve problems Kinetica can handle in a matter of seconds.