Title: A Survey on Data-driven Performance Tuning for Big Data Analytics Platforms
Venue: Big Data Research, vol. 25 (2021)
Authors: Rogério Luís de C.Costa, José Moreira, Paulo Pintor, Veronica dos Santos, Sérgio Lifschitz
Abstract: Many research works deal with big data platforms looking forward to data science and analytics. These are complex and usually distributed environments, composed of several systems and tools. As expected, there is a need for a closer look at performance issues.
In this work, we review performance tuning strategies in the big data environment. We focus on data-driven tuning techniques, discussing the use of database inspired approaches. Concerning big data and NoSQL stores, performance tuning issues are quite different from the so-called conventional systems. Many existing solutions are mostly ad-hoc activities that do not fit for multiple situations. But there are some categories of data-driven solutions that can be taken as guidelines and incorporated into general-purpose auto-tuning modules for big data systems.
We examine typical performance tuning actions, discussing available solutions to support some of the tuning process’s primary activities. We also discuss recent implementations of data-driven performance tuning solutions for big data platforms. We propose an initial classification based on the domain state-of-the-art and present selected tuning actions for large-scale data processing systems. Finally, we organized existing works towards self-tuning big data systems based on this classification and presented general and system-specific tuning recommendations. We found that most of the literature pieces evaluate the use of tuning actions at the physical design perspective, and there is a lack of self-tuning machine-learning-based solutions for big data systems.