The State of the Art of Data Analytics Systems and What is Wrong about it

April 5, 2019

Ingo Mueller


The State of the Art of Data Analytics Systems and What is Wrong about it

Time:   10:45am
Location:   Meeting room 302 (Mountain View), level 3

Few technological advances have affected as many aspects of science, economy, and society in general as the ability to collect, analyze, and understand large amounts of data. Data analytics systems play an important role in this development as they translate the exponential performance improvements made by hardware into similar improvements at higher abstraction levels. As one example, I will present a thorough study of a core database primitive, grouping with aggregation, done in the context of a commercial system for relational in-memory processing. For this primitive alone, we had to address a number of challenges: (provable) cache-efficiency, CPU-friendliness, parallelism within and across processors, robust handling of skewed data, adaptive processing, processing with constrained memory, and integration with modern database architectures.

I argue that this approach corresponds to the state of the art of system building: Today’s systems typically implement one analysis/platform combination, requiring data scientists to constantly switch tools and duplicating implementation effort of systems and their applications. Still, they are all very similar on a conceptual level, suggesting that we have not fundamentally understood what makes up the essence of our systems. I will thus sketch my vision and research theme for the foreseeable future: a common abstraction for a large span of types of data analytics that can run efficiently on a variety of hardware platforms.