The Demo by Lucas Kuhring and Zsolt István on "Bionic Distributed Storage for Parquet Files", accepted at VLDB'19

The Demo by Lucas Kuhring and Zsolt István on "Bionic Distributed Storage for Parquet Files", accepted at VLDB'19

June 11, 2019

Lucas Kuhring and Zsolt István, researchers at the IMDEA Software Institute, will present a demo “I Can’t Believe It’s Not (Only) Software! Bionic Distributed Storage for Parquet Files” at the 45th International Conference on Very Large Data Bases (VLDB'19) in Los Angeles, USA.

The size of data to be stored and processed as part of data science applications is increasing and leading to bottlenecks and inefficiencies in the datacenter. A way to reduce these bottlenecks is by tailoring the underlying distributed storage solution to the application domain, using resources more efficiently. Lucas and Zsolt have explored this idea in the context of a popular column-oriented storage format used in big data workloads, namely Apache Parquet.

The prototype they have created uses a storage node based on Field Programmable Gate Arrays (FPGAs) that offers high bandwidth data deduplication and, in the future, near data processing for machine learning workloads. The hardware is combined with a software library that allows transparent access to Parquet files.

The demonstration shows that it is possible to implement in-line deduplication without increasing latencies significantly or reducing throughput by relying on the FPGA’s dataflow processing model. It also highlights the benefits of implementing the application-specific aspects in a software library instead of FPGA circuits and how this enables, for instance, regular data science frameworks running in Python to access the data on the storage nodes.

Pic