Lossless Compression of Textual Data

February 4, 2020

Pedro Valero

Lossless Compression of Textual Data

Time: 10:45am

Location: Meeting room 302 (Mountain View), level 3

This talk is a combination of what I have learned about grammar-based compression while developing our tool -zearch- for searching on compressed text [1] and what I have learned about LZ77-based compression during my internship at Facebook.

We will begin with an introduction to these two paradigms of compression of textual data: Grammar-based and LZ77-based, followed by an explanation of one algorithm from each family: REcursive PAIRing (Grammar) [2] and zstd (LZ77) [3]. After that, we will compare these two classes of algorithms and discuss some of the advantages of each of them with respect to the other.

Finally, we will talk about dictionary-enhanced compression as a mechanism for efficiently compressing large amounts of small but similar files. This technique has proven so efficient that has been deployed at large scale within Facebook creating the so called “Managed Compression”.

[1] https://ieeexplore.ieee.org/document/8712660 [2] https://ieeexplore.ieee.org/abstract/document/755679 [3] https://facebook.github.io/zstd/