Compressed Dictionary Matching on Run-Length Encoded Strings
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Given a set of pattern strings P = {P1, P2, . . . Pk} and a text string S, the classic dictionary matching problem is to report all occurrences of each pattern in S. We study the dictionary problem in the compressed setting, where the pattern strings and the text string are compressed using run-length encoding, and the goal is to solve the problem without decompression and achieve efficient time and space in the size of the compressed strings. Let m and n be the total length of the patterns P and the length of the text string S, respectively, and let m and n be the total number of runs in the run-length encoding of the patterns in P and S, respectively. Our main result is an algorithm that achieves O((m+n) log logm+occ) expected time, and O(m) space, where occ is the total number of occurrences of patterns in S. This is the first non-trivial solution to the problem. Since any solution must read the input, our time bound is optimal within an log logm factor. We introduce several new techniques to achieve our bounds, including a new compressed representation of the classic Aho-Corasick automaton and a new efficient string index that supports fast queries in runlength encoded strings.