Python

Regular expression matching at scale with Hyperscan

Fabian Becker

08 Apr 2021 • 9 min read

Regular expressions are and always have been ubiquitous in Software Engineering. From input matching to complex parsing, their use is wide-spread. There are many different regular expression engines out there. The most popular by far is PCRE, or Perl Compatible Regular Expressions. Go developers will be more familiar with Google's re2, which is also available in Python through pyre2. In most cases the choice of engine rarely matters for the overall performance of your application. What however, if you have a large number of regular expressions that you need to match against a large dataset?

In this article we're going to look at Hyperscan and how it can help scale up the performance of regular expression based classification tasks. Hyperscan is a high-performance regular expression engine which was open sourced in 2015 by Intel. According to the about page, the library is mainly used inside a deep package inspection library stack where it aids package classification through regular expression matching. In 2018, Github adopted Hyperscan for its automatic token scanning with which it detects accidental commits of API tokens. What sets Hyperscan apart from a lot of other engines is that it supports multi-pattern matching and streaming. Streaming is when the engine doesn't keep the full string contents in memory to perform a match.

Thanks to the excellent work from David Gidwani there is a Python extension available which makes using Hyperscan in Python as easy as pip install hyperscan.

Dataset

For the purpose of this post, let's assume that you have a stream of web application logs where you want to annotate each line based on some property. For example, a line with a .php should be annotated with #php and a line with .js should get a #javascript annotation. Over time the number of rules will grow to hundreds maybe thousands of annotations. Since rules aren't necessarily mutually exclusive you will need to check each line against all rules. Let's also assume that after evaluating all alternatives we've concluded that the best way to express our rules is with regular expressions.

An important callout at this point is that we're merely looking for a boolean match - we don't care about capture groups at all, in fact Hyperscan doesn't support them. If those are important for you, you can experiment with a two stage matching setup where Hyperscan forms the first stage to identify matching entities and regular re/re2 form the second stage and pull out individual capture groups.

Sidenote: The example used here might not benefit from the use of regular expressions. Often times a simple combination of string splitting with some indexing will be faster than using regular expressions. It's important to always properly evaluate the performance of alternatives before using regular expressions.

There are a lot of example web application log datasets available on the internet. For this post I've selected a dataset from Kaggle which contains around 16k rows.

The dataset is fairly simple. It contains IP addresses, a timestamp, a URL and an HTTP status. A sample can be seen below:

IP,Time,URL,Staus
10.128.2.1,[29/Nov/2017:06:58:55,GET /login.php HTTP/1.1,200
10.128.2.1,[29/Nov/2017:06:59:02,POST /process.php HTTP/1.1,302
10.128.2.1,[29/Nov/2017:06:59:03,GET /home.php HTTP/1.1,200
10.131.2.1,[29/Nov/2017:06:59:04,GET /js/vendor/moment.min.js HTTP/1.1,200
10.130.2.1,[29/Nov/2017:06:59:06,GET /bootstrap-3.3.7/js/bootstrap.js HTTP/1.1,200
10.130.2.1,[29/Nov/2017:06:59:19,GET /profile.php?user=bala HTTP/1.1,200
10.128.2.1,[29/Nov/2017:06:59:19,GET /js/jquery.min.js HTTP/1.1,200
10.131.2.1,[29/Nov/2017:06:59:19,GET /js/chart.min.js HTTP/1.1,200
10.131.2.1,[29/Nov/2017:06:59:30,GET /edit.php?name=bala HTTP/1.1,200

We're going to ignore for a moment that these lines are comma-separated values and instead treat them just as a long string since logs normally don't come CSV formatted.

Implementation

Let's start this by loading in the data we've obtained from Kaggle. The file we obtained is technically a CSV file, but we don't necessarily care about the individual columns.

from typing import List, Optional, Any

def load_data():
    content = open("weblog.csv", "r").read()
    logs = content.split("\n")[1:]
    return logs

Data loading and pattern definition

For simplicity we'll load the entire file. In a production system you would most likely want to stream the data and not load it in its entirety into memory.

Defining Patterns

As stated above we want to use regular expressions to categorize or tag each line. Naïvely we can define a simple mapping from regular expression to "tag" or "category".

RULES = {
    r"^10\.128.*min\.js": "js",
    r"^10\.13[01].*\.php": "php",
    r"\.woff": "font",
    r"\.min\.": "minimized",
    r"\.css": "stylesheet",
    r"login": "login",
    r"bootstrap": "bootstrap",
    r"\/vendor\/": "vendored-js"
}

PATTERNS = list(RULES.keys())
TAG_LOOKUP = list(RULES.values())

re

Python comes with two major regex libraries, re and regex. The former being the built-in and the latter a popular re compatible replacement. re only supports single matches which means that we'll need to check each regex against each line.

It's good practice to pre-compile patterns when you do a lot of matching. Python has an internal cache when you compile a pattern for the first time, but it's easy to hit the max cache size, causing recompilation.

RE_PATTERNS = [re.compile(pattern) for pattern in PATTERNS]

def re_search(lines: List[str]) -> List[str]:
    result = []
    for line in lines:
        for idx, pattern in enumerate(RE_PATTERNS):
            if pattern.search(line):
                result.append(TAG_LOOKUP[idx])

    return result

re search

Note that we're using search instead of match on the pattern since we care about matching anywhere in the search string and not just at the beginning.

re2

re2 is a C++ library built by Google. It's accessible through a library built by Facebook called fb-re2. re2 is a great alternative to re/regex if you're looking to do multi-matching since it comes with a Set feature. Sets allow for efficient multi-pattern matching, they however come with quite a drastic memory consumption overhead as the number of regexes increases.

As the regex set size increases, RE2::Set requires more memory to build out its DFA state cache as it scans, and as such, its average-case performance begins to degrade. From the peak memory usage chart, we can see that even for fairly small sets (~ 30 patterns), peak memory usage for RE2::Set quickly approaches the maximum value of 2GB, [..]
Source

The Python wrapper implements an additional test_search which tells the underlying re2 that we only care about whether or not our pattern matches the string. In practice this avoids a lot of memory allocations since captures aren't reported back.

The following code first compiles all patterns and then implements two functions, one for search and one for test_search.

RE2_PATTERNS = [re2.compile(pattern) for pattern in PATTERNS]

def re2_search(lines: List[str]) -> List[str]:
    result = []
    for line in lines:
        for idx, pattern in enumerate(RE2_PATTERNS):
            if pattern.search(line):
                result.append(TAG_LOOKUP[idx])

    return result

def re2_test_search(lines: List[str]) -> List[str]:
    result = []
    for line in lines:
        for idx, pattern in enumerate(RE2_PATTERNS):
            if pattern.test_search(line):
                result.append(TAG_LOOKUP[idx])

    return result

re2 search & test_search

The re2.Set version will look very similar. We first create an instance of re2.Set, add all the patterns and compile the full set. Patterns automatically get an auto-incrementing id assigned which we can again use to look up the corresponding tags. re2.Set.match returns a list of matching pattern ids.

RE2_SET = re2.Set()
for pattern in PATTERNS:
    RE2_SET.add(pattern)
RE2_SET.compile()

def re2_set_match(lines: List[str]):
    result = []
    for line in lines:
        if matched_ids := RE2_SET.match(line):
            result += [TAG_LOOKUP[id] for id in matched_ids]
    return result

re2.Set

hyperscan

Now finally, let's come to Hyperscan. Hyperscan works a bit differently in that you need to first build a database of patterns. Each pattern is accompanied by an id and some flags. Those flags work similar to regular flags for the other engines in that you can set your pattern to be case insensitive (HS_FLAG_CASELESS), match across multiple lines (HS_FLAG_MULTILINE) and set it to match at most once (HS_FLAG_SINGLEMATCH). The full list of flags is documented in the Hyperscan docs which I recommend checking out.

A benefit of building this database is that it can be serialised to disk and reused later. In larger projects I have worked on, we compiled up to 10,000 regular expressions into a single Hyperscan database. Compilation could take up to half a minute which isn't terrible if you're doing it only once, but if you have a highly parallelised system where each node has to compile the database it can quickly eat away a lot of resources.

import hyperscan

# Building database for hyperscan
db = hyperscan.Database()
patterns = [
    # expression,  id, flags
    (pattern.encode("utf-8"), id, hyperscan.HS_FLAG_CASELESS | hyperscan.HS_FLAG_SINGLEMATCH) for id, pattern in enumerate(PATTERNS)
]
expressions, ids, flags = zip(*patterns)
db.compile(
    expressions=expressions, ids=ids, elements=len(patterns), flags=flags
)

def hyperscan_match(lines: List[str]) -> List[str]:
    result = []
    def on_match(id: int, froms: int, to: int, flags: int, context: Optional[Any] = None) -> Optional[bool]:
        result.append(TAG_LOOKUP[id])

    for line in lines:
        db.scan(line.encode("utf-8"), match_event_handler=on_match)

    return result

Note that we set the HS_FLAG_SINGLEMATCH flag to ensure hyperscan won't report repeat matches of the same pattern. Matching test test test against the regex test would normally yield three matches.

Benchmarking

Now that we've defined our patterns, multiple methods of matching those patterns against our data it's time to benchmark the performance of the different methods. There are a lot of different ways we could do this in Python. I'm a proponent of using pytest-benchmark since it automatically calibrates benchmarking methods and reports sensible runtime statistics.

We can define a simple pytest fixture to read in the data and make it available in tests.

import pytest

@pytest.fixture
def data():
    content = open("weblog.csv", "r").read()
    logs = content.split("\n")[1:]
    return logs

def test_re_match(benchmark, data):
    benchmark(re_search, lines=data)

def test_re2_search(benchmark, data):
    benchmark(re2_search, lines=data)

def test_re2_test_search(benchmark, data):
    benchmark(re2_test_search, lines=data)

def test_re2_set_match(benchmark, data):
    benchmark(re2_set_match, lines=data)

def test_hyperscan(benchmark, data):
    benchmark(hyperscan_match, lines=data)

Benchmarking using pytest-benchmark

Executing this small suite of benchmarks we can very quickly get an idea on the performance of our different methods. Now bear in mind that we have started with a fairly small set of regular expressions and the benchmark performed here will look drastically different as we add more expressions to the set.

$ pytest test_regex.py --benchmark-columns=min,max,mean,ops


---------------------------------------- benchmark: 5 tests ---------------------------------------
Name (time in ms)            Min                Max               Mean                OPS          
---------------------------------------------------------------------------------------------------
test_re2_set_match       14.6448 (1.0)      22.9035 (1.0)      16.8640 (1.0)      59.2978 (1.0)    
test_hyperscan           18.6088 (1.27)     27.2647 (1.19)     20.9760 (1.24)     47.6734 (0.80)   
test_re2_test_search     34.5846 (2.36)     48.3055 (2.11)     39.0942 (2.32)     25.5792 (0.43)   
test_re2_search          45.1589 (3.08)     68.0620 (2.97)     50.7832 (3.01)     19.6915 (0.33)   
test_re_match            48.1215 (3.29)     67.3883 (2.94)     52.4394 (3.11)     19.0696 (0.32)   
---------------------------------------------------------------------------------------------------

In fact, watch what happens when we add 500 randomly sampled regular expressions to our rule set:

from random import randint, choices
for x in range(500):
    exts = ["php", "rb", "js", "rs", "ex", "jl", "ps", "pdf"]
    rule = f"^10\\.{randint(100, 100+x)}.*\\.{choices(exts)[0]}"
    RULES[rule] = "unused"

Generate 500 random regex patterns and match to an unused tag

$ pytest test_regex.py --benchmark-columns=min,max,mean,ops


-------------------------------------------- benchmark: 5 tests --------------------------------------------
Name (time in ms)               Min                   Max                  Mean                OPS          
------------------------------------------------------------------------------------------------------------
test_re2_set_match          15.4018 (1.0)         22.3990 (1.0)         17.6431 (1.0)      56.6794 (1.0)    
test_hyperscan              28.5287 (1.85)        45.0001 (2.01)        32.8266 (1.86)     30.4631 (0.54)   
test_re2_test_search     1,325.8610 (86.09)    1,384.9993 (61.83)    1,349.8584 (76.51)     0.7408 (0.01)   
test_re2_search          1,960.2369 (127.27)   2,019.0210 (90.14)    1,987.2566 (112.64)    0.5032 (0.01)   
test_re_match            5,604.3447 (363.88)   5,754.0317 (256.89)   5,663.8423 (321.02)    0.1766 (0.00)   
------------------------------------------------------------------------------------------------------------

Benchmark with 500+ regular expressions

re2.Set and hyperscan remain on top while all other methods quickly deteriorate with standard re being almost 400 times as slow as re2.Set match.

It's notable that re2.test_search is a measurable improvement over regular re2.search.

Conclusion

In this article we covered a number of different regular expression engines. re and re2 are great if you run the occasional regular expression. If you end up using a large number of regular expressions to parse thousands if not millions of records and don't need all the features of a backtracking regular expression engine and want to benefit from multi pattern matching you should explore RE2 and Hyperscan.

RE2 comes with a hefty memory overhead and, while faster in the benchmarks here, quickly tops out once you add more complex expressions or go past a few thousand expressions. In the systems I worked on RE2 consistently ran out of memory while compiling the expressions. Hyperscan didn't have those issues and was adopted for its stable performance.