Step 1. Add the JitPack repository to your build file
Add it in your root settings.gradle at the end of repositories:
dependencyResolutionManagement {
repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
repositories {
mavenCentral()
maven { url 'https://jitpack.io' }
}
}
Add it in your settings.gradle.kts at the end of repositories:
dependencyResolutionManagement {
repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS)
repositories {
mavenCentral()
maven { url = uri("https://jitpack.io") }
}
}
Add to pom.xml
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
Add it in your build.sbt at the end of resolvers:
resolvers += "jitpack" at "https://jitpack.io"
Add it in your project.clj at the end of repositories:
:repositories [["jitpack" "https://jitpack.io"]]
Step 2. Add the dependency
dependencies {
implementation 'com.github.gliwka:hyperscan-java:v5.4.11-3.1.0'
}
dependencies {
implementation("com.github.gliwka:hyperscan-java:v5.4.11-3.1.0")
}
<dependency>
<groupId>com.github.gliwka</groupId>
<artifactId>hyperscan-java</artifactId>
<version>v5.4.11-3.1.0</version>
</dependency>
libraryDependencies += "com.github.gliwka" % "hyperscan-java" % "v5.4.11-3.1.0"
:dependencies [[com.github.gliwka/hyperscan-java "v5.4.11-3.1.0"]]
Hyperscan-java provides Java bindings for Vectorscan, a fork of Intel's Hyperscan - a high-performance multiple regex matching library.
Vectorscan uses hybrid automata techniques to allow simultaneous matching of large numbers (up to tens of thousands) of regular expressions across streams of data with exceptional performance.
PatternFilter
utility for full Java Regex API compatibilityThe library is available on Maven Central. The version number consists of two parts (e.g., 5.4.11-3.1.0
):
5.4.11
)3.1.0
)<dependency>
<groupId>com.gliwka.hyperscan</groupId>
<artifactId>hyperscan</artifactId>
<version>5.4.11-3.1.0</version>
</dependency>
implementation 'com.gliwka.hyperscan:hyperscan:5.4.11-3.1.0'
libraryDependencies += "com.gliwka.hyperscan" %% "hyperscan" % "5.4.11-3.1.0"
The library offers two primary ways to use the regex matching capabilities:
PatternFilter
is ideal when you:
It uses Vectorscan to quickly identify potential matches, then confirms them with Java's standard Regex engine.
Use the direct API when:
Note: Vectorscan doesn't support backreferences, capture groups, or backtracking verbs.
import com.gliwka.hyperscan.util.PatternFilter;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import static java.util.Arrays.asList;
public class PatternFilterExample {
public static void main(String[] args) {
// Create a list of Java regex patterns (could be thousands)
List<Pattern> patterns = asList(
Pattern.compile("The number is ([0-9]+)", Pattern.CASE_INSENSITIVE),
Pattern.compile("The color is (blue|red|orange)"),
Pattern.compile("\\w+@\\w+\\.com")
// imagine thousands more patterns here
);
try (PatternFilter filter = new PatternFilter(patterns)) {
String text = "The number is 42 and the NUMBER is 123. Contact info@example.com";
// Quickly filter to just the potentially matching patterns
List<Matcher> potentialMatches = filter.filter(text);
// Use Java's Regex API to confirm matches and extract groups
for (Matcher matcher : potentialMatches) {
while (matcher.find()) {
System.out.println("Pattern: " + matcher.pattern());
System.out.println("Match: " + matcher.group(0));
// Handle capture groups if present
if (matcher.groupCount() > 0) {
System.out.println("Captured: " + matcher.group(1));
}
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
import com.gliwka.hyperscan.wrapper.*;
import java.util.EnumSet;
import java.util.LinkedList;
import java.util.List;
public class DirectHyperscanExample {
public static void main(String[] args) {
// Define expressions to match
LinkedList<Expression> expressions = new LinkedList<>();
// Expression(pattern, flags, id)
expressions.add(new Expression("[0-9]{5}", EnumSet.of(ExpressionFlag.SOM_LEFTMOST), 0));
expressions.add(new Expression("test", EnumSet.of(ExpressionFlag.CASELESS), 1));
expressions.add(new Expression("example\\.(com|org|net)", EnumSet.of(ExpressionFlag.SOM_LEFTMOST), 2));
try (Database db = Database.compile(expressions)) {
try (Scanner scanner = new Scanner()) {
// Allocate scratch space matching the database
scanner.allocScratch(db);
// Scan text against all patterns simultaneously
String text = "12345 is a zip code. Test this at example.com!";
List<Match> matches = scanner.scan(db, text);
for (Match match : matches) {
Expression matchedExpression = match.getMatchedExpression();
System.out.println("Pattern: " + matchedExpression.getExpression());
System.out.println("Pattern ID: " + matchedExpression.getId());
// Note: start and end positions are character indices (inclusive)
System.out.println("Start position: " + match.getStartPosition());
System.out.println("End position: " + match.getEndPosition());
// The matched string is only available if SOM_LEFTMOST flag was used
if (matchedExpression.getFlags().contains(ExpressionFlag.SOM_LEFTMOST)) {
System.out.println("Matched text: " + match.getMatchedString());
}
System.out.println("---");
}
// You can also check if a pattern matches without getting match details
boolean hasAnyMatch = scanner.hasMatch(db, text);
System.out.println("Has any match: " + hasAnyMatch);
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
import com.gliwka.hyperscan.wrapper.*;
import java.util.EnumSet;
public class ValidationExample {
public static void main(String[] args) {
// Create an expression to validate
Expression expr = new Expression(
"a++", // This pattern uses a feature not supported by Hyperscan
EnumSet.of(ExpressionFlag.UTF8)
);
// Check if the expression is valid for Hyperscan
Expression.ValidationResult result = expr.validate();
if (result.isValid()) {
System.out.println("Pattern is valid");
} else {
System.out.println("Pattern is invalid: " + result.getErrorMessage());
}
// Common expression flags:
// - ExpressionFlag.CASELESS - Case insensitive matching
// - ExpressionFlag.DOTALL - Dot (.) matches newlines
// - ExpressionFlag.MULTILINE - ^ and $ match on line boundaries
// - ExpressionFlag.UTF8 - Pattern and input are UTF-8
// - ExpressionFlag.SOM_LEFTMOST - Track start of match (enables getMatchedString())
// - ExpressionFlag.PREFILTER - Optimize for pre-filtering (used by PatternFilter)
}
}
import com.gliwka.hyperscan.wrapper.*;
import java.io.*;
import java.util.EnumSet;
public class DatabaseSerializationExample {
public static void main(String[] args) {
try {
// Create and compile a database
Expression expr = new Expression("\\w+@\\w+\\.(com|org|net)",
EnumSet.of(ExpressionFlag.SOM_LEFTMOST));
// Saving to a file
try (Database db = Database.compile(expr);
OutputStream out = new FileOutputStream("email_patterns.db")) {
db.save(out);
System.out.println("Database saved, size: " + db.getSize() + " bytes");
}
// Loading from a file later (much faster than recompiling)
try (InputStream in = new FileInputStream("email_patterns.db");
Database loadedDb = Database.load(in);
Scanner scanner = new Scanner()) {
scanner.allocScratch(loadedDb);
List<Match> matches = scanner.scan(loadedDb, "Contact us at info@example.com");
System.out.println("Found " + matches.size() + " matches");
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
Vectorscan operates on bytes (exclusive end index), but Scanner.scan()
returns Match
objects with inclusive character indices:
getStartPosition()
- Character index where the match startsgetEndPosition()
- Character index where the match ends (inclusive)This is especially important when working with UTF-8 text where byte and Java character positions differ. If you're not interested in Java characters, you might see some performance improvement in using a lower level API (see below).
Scanner
(scratch space) instances for each threadDatabase
instances are thread-safe for scanningclose()
on Scanner
and Database
instancesIn addition to the default scanning methods that return Match
objects, hyperscan-java provides callback-based scanning methods for more efficient processing:
String-based scanning with callbacks:
void scan(Database db, String input, StringMatchEventHandler eventHandler)
false
from callback to stop scanning earlyByte-oriented scanning with callbacks:
void scan(Database db, byte[] input, ByteMatchEventHandler eventHandler)
void scan(Database db, ByteBuffer input, ByteMatchEventHandler eventHandler)
// String-based scanning with callback
scanner.scan(db, inputString, (expression, fromIndex, toIndex) -> {
System.out.printf("Match for pattern '%s' at positions %d-%d: %s%n",
expression.getExpression(),
fromIndex,
toIndex,
inputString.substring((int)fromIndex, (int)toIndex + 1)); // +1 because toIndex is inclusive
return true; // continue scanning
});
// Byte-oriented scanning with callback
byte[] inputBytes = "sample text".getBytes(StandardCharsets.UTF_8);
scanner.scan(db, inputBytes, (expression, fromByteIdx, toByteIdxExclusive) -> {
System.out.printf("Match for pattern '%s' at byte positions %d-%d%n",
expression.getExpression(),
fromByteIdx,
toByteIdxExclusive - 1); // -1 to convert to inclusive index for display
return true; // continue scanning
});
Use byte-oriented methods when:
Match
objects for very frequent matchesUse string-oriented methods when:
This library ships with native binaries for:
Windows is no longer supported after version 5.4.0-2.0.0
due to Vectorscan dropping Windows support.
Feel free to raise issues or submit pull requests. Please see the native libraries repository here.
Special thanks to @eliaslevy, @krzysztofzienkiewicz, @swapnilnawale, @mmimica, @Jiar and @apismensky for their contributions.
Thanks to Intel for originally open-sourcing Hyperscan and @VectorCamp for actively maintaining the Vectorscan fork!