Computing Context Triggered Piecewise Hashes in Rust

This post introduces my Rust wrapper for ssdeep by Jesse Kornblum, which is a C library and program for computing context triggered piecewise hashes (CTPH). Also called fuzzy hashes, CTPH can match inputs that have homologies. Such inputs have sequences of identical bytes in the same order, although bytes in between these sequences may be different in both content and length. In contrast to standard hashing algorithms, CTPH can be used to identify files that are highly similar but not identical.

Introduction and Motivation

To understand the motivation behind fuzzy hashes, we have to first take a brief look at traditional cryptographic hashing.

Traditional Cryptographic Hashing

Consider a file. By using a cryptographic hashing function, such as one from the SHA family, you can compute a fixed-length value called hash that identifies the file. For example, the SHA-1 hash of the GPLv3 license is

8624bcdae55baeef00cd11d5dfcfa60f68710a02

If you then receive a file, you can easily verify whether this file is the same as the one you already have. Indeed, simply obtain a hash of the received file and compare it with the original hash. If they match, you can be almost certain that it is the same file. Why only almost certain? Because the length of the hash value is fixed. So, theoretically, it may happen that two different files map to the same hash. Nevertheless, in practice, this is not an issue. At least with cryptographic hashing functions, whose purpose is to make it very difficult to find two files with different contents but equal hashes. This is useful in e.g. malware analysis. A malware analyst may simply publish a hash of a malicious sample and others may find it in malware databases, such as VirusTotal.

However, there is a catch related to how cryptographic hashing functions work. They try to ensure that even if you switch a single bit in the input file, the hash will be completely different. For example, if we remove a single word from our GPLv3 license (“either” from line 207), we obtain this SHA-1 hash:

f924f87802e7da1c95ef4693245d2ddad46d7aed

As you can see, the hashes are completely different. From one viewpoint, this is great because now we know that we have a different file, even if its contents look similar. However, from another viewpoint, this can be seen as a disadvantage. What if we would like to detect file similarity based on their hashes? For example, what if the author of a virus changes some unimportant parts of the binary file to random values? This brings us to so-called fuzzy hashing.

Fuzzy Hashing: ssdeep by Jesse Kornblum

In his paper from 2006, Jesse Kornblum introduced an algorithm for computing hashes that can be used to identify almost identical files. This new type of hashes, called fuzzy hashes (or context triggered piecewise hashes), differs from traditional cryptographic hashes in the following two ways:

The length of the hash is no longer fixed and depends on the length of the input file.
Parts of the hash correspond to distinct blocks of the file. When you make a change in a single part of the file, only the corresponding part of the fuzzy hash changes.

Jesse also implemented a C library and program, called ssdeep, which can be used to compute these fuzzy hashes. Let me use it to demonstrate the usefulness of fuzzy hashes. For example, the fuzzy hash of our GPLv3 license above is

768:Mo1acy3LTB2VsrHG/OfvMmnBCtLmJ9A7D:Mhcycsrfrnoue

If we remove a single word from our GPLv3 license (“either” from line 207), we obtain this hash:

768:Mo1acy3LTB2VsrHG/O4vMmnBCtLmJ9A7D:Mhcycsrwrnoue

By putting them together, we can see that they differ only very slightly:

768:Mo1acy3LTB2VsrHG/OfvMmnBCtLmJ9A7D:Mhcycsrfrnoue
768:Mo1acy3LTB2VsrHG/O4vMmnBCtLmJ9A7D:Mhcycsrwrnoue
                      ^

The program can also compare these hashes to tell us how similar the files are. The output is a number between 0 (no match) and 100 (identical files). The match score for our original and modified GPLv3 license is 99, which is expected as the only change we did was to remove a word.

The ssdeep program and its library are written in C. There exist ssdeep bindings for various programming languages, such as for Python, Ruby, or PHP. However, there has been no wrapper for the new kid on the block: Rust. Well, until now.

Introducing ssdeep-rs: Rust Wrapper for ssdeep

I have implemented and published a Rust wrapper for ssdeep as a library that you can use in your programs. Its source code is available on GitHub.

Usage

To compute the fuzzy hash of a given buffer, use the hash() function:

extern crate ssdeep;

let h = ssdeep::hash(b"Hello there!").unwrap();
assert_eq!(h, "3:aNRn:aNRn");

If you want to obtain the fuzzy hash of a file, you can use hash_from_file():

let h = ssdeep::hash_from_file("path/to/file").unwrap();

To compare two fuzzy hashes, use compare(), which returns an integer between 0 (no match) and 100:

let h1 = b"3:AXGBicFlgVNhBGcL6wCrFQEv:AXGHsNhxLsr2C";
let h2 = b"3:AXGBicFlIHBGcL6wCrFQEv:AXGH6xLsr2Cx";
let score = ssdeep::compare(h1, h2).unwrap();
assert_eq!(score, 22);

Each of these functions returns an Option, where None is returned when the underlying C function fails.

Installation

Add the following lines into your Cargo.toml file:

[dependencies]
ssdeep = "0.1.0"

Then, when you run cargo build, it will automatically get the wrapper’s source code from crates.io, compiler the underlying C library, and build the wrapper. The C library is statically linked into the wrapper.

The build process is known to work under Linux with GCC. If you have a different operating system or compiler and the build fails, feel free to submit a pull request or open an issue.

Documentation

An automatically generated API documentation is available here.

License

The code of the wrapper is licensed under the terms of GPLv3. This wrapper includes the unchanged source distribution of ssdeep version 2.13, which is compiled and statically linked into the wrapper during build. It is licensed under GPLv2.

Implementation

Let me close the present post by shortly describing the implementation. It is divided into two parts: low-level native bindings to the C library (called libfuzzy) and a high-level wrapper around them.

The low-level native bindings are represented by an internal libfuzzy-sys crate. The -sys suffix was chosen intentionally to conform to Rust conventions. The crate uses types from the libc crate to match C types. It exports the following three bindings:

extern "C" {
	// int fuzzy_compare(const char *sig1, const char *sig2);
	pub fn fuzzy_compare(sig1: *const libc::c_char,
	                     sig2: *const libc::c_char) -> libc::c_int;

	// int fuzzy_hash_buf(const unsigned char *buf, uint32_t buf_len, char *result);
	pub fn fuzzy_hash_buf(buf: *const libc::c_uchar,
	                      buf_len: libc::uint32_t,
	                      result: *mut libc::c_char) -> libc::c_int;

	// int fuzzy_hash_filename(const char *filename, char *result);
	pub fn fuzzy_hash_filename(filename: *const libc::c_char,
	                           result: *mut libc::c_char) -> libc::c_int;
}

Since the underlying libfuzzy library is written in C, we need to compile it to be able to later link it to the high-level crate. The compilation is performed by a custom build.rs script. It simply runs ./configure and make to build libfuzzy.a.

The implementation of the high-level wrapper is divided into two files: src/lib.rs and tests/tests.rs. As you may have guessed, the former contains the source code while the latter contains tests. The wrapper provides the following three functions that correspond to the three low-level functions above:

pub fn compare(hash1: &[u8], hash2: &[u8]) -> Option<i8>
pub fn hash(buf: &[u8]) -> Option<String>
pub fn hash_from_file<P: AsRef<Path>>(file_path: P) -> Option<String>

Since the low-level bindings expect null-terminated strings, the input hashes, buffer, and file path need to be converted to std::ffi::CString, which provides a handy method: as_bytes_with_nul(). Furthermore, fuzzy_hash_buf() and fuzzy_hash_filename() expect a pre-allocated buffer into which they store the computed hash. We use a Vec with reserved capacity, which is passed via its as_mut_ptr() method. The only quirk is that we have to adjust its length after it is populated with the hash because we are modifying the vector outside of Rust.

You can check out the full source code for more details.

Petr Zemek