RanDS

PE Static Raw Strings Dataset

In this dataset, we use the PeFile Python module to extract strings data from PE files using static analysis as explained below.

Search the file’s binary data for readable text:
- ASCII strings: sequences of printable characters (space to tilde) with a minimum length of 3 letters.
- UTF-16 LE strings: sequences where each printable character is followed by a null byte, also with a minimum length of 3 letters.
Decode the found strings into normal text and remove null bytes from UTF-16 sequences.
Remove punctuation, brackets, quotes, slashes, and other special symbols.
Convert all text to lowercase to standardize it.
Split the text into words and keep only words with at least three characters to filter out short, less meaningful words.
Redundant words were eliminated so that each word appeared only once within the representation of a file.
Join the cleaned words back together into a single line of text.

The dataset is provided as a compressed ZIP file containing two CSV files and one folder.

Benign.csv and Ransomware.csv store metadata and references to the corresponding samples.
The folder, named dataset, contains the processed text representations of the executable files.
Inside the dataset folder, the samples are organized in a two-letter directory structure derived from the first two characters of each file’s SHA-256 hash.
Each subfolder contains a text file for an individual sample, which stores the cleaned and normalized strings extracted from the raw PE files.

Note: The ZIP file may be encrypted with the password infected to make sharing and cloud storage easier. The files are safe and do not include any executable content.

Dataset size:

Compressed: ~7GB
Uncompressed: ~13GB

Download dataset

PE Static English Strings Dataset

In this dataset, the Enchant Python module is used to extract only meaningful English strings from the PE Static Strings Dataset.

The PE Static Raw Strings Dataset may contain gibberish and meaningless strings, which can still be useful in certain cases. However, when only meaningful strings are required, this dataset provides a filtered version that keeps only English words. The filtering was performed using the Python Enchant module, as described below:

Each string file was processed to extract raw string data.
From the extracted strings, all possible substrings with a minimum length of three characters and up to a maximum of twenty characters were generated.
The generated substrings were compared against an English dictionary using the Enchant library to identify valid English words.
Only valid substrings recognized as dictionary words were retained.
Non-meaningful substrings and words shorter than three characters were discarded.
Redundant words were eliminated so that each word appeared only once within the representation of a file.
The retained words were consolidated into a clean text representation for each sample.

The dataset is provided as a compressed ZIP file containing two CSV files and one folder.

Benign.csv and Ransomware.csv store metadata and references to the corresponding samples.
The folder, named dataset, contains the processed text representations of the executable files.
Inside the dataset folder, the samples are organized in a two-letter directory structure derived from the first two characters of each file’s SHA-256 hash.
Each subfolder contains a text file for an individual sample, which stores the cleaned and normalized English strings extracted from the PE Static Raw Strings Dataset.

Note: The ZIP file may be encrypted with the password infected to make sharing and cloud storage easier. The files are safe and do not include any executable content.

Dataset size:

Compressed: ~7GB
Uncompressed: ~13GB

Download dataset

PE Static APIs Dataset

In this dataset, we use the PeFile Python module to extract API calls from PE files using static analysis.

In this dataset, Pefile is used as follows:

The PE file headers are parsed while the analysis is limited to the required directories (imports and exports).
Imported functions are grouped by the dynamic-link libraries (DLLs) from which they are called.
Exported functions are grouped under the corresponding module name.
The extracted data for each sample is formatted into a consistent structure that includes both imports and exports.
The structured data is saved as a JSON file, with one file produced per PE sample.

The dataset is provided as a compressed ZIP file containing two CSV files and one folder.

Benign.csv and Ransomware.csv store metadata and references to the corresponding samples.
The folder, named dataset, contains the processed text representations of the executable files.
Inside the dataset folder, the samples are organized in a two-letter directory structure derived from the first two characters of each file’s SHA-256 hash.
Each subfolder contains a json file for an individual sample, which stores import and export APIs and DLLs extracted from the raw PE files.

Note: The ZIP file may be encrypted with the password infected to make sharing and cloud storage easier. The files are safe and do not include any executable content.

Dataset size:

Compressed: ~300MB
Uncompressed: ~850MB

Download dataset

PE Static Demangled APIs Dataset

In this dataset, we use Demumble to demangle API calls from the PE Static APIs Dataset above.

The PE Static APIs Dataset produces mangled API strings. If a demangled version is required, this dataset provides demangled API strings.
For the demangling process, we used Demumble, which supports both Itanium and Visual Studio symbols.
The structure and JSON format of this dataset follow the same organization as the PE Static APIs Dataset.

The dataset is provided as a compressed ZIP file containing two CSV files and one folder.

Benign.csv and Ransomware.csv store metadata and references to the corresponding samples.
The folder, named dataset, contains the processed text representations of the executable files.
Inside the dataset folder, the samples are organized in a two-letter directory structure derived from the first two characters of each file’s SHA-256 hash.
Each subfolder contains a json file for an individual sample, which stores demangled imports and exports APIs and DLLs extracted from the raw PE files.

Note: The ZIP file may be encrypted with the password infected to make sharing and cloud storage easier. The files are safe and do not include any executable content.

Dataset size:

Compressed: ~300MB
Uncompressed: ~1GB

Download dataset

PE Behaviour Activities Dataset

This dataset was generated using CAPEv2 and Cuckoo Sandbox to execute PE files and capture their behavioral activities.

We focused on the following behavioral activities:

Accessed registry keys: Records keys in the Windows Registry that were read or queried by the PE file.
Created registry keys: Identifies new registry entries created, which may indicate persistence or configuration changes.
Accessed files: Tracks files opened or read during execution to reveal data gathering or reconnaissance attempts.
Created files: Captures files written to disk, often used for payload drops or log generation.
Deleted files: Lists files removed by the sample, which may indicate anti-forensic behavior.
Changed files: Detects files modified or overwritten during execution.
Network traffic IPs: Logs outbound and inbound IP connections to identify command-and-control servers or data exfiltration.
Network DNS lookups: Collects domain names queried by the sample to reveal external communication attempts.
Created processes: Monitors new processes spawned, often used for privilege escalation or persistence.
Killed processes: Identifies processes terminated by the sample, sometimes used to disable security tools.
Process tree: Shows parent-child process relationships for better understanding of execution flow.
Accessed mutexes: Tracks named synchronization objects read by the sample, which may be used to detect multiple instances.
Created mutexes: Lists new mutexes created to mark infection or prevent multiple runs of the same malware.
Loaded modules: Captures DLLs and system modules loaded into memory during runtime.
Executed commands: Records command-line instructions run by the sample, which may indicate system manipulation.
Critical API calls: Highlights sensitive Windows API functions invoked, such as process injection or memory manipulation.

Not all PE files could be executed because some were outdated or not supported by the sandbox architecture. The resulting activities of each analyzed PE file are stored in a JSON file, which contains 15 keys, with each key representing a specific activity entry.

The dataset is provided as a compressed ZIP file containing two CSV files and one folder.

Benign.csv and Ransomware.csv store metadata and references to the corresponding samples.
The folder, named dataset, contains the processed text representations of the executable files.
Inside the dataset folder, the samples are organized in a two-letter directory structure derived from the first two characters of each file’s SHA-256 hash.
Each subfolder contains a json file for an individual sample, which stores all behavioral activities captured by executing the raw PE files in a sandbox.

Note: The ZIP file may be encrypted with the password infected to make sharing and cloud storage easier. The files are safe and do not include any executable content.

Dataset size:

Compressed: ~323MB
Uncompressed: ~2.16GB

Download dataset