Dataset Access
Overview
Assemblage provides datasets on various build configurations, such as CPU arch, compiler, and optimization flags. Due to the nature of different tool chains, the datasets are distributed mainly based on the source and tool chain, such as vcpkg on Windows, and Github on Linux.
Data source |
Compiler |
OS |
Count |
Licensed |
|---|---|---|---|---|
GitHub |
Visual Studio |
Windows |
960k |
120k |
GitHub |
GCC/Clang |
Linux |
428k |
211k |
vcpkg |
Visual Studio |
Windows |
130k |
130k |
Note
Currently we are only releasing binaries built from repositories that have license. Please adhere to the license of the original repositories when using the dataset.
Distribution Format
The dataset is distributed in the following format:
A compressed file of the binaries
A SQLite database containing the metadata (GitHub url, functions address, source codes, etc.) of the binaries
You can find the detailed schema also in the Datasheet. While the database provides detailed information about the binaries,
the compressed file contains the binaries themselves. The binaries are stored in the location indicated by the path field in the database.
Use Assemblage with Python and SQLite
Also, you may need to use some Python modules to load function related data from the SQLite database for faster access. The following code snippet shows how to load data from the SQLite database into dataframe,
import sqlite3
import pandas as pd
conn = sqlite3.connect('path/to/sqlite.db')
df = pd.read_sql_query("SELECT f.name, r.start\
FROM rvas r\
JOIN functions f ON r.function_id = f.id\
JOIN binaries ON f.binary_id = binaries.id\
WHERE binaries.id = some_id\
ORDER BY r.start ASC;", conn)
conn.close()
print(df.head())
and some other useful SQL queries are as follows,
-- Count functions of binaries size more than 100KB
SELECT COUNT(*) FROM functions
WHERE binary_id IN (SELECT id FROM binaries WHERE size>100);
-- Select binary information and RVA by function id:
SELECT f.id, f.name, r.start,
b.id, b.toolset_version, b.optimization, b.github_url
FROM functions
WHERE functions.id=some_id
JOIN rvas r ON r.function_id=f.id
JOIN binaries b ON b.id=f.binary_id;
-- Dump all function name, rva address and binary id:
SELECT f.name, f.binary_id, r.start
FROM functions f JOIN rvas r ON f.id==r.function_id;
-- Dump ascending function name and rva starts for binary some_id
SELECT f.name, r.start
FROM rvas r
JOIN functions f ON r.function_id = f.id
JOIN binaries ON f.binary_id = binaries.id
WHERE binaries.id = some_id
ORDER BY r.start ASC;
Dump SQL file
If you are not satisfying with SQLite’s querying speed (which is slow compared to other database servers), you can also dump the database into SQL, then load into your preferred database solution.
.output assemblage.sql
.dump
.quit
License information
We are also provide the license information as a JSON file for your convenience (each GitHub URL maps to its license). The file can be found here
Tips on PDB files
If you are using PDB files with IDA Pro, you need to sort out the file and put pdb files (sometimes the pdb file name also matters for IDA to realize that these pdbs are for the binary) along with binary file in one folder.
import ...
connection = sqlite3.connect("db.sqlite")
cursor = connection.cursor()
infos = cursor.execute('SELECT id, path, file_name, optimization, github_url, toolset_version FROM binaries;')
for binid, path, file_name, opt, github_url,toolset_version in tqdm(infos):
full_path = os.path.join(dataset_path, path.replace("\\", "/"))
if not os.path.isdir(os.path.join(flatten_dir, str(binid))):
os.makedirs(os.path.join(flatten_dir, str(binid)))
shutil.copy(full_path, os.path.join(flatten_dir, str(binid), file_name))
subcursor = connection.cursor()
pdbs = subcursor.execute('SELECT DISTINCT(pdb_path) FROM pdbs where binary_id = ?', (binid,))
for pdb in pdbs:
full_path = os.path.join(dataset_path, pdb[0].replace("\\", "/"))
shutil.copy(full_path, os.path.join(flatten_dir, str(binid), os.path.basename(os.path.basename(pdb[0]).split("_")[-1])))
Warning
Linux dataset GCC -Oz optimization (Updated November 2025) An issue has been identified with the -Oz flag in the Linux dataset: binaries labeled as being built with -Oz may actually have been compiled with -Os or with the compiler flags defined in the repository’s original Makefile. To address this, the dataset has been updated with newly generated GCC -O2 binaries.
Dataset Access
The dataset is hosted on Hugging Face. Due to file size limit, we are deprecating the dataset hosting on Kaggle.
Sample dataset (~600 binaries, 500MB):
https://www.kaggle.com/datasets/changliuh7rfs5/assemblage-sample
Windows GitHub dataset (~100k, last update 2025 May 27th):
Windows vcpkg dataset (130k, last update 2024 June 12th):
https://huggingface.co/datasets/changliu8541/Assemblage_vcpkgDLL
Linux GitHub dataset (250k, last updated 2026 Apr 5th):
https://huggingface.co/datasets/changliu8541/Assemblage_LinuxELF