Deployment on Windows
Intro
Deployment on Linux system is more convenient. Deployment on Windows is possible but more complicated due to the nature of Windows system, and they can’t be packed as images for distribution. To deploy Assemblage and harvest, you need to deploy the coordinator to a server, then set up workers on Windows instances.
Coordinator Setup
Please make sure these are installed, or you have access to:
Docker
Docker Compose
Git
Port 5672, 50052
A GitHub account, and a personal access token
Warning
You should put you server under firewall and limit the access to these ports, also make sure these ports are accessible by your worker instances
Note
By default, only repositories that have licenses will be used to build binaries
Clone the repository
git clone git@github.com:Assemblage-Dataset/Assemblage.git cd Assemblage
Install local dependencies, and change the crawler, coordinator configurations, which is locating under /assemblage/configure/ folder. You can also change the location of these config files, just remember to also update the docker-compose.yml file
mkdir /assemblage/configure nano assemblage/configure/coordinator_config.json nano assemblage/configure/crawler_config.json pip install -r requirements.txt
Build the docker image
sh build.sh docker compose up -d
Worker Setup
The Windows worker requires many software packages to be installed, and the installation is not as simple as Linux worker. Here is the step-by-step guide to set up a Windows worker
Install/build the software:
Python 3.9+
Git
MSVC Build Tools
Microsoft Visual Studio
CMake
7zip
Dia2dump
Universal Ctags
Make sure these executables are added to the PATH
ctags
dia2dump
7z
readtags
msbuild
python
Register dll file for dia2dump
# Need administrator privileges regsvr32 "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\DIA SDK\bin\msdia140.dll" regsvr32 "C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildCommunityTools\DIA SDK\bin\amd64\msdia140.dll"
Clone the repository
git clone git@github.com:Assemblage-Dataset/Assemblage.git cd Assemblage git checkout windows_github
Install the dependencies
pip install -r requirements.txt
Change the worker configuration, which is locating under /assemblage/configure/ folder, examples are provided in the repository
mkdir /assemblage/configure nano assemblage/configure/worker_config.json
Run the worker
python start_worker.py --config assemblage/configure
Note
You can create boot up tasks using Task scheduler, to start the worker automatically when the system starts, which is very useful for scaling up the workers on cloud instances. Some scripts are provided under script folder
Optional: Recover dataset
Assemblage can recover the state from previous running state and remake the binary dataset from the last state, which can be useful if the binary itself can not be distributed. To reload the previous state, grab some of the following recipe(in JSON format), and boot up the CLI, navigate to loadrepo option, and provide the JSON file, system will build the dataset from the provided file.
sept25.json.zipwinpe_recipe.zipWarning
The dataset restoration process does not guarantee generating exact same binaries, as the repository may have been hidden/deleted, some binaries might not be recovered, compiling and building same source code will generate slightly different binaries. Please to be noted, to accurately recover dataset from source code, a full git clone with all history will be performed, which will be extremely slow and have high consumption of bandwidth and CPU resource.