Deployment on Windows

Intro

Deployment on Linux system is more convenient. Deployment on Windows is possible but more complicated due to the nature of Windows system, and they can’t be packed as images for distribution. To deploy Assemblage and harvest, you need to deploy the coordinator to a server, then set up workers on Windows instances.

Coordinator Setup

Please make sure these are installed, or you have access to:

  1. Docker

  2. Docker Compose

  3. Git

  4. Port 5672, 50052

  5. A GitHub account, and a personal access token

Warning

You should put you server under firewall and limit the access to these ports, also make sure these ports are accessible by your worker instances

Note

By default, only repositories that have licenses will be used to build binaries

  1. Clone the repository

    git clone git@github.com:Assemblage-Dataset/Assemblage.git
    cd Assemblage
    
  2. Install local dependencies, and change the crawler, coordinator configurations, which is locating under /assemblage/configure/ folder. You can also change the location of these config files, just remember to also update the docker-compose.yml file

    mkdir /assemblage/configure
    nano assemblage/configure/coordinator_config.json
    nano assemblage/configure/crawler_config.json
    pip install -r requirements.txt
    
  3. Build the docker image

    sh build.sh
    docker compose up -d
    

Worker Setup

The Windows worker requires many software packages to be installed, and the installation is not as simple as Linux worker. Here is the step-by-step guide to set up a Windows worker

  1. Install/build the software:

    1. Python 3.9+

    2. Git

    3. MSVC Build Tools

    4. Microsoft Visual Studio

    5. CMake

    6. 7zip

    7. Dia2dump

    8. Universal Ctags

  2. Make sure these executables are added to the PATH

    1. ctags

    2. dia2dump

    3. 7z

    4. readtags

    5. msbuild

    6. python

  3. Register dll file for dia2dump

    # Need administrator privileges
    regsvr32 "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\DIA SDK\bin\msdia140.dll"
    regsvr32 "C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildCommunityTools\DIA SDK\bin\amd64\msdia140.dll"
    
  4. Clone the repository

    git clone git@github.com:Assemblage-Dataset/Assemblage.git
    cd Assemblage
    git checkout windows_github
    
  5. Install the dependencies

    pip install -r requirements.txt
    
  6. Change the worker configuration, which is locating under /assemblage/configure/ folder, examples are provided in the repository

    mkdir /assemblage/configure
    nano assemblage/configure/worker_config.json
    
  7. Run the worker

    python start_worker.py --config assemblage/configure
    

Note

You can create boot up tasks using Task scheduler, to start the worker automatically when the system starts, which is very useful for scaling up the workers on cloud instances. Some scripts are provided under script folder

Optional: Recover dataset

Assemblage can recover the state from previous running state and remake the binary dataset from the last state, which can be useful if the binary itself can not be distributed. To reload the previous state, grab some of the following recipe(in JSON format), and boot up the CLI, navigate to loadrepo option, and provide the JSON file, system will build the dataset from the provided file.

sept25.json.zip winpe_recipe.zip

Warning

The dataset restoration process does not guarantee generating exact same binaries, as the repository may have been hidden/deleted, some binaries might not be recovered, compiling and building same source code will generate slightly different binaries. Please to be noted, to accurately recover dataset from source code, a full git clone with all history will be performed, which will be extremely slow and have high consumption of bandwidth and CPU resource.