Deployment of Linux Worker

Intro

Assemblage provides range of builder and deployment tools to help you deploy it to your system, either on your local machine or on a server. This section will guide you through the process of deploying Assemblage to a Debian based system.

The system is made from 3 main components:

  1. Coordinator, where tasks being packed and dispatched

  2. Worker, where tasks are being executed and binaries are being built

  3. DB, the database to store records

Coordinator Setup

Please make sure these are installed, or you have access to:

  1. Docker and Docker Compose

  2. Git

  3. Port 5672, 50052 is accessile by the workder instances

  4. A GitHub account, and its personal access token

Warning

You should put your server under a firewall and limit the access to these ports, also make sure these ports are accessible by your worker instances

  1. Clone the repository

    git clone git@github.com:Assemblage-Dataset/Assemblage.git
    cd Assemblage
    
  2. Modify the cluster file

    You can find examples located under example_workers. We will use example_cluster.py, you can declare the workers, crawlers and the database connection in this file.

  3. Install local dependencies, and run the cluster file, it will create Docker images.

    pip install -r requirements.txt
    PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python python3 example_cluster.py
    
  4. Boot up the coordinators

    docker compose up
    

Optional: Recover dataset

Assemblage can recover the state from previous running state and remake the binary dataset from the last state, which can be useful if the binary itself can not be distributed. To reload the previous state, grab some of the following recipe(in JSON format), and boot up the CLI, navigate to loadrepo option, and provide the JSON file, system will build the dataset from the provided file. Please note, to recover a dataset, it’s not using our API version, please switch to the branch linux_github.

linux_recipe.zip

Warning

The previous state is not guaranteed to be the same as the current state, as the repository may have been hidden/deleted, some binaries might not be recovered. Meanwhile, to accurately recover dataset from source code, a full git clone with all history will be performed, which will be extremely slow and resource consuming