Deployment of Linux Worker

Intro

Assemblage provides range of builder and deployment tools to help you deploy it to your system, either on your local machine or on a server. This section will guide you through the process of deploying Assemblage to a Debian based system.

The system is made from 3 main components:

Coordinator, where tasks being packed and dispatched
Worker, where tasks are being executed and binaries are being built
DB, the database to store records

Coordinator Setup

Please make sure these are installed, or you have access to:

Docker and Docker Compose
Git
Port 5672, 50052 is accessile by the workder instances
A GitHub account, and its personal access token

Warning

You should put your server under a firewall and limit the access to these ports, also make sure these ports are accessible by your worker instances

Clone the repository

git clone git@github.com:Assemblage-Dataset/Assemblage.git
cd Assemblage

Modify the cluster file

You can find examples located under example_workers. We will use example_cluster.py, you can declare the workers, crawlers and the database connection in this file.

Install local dependencies, and run the cluster file, it will create Docker images.

pip install -r requirements.txt
PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python python3 example_cluster.py

Boot up the coordinators
docker compose up

Optional: Recover dataset

Assemblage can recover the state from previous running state and remake the binary dataset from the last state, which can be useful if the binary itself can not be distributed. To reload the previous state, grab some of the following recipe(in JSON format), and boot up the CLI, navigate to loadrepo option, and provide the JSON file, system will build the dataset from the provided file. Please note, to recover a dataset, it’s not using our API version, please switch to the branch linux_github.

linux_recipe.zip

Warning

The previous state is not guaranteed to be the same as the current state, as the repository may have been hidden/deleted, some binaries might not be recovered. Meanwhile, to accurately recover dataset from source code, a full git clone with all history will be performed, which will be extremely slow and resource consuming