Deployment of Linux Worker
Intro
Assemblage provides range of builder and deployment tools to help you deploy it to your system, either on your local machine or on a server. This section will guide you through the process of deploying Assemblage to a Debian based system.
The system is made from 3 main components:
Coordinator, where tasks being packed and dispatched
Worker, where tasks are being executed and binaries are being built
DB, the database to store records
Coordinator Setup
Please make sure these are installed, or you have access to:
Docker and Docker Compose
Git
Port 5672, 50052 is accessile by the workder instances
A GitHub account, and its personal access token
Warning
You should put your server under a firewall and limit the access to these ports, also make sure these ports are accessible by your worker instances
Clone the repository
git clone git@github.com:Assemblage-Dataset/Assemblage.git cd Assemblage
Modify the cluster file
You can find examples located under example_workers. We will use example_cluster.py, you can declare the workers, crawlers and the database connection in this file.
Install local dependencies, and run the cluster file, it will create Docker images.
pip install -r requirements.txt PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python python3 example_cluster.py
Boot up the coordinators
docker compose up
Optional: Recover dataset
Assemblage can recover the state from previous running state and remake the binary dataset from the last state, which can be useful if the binary itself can not be distributed. To reload the previous state, grab some of the following recipe(in JSON format), and boot up the CLI, navigate to loadrepo option, and provide the JSON file, system will build the dataset from the provided file. Please note, to recover a dataset, it’s not using our API version, please switch to the branch linux_github.
Warning
The previous state is not guaranteed to be the same as the current state, as the repository may have been hidden/deleted, some binaries might not be recovered. Meanwhile, to accurately recover dataset from source code, a full git clone with all history will be performed, which will be extremely slow and resource consuming