Go to file

Ilya Kreymer bcbc40059e Refactor backend data model to support UUID (fixes #118 ) (#119 ) * uuid fix: (fixes #118) - update all mongo models to use UUID type as main '_id' (users continue to use 'id' as defined by fastapi-users) - update all foreign doc references to use UUID instead of string - api handlers convert str->uuid as needed api fix: - fix single crawl api, add CrawlOut response model - fix collections api - fix standalone-docker apis - for manual job, set user to current user, overriding the setting from crawlconfig * additional fixes: - rename username -> userName to indicate not the login 'username' - rename user -> userid, archive -> aid for crawlconfig + crawls - ensure invites correctly convert str -> uuid as needed - filter out unset values from browsertrix-crawler config * convert remaining user -> userid variables ensure archive id is passed to crawl_manager as str (via archive.id_str) * remove bulk crawlconfig delete * add support for `stopping` state when gracefully stopping crawl * for get crawl endpoint, check stopped crawls first, then running		2022-01-29 19:00:11 -08:00
backend	Refactor backend data model to support UUID (fixes #118 ) (#119 )	2022-01-29 19:00:11 -08:00
chart	Add signing server via authsign (k8s only) (#107 )	2022-01-26 23:27:13 -08:00
configs	Config superuser (#59 )	2021-12-05 14:12:42 -08:00
frontend	Add empty state for crawls (#121 )	2022-01-29 15:55:44 -08:00
.gitignore	Remove unused files (#69 )	2021-12-26 18:01:10 -08:00
docker-compose.yml	Refactor backend data model to support UUID (fixes #118 ) (#119 )	2022-01-29 19:00:11 -08:00
docker-restart.sh
pylintrc	misc tweaks:	2021-08-25 18:34:49 -07:00
README.md	Storage + Data Model Refactor (fixes #3 ):	2021-10-09 18:58:40 -07:00

README.md

Browsertrix Cloud

Browsertrix Cloud is a cloud-native crawling system, which supports a multi-user, multi-archive crawling system to run natively in the cloud via Kubernetes or locally via Docker.

The system currently includes support for the following:

Fully API-driven, with OpenAPI specification for all APIs.
Multiple users, registered via email and/or invited to join Archives.
Crawling centered around Archives which are associated with an S3-compatible storage bucket.
Users may be part of multiple archives and have different roles in different archives
Archives contain crawler configs, which are passed to the crawler.
Crawls launched via a crontab-based schedule or manually on-demand
Crawls performed using Browsertrix Crawler.
Crawl config includes an optional timeout, after which crawl is stopped gracefully.
Crawl status is tracked in the DB (possible crawl states include: Completed, Partially-Complete (due to timeout or cancelation), Cancelation, Failure)

Deploying to Docker

To deploy via local Docker instance, copy the config.sample.env to config.env.

Docker Compose is required.

Then, run docker-compose build; docker-compose up -d to launch.

To update/relaunch, use ./docker-restart.sh.

The API should be available at: http://localhost:8000/docs

Note: When deployed in local Docker, failed crawls are not retried currently. Scheduling is handled by a subprocess, which stores active schedule in the DB.

Deploying to Kubernetes

To deploy to K8s, helm is required. Browsertrix Cloud comes with a helm chart, which can be installed as follows:

helm install -f ./chart/values.yaml btrix ./chart/

This will create a browsertrix-cloud service in the default namespace.

For a quick update, the following is recommended:

helm upgrade -f ./chart/values.yaml btrix ./chart/

Note: When deployed in Kubernetes, failed crawls are automatically retried. Scheduling is handled via Kubernetes Cronjobs, and crawl jobs are run in the crawlers namespace.

Browsertrix Cloud is currently in pre-alpha stages and not ready for production.