From ee20c659e9bbc58ed3b7b868f5a6a7f18570c27b Mon Sep 17 00:00:00 2001 From: Ilya Kreymer Date: Wed, 25 Aug 2021 16:13:06 -0700 Subject: [PATCH] add basic README --- README.md | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 00000000..8421c4b7 --- /dev/null +++ b/README.md @@ -0,0 +1,22 @@ +# Browsertrix Cloud + +Browsertrix Cloud is a cloud-native crawling system, which supports a multi-user, multi-archive crawling system to run natively in the cloud via Kubernetes or locally via Docker. + +The system currently includes support for the following: + +- Multiple users, registered via email and/or invited to join Archives. +- Crawling centered around Archives which are associated with an S3-compatible storage bucket. +- Users may be part of multiple archives and have different roles in different archives +- Archives contain crawler configs, which are passed to the crawler. +- Crawls launched via a crontab-based schedule or manually on-demand +- Crawls performed using [Browsertrix Crawler](https://github.com/webrecorder/browsertrix-crawler). +- Crawl config includes an optional timeout, after which crawl is stopped gracefully. +- Crawl status is tracked in the DB (possible crawl states include: Completed, Partially-Complete (due to timeout or cancelation), Cancelation, Failure) + + +When deployed in Kubernetes, failed crawls are automatically retried. Scheduling is handled via Kubernetes Cronjobs. + +when deployed in local Docker, failed crawls are not retried currently. Scheduling is handled by a subprocess, which stores active schedule in the DB. + +Browsertrix Cloud is currently in pre-alpha stages and not ready for production. +