* support for replay via replayweb.page embed, fixes #124 backend: - pre-sign all files urls - cache pre-signed urls in redis, presign again when expired (default duration 3600, settable via PRESIGN_DURATION_SECONDS env var) - change files output -> resources to confirm to Data Package spec supported by replayweb.page - add CrawlFileOut which contains 'name' (file id), 'path' (presigned url), 'hash', and 'size' - add /replay/sw.js endpoint to import sw.js from latest replay-web-page release - update to fastapi-users 9.2.2 - customize backend auth to allow authentication to check 'auth_bearer' query arg if 'Authorization' header not set - remove sw.js endpoint, handling in frontend frontend: - add <replay-web-page> to frontend, include rwp ui.js from latest release in index.html for now - update crawl api endpoint to end in json - replay-web-page loads the api endpoint directly! - update Crawl type to use new format, 'resources' -> instead of 'files', each file has 'name' and 'path' - nginx: add endpoint to serve the replay sw.js endpoint - add defer attr to ui.js - move 'Download' to 'Download Files' * frontend: support customizing replayweb.page loading url via RWP_BASE_URL env var in Dockerfile - default prod value set in frontend Dockerfile (set to upcoming 1.5.8 release needed for multi-wacz-file support) (can be overridden during image build via --build-arg) - rename index.html -> index.ejs to allow interpolation - RWP_BASE_URL defaults to latest https://replayweb.page/ for testing - for local testing, add sw.js loading via devServer, also using RWP_BASE_URL (#131) Co-authored-by: sua yoo <sua@suayoo.com> |
||
---|---|---|
backend | ||
chart | ||
configs | ||
frontend | ||
.gitignore | ||
docker-compose.yml | ||
docker-restart.sh | ||
pylintrc | ||
README.md |
Browsertrix Cloud
Browsertrix Cloud is a cloud-native crawling system, which supports a multi-user, multi-archive crawling system to run natively in the cloud via Kubernetes or locally via Docker.
The system currently includes support for the following:
- Fully API-driven, with OpenAPI specification for all APIs.
- Multiple users, registered via email and/or invited to join Archives.
- Crawling centered around Archives which are associated with an S3-compatible storage bucket.
- Users may be part of multiple archives and have different roles in different archives
- Archives contain crawler configs, which are passed to the crawler.
- Crawls launched via a crontab-based schedule or manually on-demand
- Crawls performed using Browsertrix Crawler.
- Crawl config includes an optional timeout, after which crawl is stopped gracefully.
- Crawl status is tracked in the DB (possible crawl states include: Completed, Partially-Complete (due to timeout or cancelation), Cancelation, Failure)
Deploying to Docker
To deploy via local Docker instance, copy the config.sample.env
to config.env
.
Docker Compose is required.
Then, run docker-compose build; docker-compose up -d
to launch.
To update/relaunch, use ./docker-restart.sh
.
The API should be available at: http://localhost:8000/docs
Note: When deployed in local Docker, failed crawls are not retried currently. Scheduling is handled by a subprocess, which stores active schedule in the DB.
Deploying to Kubernetes
To deploy to K8s, helm
is required. Browsertrix Cloud comes with a helm chart, which can be installed as follows:
helm install -f ./chart/values.yaml btrix ./chart/
This will create a browsertrix-cloud
service in the default namespace.
For a quick update, the following is recommended:
helm upgrade -f ./chart/values.yaml btrix ./chart/
Note: When deployed in Kubernetes, failed crawls are automatically retried. Scheduling is handled via Kubernetes Cronjobs, and crawl jobs are run in the crawlers
namespace.
Browsertrix Cloud is currently in pre-alpha stages and not ready for production.