Go to file
Ilya Kreymer adb5c835f2
Presign and replay (#127)
* support for replay via replayweb.page embed, fixes #124

backend:
- pre-sign all files urls
- cache pre-signed urls in redis, presign again when expired (default duration 3600, settable via PRESIGN_DURATION_SECONDS env var)
- change files output -> resources to confirm to Data Package spec supported by replayweb.page
- add CrawlFileOut which contains 'name' (file id), 'path' (presigned url), 'hash', and 'size'
- add /replay/sw.js endpoint to import sw.js from latest replay-web-page release
- update to fastapi-users 9.2.2
- customize backend auth to allow authentication to check 'auth_bearer' query arg if 'Authorization' header not set
- remove sw.js endpoint, handling in frontend

frontend:
- add <replay-web-page> to frontend, include rwp ui.js from latest release in index.html for now
- update crawl api endpoint to end in json
- replay-web-page loads the api endpoint directly!
- update Crawl type to use new format, 'resources' -> instead of 'files', each file has 'name' and 'path'

- nginx: add endpoint to serve the replay sw.js endpoint
- add defer attr to ui.js
- move 'Download' to 'Download Files'

* frontend: support customizing replayweb.page loading url via RWP_BASE_URL env var in Dockerfile
- default prod value set in frontend Dockerfile (set to upcoming 1.5.8 release needed for multi-wacz-file support) (can be overridden during image build via --build-arg)
- rename index.html -> index.ejs to allow interpolation
- RWP_BASE_URL defaults to latest https://replayweb.page/ for testing
- for local testing, add sw.js loading via devServer, also using RWP_BASE_URL (#131)

Co-authored-by: sua yoo <sua@suayoo.com>
2022-01-31 17:02:15 -08:00
backend Presign and replay (#127) 2022-01-31 17:02:15 -08:00
chart replay route: (prepare for replay, #124) 2022-01-31 11:18:10 -08:00
configs Config superuser (#59) 2021-12-05 14:12:42 -08:00
frontend Presign and replay (#127) 2022-01-31 17:02:15 -08:00
.gitignore Remove unused files (#69) 2021-12-26 18:01:10 -08:00
docker-compose.yml Refactor backend data model to support UUID (fixes #118) (#119) 2022-01-29 19:00:11 -08:00
docker-restart.sh README + docker-restart.sh add 2021-08-25 16:27:22 -07:00
pylintrc misc tweaks: 2021-08-25 18:34:49 -07:00
README.md Storage + Data Model Refactor (fixes #3): 2021-10-09 18:58:40 -07:00

Browsertrix Cloud

Browsertrix Cloud is a cloud-native crawling system, which supports a multi-user, multi-archive crawling system to run natively in the cloud via Kubernetes or locally via Docker.

The system currently includes support for the following:

  • Fully API-driven, with OpenAPI specification for all APIs.
  • Multiple users, registered via email and/or invited to join Archives.
  • Crawling centered around Archives which are associated with an S3-compatible storage bucket.
  • Users may be part of multiple archives and have different roles in different archives
  • Archives contain crawler configs, which are passed to the crawler.
  • Crawls launched via a crontab-based schedule or manually on-demand
  • Crawls performed using Browsertrix Crawler.
  • Crawl config includes an optional timeout, after which crawl is stopped gracefully.
  • Crawl status is tracked in the DB (possible crawl states include: Completed, Partially-Complete (due to timeout or cancelation), Cancelation, Failure)

Deploying to Docker

To deploy via local Docker instance, copy the config.sample.env to config.env.

Docker Compose is required.

Then, run docker-compose build; docker-compose up -d to launch.

To update/relaunch, use ./docker-restart.sh.

The API should be available at: http://localhost:8000/docs

Note: When deployed in local Docker, failed crawls are not retried currently. Scheduling is handled by a subprocess, which stores active schedule in the DB.

Deploying to Kubernetes

To deploy to K8s, helm is required. Browsertrix Cloud comes with a helm chart, which can be installed as follows:

helm install -f ./chart/values.yaml btrix ./chart/

This will create a browsertrix-cloud service in the default namespace.

For a quick update, the following is recommended:

helm upgrade -f ./chart/values.yaml btrix ./chart/

Note: When deployed in Kubernetes, failed crawls are automatically retried. Scheduling is handled via Kubernetes Cronjobs, and crawl jobs are run in the crawlers namespace.

Browsertrix Cloud is currently in pre-alpha stages and not ready for production.