- Ansible playbook for deploying on DigitalOcean, configuring space, k8s cluster, mongodb, domain / subdomain, signing subdomain, container registry, and cors
- Generates helm chat in ./deploys/ directory for future use with helm directly
- Initial support for deletion of created resources as well.
- add documentation on how to use playbook
default helm values: update to latest authsign, set default timeout to 120 seconds
* k8s chart fixes:
mongo: pin to 5.0.11 version for now
minio: create empty dir for local storage for now instead of using mc, use 'btrix-data' as bucket name
* k8s: add tolerations for 'nodeType=crawling:NoSchedule' to allow scheduling crawling on designated nodes for crawler and profiles jobs and statefulsets
* add affinity for 'nodeType=crawling' on crawling and profile browser statefulsets
* refactor crawljob: combine crawl_updater logic into base crawl_job
* increment new 'crawlAttemptCount' counter crawlconfig when crawl is started, not necessarily finished, to avoid deleting configs that had attempted but not finished crawls.
* better external mongodb support: use MONGO_DB_URL to set custom url directly, otherwise build from username, password and mongo host
- use statefulsets instead of deployments for mongo, redis, signer
- use k8s job + statefulset for running crawls
- use separate statefulset for crawl (scaled) and single-replica redis stateful set
- move crawl job update login to crawl_updater
- remove shared redis chart
package refactor:
- move to shared code to 'btrixcloud'
- move k8s to 'btrixcloud.k8s'
- move docker to 'btrixcloud.docker'
* backend: k8s:
- support crawls with multiple wacz files, don't assume crawl complete after first wacz uploaded
- if crawl is running and has wacz file, still show as running
- k8s: allow configuring node selector for main pods (eg. nodeType=main) and for crawlers (eg. nodeType=crawling)
- profiles: support uploading to alternate storage specified via 'shared_profile_storage' value is set
- misc fixes for profiles
* backend: ensure docker run_profile api matches k8s
k8s chart: don't delete pvc and pv in helm chart
* dependency: bump authsign to 0.4.0
docker: disable public redis port
* profiles: fix path, profile browser return value
* fix typo in presigned url cacheing
- set WEB_CONCURRENCY env var to configure number of backend api workers for both docker and k8s
- set via 'backend_workers' in values.yaml
- also add 'rwp_base_url' to values.yaml
- update containers to use public webrecorder/browsertrix-backend and webrecorder/browsertrix-frontend containers
- make liveness, readiness and startup health checks more tolerant
* backend support for new watch system (#134):
- support for watch via redis pubsub and websocket connection to backend
- can support watch from any number of crawler instances to support scaled crawls
- use /archives/{aid}/crawls/{crawl_id}/watch/ws websocket endpoint
- ws: ignore graceful connectionclosedok exception, log other exceptions
- set logging to info to instead of debug for now (debug logs all ws traffic)
- remove old watch apis in backend
- remove old websocket routing to crawler instance for old watch system
- oauth bearer check: support websockets, use websocket object if no request object
- crawler args: replace --screencastPort with --screencastRedis
- add k8s deployment of signing server, if 'signer.enabled' chart value if set
- update ingress to provide access for 'signer.host' if signing server enabled to verify domain, run signing server itself on different port (also turn off ssl redirects to support signing server)
- set WACZ_SIGN_URL and WACZ_SIGN_TOKEN (supported in browesertrix-crawler 0.5.0)
- authsign deployment uses a volume to store current certs
- add sample signer block, with signing disabled by default
use PersistentVolumeClaim to create a persistent volume for each local service (mongo, minio, redis) when running in a cloud setup
if cloud-specified volume storage class not specified, create default hostPath volume (eg. for minikube)
lint: add default icon for chart
* backend: automatically create super user, fixes#57
- if SUPERUSER_EMAIL is set, superuser is created with `is_superuser` and `is_verified` settings, if user doesn't already exist.
- if SUPERUSER_PASSWORD if set, the password for superuser is set, otherwise a random password is generated
update sample SUPERUSER_EMAIL and SUPERUSER_PASSWORD in config file and chart.
- ensure verification email is not sent if user already verified
* backend:
- add /api/settings endpoint for misc system-wide settings
- setting 'registrationEnabled' if open registration should be enabled, set via REGISTRATION_ENABLED=1 env var
- setting 'jwtTokenLifetimeMinutes' returns the jwt token expiry in seconds, configured in minutes via JWT_TOKEN_LIFETIME_MINUTES env var (default: 60)
* k8s: support email configuration
support sending reset password email
fix for #32
* fastapi users: update to latest (8.1.2)
send verification email upon registration
* update to latest fastapi-users(8.1.2), refactor to use UserManager class
ensure verification e-mail sent upon registration, w/o requiring separate apicall
fixes#32
* add email options to default chart/values.yaml
* separate usermanager init from fastapi users init, fix for sending invite emails
* misc backend fixes:
- fix running w/o local minio
- ensure crawler image pull policy is configurable, loaded via chart value
- use digitalocean repo for main backend image (for now)
- add bucket_name to config only if using default bucket
* enable all behaviors, support 'access_endpoint_url' for default storages
* debugging: add 'no_delete_jobs' setting for k8s and docker to disable deletion of completed jobs
- collections defined by name per archive
- can update collections with additional metadata (currently just description)
- crawl config api accepts a list of collections by name, resolved to collection uids and stored in config
- finished crawls also associated with collection list
- /archives/{aid}/collections/{name} can list all crawl artifacts (wacz files) from a named collection (in frictionless data package-ish format)
- /archives/{aid}/collections/$all lists all crawled artifacts for the archive
readiness check: add /healthz endpoints for app and nginx
ingress: add /data/ route to local bucket
storage improvements:
- for default storages, store path only, and prepend default storage access endpoint
- collections api returns the paths using the storage access endpoint
- define default storages as secrets in k8s (can support multiple), hard-coded in docker (only one for now)
support screencasting to dynamically created service via nginx (k8s only thus far)
add crawl /watch endpoint to enable watching, creates service if doesn't exist
add crawl /running endpoint to check if crawl is running
nginx auth check in place, but not yet enabled
add k8s nginx.conf
add missing chart files
file reorg: move docker config to configs/
k8s: add readiness check for nginx and api containers for smoother reloading
ensure service deleted along with job
todo: update dockerman with screencast support
- Add default vs custom (s3) storage
- K8S: All storages correspond to secrets
- K8S: Default storages inited via helm
- K8S: Custom storage results in custom secret (per archive)
- K8S: Don't add secret per crawl config
- API for changing storage per archive
- Docker: default storage just hard-coded from env vars (only one for now)
- Validate custom storage via aiobotocore before confirming
- Data Model: remove usage from users
- Data Model: support adding multiple files per crawl for parallel crawls
- Data Model: track completions for parallel crawls
- Data Model: initial support for tags per crawl, add collection as 'coll' tag
README fixes
- supported in both docker and k8s
- additional pods with same job id automatically use same crawl state in redis
- support dynamic scaling (#2) via /scale endpoint - k8s job parallelism adjusted dynamically for running job (only supported in k8s so far)
allow crawl complete/partial complete to update existing crawl state, eg. timeout
enable handling backofflimitexceeded / deadlineexceeded failure, with possible success able to override the failure state
filter out only active jobs in running crawls listing
- job watch: add watch loop for job failure (backofflimitexceeded)
- set job retries + job timeout via chart values
- sigterm starts graceful shutdown by default, including for timeout
- use sigusr1 to switch to instant shutdown
- update stop_crawl() to use new semantics
- replace storages with archives, which have a single storage (for now)
- crawls associated with archives
- users below to archive, with one admin user (if archive created by default)
- update crawlconfig for latest browsertrix-crawler (0.4.4)
- k8s: fix permissions for crawler role
- k8s: fix minio service (now requiring two ports)
move mongo into separate optional deployment along with minio
support for configuring storages
support for deleting crawls, associated config and secrets
- working apis for adding crawls, removing crawls in mongo, mapped to k8s cronjobs
- more complete crawl spec
- option to start on-demand job from cronjobs
- optional minio in separate deployment/service