Use V4 ('s3v4') signature version for for all presigning URLs to support
backblaze, fixes#2472
- add 'access_addressing_style' to be able to choose virtual/path
addressing for access endpoint (default to 'virtual' as before)
- fix minio presigning with v4 by using 'path' addressing style for
minio
- if path matches '/data/' for internal minio bucket, then always use
'path'
- also make minio access path '/data/' configurable
also simplify running in any namespace with default settings:
- don't hardcode 'local-minio.default'
- in crawlers namespace, add a 'local-minio' externalName service which
maps to the main namespace service.
Fixes#1252
Supports a generic background job system, with two background jobs,
CreateReplicaJob and DeleteReplicaJob.
- CreateReplicaJob runs on new crawls, uploads, profiles and updates the
`replicas` array with the info about the replica after the job succeeds.
- DeleteReplicaJob deletes the replica.
- Both jobs are created from the new `replica_job.yaml` template. The
CreateReplicaJob sets secrets for primary storage + replica storage,
while DeleteReplicaJob only needs the replica storage.
- The job is processed in the operator when the job is finalized
(deleted), which should happen immediately when the job is done, either
because it succeeds or because the backoffLimit is reached (currently
set to 3).
- /jobs/ api lists all jobs using a paginated response, including filtering and sorting
- /jobs/<job id> returns details for a particular job
- tests: nightly tests updated to check create + delete replica jobs for crawls as well as uploads, job api endpoints
- tests: also fixes to timeouts in nightly tests to avoid crawls finishing too quickly.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
* resource constraints: (fixes#895)
- for cpu, only set cpu requests
- for memory, set mem requests == mem limits
- add missing resource constraints for minio and scheduled job
- for crawler, set mem and cpu constraints per browser, scale based on browser instances per crawler
- add comments in values.yaml for crawler values being multiplied
- default values: bump crawler to 650 millicpu per browser instance just in case
cleanup: remove unused entries from main backend configmap
- ingress: fix proxying /data to minio, use another ingress which proxies correct host to ensure presigned urls work
- presigning: determine if signing endpoint url (minio) or access endpoint (cloud bucket) based on if access endpoint is provided, set bool on storage object
- chart: fix indent on incorrect storageClassName configs
- ingress: make 'ingress_class' configurable (set to 'public' for microk8s, default to 'nginx')
- minio: use older minio image which supports legacy fs based setup (for now)
- nginx service: add 'nginx_service_use_node_port' config setting: if true, will use NodePort for frontend,
other will use default (ClusterIP) and only for the frontend / nginx
- chart: remove changing service type for other services
* k8s chart fixes:
mongo: pin to 5.0.11 version for now
minio: create empty dir for local storage for now instead of using mc, use 'btrix-data' as bucket name
- use python-on-whale to use docker cli api directly, creating docker stack for each crawl or profile browser
- configure storages via storages.yaml secret
- add crawl_job, profile_job, splitting into base and k8s/swarm implementations
- split manager into base crawlmanager and k8s/swarm implementations
- swarm: load initial scale from db to avoid modifying fixed configs, in k8s, load from configmap
- swarm: support scheduled jobs via swarm-cronjob service
- remove docker dependencies (aiodocker, apscheduler, scheduling)
- swarm: when using local minio, expose via /data/ route in nginx via extra include (in k8s, include dir is empty and routing handled via ingress)
- k8s: cleanup minio chart: move init containers to minio.yaml
- swarm: stateful set implementation to be consistent with k8s scaling:
- don't use service replicas,
- create a unique service with '-N' appended and allocate unique volume for each replica
- allows crawl containers to be restarted w/o losing data
- add volume pruning background service, as volumes can be deleted only after service shuts down fully
- watch: fully simplify routing, route via replica index instead of ip for both k8s and swarm
- rename network btrix-cloud-net -> btrix-net to avoid conflict with compose network
* backend: k8s:
- support crawls with multiple wacz files, don't assume crawl complete after first wacz uploaded
- if crawl is running and has wacz file, still show as running
- k8s: allow configuring node selector for main pods (eg. nodeType=main) and for crawlers (eg. nodeType=crawling)
- profiles: support uploading to alternate storage specified via 'shared_profile_storage' value is set
- misc fixes for profiles
* backend: ensure docker run_profile api matches k8s
k8s chart: don't delete pvc and pv in helm chart
* dependency: bump authsign to 0.4.0
docker: disable public redis port
* profiles: fix path, profile browser return value
* fix typo in presigned url cacheing
- add 'emptyDir' volume for crawl directory (to allow any pod restarts to have access to the data)
- rename minio and redis volumes to avoid any confusion
- add pod termination grace-period (default to 600 secs)
use PersistentVolumeClaim to create a persistent volume for each local service (mongo, minio, redis) when running in a cloud setup
if cloud-specified volume storage class not specified, create default hostPath volume (eg. for minikube)
lint: add default icon for chart
- Add default vs custom (s3) storage
- K8S: All storages correspond to secrets
- K8S: Default storages inited via helm
- K8S: Custom storage results in custom secret (per archive)
- K8S: Don't add secret per crawl config
- API for changing storage per archive
- Docker: default storage just hard-coded from env vars (only one for now)
- Validate custom storage via aiobotocore before confirming
- Data Model: remove usage from users
- Data Model: support adding multiple files per crawl for parallel crawls
- Data Model: track completions for parallel crawls
- Data Model: initial support for tags per crawl, add collection as 'coll' tag
README fixes
- support listing existing crawls
- add 'schedule' and 'manual' annotations to jobs, store in Crawl obj
- ensure manual jobs are deleted when completed
- support deleting crawls by id (but not data)
- rename running crawl delete to '/cancel'
change paths for local minio/mongo to /tmp
- replace storages with archives, which have a single storage (for now)
- crawls associated with archives
- users below to archive, with one admin user (if archive created by default)
- update crawlconfig for latest browsertrix-crawler (0.4.4)
- k8s: fix permissions for crawler role
- k8s: fix minio service (now requiring two ports)
move mongo into separate optional deployment along with minio
support for configuring storages
support for deleting crawls, associated config and secrets
- working apis for adding crawls, removing crawls in mongo, mapped to k8s cronjobs
- more complete crawl spec
- option to start on-demand job from cronjobs
- optional minio in separate deployment/service