- collections defined by name per archive
- can update collections with additional metadata (currently just description)
- crawl config api accepts a list of collections by name, resolved to collection uids and stored in config
- finished crawls also associated with collection list
- /archives/{aid}/collections/{name} can list all crawl artifacts (wacz files) from a named collection (in frictionless data package-ish format)
- /archives/{aid}/collections/$all lists all crawled artifacts for the archive
readiness check: add /healthz endpoints for app and nginx
ingress: add /data/ route to local bucket
storage improvements:
- for default storages, store path only, and prepend default storage access endpoint
- collections api returns the paths using the storage access endpoint
- define default storages as secrets in k8s (can support multiple), hard-coded in docker (only one for now)
support screencasting to dynamically created service via nginx (k8s only thus far)
add crawl /watch endpoint to enable watching, creates service if doesn't exist
add crawl /running endpoint to check if crawl is running
nginx auth check in place, but not yet enabled
add k8s nginx.conf
add missing chart files
file reorg: move docker config to configs/
k8s: add readiness check for nginx and api containers for smoother reloading
ensure service deleted along with job
todo: update dockerman with screencast support
- Add default vs custom (s3) storage
- K8S: All storages correspond to secrets
- K8S: Default storages inited via helm
- K8S: Custom storage results in custom secret (per archive)
- K8S: Don't add secret per crawl config
- API for changing storage per archive
- Docker: default storage just hard-coded from env vars (only one for now)
- Validate custom storage via aiobotocore before confirming
- Data Model: remove usage from users
- Data Model: support adding multiple files per crawl for parallel crawls
- Data Model: track completions for parallel crawls
- Data Model: initial support for tags per crawl, add collection as 'coll' tag
README fixes
- supported in both docker and k8s
- additional pods with same job id automatically use same crawl state in redis
- support dynamic scaling (#2) via /scale endpoint - k8s job parallelism adjusted dynamically for running job (only supported in k8s so far)
- run as child process using aioprocessing
- cleanup: support cleanup of orphaned containers
- timeout: support crawlTimeout via check in cleanup loop
- support crawl listing + crawl stopping
- support for creating, deleting crawlconfigs, running crawls on-demand
- config stored in volume
- list to docker events and clean up containers when they exit
allow crawl complete/partial complete to update existing crawl state, eg. timeout
enable handling backofflimitexceeded / deadlineexceeded failure, with possible success able to override the failure state
filter out only active jobs in running crawls listing
- job watch: add watch loop for job failure (backofflimitexceeded)
- set job retries + job timeout via chart values
- sigterm starts graceful shutdown by default, including for timeout
- use sigusr1 to switch to instant shutdown
- update stop_crawl() to use new semantics
- support listing existing crawls
- add 'schedule' and 'manual' annotations to jobs, store in Crawl obj
- ensure manual jobs are deleted when completed
- support deleting crawls by id (but not data)
- rename running crawl delete to '/cancel'
change paths for local minio/mongo to /tmp
- sending emai for validation + invites, configured via env vars
- inviting new users to join an existing archive
- /crawldone webhook to track verify crawl id (next: store crawl complete entry)
- replace storages with archives, which have a single storage (for now)
- crawls associated with archives
- users below to archive, with one admin user (if archive created by default)
- update crawlconfig for latest browsertrix-crawler (0.4.4)
- k8s: fix permissions for crawler role
- k8s: fix minio service (now requiring two ports)
move mongo into separate optional deployment along with minio
support for configuring storages
support for deleting crawls, associated config and secrets
- working apis for adding crawls, removing crawls in mongo, mapped to k8s cronjobs
- more complete crawl spec
- option to start on-demand job from cronjobs
- optional minio in separate deployment/service