* backend support for new watch system (#134):
- support for watch via redis pubsub and websocket connection to backend
- can support watch from any number of crawler instances to support scaled crawls
- use /archives/{aid}/crawls/{crawl_id}/watch/ws websocket endpoint
- ws: ignore graceful connectionclosedok exception, log other exceptions
- set logging to info to instead of debug for now (debug logs all ws traffic)
- remove old watch apis in backend
- remove old websocket routing to crawler instance for old watch system
- oauth bearer check: support websockets, use websocket object if no request object
- crawler args: replace --screencastPort with --screencastRedis
backed: crawlconfig:
- ensure newId is saved on old config being replaced
- if old config replaced is being deleted, ensure newId link is set on its old config (if any),
and the oldId points to the oldId of config being replaced (if any)
* backend: scale support:
- add 'scale' field to crawlconfig
- support updating 'scale' field in crawlconfig patch
- add constraint for crawlconfig and crawl scale (currently 1-3)
* support inactive configs in same collection, configs with `inactive` set to true (#137)
- add `inactive`, `newId`, `oldId` to crawlconfigs
- filter out inactive configs by default for most operations
- add index for aid + inactive field for faster querying
- delete returns status: 'deactivated' or 'deleted'
- if no crawls ran, config can be deleted, otherwise it is deactivated
* update crawl endpoint: add general PATCH crawl config endpoint, support updating schedule and name
- set resource mem and cpu requests/limits for all used services (not minio for now)
- add readiness proble to redis, mongo
- adjust crawler limits, set via configmap
- add 'emptyDir' volume for crawl directory (to allow any pod restarts to have access to the data)
- rename minio and redis volumes to avoid any confusion
- add pod termination grace-period (default to 600 secs)
* misc backend fixes:
- fix uuid typing: roles list, user invites
- crawlconfig: fix created date setting, fix userName lookup
- docker: fix timezone for scheduler, fix running check
- remove prints
- fix get crawl stuck in 'stopping' - check finished list first, then run list (in case k8s job has not been deleted)
* support for replay via replayweb.page embed, fixes#124
backend:
- pre-sign all files urls
- cache pre-signed urls in redis, presign again when expired (default duration 3600, settable via PRESIGN_DURATION_SECONDS env var)
- change files output -> resources to confirm to Data Package spec supported by replayweb.page
- add CrawlFileOut which contains 'name' (file id), 'path' (presigned url), 'hash', and 'size'
- add /replay/sw.js endpoint to import sw.js from latest replay-web-page release
- update to fastapi-users 9.2.2
- customize backend auth to allow authentication to check 'auth_bearer' query arg if 'Authorization' header not set
- remove sw.js endpoint, handling in frontend
frontend:
- add <replay-web-page> to frontend, include rwp ui.js from latest release in index.html for now
- update crawl api endpoint to end in json
- replay-web-page loads the api endpoint directly!
- update Crawl type to use new format, 'resources' -> instead of 'files', each file has 'name' and 'path'
- nginx: add endpoint to serve the replay sw.js endpoint
- add defer attr to ui.js
- move 'Download' to 'Download Files'
* frontend: support customizing replayweb.page loading url via RWP_BASE_URL env var in Dockerfile
- default prod value set in frontend Dockerfile (set to upcoming 1.5.8 release needed for multi-wacz-file support) (can be overridden during image build via --build-arg)
- rename index.html -> index.ejs to allow interpolation
- RWP_BASE_URL defaults to latest https://replayweb.page/ for testing
- for local testing, add sw.js loading via devServer, also using RWP_BASE_URL (#131)
Co-authored-by: sua yoo <sua@suayoo.com>
- k8s: don't use redis, set to 'stopping' if status.active is not set, toggled immediately on delete_job
- docker: set custom redis key to indicate 'stopping' state (container still running)
- api: remove crawl is_running endpoint, redundant with general get crawl api
* backend fixes: fix graceful stop + stats
- use redis to track stopping state, to be overwritten when finished
- also include stats in completed crawls
- docker: use short container id for crawl id
- graceful stop returns 'stopping_gracefully' instead of 'stopped_gracefully'
- don't set stopping state when complete!
- beginning files support: resolve absolute urls for crawl detail (not pre-signing yet)
* uuid fix: (fixes#118)
- update all mongo models to use UUID type as main '_id' (users continue to use 'id' as defined by fastapi-users)
- update all foreign doc references to use UUID instead of string
- api handlers convert str->uuid as needed
api fix:
- fix single crawl api, add CrawlOut response model
- fix collections api
- fix standalone-docker apis
- for manual job, set user to current user, overriding the setting from crawlconfig
* additional fixes:
- rename username -> userName to indicate not the login 'username'
- rename user -> userid, archive -> aid for crawlconfig + crawls
- ensure invites correctly convert str -> uuid as needed
- filter out unset values from browsertrix-crawler config
* convert remaining user -> userid variables
ensure archive id is passed to crawl_manager as str (via archive.id_str)
* remove bulk crawlconfig delete
* add support for `stopping` state when gracefully stopping crawl
* for get crawl endpoint, check stopped crawls first, then running
* crawls api improvements (fixes#110)
- add GET /crawls/{crawlid} api to return single crawl
- resolve crawlconfig name, add as `configName` to crawl model
- add 'created' date for crawlconfigs
- flatten list to single 'crawls' list, instead of separate 'finished' and 'running' (running crawls added first)
- include 'fileCount' and 'fileSize', remove files
- remove `files` from crawl list response, also remove `aid`
- remove `schedule` from crawl data altogether, (available in crawl config)
- add ListCrawls response model
- Cancel and stop crawl
- Sorts crawls by start time, status and crawl template ID
- Filters crawls by crawl template ID
- Adds shortcut to copy template ID
- add k8s deployment of signing server, if 'signer.enabled' chart value if set
- update ingress to provide access for 'signer.host' if signing server enabled to verify domain, run signing server itself on different port (also turn off ssl redirects to support signing server)
- set WACZ_SIGN_URL and WACZ_SIGN_TOKEN (supported in browesertrix-crawler 0.5.0)
- authsign deployment uses a volume to store current certs
- add sample signer block, with signing disabled by default
frontend:
- add checkbox to basic crawl config component which sets 'extraHops' to 1, otherwise to 0
- text tweaks: rename Scope Type -> Crawl Scope, capitalization
backend: add 'extraHops' to CrawlConfig
fixes#102
* backend api: add current crawl id to crawlconfig listing
- model: add 'currCrawlId' to CrawlConfig model
- output: add response model to /crawlconfigs api response to show correct openapi model
- rename crawl_configs -> crawlConfigs for consistency