browsertrix

Author	SHA1	Message	Date
Ilya Kreymer	e6467c3374	backend work: - support {configname}-{username}-@ts-@hostsuffix.wacz as output filename, sanitize username and config name - support returning 'starting' for crawl status if no ips or 0/0 pages found. - fix updating scale via POST crawlconfig update - fix duplicate user error on superuser init	2022-03-15 18:20:25 -07:00
Ilya Kreymer	cdd0ab34a3	Watch Stream Directly from Browsertrix Crawler (#189 ) * watch work: proxy directly to crawls instead of redis pubsub - add 'watchIPs' to crawl detail output - cache crawl ips for quick access for auth - add '/ipaccess/{ip}' endpoint for watch ws connection to ensure ws has access to the specified container ip - enable 'auth_request' in nginx frontend - requirements: update to latest redis-py remaining fixes for #134	2022-03-04 14:55:11 -08:00
Ilya Kreymer	9bd402fa17	New WS Endpoint for Watching Crawl (#152 ) * backend support for new watch system (#134): - support for watch via redis pubsub and websocket connection to backend - can support watch from any number of crawler instances to support scaled crawls - use /archives/{aid}/crawls/{crawl_id}/watch/ws websocket endpoint - ws: ignore graceful connectionclosedok exception, log other exceptions - set logging to info to instead of debug for now (debug logs all ws traffic) - remove old watch apis in backend - remove old websocket routing to crawler instance for old watch system - oauth bearer check: support websockets, use websocket object if no request object - crawler args: replace --screencastPort with --screencastRedis	2022-02-22 10:33:10 -08:00
Ilya Kreymer	ee68a2f64e	Support for setting scale in crawlconfig (#148 ) * backend: scale support: - add 'scale' field to crawlconfig - support updating 'scale' field in crawlconfig patch - add constraint for crawlconfig and crawl scale (currently 1-3)	2022-02-20 11:27:47 -08:00
Ilya Kreymer	d05f04be9f	Crawl Config Editing Support (#141 ) * support inactive configs in same collection, configs with `inactive` set to true (#137) - add `inactive`, `newId`, `oldId` to crawlconfigs - filter out inactive configs by default for most operations - add index for aid + inactive field for faster querying - delete returns status: 'deactivated' or 'deleted' - if no crawls ran, config can be deleted, otherwise it is deactivated * update crawl endpoint: add general PATCH crawl config endpoint, support updating schedule and name	2022-02-17 16:04:07 -08:00
Ilya Kreymer	8acb43b171	backend: use redis to mark crawls as canceled immediately, avoid dupes in crawl list (even if paging is added for db results)	2022-02-01 15:58:56 -08:00
Ilya Kreymer	2b2e6fedfa	Misc backend fixes (#133 ) * misc backend fixes: - fix uuid typing: roles list, user invites - crawlconfig: fix created date setting, fix userName lookup - docker: fix timezone for scheduler, fix running check - remove prints - fix get crawl stuck in 'stopping' - check finished list first, then run list (in case k8s job has not been deleted)	2022-01-31 19:41:04 -08:00
Ilya Kreymer	adb5c835f2	Presign and replay (#127 ) * support for replay via replayweb.page embed, fixes #124 backend: - pre-sign all files urls - cache pre-signed urls in redis, presign again when expired (default duration 3600, settable via PRESIGN_DURATION_SECONDS env var) - change files output -> resources to confirm to Data Package spec supported by replayweb.page - add CrawlFileOut which contains 'name' (file id), 'path' (presigned url), 'hash', and 'size' - add /replay/sw.js endpoint to import sw.js from latest replay-web-page release - update to fastapi-users 9.2.2 - customize backend auth to allow authentication to check 'auth_bearer' query arg if 'Authorization' header not set - remove sw.js endpoint, handling in frontend frontend: - add <replay-web-page> to frontend, include rwp ui.js from latest release in index.html for now - update crawl api endpoint to end in json - replay-web-page loads the api endpoint directly! - update Crawl type to use new format, 'resources' -> instead of 'files', each file has 'name' and 'path' - nginx: add endpoint to serve the replay sw.js endpoint - add defer attr to ui.js - move 'Download' to 'Download Files' * frontend: support customizing replayweb.page loading url via RWP_BASE_URL env var in Dockerfile - default prod value set in frontend Dockerfile (set to upcoming 1.5.8 release needed for multi-wacz-file support) (can be overridden during image build via --build-arg) - rename index.html -> index.ejs to allow interpolation - RWP_BASE_URL defaults to latest https://replayweb.page/ for testing - for local testing, add sw.js loading via devServer, also using RWP_BASE_URL (#131) Co-authored-by: sua yoo <sua@suayoo.com>	2022-01-31 17:02:15 -08:00
Ilya Kreymer	be86505347	backend: crawls api: better fix for graceful stop - k8s: don't use redis, set to 'stopping' if status.active is not set, toggled immediately on delete_job - docker: set custom redis key to indicate 'stopping' state (container still running) - api: remove crawl is_running endpoint, redundant with general get crawl api	2022-01-30 22:01:00 -08:00
Ilya Kreymer	542680daf7	backend fixes: fix graceful stop + stats (#122 ) * backend fixes: fix graceful stop + stats - use redis to track stopping state, to be overwritten when finished - also include stats in completed crawls - docker: use short container id for crawl id - graceful stop returns 'stopping_gracefully' instead of 'stopped_gracefully' - don't set stopping state when complete! - beginning files support: resolve absolute urls for crawl detail (not pre-signing yet)	2022-01-30 18:58:47 -08:00
Ilya Kreymer	bcbc40059e	Refactor backend data model to support UUID (fixes #118 ) (#119 ) * uuid fix: (fixes #118) - update all mongo models to use UUID type as main '_id' (users continue to use 'id' as defined by fastapi-users) - update all foreign doc references to use UUID instead of string - api handlers convert str->uuid as needed api fix: - fix single crawl api, add CrawlOut response model - fix collections api - fix standalone-docker apis - for manual job, set user to current user, overriding the setting from crawlconfig * additional fixes: - rename username -> userName to indicate not the login 'username' - rename user -> userid, archive -> aid for crawlconfig + crawls - ensure invites correctly convert str -> uuid as needed - filter out unset values from browsertrix-crawler config * convert remaining user -> userid variables ensure archive id is passed to crawl_manager as str (via archive.id_str) * remove bulk crawlconfig delete * add support for `stopping` state when gracefully stopping crawl * for get crawl endpoint, check stopped crawls first, then running	2022-01-29 19:00:11 -08:00
Ilya Kreymer	9499ebfbba	Crawls API improvements (#117 ) * crawls api improvements (fixes #110) - add GET /crawls/{crawlid} api to return single crawl - resolve crawlconfig name, add as `configName` to crawl model - add 'created' date for crawlconfigs - flatten list to single 'crawls' list, instead of separate 'finished' and 'running' (running crawls added first) - include 'fileCount' and 'fileSize', remove files - remove `files` from crawl list response, also remove `aid` - remove `schedule` from crawl data altogether, (available in crawl config) - add ListCrawls response model	2022-01-29 12:08:02 -08:00
Ilya Kreymer	57a4b6b46f	add collections api: - collections defined by name per archive - can update collections with additional metadata (currently just description) - crawl config api accepts a list of collections by name, resolved to collection uids and stored in config - finished crawls also associated with collection list - /archives/{aid}/collections/{name} can list all crawl artifacts (wacz files) from a named collection (in frictionless data package-ish format) - /archives/{aid}/collections/$all lists all crawled artifacts for the archive readiness check: add /healthz endpoints for app and nginx ingress: add /data/ route to local bucket storage improvements: - for default storages, store path only, and prepend default storage access endpoint - collections api returns the paths using the storage access endpoint - define default storages as secrets in k8s (can support multiple), hard-coded in docker (only one for now)	2021-10-27 09:39:14 -07:00
Ilya Kreymer	c38e0b7bf7	use redis based queue instead of url for crawl done webhook update docker setup to support redis webhook, add consistent CRAWL_ARGS, additional fixes	2021-10-10 12:18:28 -07:00
Ilya Kreymer	4ae4005d74	add ingress + nginx container for better routing support screencasting to dynamically created service via nginx (k8s only thus far) add crawl /watch endpoint to enable watching, creates service if doesn't exist add crawl /running endpoint to check if crawl is running nginx auth check in place, but not yet enabled add k8s nginx.conf add missing chart files file reorg: move docker config to configs/ k8s: add readiness check for nginx and api containers for smoother reloading ensure service deleted along with job todo: update dockerman with screencast support	2021-10-09 23:47:29 -07:00
Ilya Kreymer	19879fe349	Storage + Data Model Refactor (fixes #3 ): - Add default vs custom (s3) storage - K8S: All storages correspond to secrets - K8S: Default storages inited via helm - K8S: Custom storage results in custom secret (per archive) - K8S: Don't add secret per crawl config - API for changing storage per archive - Docker: default storage just hard-coded from env vars (only one for now) - Validate custom storage via aiobotocore before confirming - Data Model: remove usage from users - Data Model: support adding multiple files per crawl for parallel crawls - Data Model: track completions for parallel crawls - Data Model: initial support for tags per crawl, add collection as 'coll' tag README fixes	2021-10-09 18:58:40 -07:00
Ilya Kreymer	b6d1e492d7	add redis for storing crawl state data! - supported in both docker and k8s - additional pods with same job id automatically use same crawl state in redis - support dynamic scaling (#2) via /scale endpoint - k8s job parallelism adjusted dynamically for running job (only supported in k8s so far)	2021-09-17 15:02:11 -07:00
Ilya Kreymer	223658cfa2	misc tweaks: - better error handling for not found resources, ensure 404 - typo in k8smanager - add pylintrc - ensure manual job ares deleted when complete - fix typos, reformat	2021-08-25 18:34:49 -07:00
Ilya Kreymer	60b48ee8a6	dockermanager + scheduler: - run as child process using aioprocessing - cleanup: support cleanup of orphaned containers - timeout: support crawlTimeout via check in cleanup loop - support crawl listing + crawl stopping	2021-08-25 15:28:57 -07:00
Ilya Kreymer	b417d7c185	docker manager: support scheduling with apscheduler and separate 'scheduler' process	2021-08-25 12:21:03 -07:00
Ilya Kreymer	91e9fc8699	dockerman: initial pass - support for creating, deleting crawlconfigs, running crawls on-demand - config stored in volume - list to docker events and clean up containers when they exit	2021-08-24 22:49:06 -07:00
Ilya Kreymer	20b19f932f	make crawlTimeout a per-crawconfig property allow crawl complete/partial complete to update existing crawl state, eg. timeout enable handling backofflimitexceeded / deadlineexceeded failure, with possible success able to override the failure state filter out only active jobs in running crawls listing	2021-08-24 11:29:15 -07:00
Ilya Kreymer	ed27f3e3ee	job handling: - job watch: add watch loop for job failure (backofflimitexceeded) - set job retries + job timeout via chart values - sigterm starts graceful shutdown by default, including for timeout - use sigusr1 to switch to instant shutdown - update stop_crawl() to use new semantics	2021-08-23 21:22:01 -07:00
Ilya Kreymer	7146e054a4	crawls work (#1 ): - support listing existing crawls - add 'schedule' and 'manual' annotations to jobs, store in Crawl obj - ensure manual jobs are deleted when completed - support deleting crawls by id (but not data) - rename running crawl delete to '/cancel' change paths for local minio/mongo to /tmp	2021-08-23 18:01:29 -07:00
Ilya Kreymer	66c4e618eb	crawls work (#1 ), support for: - canceling a crawl (via sigterm) - stopping a crawl gracefully (via custom exec sigint)	2021-08-23 12:25:04 -07:00
Ilya Kreymer	ea9010bf9a	add completed crawls to crawls table	2021-08-20 23:53:06 -07:00
Ilya Kreymer	4b08163ead	support usage counters per archive, per user -- handle crawl completion	2021-08-20 23:05:42 -07:00
Ilya Kreymer	170958be37	rename crawls -> crawlconfigs.py add crawls for crawl api management	2021-08-20 15:15:51 -07:00
Ilya Kreymer	f2d9d7ba6a	new features: - sending emai for validation + invites, configured via env vars - inviting new users to join an existing archive - /crawldone webhook to track verify crawl id (next: store crawl complete entry)	2021-08-20 11:02:29 -07:00
Ilya Kreymer	627e9a6f14	cleanup crawl config, add separate 'runNow' field crawler: add cpu/memory limits minio: auto-create bucket for local minio	2021-08-19 14:15:21 -07:00
Ilya Kreymer	eaa87c8b43	support for user roles (owner, crawler, viewer), owner users can issue invites to other existing users by email to join existing archives	2021-08-18 20:35:51 -07:00
Ilya Kreymer	61a608bfbe	update models: - replace storages with archives, which have a single storage (for now) - crawls associated with archives - users below to archive, with one admin user (if archive created by default) - update crawlconfig for latest browsertrix-crawler (0.4.4) - k8s: fix permissions for crawler role - k8s: fix minio service (now requiring two ports)	2021-08-18 16:53:49 -07:00
Ilya Kreymer	f77eaccf41	support committing to s3 storage move mongo into separate optional deployment along with minio support for configuring storages support for deleting crawls, associated config and secrets	2021-07-02 15:56:24 -07:00
Ilya Kreymer	a111bacfb5	add k8s support - working apis for adding crawls, removing crawls in mongo, mapped to k8s cronjobs - more complete crawl spec - option to start on-demand job from cronjobs - optional minio in separate deployment/service	2021-06-30 21:48:44 -07:00
Ilya Kreymer	c3143df0a2	rename archives -> storages add crawlconfig apis run lint pass, prep for k8s / docker crawl manager support	2021-06-29 20:30:33 -07:00
Ilya Kreymer	b08a188fea	initial commit!	2021-06-28 15:48:59 -07:00

36 Commits