browsertrix

Author	SHA1	Message	Date
Ilya Kreymer	bcbc40059e	Refactor backend data model to support UUID (fixes #118 ) (#119 ) * uuid fix: (fixes #118) - update all mongo models to use UUID type as main '_id' (users continue to use 'id' as defined by fastapi-users) - update all foreign doc references to use UUID instead of string - api handlers convert str->uuid as needed api fix: - fix single crawl api, add CrawlOut response model - fix collections api - fix standalone-docker apis - for manual job, set user to current user, overriding the setting from crawlconfig * additional fixes: - rename username -> userName to indicate not the login 'username' - rename user -> userid, archive -> aid for crawlconfig + crawls - ensure invites correctly convert str -> uuid as needed - filter out unset values from browsertrix-crawler config * convert remaining user -> userid variables ensure archive id is passed to crawl_manager as str (via archive.id_str) * remove bulk crawlconfig delete * add support for `stopping` state when gracefully stopping crawl * for get crawl endpoint, check stopped crawls first, then running	2022-01-29 19:00:11 -08:00
Ilya Kreymer	9499ebfbba	Crawls API improvements (#117 ) * crawls api improvements (fixes #110) - add GET /crawls/{crawlid} api to return single crawl - resolve crawlconfig name, add as `configName` to crawl model - add 'created' date for crawlconfigs - flatten list to single 'crawls' list, instead of separate 'finished' and 'running' (running crawls added first) - include 'fileCount' and 'fileSize', remove files - remove `files` from crawl list response, also remove `aid` - remove `schedule` from crawl data altogether, (available in crawl config) - add ListCrawls response model	2022-01-29 12:08:02 -08:00
Ilya Kreymer	0bea0cfff2	crawl config new template: add support for 'extraHops' config option (available in browsertrix-crawler 0.5.0) (#104 ) frontend: - add checkbox to basic crawl config component which sets 'extraHops' to 1, otherwise to 0 - text tweaks: rename Scope Type -> Crawl Scope, capitalization backend: add 'extraHops' to CrawlConfig fixes #102	2022-01-26 21:18:22 -08:00
Ilya Kreymer	f55f84c60b	backend: - crawlconfigs cleanup: simplify get_crawl_configs api - return CrawlConfigOut for single crawlconfig api endpoint, include currCrawlId	2022-01-22 17:41:37 -08:00
Ilya Kreymer	77aa5213f2	quickfix: typo fix, return config, not archive, fixes #96	2022-01-22 17:21:29 -08:00
Ilya Kreymer	b506442b21	backend api: add curr crawl to crawlconfig listing (#95 ) * backend api: add current crawl id to crawlconfig listing - model: add 'currCrawlId' to CrawlConfig model - output: add response model to /crawlconfigs api response to show correct openapi model - rename crawl_configs -> crawlConfigs for consistency	2022-01-22 13:52:46 -08:00
Ilya Kreymer	88f1689e0e	crawlconfig: add 'name' property to crawl config superuser init: don't check invite token for verified superuser (automatic init) fix formatting	2022-01-15 19:06:48 -08:00
Ilya Kreymer	3d4d7049a2	Misc backend fixes for cloud deployment (#26 ) * misc backend fixes: - fix running w/o local minio - ensure crawler image pull policy is configurable, loaded via chart value - use digitalocean repo for main backend image (for now) - add bucket_name to config only if using default bucket * enable all behaviors, support 'access_endpoint_url' for default storages * debugging: add 'no_delete_jobs' setting for k8s and docker to disable deletion of completed jobs	2021-11-25 11:58:26 -08:00
Ilya Kreymer	57a4b6b46f	add collections api: - collections defined by name per archive - can update collections with additional metadata (currently just description) - crawl config api accepts a list of collections by name, resolved to collection uids and stored in config - finished crawls also associated with collection list - /archives/{aid}/collections/{name} can list all crawl artifacts (wacz files) from a named collection (in frictionless data package-ish format) - /archives/{aid}/collections/$all lists all crawled artifacts for the archive readiness check: add /healthz endpoints for app and nginx ingress: add /data/ route to local bucket storage improvements: - for default storages, store path only, and prepend default storage access endpoint - collections api returns the paths using the storage access endpoint - define default storages as secrets in k8s (can support multiple), hard-coded in docker (only one for now)	2021-10-27 09:39:14 -07:00
Ilya Kreymer	c38e0b7bf7	use redis based queue instead of url for crawl done webhook update docker setup to support redis webhook, add consistent CRAWL_ARGS, additional fixes	2021-10-10 12:18:28 -07:00
Ilya Kreymer	4ae4005d74	add ingress + nginx container for better routing support screencasting to dynamically created service via nginx (k8s only thus far) add crawl /watch endpoint to enable watching, creates service if doesn't exist add crawl /running endpoint to check if crawl is running nginx auth check in place, but not yet enabled add k8s nginx.conf add missing chart files file reorg: move docker config to configs/ k8s: add readiness check for nginx and api containers for smoother reloading ensure service deleted along with job todo: update dockerman with screencast support	2021-10-09 23:47:29 -07:00
Ilya Kreymer	19879fe349	Storage + Data Model Refactor (fixes #3 ): - Add default vs custom (s3) storage - K8S: All storages correspond to secrets - K8S: Default storages inited via helm - K8S: Custom storage results in custom secret (per archive) - K8S: Don't add secret per crawl config - API for changing storage per archive - Docker: default storage just hard-coded from env vars (only one for now) - Validate custom storage via aiobotocore before confirming - Data Model: remove usage from users - Data Model: support adding multiple files per crawl for parallel crawls - Data Model: track completions for parallel crawls - Data Model: initial support for tags per crawl, add collection as 'coll' tag README fixes	2021-10-09 18:58:40 -07:00
Ilya Kreymer	b6d1e492d7	add redis for storing crawl state data! - supported in both docker and k8s - additional pods with same job id automatically use same crawl state in redis - support dynamic scaling (#2) via /scale endpoint - k8s job parallelism adjusted dynamically for running job (only supported in k8s so far)	2021-09-17 15:02:11 -07:00
Ilya Kreymer	223658cfa2	misc tweaks: - better error handling for not found resources, ensure 404 - typo in k8smanager - add pylintrc - ensure manual job ares deleted when complete - fix typos, reformat	2021-08-25 18:34:49 -07:00
Ilya Kreymer	60b48ee8a6	dockermanager + scheduler: - run as child process using aioprocessing - cleanup: support cleanup of orphaned containers - timeout: support crawlTimeout via check in cleanup loop - support crawl listing + crawl stopping	2021-08-25 15:28:57 -07:00
Ilya Kreymer	b417d7c185	docker manager: support scheduling with apscheduler and separate 'scheduler' process	2021-08-25 12:21:03 -07:00
Ilya Kreymer	91e9fc8699	dockerman: initial pass - support for creating, deleting crawlconfigs, running crawls on-demand - config stored in volume - list to docker events and clean up containers when they exit	2021-08-24 22:49:06 -07:00
Ilya Kreymer	20b19f932f	make crawlTimeout a per-crawconfig property allow crawl complete/partial complete to update existing crawl state, eg. timeout enable handling backofflimitexceeded / deadlineexceeded failure, with possible success able to override the failure state filter out only active jobs in running crawls listing	2021-08-24 11:29:15 -07:00
Ilya Kreymer	66c4e618eb	crawls work (#1 ), support for: - canceling a crawl (via sigterm) - stopping a crawl gracefully (via custom exec sigint)	2021-08-23 12:25:04 -07:00
Ilya Kreymer	a8255a76b2	crawljob: - support run once on existing crawl job - support updating/patching existing crawl job with new crawl config, new schedule and run once	2021-08-21 22:10:31 -07:00
Ilya Kreymer	170958be37	rename crawls -> crawlconfigs.py add crawls for crawl api management	2021-08-20 15:15:51 -07:00

21 Commits