browsertrix

Author	SHA1	Message	Date
Ilya Kreymer	60ba9e366f	Refactor to use new operator on backend (#789 ) * Btrixjobs Operator - Phase 1 (#679) - add metacontroller and custom crds - add main_op entrypoint for operator * Btrix Operator Crawl Management (#767) * operator backend: - run operator api in separate container but in same pod, with WEB_CONCURRENCY=1 - operator creates statefulsets and services for CrawlJob and ProfileJob - operator: use service hook endpoint, set port in values.yaml * crawls working with CrawlJob - jobs start with 'crawljob-' prefix - update status to reflect current crawl state - set sync time to 10 seconds by default, overridable with 'operator_resync_seconds' - mark crawl as running, failed, complete when finished - store finished status when crawl is complete - support updating scale, forcing rollover, stop via patching CrawlJob - support cancel via deletion - requires hack to content-length for patching custom resources - auto-delete of CrawlJob via 'ttlSecondsAfterFinished' - also delete pvcs until autodelete supported via statefulset (k8s >1.27) - ensure filesAdded always set correctly, keep counter in redis, add to status display - optimization: attempt to reduce automerging, by reusing volumeClaimTemplates from existing children, as these may have additional props added - add add_crawl_errors_to_db() for storing crawl errors from redis '<crawl>:e' key to mongodb when crawl is finished/failed/canceled - add .status.size to display human-readable crawl size, if available (from webrecorder/browsertrix-crawler#291) - support new page size, >0.9.0 and old page size key (changed in webrecorder/browsertrix-crawler#284) * support for scheduled jobs! - add main_scheduled_job entrypoint to run scheduled jobs - add crawl_cron_job.yaml template for declaring CronJob - CronJobs moved to default namespace * operator manages ProfileJobs: - jobs start with 'profilejob-' - update expiry time by updating ProfileJob object 'expireTime' while profile is active * refactor/cleanup: - remove k8s package - merge k8sman and basecrawlmanager into crawlmanager - move templates, k8sapi, utils into root package - delete all _job.py files - remove dt_now, ts_now from crawls, now in utils - all db operations happen in crawl/crawlconfig/org files - move shared crawl/crawlconfig/org functions that use the db to be importable directly, including get_crawl_config, add_new_crawl, inc_crawl_stats role binding: more secure setup, don't allow crawler namespace any k8s permissions - move cronjobs to be created in default namespace - grant default namespace access to create cronjobs in default namespace - remove role binding from crawler namespace * additional tweaks to templates: - templates: split crawler and redis statefulset into separate yaml file (in case need to load one or other separately) * stats / redis optimization: - don't update stats in mongodb on every operator sync, only when crawl is finished - for api access, read stats directly from redis to get up-to-date stats - move get_page_stats() to utils, add get_redis_url() to k8sapi to unify access * Add migration for operator changes - Update configmap for crawl configs with scale > 1 or crawlTimeout > 0 and schedule exists to recreate CronJobs - add option to rerun last migration, enabled via env var and by running helm with --set=rerun_last_migration=1 * subcharts: move crawljob and profilejob crds to separate subchart, as this seems best way to guarantee proper install order with + update on upgrade with helm, add built btrix-crds-0.1.0.tgz subchart - metacontroller: use release from ghcr, add metacontroller-helm-v4.10.1.tgz subchart * backend api fixes - ensure changing scale of crawl also updates it in the db - crawlconfigs: add 'currCrawlSize' and 'lastCrawlSize' to crawlconfig api --------- Co-authored-by: D. Lee <leepro@gmail.com> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-04-24 18:30:52 -07:00
Tessa Walsh	a2435a013b	Add totalSize to workflow API endpoints (#783 )	2023-04-20 17:23:59 -04:00
Tessa Walsh	6b19f72a89	Add crawl errors endpoint (#757 ) * Add crawl errors endpoint If this endpoint is called while the crawl is running, errors are pulled directly from redis. If this endpoint is called when the crawl is finished, errors are pulled from mongodb, where they're written when crawls complete. * Add nightly backend test for errors endpoint * Add errors for failed and cancelled crawls to mongo Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2023-04-17 12:59:25 -04:00
Ilya Kreymer	4a46f894a2	backend: add 'lastCrawlStartTime' and 'lastStartedByName' fields to crawlconfigs apis (#753 )	2023-04-17 08:34:29 -07:00
Tessa Walsh	59e49eacd5	Update collections backend API (#759 ) * Re-implement collections, storing crawlIds in collection * Return collections for crawl endpoints and filter on coll name * Remove crawl from all collections when deleted * Revert get_collection_crawls to flat array of resources * Fix tests	2023-04-14 12:17:18 -04:00
Tessa Walsh	fb80a04f18	Add crawl /log API endpoint If a crawl is completed, the endpoint streams the logs from the log files in all of the created WACZ files, sorted by timestamp. The API endpoint supports filtering by log_level and context whether the crawl is still running or not. This is not yet proper streaming because the entire log file is read into memory before being streamed to the client. We will want to switch to proper streaming eventually, but are currently blocked by an aiobotocore bug - see: https://github.com/aio-libs/aiobotocore/issues/991?#issuecomment-1490737762	2023-04-11 11:51:17 -04:00
Tessa Walsh	e9b61c632d	Add pageSize to pagination format (#736 )	2023-04-03 15:57:47 -04:00
Ilya Kreymer	887cb16146	Allow configurable max pages per crawl in deployment settings (#717 ) * backend: max pages per crawl limit, part of fix for #716: - set 'max_pages_crawl_limit' in values.yaml, default to 100,000 - if set/non-0, automatically set limit if none provided - if set/non-0, return 400 if adding config with limit exceeding max limit - return limit as 'maxPagesPerCrawl' in /api/settings - api: /all/crawls - add runningOnly=0 to show all crawls, default to 1/true (for more reliable testing) tests: add test for 'max_pages_per_crawl' setting - ensure 'limit' can not be set higher than max_pages_per_crawl - ensure pages crawled is at the limit - set test limit to max 2 pages - add settings test - check for pages.jsonl and extraPages.jsonl when crawling 2 pages	2023-03-28 16:26:29 -07:00
Tessa Walsh	4724754efc	Filter and sort crawl and workflow list API endpoints in backend (#724 ) * Re-implement pagination and paginate crawlconfig revs First step toward simplifying pagination to set us up for sorting and filtering of list endpoints. This commit removes fastapi-pagination as a dependency. * Migrate all HttpUrl seeds to Seeds This commit also updates the frontend to always use Seeds and to fix display issues resulting from the change. * Filter and sort crawls and workflows Crawls: - Filter by createdBy (via userid param) - Filter by state (comma-separated string for multiple values) - Filter by first_seed, name, description - Sort by started, finished, fileSize, firstSeed - Sort descending by default to match frontend Workflows: - Filter by createdBy (formerly userid) and modifiedBy - Filter by first_seed, name, description - Sort by created, modified, firstSeed, lastCrawlTime * Add crawlconfigs search-values API endpoint and test	2023-03-28 17:55:40 -04:00
Tessa Walsh	4136bdad2e	Add optional description to crawl configs and return in crawl endpoints (#707 )	2023-03-21 15:39:09 -04:00
Ilya Kreymer	07e9f51292	backend: update queue apis to work with new sorted queue apis (also b… (#712 ) * backend: update queue apis to work with new sorted queue apis (also backwards compatible to existing apis) designed for browsertrix-crawler 0.9.0-beta.1 but also backwards compatible with older list-based queue as well	2023-03-17 21:11:17 -07:00
Ilya Kreymer	de9212eec7	exclusions editor fix: (#692 ) - backend: fix updating model after exclusions change - frontend: don't check for new_cid, just success - fixes #691	2023-03-10 22:36:10 -08:00
Ilya Kreymer	544346d1d4	backend: make crawlconfigs mutable! (#656 ) (#662 ) * backend: make crawlconfigs mutable! (#656) - crawlconfig PATCH /{id} can now receive a new JSON config to replace the old one (in addition to scale, schedule, tags) - exclusions: add / remove APIs mutate the current crawlconfig, do not result in a new crawlconfig created - exclusions: ensure crawl job 'config' is updated when exclusions are added/removed, unify add/remove exclusions on crawl - k8s: crawlconfig json is updated along with scale - k8s: stateful set is restarted by updating annotation, instead of changing template - crawl object: now has 'config', as well as 'profileid', 'schedule', 'crawlTimeout', 'jobType' properties to ensure anything that is changeable is stored on the crawl - crawlconfigcore: store share properties between crawl and crawlconfig in new crawlconfigcore (includes 'schedule', 'jobType', 'config', 'profileid', 'schedule', 'crawlTimeout', 'tags', 'oid') - crawlconfig object: remove 'oldId', 'newId', disallow deactivating/deleting while crawl is running - rename 'userid' -> 'createdBy' - remove unused 'completions' field - add missing return to fix /run response - crawlout: ensure 'profileName' is resolved on CrawlOut from profileid - crawlout: return 'name' instead of 'configName' for consistent response - update: 'modified', 'modifiedBy' fields to set modification date and user modifying config - update: ensure PROFILE_FILENAME is updated in configmap is profileid provided, clear if profileid=="" - update: return 'settings_changed' and 'metadata_changed' if either crawl settings or metadata changed - tests: update tests to check settings_changed/metadata_changed return values add revision tracking to crawlconfig: - store each revision separate mongo db collection - revisions accessible via /crawlconfigs/{cid}/revs - store 'rev' int in crawlconfig and in crawljob - only add revision history if crawl config changed migration: - update to db v3 - copy fields from crawlconfig -> crawl - rename userid -> createdBy - copy userid -> modifiedBy, created -> modified - skip invalid crawls (missing config), make createdBy optional (just in case) frontend: Update crawl config keys with new API (#681), update frontend to use new PATCH endpoint, load config from crawl object in details view --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: sua yoo <sua@suayoo.com>	2023-03-07 20:36:50 -08:00
Tessa Walsh	e98c7172a9	Paginate API list endpoints (#659 ) * Paginate API list endpoints fastapi-pagination is pinned to 0.9.3, the latest release that plays nicely with pinned versions of fastapi and fastapi-users. * Increase page size via overriden Params and Page classes * update api resource list keys --------- Co-authored-by: sua yoo <sua@suayoo.com>	2023-03-06 14:41:25 -05:00
Tessa Walsh	e2f359c352	CrawlConfig migration and crawl stats query optimization (#633 ) * Drop crawl stats fields from CrawlConfig and add migration * Remove migrate_down from BaseMigration * Get crawl stats from optimized mongo query	2023-02-24 18:01:15 -08:00
Tessa Walsh	ed94dde7e6	Include firstSeed and seedCount in crawl endpoints (#618 )	2023-02-22 10:27:31 -05:00
Tessa Walsh	bd4fba7af7	Fix POST /orgs/{oid}/crawls/delete (#591 ) * Fix POST /orgs/{oid}/crawls/delete - Add permissions check to ensure crawler users can only delete their own crawls - Fix broken delete_crawls endpoint - Delete files from storage as well as deleting crawl from db - Add tests, including nightly test that ensures crawl files are no longer accessible after the crawl is deleted	2023-02-15 21:06:12 -05:00
Tessa Walsh	ce8f426978	Add notes to crawl and crawl updates (#587 )	2023-02-08 18:36:22 -08:00
Tessa Walsh	2e3b3cb228	Add API endpoint to update crawl tags (#545 ) * Add API endpoint to update crawls (tags only for now) * Allow setting tags to empty list in crawlconfig updates	2023-02-01 22:24:36 -05:00
Tessa Walsh	23022193fb	Reformat backend for black 23.1.0 (#548 )	2023-02-01 20:01:09 -05:00
Tessa Walsh	0fa60ebc45	Rename archives/teams -> orgs in codebase + add db migration (#486 ) * Rename archives to orgs and aid to oid on backend * Rename archive to org and aid to oid in frontend * Remove translation artifact * Rename team -> organization * Add database migrations and run once on startup * This commit also applies the new by_one_worker decorator to other asyncio tasks to prevent heavy tasks from being run in each worker. * Run black, pylint, and husky via pre-commit * Set db version and use in migrations * Update and prepare database in single task * Migrate k8s configmaps	2023-01-18 14:51:04 -08:00
Ilya Kreymer	2daa742585	Copy tags from crawlconfig to crawl (#467 ), fixes #466 - add tags to crawl object - ensure tags are copied from crawlconfig to crawl when crawl is created (both manually and scheduled) - tests: add test to ensure tags added to crawl, remove redundant wait replaced with fixtures	2023-01-12 17:46:19 -08:00
Ilya Kreymer	5efeaa58b1	API filters by user + crawl collection ids (#462 ) backend: object filtering: - add filtering crawls, crawlconfigs and profiles by userid= query arg, fixes #460 - add filtering crawls by crawlconfig via cid= query arg, fixes #400 - tests: add test_filter_results test suite to test filtering crawls and crawlconfigs by user, also create user with 'crawler' permissions, run second crawl with that user.	2023-01-11 16:50:38 -08:00
Tessa Walsh	d1b59c9bd0	Use archive_viewer_dep permissions to GET crawls (#443 ) * Use archive_viewer_dep permissions to GET crawls * Add is_viewer check to archive_dep * Add API endpoint to add new user to archive directly (/archive/<id>/add-user) * Add tests * Refactor tests to use fixtures * And remove login test that duplicates fixtures	2023-01-09 19:11:53 -08:00
Ilya Kreymer	dfca09fc9c	Add single crawl info api at /crawls/{crawl_id} (#418 ) * backend: crawl info apis: - add /crawls/{crawl_id} api endpoint which just lists the crawl info, without resolving the individual files - move /crawls/{crawl_id}.json -> /crawls/{crawl_id}/replay.json for clarity that it's used for replay * frontend: update api for new replay.json endpoint	2022-12-19 14:54:48 -08:00
Ilya Kreymer	793611e5bb	add exclusion api, fixes #311 (#349 ) * add exclusion api, fixes #311 add new apis: `POST crawls/{crawl_id}/exclusion?regex=...` and `DELETE crawls/{crawl_id}/exclusion?regex=...` which will: - create new config with add 'regex' as exclusion (deleting or making inactive previous config) OR remove as exclusion. - update crawl to point to new config - update statefulset to point to new config, causing crawler pods to restart - filter out urls matching 'regex' from both queue and seen list (currently a bit slow) (when adding only) - return 400 if exclusion already existing when adding, or doesn't exist when removing - api reads redis list in reverse to match how exclusion queue is used	2022-11-12 17:24:30 -08:00
Ilya Kreymer	d340bceb39	style pass: normalize docstring spacing	2022-10-19 21:47:34 -07:00
Ilya Kreymer	f7836c345d	Crawl Queue API (#342 ) * crawl queue api work: (#329) - add api to /crawls/{crawl_id}/queue api to get crawl queue, with offset, count, and optional regex. returns results and regex matches within the results, along with total urls in queue. - add api to match entire crawl queue, /crawls/{crawl_id}/queueMatch with query 'regex' arg, which processes entire crawl queue on backend and returns a list of matches (more experimental) - if crawl not yet started / redis not available, return empty queue - only supported for k8s deployment at the moment	2022-10-12 19:56:13 -07:00
Ilya Kreymer	df905682a5	backend: fix scaling api response, return error details if available	2022-06-29 18:37:04 -07:00
Ilya Kreymer	2717a60763	improvements / bug fixes for stop/cancel handling: (#279 ) - only send signal if stopping, no need for canceling as pods/containers will be removed - refactor stop/cancel handling to be unified in manager, separate in job - when stopping / graceful shutdown, return false if sending signal fails - return success=true in json response if and only if stop/cancel actually succeeds, return 'error' message in error, should fix #270 - allow canceling after stopping / if stopping fails - ensure finished time is set in case of cancelation before crawl starts, should fix #273	2022-06-29 17:47:25 -07:00
Ilya Kreymer	418c07bf0d	Local swarm + podman support (#261 ) * backend: refactor swarm support to also support podman (#260) - implement podman support as subclass of swarm deployment - podman is used when 'RUNTIME=podman' env var is set - podman socket is mapped instead of docker socket - podman-compose is used instead of docker-compose (though docker-compose works with podman, it does not support secrets, but podman-compose does) - separate cli utils into SwarmRunner and PodmanRunner which extends it - using config.yaml and config.env, both copied from sample versions - work on simplifying config: add docker-compose.podman.yml and docker-compose.swarm.yml and signing and debug configs in ./configs - add {build,run,stop}-{swarm,podman}.sh in scripts dir - add init-configs, only copy if configs don't exist - build local image use current version of podman, to support both podman 3.x and 4.x - additional fixes for after testing podman on centos - docs: update Deployment.md to cover swarm, podman, k8s deployment	2022-06-14 00:13:49 -07:00
Ilya Kreymer	0c8a5a49b4	refactor to use docker swarm for local alternative to k8s instead of docker compose (#247 ): - use python-on-whale to use docker cli api directly, creating docker stack for each crawl or profile browser - configure storages via storages.yaml secret - add crawl_job, profile_job, splitting into base and k8s/swarm implementations - split manager into base crawlmanager and k8s/swarm implementations - swarm: load initial scale from db to avoid modifying fixed configs, in k8s, load from configmap - swarm: support scheduled jobs via swarm-cronjob service - remove docker dependencies (aiodocker, apscheduler, scheduling) - swarm: when using local minio, expose via /data/ route in nginx via extra include (in k8s, include dir is empty and routing handled via ingress) - k8s: cleanup minio chart: move init containers to minio.yaml - swarm: stateful set implementation to be consistent with k8s scaling: - don't use service replicas, - create a unique service with '-N' appended and allocate unique volume for each replica - allows crawl containers to be restarted w/o losing data - add volume pruning background service, as volumes can be deleted only after service shuts down fully - watch: fully simplify routing, route via replica index instead of ip for both k8s and swarm - rename network btrix-cloud-net -> btrix-net to avoid conflict with compose network	2022-06-05 10:37:17 -07:00
Ilya Kreymer	bf79959a5a	refactoring to use statefulsets + job (#245 ) - use statefulsets instead of deployments for mongo, redis, signer - use k8s job + statefulset for running crawls - use separate statefulset for crawl (scaled) and single-replica redis stateful set - move crawl job update login to crawl_updater - remove shared redis chart package refactor: - move to shared code to 'btrixcloud' - move k8s to 'btrixcloud.k8s' - move docker to 'btrixcloud.docker'	2022-06-05 10:37:17 -07:00

33 Commits