browsertrix

Author	SHA1	Message	Date
Tessa Walsh	9c7a312a4c	Rework collections to track collections in Crawl (#878 ) * Track collections in Crawl rather than crawls in Collection * Add delete collection API endpoint and tests * Precompute collection crawlCount, pageCount, and tags and add them to GET collection responses * Add modified field to Collection * Update collection replay.json method * Make add and remove crawls accept list of crawl ids * Auto-add new workflow crawls to collections when they successfully complete via CrawlConfig.autoAddCollections field * Move long-running post-crawl operator tasks into asyncio task * Make CrawlConfig.autoAddCollections updatable via /update API endpoint	2023-05-25 15:41:50 -04:00
Ilya Kreymer	d7c19c7613	Wait for DB init for healthcheck + settings (#885 ) * init check: (backend fix for #794) - wait until db is inited before settings /api/settings to return 200 - also return 503 from healthcheck endpoint, until db is available	2023-05-25 09:58:30 -07:00
Tessa Walsh	5c944d4626	Remove uniqueness constraint on collection descriptions Fix for copy-paste error	2023-05-23 11:03:13 -04:00
Ilya Kreymer	12f7db3ae2	tests: fixes for crawl cancel + crawl stopped (#864 ) * tests: - fix cancel crawl test by ensuring state is not running or waiting - fix stop crawl test by ensuring stop is only initiated after at least one page has been crawled, otherwise result may be failed, as no crawl data has been crawled yet (separate fix in crawler to avoid loop if stopped before any data written webrecorder/browsertrix-crawler#314) - bump page limit to 4 for tests to ensure crawl is partially complete, not fully complete when stopping - allow canceled or partial_complete due to race condition * chart: bump frontend limits in default, not just for tests (addresses #780) * crawl stop before starting: - if crawl stopped before it started, mark as canceled - add test for stopping immediately, which should result in 'canceled' crawl - attempt to increase resync interval for immediate failure - nightly tests: increase page limit to test timeout * backend: - detect stopped-before-start crawl as 'failed' instead of 'done' - stats: return stats counters as int instead of string	2023-05-22 20:17:29 -07:00
Tessa Walsh	28f1c815d0	Add crawlSuccessfulCount to workflows (#871 )	2023-05-22 19:06:37 -04:00
Tessa Walsh	bd8b306fbd	Improve sorting workflows by lastUpdated (#826 ) * Precompute config crawl stats Includes a database migration to move preciously dynamically computed crawl stats for workflows into the CrawlConfig model. * Add lastRun sorting option and enable it by default * Add modified as final sort key to order non-run workflows * Remove currCrawl* fields and update frontend accordingly * Add isCrawlRunning field to backend and use in frontend	2023-05-22 18:42:30 -04:00
Tessa Walsh	60fac2b677	Add collection sorting and filtering (#863 ) * Sort by name and description (ascending by default) * Filter by name * Add endpoint to fetch collection names for search * Add collation so that utf-8 chars sort as expected	2023-05-22 16:53:49 -04:00
Ilya Kreymer	826c2e8298	version: bump to 1.6.0-beta.0	2023-05-19 11:29:31 -07:00
Tessa Walsh	f482831d53	Use collection uuid as id (instead of name) (#855 ) Also ensure name is not empty by adding minimum length of 1 Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-05-19 09:03:48 -04:00
Ilya Kreymer	d07204e59d	version: bump to 1.5.1	2023-05-18 17:28:42 -07:00
Ilya Kreymer	a1ef93a46a	version: bump to 1.5.0 for release!	2023-05-16 17:36:58 +02:00
Ilya Kreymer	ebee5e1788	version: bump to 1.5.0-beta.4	2023-05-12 07:34:50 +02:00
Ilya Kreymer	d8b36c0ae2	version: bump to 1.5.0-beta.3	2023-05-11 03:05:46 +02:00
Ilya Kreymer	d1e5b0a021	version: bump to 1.5.0-beta.2	2023-05-10 14:55:35 +02:00
Ilya Kreymer	a6ddde496d	backend: fixes to 0005 migration: (#843 ) - catch any errors on updating config (likely due to missing configmap), fix formatting	2023-05-10 12:00:41 +02:00
Ilya Kreymer	cf15d9c873	backend: ensure cid is a UUID, remove unneeded inactive check on crawls (#842 ) * backend: ensure cid is a UUID, remove unneeded inactive check on crawls * add UUID cast to cancel only	2023-05-10 11:59:44 +02:00
Ilya Kreymer	2cae065c46	Add Waiting state on the backend and frontend (#839 ) * operator: add waiting state - add pods as related objects - inspect pod status, set crawl status to 'waiting' if no pods are running frontend: - frontend support for 'waiting' state - show waiting icon from mocks --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2023-05-08 17:05:01 -07:00
Ilya Kreymer	70319594c2	crawlconfig: fix default filename template, make configurable (#835 ) * crawlconfig: fix default filename template, make configurable - make default crawl file template configurable with 'default_crawl_filename_template' value in values.yaml - set to '@ts-@hostsuffix.wacz' by default - allow updating via 'crawlFilenameTemplate' in crawlconfig patch, which updates configmap - tests: add test for custom 'default_crawl_filename_template'	2023-05-08 14:03:27 -07:00
Ilya Kreymer	fd7e81b8b7	stopping fix: backend fixes for #836 + prep for additional status fields (#837 ) * stopping fix: backend fixes for #836 - sets 'stopping' field on crawl when crawl is being stopped (both via db and on k8s object) - k8s: show 'stopping' as part of crawljob object, update subchart - set 'currCrawlStopping' on workflow - support old and new browsertrix-crawler stopping keys - tests: add tests for new stopping state, also test canceling crawl (disable test for stopping crawl, currently failing) - catch redis error when getting stats operator: additional optimizations: - run pvc removal as background task - catch any exceptions in finalizer stage (eg. if db is down), return false until finalizer completes	2023-05-08 14:02:20 -07:00
Ilya Kreymer	064cd7e08a	quickfix: fix stopping crawls with current browsertrix-crawler beta	2023-05-06 23:35:25 -07:00
Ilya Kreymer	b40d599e17	operator fixes: (#834 ) - just pass cid from operator for consistency, don't load crawl from update_crawl (different object) - don't throw in update_config_crawl_stats() to avoid exception in operator, only throw in crawlconfigs api	2023-05-06 13:02:33 -07:00
Ilya Kreymer	f992704491	version: bump version to 1.5.0-beta.1	2023-05-06 00:31:03 -07:00
Tessa Walsh	4f121fb868	Update precompute migration to only update active workflows (#833 )	2023-05-05 21:35:03 -07:00
Tessa Walsh	8281ba723e	Pre-compute workflow last crawl information (#812 ) * Precompute config crawl stats * Includes a database migration to move preciously dynamically computed crawl stats for workflows into the CrawlConfig model. * Add crawls.finished descending index * Add last crawl fields to workflow tests	2023-05-05 15:12:52 -07:00
Ilya Kreymer	aae0e6590e	Ensure Volumes are deleted when crawl is canceled (#828 ) * operator: - ensures crawler pvcs are always deleted before crawl object is finalized (fixes #827) - refactor to ensure finalizer handler always run when finalizing - remove obsolete config entries	2023-05-05 12:05:54 -07:00
Tessa Walsh	48d34bc3c4	Add option to list workflows API endpoint to filter by schedule (#822 ) * Add option to filter workflows by empty or non-empty schedule * Add tests	2023-05-05 12:05:19 -07:00
Tessa Walsh	542ad7a24a	Update scale in workflow when crawl scale is updated (#820 )	2023-05-05 11:59:57 -07:00
Tessa Walsh	774ae518f4	Set crawl-stop in redis from operator when crawl is stopped (#815 ) Change redis to <crawl-id>:crawl-stop to match webrecorder/browsertrix-crawler#303	2023-05-05 11:34:24 -07:00
Tessa Walsh	b2005fe389	Fix crawl /errors API endpoint (#813 ) * Fix crawl error slicing to ensure a consistent number of errors per page * Fix total count in paginated API response	2023-05-03 13:58:38 -04:00
Tessa Walsh	1a63c31b71	backend: errors endpoint: Parse JSON-l errors before returning (#799 )	2023-04-26 14:36:48 -07:00
Ilya Kreymer	7aefe09581	startup fixes: (#793 ) - don't run migrations on first init, just set to CURR_DB_VERSION - implement 'run once lock' with mkdir/rmdir - move register_exit_handler() to utils - remove old run once handler	2023-04-24 18:32:52 -07:00
Ilya Kreymer	60ba9e366f	Refactor to use new operator on backend (#789 ) * Btrixjobs Operator - Phase 1 (#679) - add metacontroller and custom crds - add main_op entrypoint for operator * Btrix Operator Crawl Management (#767) * operator backend: - run operator api in separate container but in same pod, with WEB_CONCURRENCY=1 - operator creates statefulsets and services for CrawlJob and ProfileJob - operator: use service hook endpoint, set port in values.yaml * crawls working with CrawlJob - jobs start with 'crawljob-' prefix - update status to reflect current crawl state - set sync time to 10 seconds by default, overridable with 'operator_resync_seconds' - mark crawl as running, failed, complete when finished - store finished status when crawl is complete - support updating scale, forcing rollover, stop via patching CrawlJob - support cancel via deletion - requires hack to content-length for patching custom resources - auto-delete of CrawlJob via 'ttlSecondsAfterFinished' - also delete pvcs until autodelete supported via statefulset (k8s >1.27) - ensure filesAdded always set correctly, keep counter in redis, add to status display - optimization: attempt to reduce automerging, by reusing volumeClaimTemplates from existing children, as these may have additional props added - add add_crawl_errors_to_db() for storing crawl errors from redis '<crawl>:e' key to mongodb when crawl is finished/failed/canceled - add .status.size to display human-readable crawl size, if available (from webrecorder/browsertrix-crawler#291) - support new page size, >0.9.0 and old page size key (changed in webrecorder/browsertrix-crawler#284) * support for scheduled jobs! - add main_scheduled_job entrypoint to run scheduled jobs - add crawl_cron_job.yaml template for declaring CronJob - CronJobs moved to default namespace * operator manages ProfileJobs: - jobs start with 'profilejob-' - update expiry time by updating ProfileJob object 'expireTime' while profile is active * refactor/cleanup: - remove k8s package - merge k8sman and basecrawlmanager into crawlmanager - move templates, k8sapi, utils into root package - delete all _job.py files - remove dt_now, ts_now from crawls, now in utils - all db operations happen in crawl/crawlconfig/org files - move shared crawl/crawlconfig/org functions that use the db to be importable directly, including get_crawl_config, add_new_crawl, inc_crawl_stats role binding: more secure setup, don't allow crawler namespace any k8s permissions - move cronjobs to be created in default namespace - grant default namespace access to create cronjobs in default namespace - remove role binding from crawler namespace * additional tweaks to templates: - templates: split crawler and redis statefulset into separate yaml file (in case need to load one or other separately) * stats / redis optimization: - don't update stats in mongodb on every operator sync, only when crawl is finished - for api access, read stats directly from redis to get up-to-date stats - move get_page_stats() to utils, add get_redis_url() to k8sapi to unify access * Add migration for operator changes - Update configmap for crawl configs with scale > 1 or crawlTimeout > 0 and schedule exists to recreate CronJobs - add option to rerun last migration, enabled via env var and by running helm with --set=rerun_last_migration=1 * subcharts: move crawljob and profilejob crds to separate subchart, as this seems best way to guarantee proper install order with + update on upgrade with helm, add built btrix-crds-0.1.0.tgz subchart - metacontroller: use release from ghcr, add metacontroller-helm-v4.10.1.tgz subchart * backend api fixes - ensure changing scale of crawl also updates it in the db - crawlconfigs: add 'currCrawlSize' and 'lastCrawlSize' to crawlconfig api --------- Co-authored-by: D. Lee <leepro@gmail.com> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-04-24 18:30:52 -07:00
Tessa Walsh	a2435a013b	Add totalSize to workflow API endpoints (#783 )	2023-04-20 17:23:59 -04:00
Ilya Kreymer	3f41498c5c	quickfix: fix typo, remove unnecessary async	2023-04-18 16:14:15 -07:00
Ilya Kreymer	821d29bd2a	crawlconfig api: add 'currCrawlState' and 'currCrawlTimeStart' to crawlconfig list api (already queried on backend) (#770 ) * crawlconfig api: add 'currCrawlState' and 'currCrawlTimeStart' to crawlconfig list api (already queried on backend)	2023-04-17 23:13:13 -07:00
Tessa Walsh	6b19f72a89	Add crawl errors endpoint (#757 ) * Add crawl errors endpoint If this endpoint is called while the crawl is running, errors are pulled directly from redis. If this endpoint is called when the crawl is finished, errors are pulled from mongodb, where they're written when crawls complete. * Add nightly backend test for errors endpoint * Add errors for failed and cancelled crawls to mongo Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2023-04-17 12:59:25 -04:00
Ilya Kreymer	4a46f894a2	backend: add 'lastCrawlStartTime' and 'lastStartedByName' fields to crawlconfigs apis (#753 )	2023-04-17 08:34:29 -07:00
Tessa Walsh	59e49eacd5	Update collections backend API (#759 ) * Re-implement collections, storing crawlIds in collection * Return collections for crawl endpoints and filter on coll name * Remove crawl from all collections when deleted * Revert get_collection_crawls to flat array of resources * Fix tests	2023-04-14 12:17:18 -04:00
Ilya Kreymer	85b6a05419	Upgrade to mongo 6 and use sortArray for workflow crawls (#764 ) (#765 ) fixes from 1.4.1: * Upgrade to mongo 6 and use for workflow crawls * update readiness probe with timeouts doubled, and failure threshold increased for slower 'mongosh' readiness check update versions to 1.5.0-beta.0 in backend and frontend Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-04-11 18:22:07 -07:00
Tessa Walsh	fb80a04f18	Add crawl /log API endpoint If a crawl is completed, the endpoint streams the logs from the log files in all of the created WACZ files, sorted by timestamp. The API endpoint supports filtering by log_level and context whether the crawl is still running or not. This is not yet proper streaming because the entire log file is read into memory before being streamed to the client. We will want to switch to proper streaming eventually, but are currently blocked by an aiobotocore bug - see: https://github.com/aio-libs/aiobotocore/issues/991?#issuecomment-1490737762	2023-04-11 11:51:17 -04:00
Ilya Kreymer	631c84e488	version: bump to 1.4.0!	2023-04-06 10:12:43 -07:00
Ilya Kreymer	3ab62547a9	version: bump to 1.4.0-beta.2	2023-04-06 02:45:20 -07:00
Ilya Kreymer	7f757d396a	config: add 'pageLoadTimeout' and 'pageExtraDelay' options to backend… (#742 ) * config: add 'pageLoadTimeout' and 'pageExtraDelay' options to backend config - add 'default_page_load_timeout_seconds' to values.yaml, defaulting to 120, for pageLoadTimeout - add 'defaultPageLoadTimeSeconds ' to /api/settings, update tests for /api/settings addresses issue in #636	2023-04-04 19:52:23 -07:00
Ilya Kreymer	67172ca1e2	fix: only include finished crawls in crawlCount value for /api/crawlconfigs (#746 )	2023-04-04 19:50:14 -07:00
Ilya Kreymer	1c47a648a9	Max page limit override (#737 ) * more page limit: update to #717, instead of setting --limit in each crawlconfig, apply override --maxPageLimit setting, implemented in crawler, to override individually configured page limit * update tests, no longer returning 'crawl_page_limit_exceeds_allowed'	2023-04-03 14:01:32 -07:00
Tessa Walsh	e9b61c632d	Add pageSize to pagination format (#736 )	2023-04-03 15:57:47 -04:00
Ilya Kreymer	887cb16146	Allow configurable max pages per crawl in deployment settings (#717 ) * backend: max pages per crawl limit, part of fix for #716: - set 'max_pages_crawl_limit' in values.yaml, default to 100,000 - if set/non-0, automatically set limit if none provided - if set/non-0, return 400 if adding config with limit exceeding max limit - return limit as 'maxPagesPerCrawl' in /api/settings - api: /all/crawls - add runningOnly=0 to show all crawls, default to 1/true (for more reliable testing) tests: add test for 'max_pages_per_crawl' setting - ensure 'limit' can not be set higher than max_pages_per_crawl - ensure pages crawled is at the limit - set test limit to max 2 pages - add settings test - check for pages.jsonl and extraPages.jsonl when crawling 2 pages	2023-03-28 16:26:29 -07:00
Tessa Walsh	4724754efc	Filter and sort crawl and workflow list API endpoints in backend (#724 ) * Re-implement pagination and paginate crawlconfig revs First step toward simplifying pagination to set us up for sorting and filtering of list endpoints. This commit removes fastapi-pagination as a dependency. * Migrate all HttpUrl seeds to Seeds This commit also updates the frontend to always use Seeds and to fix display issues resulting from the change. * Filter and sort crawls and workflows Crawls: - Filter by createdBy (via userid param) - Filter by state (comma-separated string for multiple values) - Filter by first_seed, name, description - Sort by started, finished, fileSize, firstSeed - Sort descending by default to match frontend Workflows: - Filter by createdBy (formerly userid) and modifiedBy - Filter by first_seed, name, description - Sort by created, modified, firstSeed, lastCrawlTime * Add crawlconfigs search-values API endpoint and test	2023-03-28 17:55:40 -04:00
Tessa Walsh	e293e98ac3	Fix migration to avoid jobType KeyError (#727 ) * Fix migration to avoid KeyError * Use .get() for other optional fields	2023-03-27 13:52:05 -07:00
Tessa Walsh	4136bdad2e	Add optional description to crawl configs and return in crawl endpoints (#707 )	2023-03-21 15:39:09 -04:00

1 2 3

139 Commits