Commit Graph

127 Commits

Author SHA1 Message Date
Ilya Kreymer
d8b36c0ae2 version: bump to 1.5.0-beta.3 2023-05-11 03:05:46 +02:00
Ilya Kreymer
d1e5b0a021 version: bump to 1.5.0-beta.2 2023-05-10 14:55:35 +02:00
Ilya Kreymer
a6ddde496d
backend: fixes to 0005 migration: (#843)
- catch any errors on updating config (likely due to missing configmap), fix formatting
2023-05-10 12:00:41 +02:00
Ilya Kreymer
cf15d9c873
backend: ensure cid is a UUID, remove unneeded inactive check on crawls (#842)
* backend: ensure cid is a UUID, remove unneeded inactive check on crawls

* add UUID cast to cancel only
2023-05-10 11:59:44 +02:00
Ilya Kreymer
2cae065c46
Add Waiting state on the backend and frontend (#839)
* operator: add waiting state
- add pods as related objects
- inspect pod status, set crawl status to 'waiting' if no pods are running

frontend:
- frontend support for 'waiting' state
- show waiting icon from mocks

---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2023-05-08 17:05:01 -07:00
Ilya Kreymer
70319594c2
crawlconfig: fix default filename template, make configurable (#835)
* crawlconfig: fix default filename template, make configurable
- make default crawl file template configurable with 'default_crawl_filename_template' value in values.yaml
- set to '@ts-@hostsuffix.wacz' by default
- allow updating via 'crawlFilenameTemplate' in crawlconfig patch, which updates configmap
- tests: add test for custom 'default_crawl_filename_template'
2023-05-08 14:03:27 -07:00
Ilya Kreymer
fd7e81b8b7
stopping fix: backend fixes for #836 + prep for additional status fields (#837)
* stopping fix: backend fixes for #836
- sets 'stopping' field on crawl when crawl is being stopped (both via db and on k8s object)
- k8s: show 'stopping' as part of crawljob object, update subchart
- set 'currCrawlStopping' on workflow
- support old and new browsertrix-crawler stopping keys
- tests: add tests for new stopping state, also test canceling crawl (disable test for stopping crawl, currently failing)
- catch redis error when getting stats

operator: additional optimizations:
- run pvc removal as background task
- catch any exceptions in finalizer stage (eg. if db is down), return false until finalizer completes
2023-05-08 14:02:20 -07:00
Ilya Kreymer
064cd7e08a quickfix: fix stopping crawls with current browsertrix-crawler beta 2023-05-06 23:35:25 -07:00
Ilya Kreymer
b40d599e17
operator fixes: (#834)
- just pass cid from operator for consistency, don't load crawl from update_crawl (different object)
- don't throw in update_config_crawl_stats() to avoid exception in operator, only throw in crawlconfigs api
2023-05-06 13:02:33 -07:00
Ilya Kreymer
f992704491 version: bump version to 1.5.0-beta.1 2023-05-06 00:31:03 -07:00
Tessa Walsh
4f121fb868
Update precompute migration to only update active workflows (#833) 2023-05-05 21:35:03 -07:00
Tessa Walsh
8281ba723e
Pre-compute workflow last crawl information (#812)
* Precompute config crawl stats

* Includes a database migration to move preciously dynamically computed crawl stats for workflows into the CrawlConfig model.

* Add crawls.finished descending index

* Add last crawl fields to workflow tests
2023-05-05 15:12:52 -07:00
Ilya Kreymer
aae0e6590e
Ensure Volumes are deleted when crawl is canceled (#828)
* operator:
- ensures crawler pvcs are always deleted before crawl object is finalized (fixes #827)
- refactor to ensure finalizer handler always run when finalizing
- remove obsolete config entries
2023-05-05 12:05:54 -07:00
Tessa Walsh
48d34bc3c4
Add option to list workflows API endpoint to filter by schedule (#822)
* Add option to filter workflows by empty or non-empty schedule

* Add tests
2023-05-05 12:05:19 -07:00
Tessa Walsh
542ad7a24a
Update scale in workflow when crawl scale is updated (#820) 2023-05-05 11:59:57 -07:00
Tessa Walsh
774ae518f4
Set crawl-stop in redis from operator when crawl is stopped (#815)
Change redis to <crawl-id>:crawl-stop to match webrecorder/browsertrix-crawler#303
2023-05-05 11:34:24 -07:00
Tessa Walsh
b2005fe389
Fix crawl /errors API endpoint (#813)
* Fix crawl error slicing to ensure a consistent number of errors per page
* Fix total count in paginated API response
2023-05-03 13:58:38 -04:00
Tessa Walsh
1a63c31b71
backend: errors endpoint: Parse JSON-l errors before returning (#799) 2023-04-26 14:36:48 -07:00
Ilya Kreymer
7aefe09581
startup fixes: (#793)
- don't run migrations on first init, just set to CURR_DB_VERSION
- implement 'run once lock' with mkdir/rmdir
- move register_exit_handler() to utils
- remove old run once handler
2023-04-24 18:32:52 -07:00
Ilya Kreymer
60ba9e366f
Refactor to use new operator on backend (#789)
* Btrixjobs Operator - Phase 1 (#679)

- add metacontroller and custom crds
- add main_op entrypoint for operator

* Btrix Operator Crawl Management (#767)

* operator backend:
- run operator api in separate container but in same pod, with WEB_CONCURRENCY=1
- operator creates statefulsets and services for CrawlJob and ProfileJob
- operator: use service hook endpoint, set port in values.yaml

* crawls working with CrawlJob
- jobs start with 'crawljob-' prefix
- update status to reflect current crawl state
- set sync time to 10 seconds by default, overridable with 'operator_resync_seconds'
- mark crawl as running, failed, complete when finished
- store finished status when crawl is complete
- support updating scale, forcing rollover, stop via patching CrawlJob
- support cancel via deletion
- requires hack to content-length for patching custom resources
- auto-delete of CrawlJob via 'ttlSecondsAfterFinished'
- also delete pvcs until autodelete supported via statefulset (k8s >1.27)
- ensure filesAdded always set correctly, keep counter in redis, add to status display
- optimization: attempt to reduce automerging, by reusing volumeClaimTemplates from existing children, as these may have additional props added
- add add_crawl_errors_to_db() for storing crawl errors from redis '<crawl>:e' key to mongodb when crawl is finished/failed/canceled
- add .status.size to display human-readable crawl size, if available (from webrecorder/browsertrix-crawler#291)
- support new page size, >0.9.0 and old page size key (changed in webrecorder/browsertrix-crawler#284)

* support for scheduled jobs!
- add main_scheduled_job entrypoint to run scheduled jobs
- add crawl_cron_job.yaml template for declaring CronJob
- CronJobs moved to default namespace

* operator manages ProfileJobs:
- jobs start with 'profilejob-'
- update expiry time by updating ProfileJob object 'expireTime' while profile is active

* refactor/cleanup:
- remove k8s package
- merge k8sman and basecrawlmanager into crawlmanager
- move templates, k8sapi, utils into root package
- delete all *_job.py files
- remove dt_now, ts_now from crawls, now in utils
- all db operations happen in crawl/crawlconfig/org files
- move shared crawl/crawlconfig/org functions that use the db to be importable directly,
including get_crawl_config, add_new_crawl, inc_crawl_stats

* role binding: more secure setup, don't allow crawler namespace any k8s permissions
- move cronjobs to be created in default namespace
- grant default namespace access to create cronjobs in default namespace
- remove role binding from crawler namespace

* additional tweaks to templates:
- templates: split crawler and redis statefulset into separate yaml file (in case need to load one or other separately)

* stats / redis optimization:
- don't update stats in mongodb on every operator sync, only when crawl is finished
- for api access, read stats directly from redis to get up-to-date stats
- move get_page_stats() to utils, add get_redis_url() to k8sapi to unify access

* Add migration for operator changes
- Update configmap for crawl configs with scale > 1 or
crawlTimeout > 0 and schedule exists to recreate CronJobs
- add option to rerun last migration, enabled via env var and by running helm with --set=rerun_last_migration=1

* subcharts: move crawljob and profilejob crds to separate subchart, as this seems best way to guarantee proper install order with + update on upgrade with helm, add built btrix-crds-0.1.0.tgz subchart
- metacontroller: use release from ghcr, add metacontroller-helm-v4.10.1.tgz subchart

* backend api fixes
- ensure changing scale of crawl also updates it in the db
- crawlconfigs: add 'currCrawlSize' and 'lastCrawlSize' to crawlconfig api

---------

Co-authored-by: D. Lee <leepro@gmail.com>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-04-24 18:30:52 -07:00
Tessa Walsh
a2435a013b
Add totalSize to workflow API endpoints (#783) 2023-04-20 17:23:59 -04:00
Ilya Kreymer
3f41498c5c quickfix: fix typo, remove unnecessary async 2023-04-18 16:14:15 -07:00
Ilya Kreymer
821d29bd2a
crawlconfig api: add 'currCrawlState' and 'currCrawlTimeStart' to crawlconfig list api (already queried on backend) (#770)
* crawlconfig api: add 'currCrawlState' and 'currCrawlTimeStart' to crawlconfig list api (already queried on backend)
2023-04-17 23:13:13 -07:00
Tessa Walsh
6b19f72a89
Add crawl errors endpoint (#757)
* Add crawl errors endpoint

If this endpoint is called while the crawl is running, errors are
pulled directly from redis.

If this endpoint is called when the crawl is finished, errors are
pulled from mongodb, where they're written when crawls complete.

* Add nightly backend test for errors endpoint

* Add errors for failed and cancelled crawls to mongo

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2023-04-17 12:59:25 -04:00
Ilya Kreymer
4a46f894a2
backend: add 'lastCrawlStartTime' and 'lastStartedByName' fields to crawlconfigs apis (#753) 2023-04-17 08:34:29 -07:00
Tessa Walsh
59e49eacd5
Update collections backend API (#759)
* Re-implement collections, storing crawlIds in collection

* Return collections for crawl endpoints and filter on coll name

* Remove crawl from all collections when deleted

* Revert get_collection_crawls to flat array of resources

* Fix tests
2023-04-14 12:17:18 -04:00
Ilya Kreymer
85b6a05419
Upgrade to mongo 6 and use sortArray for workflow crawls (#764) (#765)
fixes from 1.4.1:
* Upgrade to mongo 6 and use  for workflow crawls

* update readiness probe with timeouts doubled, and failure threshold increased for slower 'mongosh' readiness check

update versions to 1.5.0-beta.0 in backend and frontend

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-04-11 18:22:07 -07:00
Tessa Walsh
fb80a04f18 Add crawl /log API endpoint
If a crawl is completed, the endpoint streams the logs from the log
files in all of the created WACZ files, sorted by timestamp.

The API endpoint supports filtering by log_level and context whether
the crawl is still running or not.

This is not yet proper streaming because the entire log file is read
into memory before being streamed to the client. We will want to
switch to proper streaming eventually, but are currently blocked by
an aiobotocore bug - see:

https://github.com/aio-libs/aiobotocore/issues/991?#issuecomment-1490737762
2023-04-11 11:51:17 -04:00
Ilya Kreymer
631c84e488 version: bump to 1.4.0! 2023-04-06 10:12:43 -07:00
Ilya Kreymer
3ab62547a9 version: bump to 1.4.0-beta.2 2023-04-06 02:45:20 -07:00
Ilya Kreymer
7f757d396a
config: add 'pageLoadTimeout' and 'pageExtraDelay' options to backend… (#742)
* config: add 'pageLoadTimeout' and 'pageExtraDelay' options to backend config
- add 'default_page_load_timeout_seconds' to values.yaml, defaulting to 120, for pageLoadTimeout
- add 'defaultPageLoadTimeSeconds ' to /api/settings, update tests for /api/settings
addresses issue in #636
2023-04-04 19:52:23 -07:00
Ilya Kreymer
67172ca1e2
fix: only include finished crawls in crawlCount value for /api/crawlconfigs (#746) 2023-04-04 19:50:14 -07:00
Ilya Kreymer
1c47a648a9
Max page limit override (#737)
* more page limit: update to #717, instead of setting --limit in each crawlconfig,
apply override --maxPageLimit setting, implemented in crawler, to override individually configured page limit

* update tests, no longer returning 'crawl_page_limit_exceeds_allowed'
2023-04-03 14:01:32 -07:00
Tessa Walsh
e9b61c632d
Add pageSize to pagination format (#736) 2023-04-03 15:57:47 -04:00
Ilya Kreymer
887cb16146
Allow configurable max pages per crawl in deployment settings (#717)
* backend: max pages per crawl limit, part of fix for #716:
- set 'max_pages_crawl_limit' in values.yaml, default to 100,000
- if set/non-0, automatically set limit if none provided
- if set/non-0, return 400 if adding config with limit exceeding max limit
- return limit as 'maxPagesPerCrawl' in /api/settings
- api: /all/crawls - add runningOnly=0 to show all crawls, default to 1/true (for more reliable testing)

tests: add test for 'max_pages_per_crawl' setting
- ensure 'limit' can not be set higher than max_pages_per_crawl
- ensure pages crawled is at the limit
- set test limit to max 2 pages
- add settings test
- check for pages.jsonl and extraPages.jsonl when crawling 2 pages
2023-03-28 16:26:29 -07:00
Tessa Walsh
4724754efc
Filter and sort crawl and workflow list API endpoints in backend (#724)
* Re-implement pagination and paginate crawlconfig revs

First step toward simplifying pagination to set us up for sorting
and filtering of list endpoints. This commit removes fastapi-pagination
as a dependency.

* Migrate all HttpUrl seeds to Seeds

This commit also updates the frontend to always use Seeds and to
fix display issues resulting from the change.

* Filter and sort crawls and workflows

Crawls:
- Filter by createdBy (via userid param)
- Filter by state (comma-separated string for multiple values)
- Filter by first_seed, name, description
- Sort by started, finished, fileSize, firstSeed
- Sort descending by default to match frontend

Workflows:
- Filter by createdBy (formerly userid) and modifiedBy
- Filter by first_seed, name, description
- Sort by created, modified, firstSeed, lastCrawlTime

* Add crawlconfigs search-values API endpoint and test
2023-03-28 17:55:40 -04:00
Tessa Walsh
e293e98ac3
Fix migration to avoid jobType KeyError (#727)
* Fix migration to avoid KeyError

* Use .get() for other optional fields
2023-03-27 13:52:05 -07:00
Tessa Walsh
4136bdad2e
Add optional description to crawl configs and return in crawl endpoints (#707) 2023-03-21 15:39:09 -04:00
Ilya Kreymer
ba70d3227e version: update to 1.4.0-beta.1 2023-03-17 21:14:42 -07:00
Ilya Kreymer
07e9f51292
backend: update queue apis to work with new sorted queue apis (also b… (#712)
* backend: update queue apis to work with new sorted queue apis (also backwards compatible to existing apis)
designed for browsertrix-crawler 0.9.0-beta.1 but also backwards compatible with older list-based queue as well
2023-03-17 21:11:17 -07:00
Ilya Kreymer
de9212eec7
exclusions editor fix: (#692)
- backend: fix updating model after exclusions change
- frontend: don't check for new_cid, just success
- fixes #691
2023-03-10 22:36:10 -08:00
Ilya Kreymer
86ca9c4bac
backend: Fix for total crawl time limit. (#665)
* backend: fix for total crawl timelimit:
- time limit is computed for total job run time
- when limit is exceeded, job starts to stop crawls gracefully, equivalent to 'stop crawl' operation
- fix for #664

* rename crawl-timeout -> crawl_expire_time

* fix lint
2023-03-10 11:43:16 -08:00
Ilya Kreymer
c2fa78859b
permissions: allow user with 'viewer' permissions to access read-only crawlconfig apis (#687)
addresses issue in #653, fixes #685
2023-03-08 09:29:25 -08:00
Ilya Kreymer
544346d1d4
backend: make crawlconfigs mutable! (#656) (#662)
* backend: make crawlconfigs mutable! (#656)
- crawlconfig PATCH /{id} can now receive a new JSON config to replace the old one (in addition to scale, schedule, tags)
- exclusions: add / remove APIs mutate the current crawlconfig, do not result in a new crawlconfig created
- exclusions: ensure crawl job 'config' is updated when exclusions are added/removed, unify add/remove exclusions on crawl
- k8s: crawlconfig json is updated along with scale
- k8s: stateful set is restarted by updating annotation, instead of changing template
- crawl object: now has 'config', as well as 'profileid', 'schedule', 'crawlTimeout', 'jobType' properties to ensure anything that is changeable is stored on the crawl
- crawlconfigcore: store share properties between crawl and crawlconfig in new crawlconfigcore (includes 'schedule', 'jobType', 'config', 'profileid', 'schedule', 'crawlTimeout', 'tags', 'oid')
- crawlconfig object: remove 'oldId', 'newId', disallow deactivating/deleting while crawl is running
- rename 'userid' -> 'createdBy'
- remove unused 'completions' field
- add missing return to fix /run response
- crawlout: ensure 'profileName' is resolved on CrawlOut from profileid
- crawlout: return 'name' instead of 'configName' for consistent response
- update: 'modified', 'modifiedBy' fields to set modification date and user modifying config
- update: ensure PROFILE_FILENAME is updated in configmap is profileid provided, clear if profileid==""
- update: return 'settings_changed' and 'metadata_changed' if either crawl settings or metadata changed
- tests: update tests to check settings_changed/metadata_changed return values

add revision tracking to crawlconfig:
- store each revision separate mongo db collection
- revisions accessible via /crawlconfigs/{cid}/revs
- store 'rev' int in crawlconfig and in crawljob
- only add revision history if crawl config changed

migration:
- update to db v3
- copy fields from crawlconfig -> crawl
- rename userid -> createdBy
- copy userid -> modifiedBy, created -> modified
- skip invalid crawls (missing config), make createdBy optional (just in case)

frontend: Update crawl config keys with new API (#681), update frontend to use new PATCH endpoint, load config from crawl object in details view

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: sua yoo <sua@suayoo.com>
2023-03-07 20:36:50 -08:00
Tessa Walsh
e98c7172a9
Paginate API list endpoints (#659)
* Paginate API list endpoints

fastapi-pagination is pinned to 0.9.3, the latest release that plays
nicely with pinned versions of fastapi and fastapi-users.

* Increase page size via overriden Params and Page classes

* update api resource list keys

---------

Co-authored-by: sua yoo <sua@suayoo.com>
2023-03-06 14:41:25 -05:00
Ilya Kreymer
ace4e79e3f version: bump version to 1.4.0-beta.0 2023-03-06 10:20:56 -08:00
Ilya Kreymer
df9a7eccf3 version: bump to 1.3.1 2023-02-28 18:40:15 -08:00
Ilya Kreymer
4901fc2fe9 version: bump to 1.3.0 2023-02-24 18:07:56 -08:00
Tessa Walsh
e2f359c352
CrawlConfig migration and crawl stats query optimization (#633)
* Drop crawl stats fields from CrawlConfig and add migration

* Remove migrate_down from BaseMigration

* Get crawl stats from optimized mongo query
2023-02-24 18:01:15 -08:00
Sara Tavares
8167d7da8d
fix typos (#640) 2023-02-24 11:10:49 -08:00