Commit Graph

10 Commits

Author SHA1 Message Date
Ilya Kreymer
9159c7c914
ensure max crawl size and max crawl timeout values are set to 0 when unused, instead of null (#1167)
- convert None->0 when creating CrawlJob
- ensure frontend sends 0 not null
- make input model require 'int = 0' instead of 'Optional[int] = 0'
2023-09-13 09:51:26 -07:00
Ilya Kreymer
c9c39d47b7
Scheduled Crawl Refactor: Handle via Operator + Add Skipped Crawls on Quota Reached (#1162)
* use metacontroller's decoratorcontroller to create CrawlJob from Job
* scheduled job work:
- use existing job name for scheduled crawljob
- use suspended job, set startTime, completionTime and succeeded status on job when crawljob is done
- simplify cronjob template: remove job_image, cron_namespace, using same namespace as crawls,
placeholder job image for cronjobs

* move storage quota check to crawljob handler:
- add 'skipped_quota_reached' as new failed status type
- check for storage quota before checking if crawljob can be started, fail if not (check before any pods/pvcs created)

* frontend:
- show all crawls in crawl workflow, no need to filter by status
- add 'skipped_quota_reached' status, show as 'Skipped (Quota Reached)', render same as failed

* migration: make release namespace available as DEFAULT_NAMESPACE, delete old cronjobs in DEFAULT_NAMESPACE and recreate in crawlers namespace with new template
2023-09-12 13:05:43 -07:00
Ilya Kreymer
ad9bca2e92
Operator refactor to control pods + pvcs directly instead of statefulsets (#1149)
- Ability for pod to be Completed, unlike in Statefulset - eg. if 3 pods are running and first one finishes, all 3 must be running until all 3 are done. With this setup, the first finished pod can remain in Completed state.
- Fixed shutdown order - crawler pods now correctly shutdown first before redis pods, by switching to background deletion.
- Pod priority decreases with scale: 1st instance of a new crawl can preempt 3rd or 2nd instance of another crawl
- Create priority classes upto 'max_crawl_scale, configured in values.yaml
- Improved scale change reconciliation: if increasing scale, immediately scale up. If decreasing scale,
graceful stop scaled-down instance to complete via redis 'stopone' key, wait until they exit with Completed state
before adjust status.scale / removing scaled down pods. Ensures unaccepted interrupts don't cause scaled down data to be deleted.
- Redis pod remains inactive until crawler is first active, or after no crawl pods are active for 60 seconds
- Configurable Redis storage with 'redis_storage' value, set to 3Gi by default
- CrawlJob deletion starts as soon as post-finish crawl operations are run
- Post-crawl operations get their own redis instance, since one during response is being cleaned up in finalizer
- Finalizer ignores request with incorrect state (returns 400 if reported as not finished while crawl is finished)
- Current resource usage added to status
- Profile browser: also manage single pod directly without statefulset for consistency.
- Restart pods via restartTime value: if spec.restartTime != status.restartTime, clear out pods and update status.restartTime (using OnDelete policy to avoid recreate loops in edge cases).
- Update to latest metacontroller (v4.11.0)
- Add --restartOnError flag for crawler (for browsertrix-crawler 0.11.0)
- Failed crawl logging: dd 'fail_crawl()' to be used for failing a crawl, which prints logs for default container (if enabled) as well as pod status
- tests: check other finished states to avoid stuck in infinite loop if crawl fails
- tests: disable disk utilization check, which adds unpredictability to crawl testing!
fixes #1147 

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-09-11 10:38:04 -07:00
Ilya Kreymer
68bc053ba0
Print crawl log to operator log (mostly for testing) (#1148)
* log only if 'log_failed_crawl_lines' value is set to number of last lines to log
from failed container

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-09-06 17:53:02 -07:00
Tessa Walsh
e667fe2e97
Add max crawl size option to backend and frontend (#1045)
Backend:
- add 'maxCrawlSize' to models and crawljob spec
- add 'MAX_CRAWL_SIZE' to configmap
- add maxCrawlSize to new crawlconfig + update APIs
- operator: gracefully stop crawl if current size (from stats) exceeds maxCrawlSize
- tests: add max crawl size tests

Frontend:
- Add Max Crawl Size text box Limits tab
- Users enter max crawl size in GB, convert to bytes
- Add BYTES_PER_GB as constant for converting to bytes
- docs: Crawl Size Limit to user guide workflow setup section

Operator Refactor:
- use 'status.stopping' instead of 'crawl.stopping' to indicate crawl is being stopped, as changing later has no effect in operator
- add is_crawl_stopping() to return if crawl is being stopped, based on crawl.stopping or size or time limit being reached
- crawlerjob status: store byte size under 'size', human readable size under 'sizeHuman' for clarity
- size stat always exists so remove unneeded conditional (defaults to 0)
- store raw byte size in 'size', human readable size in 'sizeHuman'

Charts:
- subchart: update crawlerjob crd in btrix-crds to show status.stopping instead of spec.stopping
- subchart: show 'sizeHuman' property instead of 'size'
- bump subchart version to 0.1.1

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-08-26 22:00:37 -07:00
Ilya Kreymer
989ed2a8da
Use Shared Services for Crawling, Redis, Profile Browsers (#1088)
* refactor to use shared role-based service shared across pods:
- 'crawler' service for all crawler screencasting, scales 0 .. N with crawler-<ID>-N.crawl
- 'redis' service for all redis access, redis-<ID>-0.redis
- 'browser' service for all browser access (profile browsers), browser-<ID>-0.browser
- don't create a new service per crawl/profile at all
- enable 'publishNotReadyAddresses' for potentially faster resolving, esp for redis
- remove service as type managed by operator as no longer creating services dynamically
- remove frontend var CRAWLER_SVC_SUFFIX, suffix always '.crawler' to match crawler service name
2023-08-24 20:08:53 -07:00
Ilya Kreymer
90b2f94aef
follow-up to #1066: update redis to 5.0.0 which includes full fix for connection leak in from_url(), (#1081)
simplifies previous workaround addressed in 5.0.0
2023-08-15 20:34:47 -07:00
Ilya Kreymer
2e73148bea
fix redis connection leaks + exclusions error: (fixes #1065) (#1066)
* fix redis connection leaks + exclusions error: (fixes #1065)
- use contextmanager for accessing redis to ensure redis.close() is always called
- add get_redis_client() to k8sapi to ensure unified place to get redis client
- use connectionpool.from_url() until redis 5.0.0 is released to ensure auto close and single client settings are applied
- also: catch invalid regex passed to re.compile() in queue regex check, return 400 instead of 500 for invalid regex
- redis requirements: bump to 5.0.0rc2
2023-08-14 18:29:28 -07:00
Ilya Kreymer
00fb8ac048
Concurrent Crawl Limit (#874)
concurrent crawl limits: (addresses #866)
- support limits on concurrent crawls that can be run within a single org
- change 'waiting' state to 'waiting_org_limit' for concurrent crawl limit and 'waiting_capacity' for capacity-based
limits

orgs:
- add 'maxConcurrentCrawl' to new 'quotas' object on orgs
- add /quotas endpoint for updating quotas object

operator:
- add all crawljobs as related, appear to be returned in creation order
- operator: if concurrent crawl limit set, ensures current job is in the first N set of crawljobs (as provided via 'related' list of crawljob objects) before it can proceed to 'starting', otherwise set to 'waiting_org_limit'
- api: add org /quotas endpoint for configuring quotas
- remove 'new' state, always start with 'starting'
- crawljob: add 'oid' to crawljob spec and label for easier querying
- more stringent state transitions: add allowed_from to set_state()
- ensure state transitions only happened from allowed states, while failed/canceled can happen from any state
- ensure finished and state synched from db if transition not allowed
- add crawl indices by oid and cid

frontend: 
- show different waiting states on frontend: 'Waiting (Crawl Limit) and 'Waiting (At Capacity)'
- add gear icon on orgs admin page
- and initial popup for setting org quotas, showing all properties from org 'quotas' object

tests:
- add concurrent crawl limit nightly tests
- fix state waiting -> waiting_capacity
- ci: add logging of operator output on test failure
2023-05-30 15:38:03 -07:00
Ilya Kreymer
60ba9e366f
Refactor to use new operator on backend (#789)
* Btrixjobs Operator - Phase 1 (#679)

- add metacontroller and custom crds
- add main_op entrypoint for operator

* Btrix Operator Crawl Management (#767)

* operator backend:
- run operator api in separate container but in same pod, with WEB_CONCURRENCY=1
- operator creates statefulsets and services for CrawlJob and ProfileJob
- operator: use service hook endpoint, set port in values.yaml

* crawls working with CrawlJob
- jobs start with 'crawljob-' prefix
- update status to reflect current crawl state
- set sync time to 10 seconds by default, overridable with 'operator_resync_seconds'
- mark crawl as running, failed, complete when finished
- store finished status when crawl is complete
- support updating scale, forcing rollover, stop via patching CrawlJob
- support cancel via deletion
- requires hack to content-length for patching custom resources
- auto-delete of CrawlJob via 'ttlSecondsAfterFinished'
- also delete pvcs until autodelete supported via statefulset (k8s >1.27)
- ensure filesAdded always set correctly, keep counter in redis, add to status display
- optimization: attempt to reduce automerging, by reusing volumeClaimTemplates from existing children, as these may have additional props added
- add add_crawl_errors_to_db() for storing crawl errors from redis '<crawl>:e' key to mongodb when crawl is finished/failed/canceled
- add .status.size to display human-readable crawl size, if available (from webrecorder/browsertrix-crawler#291)
- support new page size, >0.9.0 and old page size key (changed in webrecorder/browsertrix-crawler#284)

* support for scheduled jobs!
- add main_scheduled_job entrypoint to run scheduled jobs
- add crawl_cron_job.yaml template for declaring CronJob
- CronJobs moved to default namespace

* operator manages ProfileJobs:
- jobs start with 'profilejob-'
- update expiry time by updating ProfileJob object 'expireTime' while profile is active

* refactor/cleanup:
- remove k8s package
- merge k8sman and basecrawlmanager into crawlmanager
- move templates, k8sapi, utils into root package
- delete all *_job.py files
- remove dt_now, ts_now from crawls, now in utils
- all db operations happen in crawl/crawlconfig/org files
- move shared crawl/crawlconfig/org functions that use the db to be importable directly,
including get_crawl_config, add_new_crawl, inc_crawl_stats

* role binding: more secure setup, don't allow crawler namespace any k8s permissions
- move cronjobs to be created in default namespace
- grant default namespace access to create cronjobs in default namespace
- remove role binding from crawler namespace

* additional tweaks to templates:
- templates: split crawler and redis statefulset into separate yaml file (in case need to load one or other separately)

* stats / redis optimization:
- don't update stats in mongodb on every operator sync, only when crawl is finished
- for api access, read stats directly from redis to get up-to-date stats
- move get_page_stats() to utils, add get_redis_url() to k8sapi to unify access

* Add migration for operator changes
- Update configmap for crawl configs with scale > 1 or
crawlTimeout > 0 and schedule exists to recreate CronJobs
- add option to rerun last migration, enabled via env var and by running helm with --set=rerun_last_migration=1

* subcharts: move crawljob and profilejob crds to separate subchart, as this seems best way to guarantee proper install order with + update on upgrade with helm, add built btrix-crds-0.1.0.tgz subchart
- metacontroller: use release from ghcr, add metacontroller-helm-v4.10.1.tgz subchart

* backend api fixes
- ensure changing scale of crawl also updates it in the db
- crawlconfigs: add 'currCrawlSize' and 'lastCrawlSize' to crawlconfig api

---------

Co-authored-by: D. Lee <leepro@gmail.com>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-04-24 18:30:52 -07:00