- Remove config.seeds from workflow and crawl detail endpoints
- Add new paginated GET /crawls/{crawl_id}/seeds and /crawlconfigs/{cid}/seeds endpoints to retrieve seeds for a crawl or workflow
- Include firstSeed in GET /crawlconfigs/{cid} endpoint (was missing before)
- Modify frontend to fetch seeds from new /seeds endpoints with loading indicator
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
* Fix(backend): Stream logs without causing OOM
Also be smarter about when to use `heapq.merge` and when to use
`itertools.chain`: If all the logs are coming from the same instance we
`chain` them, otherwise we'll `merge` them
iterator fixes:
- group wacz files by instance by suffix, eg. -0.wacz, -1.wacz, -2.wacz
- sort wacz files, and all logs within each wacz file
- chain log iterators for all log files within wacz group
- merge log iterators across wacz files in different groups
- add type hints to help keep track of iterator helper functions
- add iter_lines() from botocore, use that for line parsing for simplicity
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- add liveness check/fix readiness check - ensure 'redis-cli ping' actually returns 'PONG', as exit code is 0 even if errors
will detect situations where redis is not available, such as due to to max clients being reached
- bump redis memory/cpu for now (until autoscaling/automatic adjustment is available)
- optimization: convert all uses of 'async for' to use iterator directly instead of converting to list to avoid
unbounded size lists
- additional cursor.to_list() to async for conversions for stats computation, simply crawlconfigs stats computation
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
`smtp_use_tls = bool(os.environ.get("EMAIL_SMTP_USE_TLS", True))` would only disable tls when `EMAIL_SMTP_USE_TLS` is set to an empty string which is not intuitive
* migration improvements + rerunning migrations: (fixes#1227)
- avoid starting some workers while migration is still running
- ensure workers that aren't performing migration await for migration to complete
- backend will not be valid until migration is run
* allow rerunning migration from specified version via --set rerun_from_migration=<VERSION> (replaces rerun_last_migration)
* store usernames (createdByName, modifiedByName, startedByName) in db for workflows
* store userName for userid for crawls in db
* update output models to return usernames
* add migration 0018 to add usernames to existing crawls and crawlconfigs
* updated tests for crawl and config usernames
* use async for to iterate over crawls and crawlconfigs
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- limit how many exclusion matches are returned at once
- option to specify 'offset', 'limit' and return 'nextOffset' for further pagination
- set page limit to 1000 by default
* Add bytes stored per type to org and metrics
The org now tracks bytesStored by type of crawl, uploads, and browser profiles
in addition to the total, and returns these values in the org metrics endpoint.
A migration is added to precompute these values in existing deployments.
In addition, all /metrics storage values are now returned solely as bytes, as
the GB form wasn't being used in the frontend and is unnecessary.
* Improve deletion of multiple archived item types via `/all-crawls` delete endpoint
- Update `/all-crawls` delete test to check that org and workflow size values
are correct following deletion.
- Fix bug where it was always assumed only one crawl was deleted per cid
and size was not tracked per cid
- Add type check within delete_crawls
* Initial implementation of org metrics
(This can eventually be sped up significantly by precomputing the
values and storing them in the db.)
* Rename storageQuota to storageQuotaBytes to be consistent
* Update tests to include metrics
* Remove config from list endpoints
- Remove config field from workflow and crawl list endpoints
- Add seedCount to CrawlConfigOut on backend and Workflow on frontend
- Refactor CrawlConfig and CrawlConfigOut to extend CrawlConfigCore + CrawlConfigAdditional
- Refactor workflow list in frontend to use firstSeed and seedCount
- Frontend uses ListWorkflow type which is Omit<Workflow, "config">
* operator state changes: (fixes#1178)
- if at least one crawler is 'running' ensure state is reset back to running
- for multiple instances, set status to earliest state (not latest) to be consistent,
eg. if at least one crawl is running, set to running, if at least one is generating wacz, set to that
* remove almost all standalone functions and move them back into ops member functions
* operator now has access to all the ops classes as well
* keep two standalone functions used only in migrations
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
* Add success filter to webhook list GET endpoint
* Add sorting to webhooks list API and add event filter
* Test webhooks via echo server
* Set address to echo server on host from CI env var for k3d and microk8s
* Add -s back to pytest command for k3d ci
* Change pytest test path to avoid hanging on collecting tests
* Revert microk8s to only run on push to main
* use metacontroller's decoratorcontroller to create CrawlJob from Job
* scheduled job work:
- use existing job name for scheduled crawljob
- use suspended job, set startTime, completionTime and succeeded status on job when crawljob is done
- simplify cronjob template: remove job_image, cron_namespace, using same namespace as crawls,
placeholder job image for cronjobs
* move storage quota check to crawljob handler:
- add 'skipped_quota_reached' as new failed status type
- check for storage quota before checking if crawljob can be started, fail if not (check before any pods/pvcs created)
* frontend:
- show all crawls in crawl workflow, no need to filter by status
- add 'skipped_quota_reached' status, show as 'Skipped (Quota Reached)', render same as failed
* migration: make release namespace available as DEFAULT_NAMESPACE, delete old cronjobs in DEFAULT_NAMESPACE and recreate in crawlers namespace with new template
- More specific toast notification error messages to the action being attempted
- Single dismissable global banner shown when org storage is reached
- Removed check for storage quota reached in `runNow`, since buttons are disabled in UI, and errors handled if request fails.
- Allow creating new workflow when storage quota reached
- More responsive storage quota updates: add storageQuotaReached to archived item replay.json, updates w/o reload when crawl pushes quota over limit
- Modify LiteElement to check for storageQuotaReached on GET requests
---------
Co-authored-by: sua yoo <sua@suayoo.com>
- Ability for pod to be Completed, unlike in Statefulset - eg. if 3 pods are running and first one finishes, all 3 must be running until all 3 are done. With this setup, the first finished pod can remain in Completed state.
- Fixed shutdown order - crawler pods now correctly shutdown first before redis pods, by switching to background deletion.
- Pod priority decreases with scale: 1st instance of a new crawl can preempt 3rd or 2nd instance of another crawl
- Create priority classes upto 'max_crawl_scale, configured in values.yaml
- Improved scale change reconciliation: if increasing scale, immediately scale up. If decreasing scale,
graceful stop scaled-down instance to complete via redis 'stopone' key, wait until they exit with Completed state
before adjust status.scale / removing scaled down pods. Ensures unaccepted interrupts don't cause scaled down data to be deleted.
- Redis pod remains inactive until crawler is first active, or after no crawl pods are active for 60 seconds
- Configurable Redis storage with 'redis_storage' value, set to 3Gi by default
- CrawlJob deletion starts as soon as post-finish crawl operations are run
- Post-crawl operations get their own redis instance, since one during response is being cleaned up in finalizer
- Finalizer ignores request with incorrect state (returns 400 if reported as not finished while crawl is finished)
- Current resource usage added to status
- Profile browser: also manage single pod directly without statefulset for consistency.
- Restart pods via restartTime value: if spec.restartTime != status.restartTime, clear out pods and update status.restartTime (using OnDelete policy to avoid recreate loops in edge cases).
- Update to latest metacontroller (v4.11.0)
- Add --restartOnError flag for crawler (for browsertrix-crawler 0.11.0)
- Failed crawl logging: dd 'fail_crawl()' to be used for failing a crawl, which prints logs for default container (if enabled) as well as pod status
- tests: check other finished states to avoid stuck in infinite loop if crawl fails
- tests: disable disk utilization check, which adds unpredictability to crawl testing!
fixes#1147
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
* Implement in backend
- Track bytesStored in org
- Add migration to pre-calculate based on size of crawlfiles and profilefiles
- Add methods to increase or decrease org storage when crawl or profile files
are added or deleted
- Include storageQuotaReached boolean in API responses that alter storage
- Don't start new crawls and fail uploads if storage quota reached
* Implement in frontend
- Add to orgs-list quotas
- Update org's storageQuotaReached based on backend endpoint responses
- Disable buttons when storage quota is met
- Show toast notification when attempting to run a crawl when org
storage quota is met
* log only if 'log_failed_crawl_lines' value is set to number of last lines to log
from failed container
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- set crawler cpu / memory with fixed base + incremental bumps based on number of browsers
- allow parsing k8s quantities with parse_quantity, compute in operator
- set 'crawler_cpu = crawler_cpu_base + crawler_extra_cpu_per_browser * (num_browsers - 1)'
and same for memory
Initial set of backend API for event webhook notifications for the following events:
* Crawl started (including boolean indicating if crawl was scheduled)
* Crawl finished
* Upload finished
* Archived item added to collection
* Archived item removed from collection
Configuration of URLs is done via /api/orgs/<oid>/event-webhook-urls. If a URL is configured for a given event, a webhook notification is added to the database and then attempted to be sent (up to a total of 5 tries per overall attempt, with an increasing backoff between, implemented via use of the backoff library, which supports async).
webhook status available via /api/orgs/<oid>/webhooks
(Additional testing + potential fastapi integration left in separate follow-ups
Fixes#1041
* Add support for collectionIds to patch endpoints
* Make update available via all-crawls/ and add test
* Fix tests
* Always remove collectionIds from udpate
* Remove unnecessary fallback
* One more pass on expected values before update
Backend:
- add 'maxCrawlSize' to models and crawljob spec
- add 'MAX_CRAWL_SIZE' to configmap
- add maxCrawlSize to new crawlconfig + update APIs
- operator: gracefully stop crawl if current size (from stats) exceeds maxCrawlSize
- tests: add max crawl size tests
Frontend:
- Add Max Crawl Size text box Limits tab
- Users enter max crawl size in GB, convert to bytes
- Add BYTES_PER_GB as constant for converting to bytes
- docs: Crawl Size Limit to user guide workflow setup section
Operator Refactor:
- use 'status.stopping' instead of 'crawl.stopping' to indicate crawl is being stopped, as changing later has no effect in operator
- add is_crawl_stopping() to return if crawl is being stopped, based on crawl.stopping or size or time limit being reached
- crawlerjob status: store byte size under 'size', human readable size under 'sizeHuman' for clarity
- size stat always exists so remove unneeded conditional (defaults to 0)
- store raw byte size in 'size', human readable size in 'sizeHuman'
Charts:
- subchart: update crawlerjob crd in btrix-crds to show status.stopping instead of spec.stopping
- subchart: show 'sizeHuman' property instead of 'size'
- bump subchart version to 0.1.1
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
* fix latest crawl (lastRun) sort:
- don't cast 'started' value to string when setting as starting crawl time (regression from #937)
- caused incorrect sorting as finished crawl time was a datetime, while starting crawl time was a string
- move updated config crawl info in one place, simplify to avoid returning started time altogether, just set directly
- pass mdb crawlconfigs and crawls collections directly to add_new_crawl() function
- fixes#1108
* Add dropdown menu containing 'Remove from Collection' to archived items in collection view (#1110)
- Enables users to remove an item from a collection from the collection detail view - menu was previously missing
- Fixes: #1102 (missing dropdown menu) by making use of the inactive menu trigger button.
- Updates collection items page size to match "Archived Items" page size (20 items per page)
---------
Co-authored-by: sua yoo <sua@webrecorder.org>
- rename 'collections' -> 'collectionIds', adding migration 0014
- only populate 'collections' array with {name, id} pair for get_crawl() / single archived item
path, but not for aggregate/list methods
- remove Crawl.get_crawl(), redundant with BaseCrawl.get_crawl() version
- ensure _files_to_resources returns an empty [] instead of none if empty (matching BaseCrawl.get_crawl() behavior to Crawl.get_crawl())
- tests: update tests to use collectionIds for id list, add 'collections' for {name, id} test
- frontend: change Crawl object to have collectionIds instead of collections
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
* refactor to use shared role-based service shared across pods:
- 'crawler' service for all crawler screencasting, scales 0 .. N with crawler-<ID>-N.crawl
- 'redis' service for all redis access, redis-<ID>-0.redis
- 'browser' service for all browser access (profile browsers), browser-<ID>-0.browser
- don't create a new service per crawl/profile at all
- enable 'publishNotReadyAddresses' for potentially faster resolving, esp for redis
- remove service as type managed by operator as no longer creating services dynamically
- remove frontend var CRAWLER_SVC_SUFFIX, suffix always '.crawler' to match crawler service name
- Paginates Crawl Workflows when there are more than 10 workflows
- Refactors workflow search and crawl search to use the same component
- Adds sort by first seed, workflow creation date, and workflow modified date
- Separates "last run" date from "modified" date
- Update column layout into Name & Schedule (or Manual Ru'ri=), Latest Crawl (<finish time> in <duration>), total size, and last modified (modified by and modified time)
* fix redis connection leaks + exclusions error: (fixes#1065)
- use contextmanager for accessing redis to ensure redis.close() is always called
- add get_redis_client() to k8sapi to ensure unified place to get redis client
- use connectionpool.from_url() until redis 5.0.0 is released to ensure auto close and single client settings are applied
- also: catch invalid regex passed to re.compile() in queue regex check, return 400 instead of 500 for invalid regex
- redis requirements: bump to 5.0.0rc2
* helm chart tweaks:
- lower mem requirements for backend and crawler
- disable cors in ingress to pass through cors headers from backend
- crawler statefulset: use ordered instead of parallel scaling policy to avoid single crawl taking up all crawling capacity quickly
Frontend:
- Renames list view to "All Archived Items"
- Refactors fetches to use single all-crawls endpoints
- Removes search by config ID for more search parity with uploads
- Adds sort by size
- Refactors property and method names to replace crawl*
- Replaces remaining references to "crawl" in copy with "item"'
- Rename Upload Archive button to Upload WACZ
- Fix focusout in item menu so menus close
Backend:
- Filter search values by type as well
- Only get list of cids for crawls in search values
- Don't list crawl/workflow ids in search values
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>