concurrent crawl limits: (addresses #866)
- support limits on concurrent crawls that can be run within a single org
- change 'waiting' state to 'waiting_org_limit' for concurrent crawl limit and 'waiting_capacity' for capacity-based
limits
orgs:
- add 'maxConcurrentCrawl' to new 'quotas' object on orgs
- add /quotas endpoint for updating quotas object
operator:
- add all crawljobs as related, appear to be returned in creation order
- operator: if concurrent crawl limit set, ensures current job is in the first N set of crawljobs (as provided via 'related' list of crawljob objects) before it can proceed to 'starting', otherwise set to 'waiting_org_limit'
- api: add org /quotas endpoint for configuring quotas
- remove 'new' state, always start with 'starting'
- crawljob: add 'oid' to crawljob spec and label for easier querying
- more stringent state transitions: add allowed_from to set_state()
- ensure state transitions only happened from allowed states, while failed/canceled can happen from any state
- ensure finished and state synched from db if transition not allowed
- add crawl indices by oid and cid
frontend:
- show different waiting states on frontend: 'Waiting (Crawl Limit) and 'Waiting (At Capacity)'
- add gear icon on orgs admin page
- and initial popup for setting org quotas, showing all properties from org 'quotas' object
tests:
- add concurrent crawl limit nightly tests
- fix state waiting -> waiting_capacity
- ci: add logging of operator output on test failure
* optimizations:
- rename update_crawl_config_stats to stats_recompute_all, only used in migration to fetch all crawls
and do a full recompute of all file sizes
- add stats_recompute_last to only get last crawl by size, increment total size by specified amount, and incr/decr number of crawls
- Update migration 0007 to use stats_recompute_all
- Add isCrawlRunning, lastCrawlStopping, and lastRun to
stats_recompute_last
- Increment crawlSuccessfulCount in stats_recompute_last
* operator/crawls:
- operator: keep track of filesAddedSize in redis as well
- rename update_crawl to update_crawl_state_if_changed() and only update
if state is different, otherwise return false
- ensure mark_finished() operations only occur if crawl is state has changed
- don't clear 'stopping' flag, can track if crawl was stopped
- state always starts with "starting", don't reset to starting
tests:
- Add test for incremental workflow stats updating
- don't clear stopping==true, indicates crawl was manually stopped
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
* Track collections in Crawl rather than crawls in Collection
* Add delete collection API endpoint and tests
* Precompute collection crawlCount, pageCount, and tags and add them to
GET collection responses
* Add modified field to Collection
* Update collection replay.json method
* Make add and remove crawls accept list of crawl ids
* Auto-add new workflow crawls to collections when they successfully
complete via CrawlConfig.autoAddCollections field
* Move long-running post-crawl operator tasks into asyncio task
* Make CrawlConfig.autoAddCollections updatable via /update API endpoint
* init check: (backend fix for #794)
- wait until db is inited before settings /api/settings to return 200
- also return 503 from healthcheck endpoint, until db is available
* docs refactor:
- add local deployment guide local-dev-setup.md
- deploy/local.md focuses only on deployment with latest release, links to local-dev-setup.md
for local image deployment
- add nav to mkdocs.yml to ensure correct order of pages
- update microk8s specific info
- update minikube specific info
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
* tests:
- fix cancel crawl test by ensuring state is not running or waiting
- fix stop crawl test by ensuring stop is only initiated after at least one page has been crawled,
otherwise result may be failed, as no crawl data has been crawled yet (separate fix in crawler to avoid loop if stopped
before any data written webrecorder/browsertrix-crawler#314)
- bump page limit to 4 for tests to ensure crawl is partially complete, not fully complete when stopping
- allow canceled or partial_complete due to race condition
* chart: bump frontend limits in default, not just for tests (addresses #780)
* crawl stop before starting:
- if crawl stopped before it started, mark as canceled
- add test for stopping immediately, which should result in 'canceled' crawl
- attempt to increase resync interval for immediate failure
- nightly tests: increase page limit to test timeout
* backend:
- detect stopped-before-start crawl as 'failed' instead of 'done'
- stats: return stats counters as int instead of string
* Precompute config crawl stats
Includes a database migration to move preciously dynamically computed
crawl stats for workflows into the CrawlConfig model.
* Add lastRun sorting option and enable it by default
* Add modified as final sort key to order non-run workflows
* Remove currCrawl* fields and update frontend accordingly
* Add isCrawlRunning field to backend and use in frontend
* Sort by name and description (ascending by default)
* Filter by name
* Add endpoint to fetch collection names for search
* Add collation so that utf-8 chars sort as expected
- Updates superadmin "Running Crawls" to show active crawls (starting, waiting, running, stopping) and sort by start by default
- Navigates to crawl workflow watch view on clicking crawl item
- Adds "Copy Crawl ID" to crawl actions for easy paste into "Jump to crawl"
- Navigates to crawl workflow watch when jumping to crawl
- Adds primary action button next to "Actions" dropdown
- Switches "Edit Workflow Settings" button to icon button
- Redirects user to "Watch Crawl" tab when starting crawl
- Now uses crawl ID from `data.started` in API `/run` response for more responsive UI
- Keeps "Watch Crawl" tab navigation button in list but disable when crawl is not running
- Also handles watch view when workflow is not running to cover navigational edge cases
- Adds banner in "Crawls" list to direct users to the Watch Crawl when workflow is running
- Shows notification when crawl is done to make redirect to Crawls tab smoother
- Uses workflow scale when updating crawl scale
- Removes "All" from "View: All Finished Crawls" on Finished Crawl page for wording consistency
* frontend crawl stopping improvements (#836)
- support new backend 'stopping' property
- for now, keep 'stopping' indicator state when crawl is running but stopping set to true
* operator: add waiting state
- add pods as related objects
- inspect pod status, set crawl status to 'waiting' if no pods are running
frontend:
- frontend support for 'waiting' state
- show waiting icon from mocks
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
* crawlconfig: fix default filename template, make configurable
- make default crawl file template configurable with 'default_crawl_filename_template' value in values.yaml
- set to '@ts-@hostsuffix.wacz' by default
- allow updating via 'crawlFilenameTemplate' in crawlconfig patch, which updates configmap
- tests: add test for custom 'default_crawl_filename_template'
* stopping fix: backend fixes for #836
- sets 'stopping' field on crawl when crawl is being stopped (both via db and on k8s object)
- k8s: show 'stopping' as part of crawljob object, update subchart
- set 'currCrawlStopping' on workflow
- support old and new browsertrix-crawler stopping keys
- tests: add tests for new stopping state, also test canceling crawl (disable test for stopping crawl, currently failing)
- catch redis error when getting stats
operator: additional optimizations:
- run pvc removal as background task
- catch any exceptions in finalizer stage (eg. if db is down), return false until finalizer completes
- just pass cid from operator for consistency, don't load crawl from update_crawl (different object)
- don't throw in update_config_crawl_stats() to avoid exception in operator, only throw in crawlconfigs api
* Precompute config crawl stats
* Includes a database migration to move preciously dynamically computed crawl stats for workflows into the CrawlConfig model.
* Add crawls.finished descending index
* Add last crawl fields to workflow tests
- Adds an input to the Workflow creation and edit form for specifying crawl depth. This input is conditionally shown for seeded crawls when the scope is set to "Pages on this domain", "Pages on this domain & subdomains" or "Custom page prefix". The "any" scope is also supported for backwards compatibility but is not shown by default or in new configs.
- API implementation: The depth value is set in the primary seed config, i.e. the first seed in seeds: [], not in the outer .config.depth property.
Resolves#817
- Adds relevant action buttons to each Workflow detail tab header
- Adds "Delete" action menu item to crawls in Crawls tab
- Prevent automatically switching to "Watch" tab after running crawl from detail page
- Removes "Stop" confirmation prompt and only shows "Cancel" confirmation prompt if there are one or more pages crawled
- Replaces "Cancel" confirmation prompt with web component dialog (partially addresses Switch to in-page dialogue boxes #619)
- Fixes hash routing to fix going back with browser back button
* operator:
- ensures crawler pvcs are always deleted before crawl object is finalized (fixes#827)
- refactor to ensure finalizer handler always run when finalizing
- remove obsolete config entries