Fixes#1261Closes#1092
The quota for monthly execution minutes is treated as a hard cap. Once
it is exceeded, an alert indicating that an org has exceeded its monthly
execution minutes will display and the user will be unable to start new
crawls. Any running crawls will be stopped once the quota is exceeded.
An execution minutes meter bar is also added in the Org Dashboard and
displayed if a quota is set. More detail in #1305 which was
merged into this branch.
## Changes
- Enable setting 'maxExecMinutesPerMonth' in orgs list quotas by superadmin
- Enforce quota by stopping crawls in operator once quota is reached
- Show alert banner once execution time quota is hit:
- Once quota is hit, disable Run Crawl buttons in frontend, return 403
message with `exec_minutes_quota_reached` detail in backend from
crawl config `/run` endpoint, and don't run new workflows on creation
(similar to storage quota)
- Display execution time for crawls in the crawl details overview,
immediately below
- Show execution minutes meter on dashboard (from #1305)
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: sua yoo <sua@webrecorder.org>
Partially resolves#1223, fixes#1298
- Adds crawl usage table in dashboard under metrics
- Shows skeleton loading indicator when metrics are loading (@Shrinks99
feel free to adjust how this looks)
- Shows max number of concurrent crawls running if any are running ("`running` / `max` Crawls Running")
- Allows editing of org slugs (actual URL updates will be handled in
https://github.com/webrecorder/browsertrix-cloud/issues/1258.)
- Converts user input to slug using slugify
- Adds help text to org name and slug
- Renames tab from "information" to "general" settings
- If set, and any of the seeds fails, the entire crawl is marked as a failure.
- Add checkbox which adds --failOnFailedSeed checkbox to URL list workflows
- Add 'Fail Crawl On Failed URL' to crawl workflow setup docs
- Remove config.seeds from workflow and crawl detail endpoints
- Add new paginated GET /crawls/{crawl_id}/seeds and /crawlconfigs/{cid}/seeds endpoints to retrieve seeds for a crawl or workflow
- Include firstSeed in GET /crawlconfigs/{cid} endpoint (was missing before)
- Modify frontend to fetch seeds from new /seeds endpoints with loading indicator
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
* Remove config from list endpoints
- Remove config field from workflow and crawl list endpoints
- Add seedCount to CrawlConfigOut on backend and Workflow on frontend
- Refactor CrawlConfig and CrawlConfigOut to extend CrawlConfigCore + CrawlConfigAdditional
- Refactor workflow list in frontend to use firstSeed and seedCount
- Frontend uses ListWorkflow type which is Omit<Workflow, "config">
* use metacontroller's decoratorcontroller to create CrawlJob from Job
* scheduled job work:
- use existing job name for scheduled crawljob
- use suspended job, set startTime, completionTime and succeeded status on job when crawljob is done
- simplify cronjob template: remove job_image, cron_namespace, using same namespace as crawls,
placeholder job image for cronjobs
* move storage quota check to crawljob handler:
- add 'skipped_quota_reached' as new failed status type
- check for storage quota before checking if crawljob can be started, fail if not (check before any pods/pvcs created)
* frontend:
- show all crawls in crawl workflow, no need to filter by status
- add 'skipped_quota_reached' status, show as 'Skipped (Quota Reached)', render same as failed
* migration: make release namespace available as DEFAULT_NAMESPACE, delete old cronjobs in DEFAULT_NAMESPACE and recreate in crawlers namespace with new template
* Implement in backend
- Track bytesStored in org
- Add migration to pre-calculate based on size of crawlfiles and profilefiles
- Add methods to increase or decrease org storage when crawl or profile files
are added or deleted
- Include storageQuotaReached boolean in API responses that alter storage
- Don't start new crawls and fail uploads if storage quota reached
* Implement in frontend
- Add to orgs-list quotas
- Update org's storageQuotaReached based on backend endpoint responses
- Disable buttons when storage quota is met
- Show toast notification when attempting to run a crawl when org
storage quota is met
- Lists collections that an archived item belongs to in item detail view
- Improves performance of collection add component
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Backend:
- add 'maxCrawlSize' to models and crawljob spec
- add 'MAX_CRAWL_SIZE' to configmap
- add maxCrawlSize to new crawlconfig + update APIs
- operator: gracefully stop crawl if current size (from stats) exceeds maxCrawlSize
- tests: add max crawl size tests
Frontend:
- Add Max Crawl Size text box Limits tab
- Users enter max crawl size in GB, convert to bytes
- Add BYTES_PER_GB as constant for converting to bytes
- docs: Crawl Size Limit to user guide workflow setup section
Operator Refactor:
- use 'status.stopping' instead of 'crawl.stopping' to indicate crawl is being stopped, as changing later has no effect in operator
- add is_crawl_stopping() to return if crawl is being stopped, based on crawl.stopping or size or time limit being reached
- crawlerjob status: store byte size under 'size', human readable size under 'sizeHuman' for clarity
- size stat always exists so remove unneeded conditional (defaults to 0)
- store raw byte size in 'size', human readable size in 'sizeHuman'
Charts:
- subchart: update crawlerjob crd in btrix-crds to show status.stopping instead of spec.stopping
- subchart: show 'sizeHuman' property instead of 'size'
- bump subchart version to 0.1.1
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- rename 'collections' -> 'collectionIds', adding migration 0014
- only populate 'collections' array with {name, id} pair for get_crawl() / single archived item
path, but not for aggregate/list methods
- remove Crawl.get_crawl(), redundant with BaseCrawl.get_crawl() version
- ensure _files_to_resources returns an empty [] instead of none if empty (matching BaseCrawl.get_crawl() behavior to Crawl.get_crawl())
- tests: update tests to use collectionIds for id list, add 'collections' for {name, id} test
- frontend: change Crawl object to have collectionIds instead of collections
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- Paginates Crawl Workflows when there are more than 10 workflows
- Refactors workflow search and crawl search to use the same component
- Adds sort by first seed, workflow creation date, and workflow modified date
- Separates "last run" date from "modified" date
- Update column layout into Name & Schedule (or Manual Ru'ri=), Latest Crawl (<finish time> in <duration>), total size, and last modified (modified by and modified time)
* collections: support toggling collections public/private, viewable via RWP
- backend: add 'public' to collection model, support patching to update
- backend: add .../collections/<id>/public/replay.json for public access
- backend: add CORS handling for public endpoint
- frontend: support 'make shareable / make private' dropdown actions on collection detail + collection list views
- frontend: show shareable / private icons by collection name on detail + list views
- frontend: link to replayweb.page for standalone browsing
- frontend: add embed code popup when a collection is shareable
- refer to public collections as 'shareable' for now
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
operator: ensure transitions from each of these states is supported, including to 'waiting_capacity'
add extra check on stopping to avoid transitioning back to a running state after crawl is finished
ui: add states to UI display, localization, add as active states
fixes#263
* support streaming download of collections (part of #927)
- WACZ zip created on the fly using stream-zip
- add 'Download Collection' option to collection detail and list
- after editing collection, return to collection view
- tests: add test for streaming download, ensure WACZ files + datapackage present, STORE compression used
---------
Co-authored-by: sua yoo <sua@suayoo.com>
- Adds top-level "Archived Data" view, replacing "Finished Crawls" and moving it as "Crawls" into view
- Adds list for viewing all artifacts/data
- Adds list for viewing all uploaded crawls
- Updates crawl detail view to show upload details
- Edit upload metadata, including 'name'
- Delete uploads
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
- Adds collections search and list to workflow editor
- Adds collections to workflow details component
- Adds namePrefix filter to backend GET /orgs/{oid}/collections endpoint to support case-insensitive searching of collections
- Adds documentation for new setting
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
- Support for creating new collections and editing existing collections
- Can select crawling workflows which adds entire workflow, and then deselect individual crawls
- Can edit existing collections and add more crawls
- Can view, create and delete collections via new Collections top-level nav entry
concurrent crawl limits: (addresses #866)
- support limits on concurrent crawls that can be run within a single org
- change 'waiting' state to 'waiting_org_limit' for concurrent crawl limit and 'waiting_capacity' for capacity-based
limits
orgs:
- add 'maxConcurrentCrawl' to new 'quotas' object on orgs
- add /quotas endpoint for updating quotas object
operator:
- add all crawljobs as related, appear to be returned in creation order
- operator: if concurrent crawl limit set, ensures current job is in the first N set of crawljobs (as provided via 'related' list of crawljob objects) before it can proceed to 'starting', otherwise set to 'waiting_org_limit'
- api: add org /quotas endpoint for configuring quotas
- remove 'new' state, always start with 'starting'
- crawljob: add 'oid' to crawljob spec and label for easier querying
- more stringent state transitions: add allowed_from to set_state()
- ensure state transitions only happened from allowed states, while failed/canceled can happen from any state
- ensure finished and state synched from db if transition not allowed
- add crawl indices by oid and cid
frontend:
- show different waiting states on frontend: 'Waiting (Crawl Limit) and 'Waiting (At Capacity)'
- add gear icon on orgs admin page
- and initial popup for setting org quotas, showing all properties from org 'quotas' object
tests:
- add concurrent crawl limit nightly tests
- fix state waiting -> waiting_capacity
- ci: add logging of operator output on test failure
* Precompute config crawl stats
Includes a database migration to move preciously dynamically computed
crawl stats for workflows into the CrawlConfig model.
* Add lastRun sorting option and enable it by default
* Add modified as final sort key to order non-run workflows
* Remove currCrawl* fields and update frontend accordingly
* Add isCrawlRunning field to backend and use in frontend
* frontend crawl stopping improvements (#836)
- support new backend 'stopping' property
- for now, keep 'stopping' indicator state when crawl is running but stopping set to true
* operator: add waiting state
- add pods as related objects
- inspect pod status, set crawl status to 'waiting' if no pods are running
frontend:
- frontend support for 'waiting' state
- show waiting icon from mocks
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
- Adds an input to the Workflow creation and edit form for specifying crawl depth. This input is conditionally shown for seeded crawls when the scope is set to "Pages on this domain", "Pages on this domain & subdomains" or "Custom page prefix". The "any" scope is also supported for backwards compatibility but is not shown by default or in new configs.
- API implementation: The depth value is set in the primary seed config, i.e. the first seed in seeds: [], not in the outer .config.depth property.
* Re-implement pagination and paginate crawlconfig revs
First step toward simplifying pagination to set us up for sorting
and filtering of list endpoints. This commit removes fastapi-pagination
as a dependency.
* Migrate all HttpUrl seeds to Seeds
This commit also updates the frontend to always use Seeds and to
fix display issues resulting from the change.
* Filter and sort crawls and workflows
Crawls:
- Filter by createdBy (via userid param)
- Filter by state (comma-separated string for multiple values)
- Filter by first_seed, name, description
- Sort by started, finished, fileSize, firstSeed
- Sort descending by default to match frontend
Workflows:
- Filter by createdBy (formerly userid) and modifiedBy
- Filter by first_seed, name, description
- Sort by created, modified, firstSeed, lastCrawlTime
* Add crawlconfigs search-values API endpoint and test
* backend: make crawlconfigs mutable! (#656)
- crawlconfig PATCH /{id} can now receive a new JSON config to replace the old one (in addition to scale, schedule, tags)
- exclusions: add / remove APIs mutate the current crawlconfig, do not result in a new crawlconfig created
- exclusions: ensure crawl job 'config' is updated when exclusions are added/removed, unify add/remove exclusions on crawl
- k8s: crawlconfig json is updated along with scale
- k8s: stateful set is restarted by updating annotation, instead of changing template
- crawl object: now has 'config', as well as 'profileid', 'schedule', 'crawlTimeout', 'jobType' properties to ensure anything that is changeable is stored on the crawl
- crawlconfigcore: store share properties between crawl and crawlconfig in new crawlconfigcore (includes 'schedule', 'jobType', 'config', 'profileid', 'schedule', 'crawlTimeout', 'tags', 'oid')
- crawlconfig object: remove 'oldId', 'newId', disallow deactivating/deleting while crawl is running
- rename 'userid' -> 'createdBy'
- remove unused 'completions' field
- add missing return to fix /run response
- crawlout: ensure 'profileName' is resolved on CrawlOut from profileid
- crawlout: return 'name' instead of 'configName' for consistent response
- update: 'modified', 'modifiedBy' fields to set modification date and user modifying config
- update: ensure PROFILE_FILENAME is updated in configmap is profileid provided, clear if profileid==""
- update: return 'settings_changed' and 'metadata_changed' if either crawl settings or metadata changed
- tests: update tests to check settings_changed/metadata_changed return values
add revision tracking to crawlconfig:
- store each revision separate mongo db collection
- revisions accessible via /crawlconfigs/{cid}/revs
- store 'rev' int in crawlconfig and in crawljob
- only add revision history if crawl config changed
migration:
- update to db v3
- copy fields from crawlconfig -> crawl
- rename userid -> createdBy
- copy userid -> modifiedBy, created -> modified
- skip invalid crawls (missing config), make createdBy optional (just in case)
frontend: Update crawl config keys with new API (#681), update frontend to use new PATCH endpoint, load config from crawl object in details view
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: sua yoo <sua@suayoo.com>
* Paginate API list endpoints
fastapi-pagination is pinned to 0.9.3, the latest release that plays
nicely with pinned versions of fastapi and fastapi-users.
* Increase page size via overriden Params and Page classes
* update api resource list keys
---------
Co-authored-by: sua yoo <sua@suayoo.com>