* Remove config from list endpoints
- Remove config field from workflow and crawl list endpoints
- Add seedCount to CrawlConfigOut on backend and Workflow on frontend
- Refactor CrawlConfig and CrawlConfigOut to extend CrawlConfigCore + CrawlConfigAdditional
- Refactor workflow list in frontend to use firstSeed and seedCount
- Frontend uses ListWorkflow type which is Omit<Workflow, "config">
- Moves metadata tab to first position
- Adds save button to each section, stays in edit view on saving
- Validates name exists before moving to next section or saving
- Changes save button text to "Create Collection without Items" if crawl/uploads aren't selected in new collection
- Fix server error not showing in UI
* update text
* remove trailing slash removal
* make scope help text responsive as user types
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
* use metacontroller's decoratorcontroller to create CrawlJob from Job
* scheduled job work:
- use existing job name for scheduled crawljob
- use suspended job, set startTime, completionTime and succeeded status on job when crawljob is done
- simplify cronjob template: remove job_image, cron_namespace, using same namespace as crawls,
placeholder job image for cronjobs
* move storage quota check to crawljob handler:
- add 'skipped_quota_reached' as new failed status type
- check for storage quota before checking if crawljob can be started, fail if not (check before any pods/pvcs created)
* frontend:
- show all crawls in crawl workflow, no need to filter by status
- add 'skipped_quota_reached' status, show as 'Skipped (Quota Reached)', render same as failed
* migration: make release namespace available as DEFAULT_NAMESPACE, delete old cronjobs in DEFAULT_NAMESPACE and recreate in crawlers namespace with new template
- More specific toast notification error messages to the action being attempted
- Single dismissable global banner shown when org storage is reached
- Removed check for storage quota reached in `runNow`, since buttons are disabled in UI, and errors handled if request fails.
- Allow creating new workflow when storage quota reached
- More responsive storage quota updates: add storageQuotaReached to archived item replay.json, updates w/o reload when crawl pushes quota over limit
- Modify LiteElement to check for storageQuotaReached on GET requests
---------
Co-authored-by: sua yoo <sua@suayoo.com>
- Ability for pod to be Completed, unlike in Statefulset - eg. if 3 pods are running and first one finishes, all 3 must be running until all 3 are done. With this setup, the first finished pod can remain in Completed state.
- Fixed shutdown order - crawler pods now correctly shutdown first before redis pods, by switching to background deletion.
- Pod priority decreases with scale: 1st instance of a new crawl can preempt 3rd or 2nd instance of another crawl
- Create priority classes upto 'max_crawl_scale, configured in values.yaml
- Improved scale change reconciliation: if increasing scale, immediately scale up. If decreasing scale,
graceful stop scaled-down instance to complete via redis 'stopone' key, wait until they exit with Completed state
before adjust status.scale / removing scaled down pods. Ensures unaccepted interrupts don't cause scaled down data to be deleted.
- Redis pod remains inactive until crawler is first active, or after no crawl pods are active for 60 seconds
- Configurable Redis storage with 'redis_storage' value, set to 3Gi by default
- CrawlJob deletion starts as soon as post-finish crawl operations are run
- Post-crawl operations get their own redis instance, since one during response is being cleaned up in finalizer
- Finalizer ignores request with incorrect state (returns 400 if reported as not finished while crawl is finished)
- Current resource usage added to status
- Profile browser: also manage single pod directly without statefulset for consistency.
- Restart pods via restartTime value: if spec.restartTime != status.restartTime, clear out pods and update status.restartTime (using OnDelete policy to avoid recreate loops in edge cases).
- Update to latest metacontroller (v4.11.0)
- Add --restartOnError flag for crawler (for browsertrix-crawler 0.11.0)
- Failed crawl logging: dd 'fail_crawl()' to be used for failing a crawl, which prints logs for default container (if enabled) as well as pod status
- tests: check other finished states to avoid stuck in infinite loop if crawl fails
- tests: disable disk utilization check, which adds unpredictability to crawl testing!
fixes#1147
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
* Implement in backend
- Track bytesStored in org
- Add migration to pre-calculate based on size of crawlfiles and profilefiles
- Add methods to increase or decrease org storage when crawl or profile files
are added or deleted
- Include storageQuotaReached boolean in API responses that alter storage
- Don't start new crawls and fail uploads if storage quota reached
* Implement in frontend
- Add to orgs-list quotas
- Update org's storageQuotaReached based on backend endpoint responses
- Disable buttons when storage quota is met
- Show toast notification when attempting to run a crawl when org
storage quota is met
* supports overriding the replayweb.page version without having to be rebuild frontend image:
- ensures 'rwp_base_url' from helm chart is passed to nginx
- ensures both ui.js and sw.js are loaded based on nginx environment variable, not hard-coded
- ui.js loaded via redirect from new /replay/ui.js path
- pin RWP to known working release in default values.yaml
- remove RWP_BASE_URL from Dockerfile, no longer needed, set via chart env var
- set default RWP_BASE_URL for devserver to use CDN
- set RWP version to 1.8.11
- Lists collections that an archived item belongs to in item detail view
- Improves performance of collection add component
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Backend:
- add 'maxCrawlSize' to models and crawljob spec
- add 'MAX_CRAWL_SIZE' to configmap
- add maxCrawlSize to new crawlconfig + update APIs
- operator: gracefully stop crawl if current size (from stats) exceeds maxCrawlSize
- tests: add max crawl size tests
Frontend:
- Add Max Crawl Size text box Limits tab
- Users enter max crawl size in GB, convert to bytes
- Add BYTES_PER_GB as constant for converting to bytes
- docs: Crawl Size Limit to user guide workflow setup section
Operator Refactor:
- use 'status.stopping' instead of 'crawl.stopping' to indicate crawl is being stopped, as changing later has no effect in operator
- add is_crawl_stopping() to return if crawl is being stopped, based on crawl.stopping or size or time limit being reached
- crawlerjob status: store byte size under 'size', human readable size under 'sizeHuman' for clarity
- size stat always exists so remove unneeded conditional (defaults to 0)
- store raw byte size in 'size', human readable size in 'sizeHuman'
Charts:
- subchart: update crawlerjob crd in btrix-crds to show status.stopping instead of spec.stopping
- subchart: show 'sizeHuman' property instead of 'size'
- bump subchart version to 0.1.1
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
* fix latest crawl (lastRun) sort:
- don't cast 'started' value to string when setting as starting crawl time (regression from #937)
- caused incorrect sorting as finished crawl time was a datetime, while starting crawl time was a string
- move updated config crawl info in one place, simplify to avoid returning started time altogether, just set directly
- pass mdb crawlconfigs and crawls collections directly to add_new_crawl() function
- fixes#1108
* Add dropdown menu containing 'Remove from Collection' to archived items in collection view (#1110)
- Enables users to remove an item from a collection from the collection detail view - menu was previously missing
- Fixes: #1102 (missing dropdown menu) by making use of the inactive menu trigger button.
- Updates collection items page size to match "Archived Items" page size (20 items per page)
---------
Co-authored-by: sua yoo <sua@webrecorder.org>
- rename 'collections' -> 'collectionIds', adding migration 0014
- only populate 'collections' array with {name, id} pair for get_crawl() / single archived item
path, but not for aggregate/list methods
- remove Crawl.get_crawl(), redundant with BaseCrawl.get_crawl() version
- ensure _files_to_resources returns an empty [] instead of none if empty (matching BaseCrawl.get_crawl() behavior to Crawl.get_crawl())
- tests: update tests to use collectionIds for id list, add 'collections' for {name, id} test
- frontend: change Crawl object to have collectionIds instead of collections
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
* refactor to use shared role-based service shared across pods:
- 'crawler' service for all crawler screencasting, scales 0 .. N with crawler-<ID>-N.crawl
- 'redis' service for all redis access, redis-<ID>-0.redis
- 'browser' service for all browser access (profile browsers), browser-<ID>-0.browser
- don't create a new service per crawl/profile at all
- enable 'publishNotReadyAddresses' for potentially faster resolving, esp for redis
- remove service as type managed by operator as no longer creating services dynamically
- remove frontend var CRAWLER_SVC_SUFFIX, suffix always '.crawler' to match crawler service name
- Paginates Crawl Workflows when there are more than 10 workflows
- Refactors workflow search and crawl search to use the same component
- Adds sort by first seed, workflow creation date, and workflow modified date
- Separates "last run" date from "modified" date
- Update column layout into Name & Schedule (or Manual Ru'ri=), Latest Crawl (<finish time> in <duration>), total size, and last modified (modified by and modified time)
Shows crawl error log details in a dialog. Since the detail object does not always follow a specific format, this iteration uses the detail key in uppercase as the label.
- Clarifies use case for frontend development server
- Fixes incorrect sample API URLs
- Adds additional detail around requirements and quickstart
- Links back to docs from frontend README
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
- Changes to URLs in "Crawling", "All Archived Items", and "Collections":
- Rename Artifacts -> Items
- Unifies view crawl view as loaded from All Archived Items and from Workflows
- Includes redirect for /artifacts/uploads -> /items/uploads to support archiveweb.page usage
- Upgrades webpack and webpack-dev-server for bugfixes and performance updates
- Removes unnecessary file watching
- Enables persistent build cache in dev
- Switches to faster dev source map
* terminology tweaks in frontend: (part of #922)
- use 'crawl workflow' instead of 'workflow' where possible
- use 'replay' instead of 'replay crawl'
- localization: rerun string extraction / processing
- "Review Config" → "Review Settings"
- "Workflow" → "Crawl Workflow" in error message
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Frontend:
- Renames list view to "All Archived Items"
- Refactors fetches to use single all-crawls endpoints
- Removes search by config ID for more search parity with uploads
- Adds sort by size
- Refactors property and method names to replace crawl*
- Replaces remaining references to "crawl" in copy with "item"'
- Rename Upload Archive button to Upload WACZ
- Fix focusout in item menu so menus close
Backend:
- Filter search values by type as well
- Only get list of cids for crawls in search values
- Don't list crawl/workflow ids in search values
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
* Always run yarn only on build platform with --platform=$BUILDPLATFORM
* Remove optional dependencies (playwright + chromium) from build with --ignore-optional and move some devDependencies to be optional
* Disable husky pre-commit hook checks on frontend
Co-authored-by: sua yoo <sua@suayoo.com>
- Always shows primary "Share" action button in Collection detail page.
- Enables toggling shareable status and share info from dialog. Difference from mockups: I made the "Done" button neutral do differentiate from our submit action buttons in the dialog, since toggling will apply changes immediately.
- Menu item: "Go to Public View"/"Go to Shareable View" -> "Visit Shareable URL".
- Toggle label: "Make Collection Shareable" -> "Collection is Shareable".
- Additional dialog copy: adds "This collection can be viewed by anyone with the link." under "Link to Share" and "Share this collection by embedding it into an existing webpage." under "Embed Collection".
- Moves share status icon to its own column in list view.
- Adds new syntax-highlighted code component that supports js and html.
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
* collections: support toggling collections public/private, viewable via RWP
- backend: add 'public' to collection model, support patching to update
- backend: add .../collections/<id>/public/replay.json for public access
- backend: add CORS handling for public endpoint
- frontend: support 'make shareable / make private' dropdown actions on collection detail + collection list views
- frontend: show shareable / private icons by collection name on detail + list views
- frontend: link to replayweb.page for standalone browsing
- frontend: add embed code popup when a collection is shareable
- refer to public collections as 'shareable' for now
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
- Adds Collection info bar to detail view
- Update "Web Captures" -> "Archived Items"
- Updates Collection list columns to match
- Refactors `btrix-desc-list` and usage in `workflow-details` to reuse horizontal info bar component