Fixes#1648
- Tracks failed QA runs in database, not only successful ones
- Includes failed QA runs in list endpoint by default
- Adds `skipFailed` param to list endpoint to return only successful
runs
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#1617
Filters added:
- reviewed: filter by page has approval or at least one note (true) or
neither (false)
- approved: filter by approval value (accepts list of strings,
comma-separated, each of which are coerced into True, False, or None, or
ignored if they are invalid values)
- hasNotes: filter by has at least one note (true) or not (false)
Tests have also been added to ensure that results are as expected.
Supports running QA Runs via the QA API!
Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes
#1498
Also requires the latest Browsertrix Crawler 1.1.0+ (from
webrecorder/browsertrix-crawler#469 branch)
Notable changes:
- QARun objects contain info about QA runs, which are crawls
performed on data loaded from existing crawls.
- Various crawl db operations can be performed on either the crawl or
`qa.` object, and core crawl fields have been moved to CoreCrawlable.
- While running,`QARun` data stored in a single `qa` object, while
finished qa runs are added to `qaFinished` dictionary on the Crawl. The
QA list API returns data from the finished list, sorted by most recent
first.
- Includes additional type fixes / type safety, especially around
BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific
get_upload(), get_basecrawl(), get_crawl() getters for internal use and
get_crawl_out() for API
- Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch)
along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#1597
New endpoints (replacing old migration) to re-add crawl pages to db from
WACZs.
After a few implementation attempts, we settled on using
[remotezip](https://github.com/gtsystem/python-remotezip) to handle
parsing of the zip files and streaming their contents line-by-line for
pages. I've also modified the sync log streaming to use remotezip as
well, which allows us to remove our own zip module and let remotezip
handle the complexity of parsing zip files.
Database inserts for pages from WACZs are batched 100 at a time to help
speed up the endpoint, and the task is kicked off using
asyncio.create_task so as not to block before giving a response.
StorageOps now contains a method for streaming the bytes of any file in
a remote WACZ, requiring only the presigned URL for the WACZ and the
name of the file to stream.
Allow maximum scale option to be fully configurable via
`max_crawl_scale`. Already configurable on the backend, and now exposed
to the frontend via API `/api/settings` `maxCrawlScale` value.
The workflow editor and workflow details are updated to allow selecting
the scale up to the maxCrawlScale setting (which defaults to 3 if not
set).
Fixes#1539
Adds `reviewStatus` field to `BaseCrawl` model, updatable via the crawl
update API endpoint. Acceptable values are "good", "acceptable" or
"failure", enforced by an Enum.
Added to `BaseCrawl` so that we can extend support to uploads more
easily later on, but for now we'll only display this for crawls in the
frontend.
Fixes#1502
- Adds pages to database as they get added to Redis during crawl
- Adds migration to add pages to database for older crawls from
pages.jsonl and extraPages.jsonl files in WACZ
- Adds GET, list GET, and PATCH update endpoints for pages
- Adds POST (add), PATCH, and POST (delete) endpoints for page notes,
each with their own id, timestamp, and user info in addition to text
- Adds page_ops methods for 1. adding resources/urls to page, and 2.
adding automated heuristics and supplemental info (mime, type, etc.) to
page (for use in crawl QA job)
- Modifies `Migration` class to accept kwargs so that we can pass in ops
classes as needed for migrations
- Deletes WACZ files and pages from database for failed crawls during
crawl_finished process
- Deletes crawl pages when a crawl is deleted
Note: Requires a crawler version 1.0.0 beta3 or later, with support for
`--writePagesToRedis` to populate pages at crawl completion. Beta 4 is
configured in the test chart, which should be upgraded to stable 1.0.0
when it's released.
Connected to https://github.com/webrecorder/browsertrix-crawler/pull/464
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#1385
## Changes
Supports multiple crawler 'channels' which can be configured to
different browsertrix-crawler versions
- Replaces `crawler_image` in helm chart with `crawler_channels` array
similar to how storages are handled
- The `default` crawler channel must always be provided and specifies
the default crawler image
- Adds backend `/orgs/{oid}/crawlconfigs/crawler-channels` API endpoint
to fetch information about available crawler versions (name, image, and
label) and test
- Adds crawler channel select to workflow creation/edit screens and
profile creation dialog, and updates related API endpoints and
configmaps accordingly. The select dropdown is shown only if more than
one channel is configured.
- Adds `crawlerChannel` to workflow and crawl details.
- Add `image` to crawler image, used to display actual image used as
part of the crawl.
- Modifies `crawler_crawl_id` backend test fixture to use `test` crawler
version to ensure crawler versions other than latest work
- Adds migration to add `crawlerChannel` set to `default` to existing
workflow and profile objects and workflow configmaps
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Fixes#1158
Introduces two new API endpoints that stream crawling statistics CSVs
(with a suggested attachment filename header):
- `GET /api/orgs/all/crawls/stats` - crawls from all orgs (superuser
only)
- `GET /api/orgs/{oid}/crawls/stats` - crawls from just one org
(available to org crawler/admin users as well as superusers)
Also includes tests for both endpoints.
- Adds two new crawl finished state, stopped_by_user and
stopped_quota_reached
- Tracking other possible 'stop reasons' in operator, though not making
them distinct states for now.
- Updated frontend with 'Stopped by User' and 'Stopped: Time Quota
Reached', shown with same icon as current partial_complete
- Added migration of partial_complete to either stopped_by_user or
complete (no historical quota data available)
- Addresses edge case in scaling: if crawl never scaled (no redis entry,
no pod), automatically scale down
- Edge case in status: if crawl is somehow 'canceled' but not deleted,
immediately delete crawl object and begin finalizing.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#1307Fixes#1132
Related to #1306
Deleted webhook notifications include the org id and item/collection id.
This PR also includes API docs for the new webhooks and extends the
existing tests to account for the new webhooks.
This PR also does some additional cleanup for existing webhooks:
- Remove `downloadUrls` from item finished webhook bodies
- Rename collection webhook body `downloadUrls` to `downloadUrl`, since
we only ever have one per collection
- Fix API docs for existing webhooks, one of which had the wrong
response body
Fixes#1328
- Adds /retry endpoint for retrying failed jobs.
- Returns 400 error if previous job still running or has succeeded
- Keeps track of previous failed attempts in previousAttempts array on failed job.
- Also amends the similar webhook /retry endpoint to use `POST` for consistency.
- Remove duplicate api tag for backgroundjobs
- Refactors storage to support replicas + custom storages on the Org.
- There is a default primary + replica storage, while an Org can also have
primary and replica storages.
- StorageRef object is used to store references to default and custom
storage.
- CrawlFile has been updated to contain a StorageRef instead of a
def_storage_name, which references
either a default storage (in StorageOps) or custom storage (in
Organization)
- There is also a 'replicas' Optional[List[StorageRef]] which contains
replicas, if any.
- CrawlFileOut contain a numReplicas for how many replicas exist for
a given file.
- Migration: migration 0020 added to migrate existing Orgs, CrawlFile and ProfileFile objects to new storage system (CrawlFile and ProfileFile now extend BaseFile)
Part of #1262
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#1306
- Include full `resources` with expireAt (as string) in crawlFinished
and uploadFinished webhook notifications rather than using the
`downloadUrls` field (this is retained for collections).
- Set default presigned duration to one minute short of 1 week and enforce
maximum supported by S3
- Add 'storage_presign_duration_minutes' commented out to helm values.yaml
- Update tests
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#1270
After 5 consecutive failed logins from the same user, we now prevent the
user from logging in even with the correct password until they reset it
via their email, or wait an hour.
- After failure threshold is reached, all further login attempts are rejected
- Attempts for invalid email addresses are also tracked
- On 6th try, a reset password email is automatically sent, only once
- Failed login counter resets after an hour of no further logins after last attempted login.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- avoid exception if 'errors' (or 'files' keys) don't exist (part of
#1297)
- ensure 'errors' list always set on output model for consistency,
defaulting to empty list
- fix tests for 'errors' being an empty empty list
follow-up to #1300 (merging 1.7.1 release into main)
Fixes#1050
Major refactor of the user/auth system to remove fastapi_users
dependency. Refactors users.py to be standalone
and adds new auth.py module for handling auth. UserManager now works
similar to other ops classes.
The auth should be fully backwards compatible with fastapi_users auth,
including accepting previous JWT tokens w/o having to re-login. The User
data model in mongodb is also unchanged.
Additional fixes:
- allows updating fastapi to latest
- add webhook docs to openapi (follow up to #1041)
API changes:
- Removing the`GET, PATCH, DELETE /users/<id>` endpoints, which were not
in used before, as users are scoped to orgs. For deletion, probably
auto-delete when user is removed from last org (to be implemented).
- Rename `/users/me-with-orgs` is renamed to just `/users/me/`
- New `PUT /users/me/change-password` endpoint with password required to update password, fixes #1269, supersedes #1272
Frontend changes:
- Fixes from #1272 to support new change password endpoint.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: sua yoo <sua@suayoo.com>
Fixes#1278
- Adds `GET /orgs/slug-lookup` endpoint returning `{id: slug}` for all
orgs
- Restricts new endpoint and existing `GET /orgs/slugs` to superadmins
- Add slug field with uniqueness constraint to Organization
- Use python-slugify to generate slug from name and import that in migration
- Require name in all /rename and org creation requests
- Auto-generate slug for new org with no slug or when /rename is called w/o a slug
- Auto-generate slug for 'default-org' based on name
- Add /api/orgs/slugs GET endpoint to return all slugs in use
- tests: extend backend test-requirements.txt from requirements to allow testing slugify
- tests: move get_redis_crawl_stats() to avoid extra dependency in utils
- Require that all passwords are between 8 and 64 characters
- Fixes account settings password reset form to only trigger
logged-in event after successful password change.
- Password validation can be extended within the UserManager's
validate_password method to add or modify requirements.
- Add tests for password validation
- Applies user permissions check before deleting anything in all /delete endpoints
- Shuts down running crawls before deleting anything in /all-crawls/delete as well as /crawls/delete
- Splits delete_list.crawl_ids into crawls and upload lists at same time as checks in /all-crawls/delete
- Updates frontend notification message to Only org owners can delete other users' archived items. when a crawler user attempts to delete another users' archived items
- Remove config.seeds from workflow and crawl detail endpoints
- Add new paginated GET /crawls/{crawl_id}/seeds and /crawlconfigs/{cid}/seeds endpoints to retrieve seeds for a crawl or workflow
- Include firstSeed in GET /crawlconfigs/{cid} endpoint (was missing before)
- Modify frontend to fetch seeds from new /seeds endpoints with loading indicator
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
* store usernames (createdByName, modifiedByName, startedByName) in db for workflows
* store userName for userid for crawls in db
* update output models to return usernames
* add migration 0018 to add usernames to existing crawls and crawlconfigs
* updated tests for crawl and config usernames
* use async for to iterate over crawls and crawlconfigs
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
* Add bytes stored per type to org and metrics
The org now tracks bytesStored by type of crawl, uploads, and browser profiles
in addition to the total, and returns these values in the org metrics endpoint.
A migration is added to precompute these values in existing deployments.
In addition, all /metrics storage values are now returned solely as bytes, as
the GB form wasn't being used in the frontend and is unnecessary.
* Improve deletion of multiple archived item types via `/all-crawls` delete endpoint
- Update `/all-crawls` delete test to check that org and workflow size values
are correct following deletion.
- Fix bug where it was always assumed only one crawl was deleted per cid
and size was not tracked per cid
- Add type check within delete_crawls
* Initial implementation of org metrics
(This can eventually be sped up significantly by precomputing the
values and storing them in the db.)
* Rename storageQuota to storageQuotaBytes to be consistent
* Update tests to include metrics
* Remove config from list endpoints
- Remove config field from workflow and crawl list endpoints
- Add seedCount to CrawlConfigOut on backend and Workflow on frontend
- Refactor CrawlConfig and CrawlConfigOut to extend CrawlConfigCore + CrawlConfigAdditional
- Refactor workflow list in frontend to use firstSeed and seedCount
- Frontend uses ListWorkflow type which is Omit<Workflow, "config">
* Add success filter to webhook list GET endpoint
* Add sorting to webhooks list API and add event filter
* Test webhooks via echo server
* Set address to echo server on host from CI env var for k3d and microk8s
* Add -s back to pytest command for k3d ci
* Change pytest test path to avoid hanging on collecting tests
* Revert microk8s to only run on push to main
- Ability for pod to be Completed, unlike in Statefulset - eg. if 3 pods are running and first one finishes, all 3 must be running until all 3 are done. With this setup, the first finished pod can remain in Completed state.
- Fixed shutdown order - crawler pods now correctly shutdown first before redis pods, by switching to background deletion.
- Pod priority decreases with scale: 1st instance of a new crawl can preempt 3rd or 2nd instance of another crawl
- Create priority classes upto 'max_crawl_scale, configured in values.yaml
- Improved scale change reconciliation: if increasing scale, immediately scale up. If decreasing scale,
graceful stop scaled-down instance to complete via redis 'stopone' key, wait until they exit with Completed state
before adjust status.scale / removing scaled down pods. Ensures unaccepted interrupts don't cause scaled down data to be deleted.
- Redis pod remains inactive until crawler is first active, or after no crawl pods are active for 60 seconds
- Configurable Redis storage with 'redis_storage' value, set to 3Gi by default
- CrawlJob deletion starts as soon as post-finish crawl operations are run
- Post-crawl operations get their own redis instance, since one during response is being cleaned up in finalizer
- Finalizer ignores request with incorrect state (returns 400 if reported as not finished while crawl is finished)
- Current resource usage added to status
- Profile browser: also manage single pod directly without statefulset for consistency.
- Restart pods via restartTime value: if spec.restartTime != status.restartTime, clear out pods and update status.restartTime (using OnDelete policy to avoid recreate loops in edge cases).
- Update to latest metacontroller (v4.11.0)
- Add --restartOnError flag for crawler (for browsertrix-crawler 0.11.0)
- Failed crawl logging: dd 'fail_crawl()' to be used for failing a crawl, which prints logs for default container (if enabled) as well as pod status
- tests: check other finished states to avoid stuck in infinite loop if crawl fails
- tests: disable disk utilization check, which adds unpredictability to crawl testing!
fixes#1147
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
* Implement in backend
- Track bytesStored in org
- Add migration to pre-calculate based on size of crawlfiles and profilefiles
- Add methods to increase or decrease org storage when crawl or profile files
are added or deleted
- Include storageQuotaReached boolean in API responses that alter storage
- Don't start new crawls and fail uploads if storage quota reached
* Implement in frontend
- Add to orgs-list quotas
- Update org's storageQuotaReached based on backend endpoint responses
- Disable buttons when storage quota is met
- Show toast notification when attempting to run a crawl when org
storage quota is met
* log only if 'log_failed_crawl_lines' value is set to number of last lines to log
from failed container
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Initial set of backend API for event webhook notifications for the following events:
* Crawl started (including boolean indicating if crawl was scheduled)
* Crawl finished
* Upload finished
* Archived item added to collection
* Archived item removed from collection
Configuration of URLs is done via /api/orgs/<oid>/event-webhook-urls. If a URL is configured for a given event, a webhook notification is added to the database and then attempted to be sent (up to a total of 5 tries per overall attempt, with an increasing backoff between, implemented via use of the backoff library, which supports async).
webhook status available via /api/orgs/<oid>/webhooks
(Additional testing + potential fastapi integration left in separate follow-ups
Fixes#1041
* Add support for collectionIds to patch endpoints
* Make update available via all-crawls/ and add test
* Fix tests
* Always remove collectionIds from udpate
* Remove unnecessary fallback
* One more pass on expected values before update
Backend:
- add 'maxCrawlSize' to models and crawljob spec
- add 'MAX_CRAWL_SIZE' to configmap
- add maxCrawlSize to new crawlconfig + update APIs
- operator: gracefully stop crawl if current size (from stats) exceeds maxCrawlSize
- tests: add max crawl size tests
Frontend:
- Add Max Crawl Size text box Limits tab
- Users enter max crawl size in GB, convert to bytes
- Add BYTES_PER_GB as constant for converting to bytes
- docs: Crawl Size Limit to user guide workflow setup section
Operator Refactor:
- use 'status.stopping' instead of 'crawl.stopping' to indicate crawl is being stopped, as changing later has no effect in operator
- add is_crawl_stopping() to return if crawl is being stopped, based on crawl.stopping or size or time limit being reached
- crawlerjob status: store byte size under 'size', human readable size under 'sizeHuman' for clarity
- size stat always exists so remove unneeded conditional (defaults to 0)
- store raw byte size in 'size', human readable size in 'sizeHuman'
Charts:
- subchart: update crawlerjob crd in btrix-crds to show status.stopping instead of spec.stopping
- subchart: show 'sizeHuman' property instead of 'size'
- bump subchart version to 0.1.1
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- rename 'collections' -> 'collectionIds', adding migration 0014
- only populate 'collections' array with {name, id} pair for get_crawl() / single archived item
path, but not for aggregate/list methods
- remove Crawl.get_crawl(), redundant with BaseCrawl.get_crawl() version
- ensure _files_to_resources returns an empty [] instead of none if empty (matching BaseCrawl.get_crawl() behavior to Crawl.get_crawl())
- tests: update tests to use collectionIds for id list, add 'collections' for {name, id} test
- frontend: change Crawl object to have collectionIds instead of collections
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Frontend:
- Renames list view to "All Archived Items"
- Refactors fetches to use single all-crawls endpoints
- Removes search by config ID for more search parity with uploads
- Adds sort by size
- Refactors property and method names to replace crawl*
- Replaces remaining references to "crawl" in copy with "item"'
- Rename Upload Archive button to Upload WACZ
- Fix focusout in item menu so menus close
Backend:
- Filter search values by type as well
- Only get list of cids for crawls in search values
- Don't list crawl/workflow ids in search values
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- all-crawls list endpoint filters now conform to 'Standardize list controls for archived items #1025' and URL decode values before passing them in
- Uploads list endpoint now includes all all-crawls filters relevant to uploads
- An all-crawls/search-values endpoint is added to support searching across all archived item types
- Crawl configuration names are now copied to the crawl when the crawl is created, and crawl names and descriptions are now editable via the backend API (note: this will require frontend changes as well to make them editable via the UI)
- Migration added to copy existing config names for active configs into their associated crawls. This migration has been tested in a local deployment
- New statuses generate-wacz, uploading-wacz, and pending-wait are added when relevant to tests to ensure that they pass
- Tests coverage added for all new all-crawls endpoints, filters, and sort values
* support streaming download of collections (part of #927)
- WACZ zip created on the fly using stream-zip
- add 'Download Collection' option to collection detail and list
- after editing collection, return to collection view
- tests: add test for streaming download, ensure WACZ files + datapackage present, STORE compression used
---------
Co-authored-by: sua yoo <sua@suayoo.com>
* Move all pydantic models to models.py to avoid circular dependencies
* Include automated crawl details in all-crawls GET endpoints
- ensure /all-crawls endpoint resolves names / firstSeed data same as /crawls endpoint for crawls to ensure consistent frontend display. fields added in get and list all-crawl endpoints for automated
crawls only:
- cid
- name
- description
- firstSeed
- seedCount
- profileName
* Add automated crawl fields to list all-crawls test
* Uncomment mongo readinessProbe
* cleanup CrawlOutWithResources:
- remove 'files' from output model, only resources should be returned
- add _files_to_resources() to simplify computing presigned 'resources' from raw 'files'
- update upload tests to be more consistent, 'files' never present, 'errors' always none
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
* tests: add sleep() between all looping get_crawl() calls to avoid tight request loop, also remove unneeded loop
will likely fix occasional '504 timeout' test failures where frontend is overwhelmed with /replay.json requests
* basecrawl refactor: make crawls db more generic, supporting different types of 'base crawls': crawls, uploads, manual archives
- move shared functionality to basecrawl.py
- create a base BaseCrawl object, which contains start / finish time, metadata and files array
- create BaseCrawlOps, base class for CrawlOps, which supports base crawl deletion, querying and collection add/remove
* uploads api: (part of #929)
- new UploadCrawl object which extends BaseCrawl, has name and description
- support multipart form data data upload to /uploads/formdata
- support streaming upload of a single file via /uploads/stream, using botocore multipart upload to upload to s3-endpoint in parts
- require 'filename' param to set upload filename for streaming uploads (otherwise use form data names)
- sanitize filename, place uploads in /uploads/<uuid>/<sanitized-filename>-<random>.wacz
- uploads have internal id 'upload-<uuid>'
- create UploadedCrawl object with CrawlFiles pointing to the newly uploaded files, set state to 'complete'
- handle upload failures, abort multipart upload
- ensure uploads added within org bucket path
- return id / added when adding new UploadedCrawl
- support listing, deleting, and patch /uploads
- support upload details via /replay.json to support for replay
- add support for 'replaceId=<id>', which would remove all previous files in upload after new upload succeeds. if replaceId doesn't exist, create new upload. (only for stream endpoint so far).
- support patching upload metadata: notes, tags and name on uploads (UpdateUpload extends UpdateCrawl and adds 'name')
* base crawls api: Add /all-crawls list and delete endpoints for all crawl types (without resources)
- support all-crawls/<id>/replay.json with resources
- Use ListCrawlOut model for /all-crawls list endpoint
- Extend BaseCrawlOut from ListCrawlOut, add type
- use 'type: crawl' for crawls and 'type: upload' for uploads
- migration: ensure all previous crawl objects / missing type are set to 'type: crawl'
- indexes: add db indices on 'type' field and with 'type' field and oid, cid, finished, state
* tests: add test for multipart and streaming upload, listing uploads, deleting upload
- add sample WACZ for upload testing: 'example.wacz' and 'example-2.wacz'
* collections: support adding and remove both crawls and uploads via base crawl
- include collection_ids in /all-crawls list
- collections replay.json can include both crawls and uploads
bump version to 1.6.0-beta.2
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
* Make API add and update method returns consistent
- Updates return {"updated": True}
- Adds return {"added": True}
- Both can additionally have other fields as needed, e.g. id or name
- remove Profile response model, as returning added / id only
- reformat
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- Adds collections search and list to workflow editor
- Adds collections to workflow details component
- Adds namePrefix filter to backend GET /orgs/{oid}/collections endpoint to support case-insensitive searching of collections
- Adds documentation for new setting
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
This fixes#917, where crawls added to a collection via the workflow
autoAddCollections were not successfully represented in the crawl
and page count stats in the collection after completing.
* crawls list: unset errors in crawls list response to avoid very large responses #872
* Remove errors from crawl replay.json
* Add tests to ensure errors are excluded from crawl GET endpoints
* Update tests to accept None for errors
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
concurrent crawl limits: (addresses #866)
- support limits on concurrent crawls that can be run within a single org
- change 'waiting' state to 'waiting_org_limit' for concurrent crawl limit and 'waiting_capacity' for capacity-based
limits
orgs:
- add 'maxConcurrentCrawl' to new 'quotas' object on orgs
- add /quotas endpoint for updating quotas object
operator:
- add all crawljobs as related, appear to be returned in creation order
- operator: if concurrent crawl limit set, ensures current job is in the first N set of crawljobs (as provided via 'related' list of crawljob objects) before it can proceed to 'starting', otherwise set to 'waiting_org_limit'
- api: add org /quotas endpoint for configuring quotas
- remove 'new' state, always start with 'starting'
- crawljob: add 'oid' to crawljob spec and label for easier querying
- more stringent state transitions: add allowed_from to set_state()
- ensure state transitions only happened from allowed states, while failed/canceled can happen from any state
- ensure finished and state synched from db if transition not allowed
- add crawl indices by oid and cid
frontend:
- show different waiting states on frontend: 'Waiting (Crawl Limit) and 'Waiting (At Capacity)'
- add gear icon on orgs admin page
- and initial popup for setting org quotas, showing all properties from org 'quotas' object
tests:
- add concurrent crawl limit nightly tests
- fix state waiting -> waiting_capacity
- ci: add logging of operator output on test failure
* optimizations:
- rename update_crawl_config_stats to stats_recompute_all, only used in migration to fetch all crawls
and do a full recompute of all file sizes
- add stats_recompute_last to only get last crawl by size, increment total size by specified amount, and incr/decr number of crawls
- Update migration 0007 to use stats_recompute_all
- Add isCrawlRunning, lastCrawlStopping, and lastRun to
stats_recompute_last
- Increment crawlSuccessfulCount in stats_recompute_last
* operator/crawls:
- operator: keep track of filesAddedSize in redis as well
- rename update_crawl to update_crawl_state_if_changed() and only update
if state is different, otherwise return false
- ensure mark_finished() operations only occur if crawl is state has changed
- don't clear 'stopping' flag, can track if crawl was stopped
- state always starts with "starting", don't reset to starting
tests:
- Add test for incremental workflow stats updating
- don't clear stopping==true, indicates crawl was manually stopped
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
* Track collections in Crawl rather than crawls in Collection
* Add delete collection API endpoint and tests
* Precompute collection crawlCount, pageCount, and tags and add them to
GET collection responses
* Add modified field to Collection
* Update collection replay.json method
* Make add and remove crawls accept list of crawl ids
* Auto-add new workflow crawls to collections when they successfully
complete via CrawlConfig.autoAddCollections field
* Move long-running post-crawl operator tasks into asyncio task
* Make CrawlConfig.autoAddCollections updatable via /update API endpoint
* tests:
- fix cancel crawl test by ensuring state is not running or waiting
- fix stop crawl test by ensuring stop is only initiated after at least one page has been crawled,
otherwise result may be failed, as no crawl data has been crawled yet (separate fix in crawler to avoid loop if stopped
before any data written webrecorder/browsertrix-crawler#314)
- bump page limit to 4 for tests to ensure crawl is partially complete, not fully complete when stopping
- allow canceled or partial_complete due to race condition
* chart: bump frontend limits in default, not just for tests (addresses #780)
* crawl stop before starting:
- if crawl stopped before it started, mark as canceled
- add test for stopping immediately, which should result in 'canceled' crawl
- attempt to increase resync interval for immediate failure
- nightly tests: increase page limit to test timeout
* backend:
- detect stopped-before-start crawl as 'failed' instead of 'done'
- stats: return stats counters as int instead of string
* Precompute config crawl stats
Includes a database migration to move preciously dynamically computed
crawl stats for workflows into the CrawlConfig model.
* Add lastRun sorting option and enable it by default
* Add modified as final sort key to order non-run workflows
* Remove currCrawl* fields and update frontend accordingly
* Add isCrawlRunning field to backend and use in frontend
* Sort by name and description (ascending by default)
* Filter by name
* Add endpoint to fetch collection names for search
* Add collation so that utf-8 chars sort as expected
* operator: add waiting state
- add pods as related objects
- inspect pod status, set crawl status to 'waiting' if no pods are running
frontend:
- frontend support for 'waiting' state
- show waiting icon from mocks
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
* crawlconfig: fix default filename template, make configurable
- make default crawl file template configurable with 'default_crawl_filename_template' value in values.yaml
- set to '@ts-@hostsuffix.wacz' by default
- allow updating via 'crawlFilenameTemplate' in crawlconfig patch, which updates configmap
- tests: add test for custom 'default_crawl_filename_template'
* stopping fix: backend fixes for #836
- sets 'stopping' field on crawl when crawl is being stopped (both via db and on k8s object)
- k8s: show 'stopping' as part of crawljob object, update subchart
- set 'currCrawlStopping' on workflow
- support old and new browsertrix-crawler stopping keys
- tests: add tests for new stopping state, also test canceling crawl (disable test for stopping crawl, currently failing)
- catch redis error when getting stats
operator: additional optimizations:
- run pvc removal as background task
- catch any exceptions in finalizer stage (eg. if db is down), return false until finalizer completes
- just pass cid from operator for consistency, don't load crawl from update_crawl (different object)
- don't throw in update_config_crawl_stats() to avoid exception in operator, only throw in crawlconfigs api
* Precompute config crawl stats
* Includes a database migration to move preciously dynamically computed crawl stats for workflows into the CrawlConfig model.
* Add crawls.finished descending index
* Add last crawl fields to workflow tests
* Btrixjobs Operator - Phase 1 (#679)
- add metacontroller and custom crds
- add main_op entrypoint for operator
* Btrix Operator Crawl Management (#767)
* operator backend:
- run operator api in separate container but in same pod, with WEB_CONCURRENCY=1
- operator creates statefulsets and services for CrawlJob and ProfileJob
- operator: use service hook endpoint, set port in values.yaml
* crawls working with CrawlJob
- jobs start with 'crawljob-' prefix
- update status to reflect current crawl state
- set sync time to 10 seconds by default, overridable with 'operator_resync_seconds'
- mark crawl as running, failed, complete when finished
- store finished status when crawl is complete
- support updating scale, forcing rollover, stop via patching CrawlJob
- support cancel via deletion
- requires hack to content-length for patching custom resources
- auto-delete of CrawlJob via 'ttlSecondsAfterFinished'
- also delete pvcs until autodelete supported via statefulset (k8s >1.27)
- ensure filesAdded always set correctly, keep counter in redis, add to status display
- optimization: attempt to reduce automerging, by reusing volumeClaimTemplates from existing children, as these may have additional props added
- add add_crawl_errors_to_db() for storing crawl errors from redis '<crawl>:e' key to mongodb when crawl is finished/failed/canceled
- add .status.size to display human-readable crawl size, if available (from webrecorder/browsertrix-crawler#291)
- support new page size, >0.9.0 and old page size key (changed in webrecorder/browsertrix-crawler#284)
* support for scheduled jobs!
- add main_scheduled_job entrypoint to run scheduled jobs
- add crawl_cron_job.yaml template for declaring CronJob
- CronJobs moved to default namespace
* operator manages ProfileJobs:
- jobs start with 'profilejob-'
- update expiry time by updating ProfileJob object 'expireTime' while profile is active
* refactor/cleanup:
- remove k8s package
- merge k8sman and basecrawlmanager into crawlmanager
- move templates, k8sapi, utils into root package
- delete all *_job.py files
- remove dt_now, ts_now from crawls, now in utils
- all db operations happen in crawl/crawlconfig/org files
- move shared crawl/crawlconfig/org functions that use the db to be importable directly,
including get_crawl_config, add_new_crawl, inc_crawl_stats
* role binding: more secure setup, don't allow crawler namespace any k8s permissions
- move cronjobs to be created in default namespace
- grant default namespace access to create cronjobs in default namespace
- remove role binding from crawler namespace
* additional tweaks to templates:
- templates: split crawler and redis statefulset into separate yaml file (in case need to load one or other separately)
* stats / redis optimization:
- don't update stats in mongodb on every operator sync, only when crawl is finished
- for api access, read stats directly from redis to get up-to-date stats
- move get_page_stats() to utils, add get_redis_url() to k8sapi to unify access
* Add migration for operator changes
- Update configmap for crawl configs with scale > 1 or
crawlTimeout > 0 and schedule exists to recreate CronJobs
- add option to rerun last migration, enabled via env var and by running helm with --set=rerun_last_migration=1
* subcharts: move crawljob and profilejob crds to separate subchart, as this seems best way to guarantee proper install order with + update on upgrade with helm, add built btrix-crds-0.1.0.tgz subchart
- metacontroller: use release from ghcr, add metacontroller-helm-v4.10.1.tgz subchart
* backend api fixes
- ensure changing scale of crawl also updates it in the db
- crawlconfigs: add 'currCrawlSize' and 'lastCrawlSize' to crawlconfig api
---------
Co-authored-by: D. Lee <leepro@gmail.com>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
* Re-implement collections, storing crawlIds in collection
* Return collections for crawl endpoints and filter on coll name
* Remove crawl from all collections when deleted
* Revert get_collection_crawls to flat array of resources
* Fix tests
* config: add 'pageLoadTimeout' and 'pageExtraDelay' options to backend config
- add 'default_page_load_timeout_seconds' to values.yaml, defaulting to 120, for pageLoadTimeout
- add 'defaultPageLoadTimeSeconds ' to /api/settings, update tests for /api/settings
addresses issue in #636
* more page limit: update to #717, instead of setting --limit in each crawlconfig,
apply override --maxPageLimit setting, implemented in crawler, to override individually configured page limit
* update tests, no longer returning 'crawl_page_limit_exceeds_allowed'
* backend: max pages per crawl limit, part of fix for #716:
- set 'max_pages_crawl_limit' in values.yaml, default to 100,000
- if set/non-0, automatically set limit if none provided
- if set/non-0, return 400 if adding config with limit exceeding max limit
- return limit as 'maxPagesPerCrawl' in /api/settings
- api: /all/crawls - add runningOnly=0 to show all crawls, default to 1/true (for more reliable testing)
tests: add test for 'max_pages_per_crawl' setting
- ensure 'limit' can not be set higher than max_pages_per_crawl
- ensure pages crawled is at the limit
- set test limit to max 2 pages
- add settings test
- check for pages.jsonl and extraPages.jsonl when crawling 2 pages
* Re-implement pagination and paginate crawlconfig revs
First step toward simplifying pagination to set us up for sorting
and filtering of list endpoints. This commit removes fastapi-pagination
as a dependency.
* Migrate all HttpUrl seeds to Seeds
This commit also updates the frontend to always use Seeds and to
fix display issues resulting from the change.
* Filter and sort crawls and workflows
Crawls:
- Filter by createdBy (via userid param)
- Filter by state (comma-separated string for multiple values)
- Filter by first_seed, name, description
- Sort by started, finished, fileSize, firstSeed
- Sort descending by default to match frontend
Workflows:
- Filter by createdBy (formerly userid) and modifiedBy
- Filter by first_seed, name, description
- Sort by created, modified, firstSeed, lastCrawlTime
* Add crawlconfigs search-values API endpoint and test
* backend: make crawlconfigs mutable! (#656)
- crawlconfig PATCH /{id} can now receive a new JSON config to replace the old one (in addition to scale, schedule, tags)
- exclusions: add / remove APIs mutate the current crawlconfig, do not result in a new crawlconfig created
- exclusions: ensure crawl job 'config' is updated when exclusions are added/removed, unify add/remove exclusions on crawl
- k8s: crawlconfig json is updated along with scale
- k8s: stateful set is restarted by updating annotation, instead of changing template
- crawl object: now has 'config', as well as 'profileid', 'schedule', 'crawlTimeout', 'jobType' properties to ensure anything that is changeable is stored on the crawl
- crawlconfigcore: store share properties between crawl and crawlconfig in new crawlconfigcore (includes 'schedule', 'jobType', 'config', 'profileid', 'schedule', 'crawlTimeout', 'tags', 'oid')
- crawlconfig object: remove 'oldId', 'newId', disallow deactivating/deleting while crawl is running
- rename 'userid' -> 'createdBy'
- remove unused 'completions' field
- add missing return to fix /run response
- crawlout: ensure 'profileName' is resolved on CrawlOut from profileid
- crawlout: return 'name' instead of 'configName' for consistent response
- update: 'modified', 'modifiedBy' fields to set modification date and user modifying config
- update: ensure PROFILE_FILENAME is updated in configmap is profileid provided, clear if profileid==""
- update: return 'settings_changed' and 'metadata_changed' if either crawl settings or metadata changed
- tests: update tests to check settings_changed/metadata_changed return values
add revision tracking to crawlconfig:
- store each revision separate mongo db collection
- revisions accessible via /crawlconfigs/{cid}/revs
- store 'rev' int in crawlconfig and in crawljob
- only add revision history if crawl config changed
migration:
- update to db v3
- copy fields from crawlconfig -> crawl
- rename userid -> createdBy
- copy userid -> modifiedBy, created -> modified
- skip invalid crawls (missing config), make createdBy optional (just in case)
frontend: Update crawl config keys with new API (#681), update frontend to use new PATCH endpoint, load config from crawl object in details view
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: sua yoo <sua@suayoo.com>
* Paginate API list endpoints
fastapi-pagination is pinned to 0.9.3, the latest release that plays
nicely with pinned versions of fastapi and fastapi-users.
* Increase page size via overriden Params and Page classes
* update api resource list keys
---------
Co-authored-by: sua yoo <sua@suayoo.com>
* Fix POST /orgs/{oid}/crawls/delete
- Add permissions check to ensure crawler users can only delete
their own crawls
- Fix broken delete_crawls endpoint
- Delete files from storage as well as deleting crawl from db
- Add tests, including nightly test that ensures crawl files are
no longer accessible after the crawl is deleted
* Make invites expire after configurable window
The value can be set in EXPIRE_AFTER_SECONDS env var and via
helm chart values, and defaults to 7 days.
* Create nightly test CI and add invite expiration test to it
* Update 404 error message for missing or expired invite
---------
Co-authored-by: sua yoo <sua@suayoo.com>
Adds POST /orgs/{oid}/invites/delete, which expects the invited
email address in the POST body.
This endpoint will also delete duplicate invites with the same
email/oid combination if env var ALLOW_DUPE_INVITES allows dupes.