browsertrix

Author	SHA1	Message	Date
Tessa Walsh	83f80d4103	Add org metrics API endpoint (#1196 ) * Initial implementation of org metrics (This can eventually be sped up significantly by precomputing the values and storing them in the db.) * Rename storageQuota to storageQuotaBytes to be consistent * Update tests to include metrics	2023-09-19 16:24:27 -05:00
Tessa Walsh	859f2271da	fix(backend): call run now when updating crawlConfig #1194 Update backend/btrixcloud/crawlconfigs.py Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-09-19 11:57:41 -07:00
Tessa Walsh	9224f52f51	Remove config from list endpoints to speed up responses (#1193 ) * Remove config from list endpoints - Remove config field from workflow and crawl list endpoints - Add seedCount to CrawlConfigOut on backend and Workflow on frontend - Refactor CrawlConfig and CrawlConfigOut to extend CrawlConfigCore + CrawlConfigAdditional - Refactor workflow list in frontend to use firstSeed and seedCount - Frontend uses ListWorkflow type which is Omit<Workflow, "config">	2023-09-19 11:05:48 -05:00
Ilya Kreymer	feb7ab7652	Improved type checking for backend with mypy (#1174 ) * add mypy type check - run type check on backend fix ambiguous typing issues - add mypy to lint gh action + precommit hook - add mypy.ini	2023-09-13 19:40:26 -07:00
Ilya Kreymer	9159c7c914	ensure max crawl size and max crawl timeout values are set to 0 when unused, instead of null (#1167 ) - convert None->0 when creating CrawlJob - ensure frontend sends 0 not null - make input model require 'int = 0' instead of 'Optional[int] = 0'	2023-09-13 09:51:26 -07:00
Tessa Walsh	7cf2b11eb7	Add event webhook tests (#1155 ) * Add success filter to webhook list GET endpoint * Add sorting to webhooks list API and add event filter * Test webhooks via echo server * Set address to echo server on host from CI env var for k3d and microk8s * Add -s back to pytest command for k3d ci * Change pytest test path to avoid hanging on collecting tests * Revert microk8s to only run on push to main	2023-09-12 22:08:40 -07:00
Tessa Walsh	9377a6f456	Issue all non-upload storage-quota-update events from LiteElement (#1151 ) - More specific toast notification error messages to the action being attempted - Single dismissable global banner shown when org storage is reached - Removed check for storage quota reached in `runNow`, since buttons are disabled in UI, and errors handled if request fails. - Allow creating new workflow when storage quota reached - More responsive storage quota updates: add storageQuotaReached to archived item replay.json, updates w/o reload when crawl pushes quota over limit - Modify LiteElement to check for storageQuotaReached on GET requests --------- Co-authored-by: sua yoo <sua@suayoo.com>	2023-09-11 18:17:48 -07:00
Ilya Kreymer	ad9bca2e92	Operator refactor to control pods + pvcs directly instead of statefulsets (#1149 ) - Ability for pod to be Completed, unlike in Statefulset - eg. if 3 pods are running and first one finishes, all 3 must be running until all 3 are done. With this setup, the first finished pod can remain in Completed state. - Fixed shutdown order - crawler pods now correctly shutdown first before redis pods, by switching to background deletion. - Pod priority decreases with scale: 1st instance of a new crawl can preempt 3rd or 2nd instance of another crawl - Create priority classes upto 'max_crawl_scale, configured in values.yaml - Improved scale change reconciliation: if increasing scale, immediately scale up. If decreasing scale, graceful stop scaled-down instance to complete via redis 'stopone' key, wait until they exit with Completed state before adjust status.scale / removing scaled down pods. Ensures unaccepted interrupts don't cause scaled down data to be deleted. - Redis pod remains inactive until crawler is first active, or after no crawl pods are active for 60 seconds - Configurable Redis storage with 'redis_storage' value, set to 3Gi by default - CrawlJob deletion starts as soon as post-finish crawl operations are run - Post-crawl operations get their own redis instance, since one during response is being cleaned up in finalizer - Finalizer ignores request with incorrect state (returns 400 if reported as not finished while crawl is finished) - Current resource usage added to status - Profile browser: also manage single pod directly without statefulset for consistency. - Restart pods via restartTime value: if spec.restartTime != status.restartTime, clear out pods and update status.restartTime (using OnDelete policy to avoid recreate loops in edge cases). - Update to latest metacontroller (v4.11.0) - Add --restartOnError flag for crawler (for browsertrix-crawler 0.11.0) - Failed crawl logging: dd 'fail_crawl()' to be used for failing a crawl, which prints logs for default container (if enabled) as well as pod status - tests: check other finished states to avoid stuck in infinite loop if crawl fails - tests: disable disk utilization check, which adds unpredictability to crawl testing! fixes #1147 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-09-11 10:38:04 -07:00
Tessa Walsh	d2ededc895	Add and enforce org storage quota (#1106 ) * Implement in backend - Track bytesStored in org - Add migration to pre-calculate based on size of crawlfiles and profilefiles - Add methods to increase or decrease org storage when crawl or profile files are added or deleted - Include storageQuotaReached boolean in API responses that alter storage - Don't start new crawls and fail uploads if storage quota reached * Implement in frontend - Add to orgs-list quotas - Update org's storageQuotaReached based on backend endpoint responses - Disable buttons when storage quota is met - Show toast notification when attempting to run a crawl when org storage quota is met	2023-09-07 12:45:43 -04:00
Tessa Walsh	147bfd9d44	Add event webhook notifications system to backend (#1061 ) Initial set of backend API for event webhook notifications for the following events: * Crawl started (including boolean indicating if crawl was scheduled) * Crawl finished * Upload finished * Archived item added to collection * Archived item removed from collection Configuration of URLs is done via /api/orgs/<oid>/event-webhook-urls. If a URL is configured for a given event, a webhook notification is added to the database and then attempted to be sent (up to a total of 5 tries per overall attempt, with an increasing backoff between, implemented via use of the backoff library, which supports async). webhook status available via /api/orgs/<oid>/webhooks (Additional testing + potential fastapi integration left in separate follow-ups Fixes #1041	2023-08-31 19:52:37 -07:00
Tessa Walsh	f6369ee01e	Add support for collectionIds to archived item PATCH endpoints (#1121 ) * Add support for collectionIds to patch endpoints * Make update available via all-crawls/ and add test * Fix tests * Always remove collectionIds from udpate * Remove unnecessary fallback * One more pass on expected values before update	2023-08-30 10:41:30 -04:00
Tessa Walsh	e667fe2e97	Add max crawl size option to backend and frontend (#1045 ) Backend: - add 'maxCrawlSize' to models and crawljob spec - add 'MAX_CRAWL_SIZE' to configmap - add maxCrawlSize to new crawlconfig + update APIs - operator: gracefully stop crawl if current size (from stats) exceeds maxCrawlSize - tests: add max crawl size tests Frontend: - Add Max Crawl Size text box Limits tab - Users enter max crawl size in GB, convert to bytes - Add BYTES_PER_GB as constant for converting to bytes - docs: Crawl Size Limit to user guide workflow setup section Operator Refactor: - use 'status.stopping' instead of 'crawl.stopping' to indicate crawl is being stopped, as changing later has no effect in operator - add is_crawl_stopping() to return if crawl is being stopped, based on crawl.stopping or size or time limit being reached - crawlerjob status: store byte size under 'size', human readable size under 'sizeHuman' for clarity - size stat always exists so remove unneeded conditional (defaults to 0) - store raw byte size in 'size', human readable size in 'sizeHuman' Charts: - subchart: update crawlerjob crd in btrix-crds to show status.stopping instead of spec.stopping - subchart: show 'sizeHuman' property instead of 'size' - bump subchart version to 0.1.1 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-08-26 22:00:37 -07:00
Anish Lakhwara	8b16124675	feat: implement 'collections' array with {name, id} for archived item details (#1098 ) - rename 'collections' -> 'collectionIds', adding migration 0014 - only populate 'collections' array with {name, id} pair for get_crawl() / single archived item path, but not for aggregate/list methods - remove Crawl.get_crawl(), redundant with BaseCrawl.get_crawl() version - ensure _files_to_resources returns an empty [] instead of none if empty (matching BaseCrawl.get_crawl() behavior to Crawl.get_crawl()) - tests: update tests to use collectionIds for id list, add 'collections' for {name, id} test - frontend: change Crawl object to have collectionIds instead of collections --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-08-25 00:26:46 -07:00
Tessa Walsh	ce5b52f8af	Add and enforce org maxPagesPerCrawl quota (#1044 )	2023-08-23 10:38:36 -04:00
Ilya Kreymer	de3e5907a7	backend: crawlout: include raw crawnconfig in api details, fixes #1030 (#1055 )	2023-08-09 08:46:42 -07:00
Tessa Walsh	7ff57ce6b5	Backend: standardize search values, filters, and sorting for archived items (#1039 ) - all-crawls list endpoint filters now conform to 'Standardize list controls for archived items #1025' and URL decode values before passing them in - Uploads list endpoint now includes all all-crawls filters relevant to uploads - An all-crawls/search-values endpoint is added to support searching across all archived item types - Crawl configuration names are now copied to the crawl when the crawl is created, and crawl names and descriptions are now editable via the backend API (note: this will require frontend changes as well to make them editable via the UI) - Migration added to copy existing config names for active configs into their associated crawls. This migration has been tested in a local deployment - New statuses generate-wacz, uploading-wacz, and pending-wait are added when relevant to tests to ensure that they pass - Tests coverage added for all new all-crawls endpoints, filters, and sort values	2023-08-04 09:56:52 -07:00
Ilya Kreymer	362afa47bd	Support for Public / Shareable Collections (#1038 ) * collections: support toggling collections public/private, viewable via RWP - backend: add 'public' to collection model, support patching to update - backend: add .../collections/<id>/public/replay.json for public access - backend: add CORS handling for public endpoint - frontend: support 'make shareable / make private' dropdown actions on collection detail + collection list views - frontend: show shareable / private icons by collection name on detail + list views - frontend: link to replayweb.page for standalone browsing - frontend: add embed code popup when a collection is shareable - refer to public collections as 'shareable' for now --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2023-08-03 19:11:01 -07:00
Tessa Walsh	c21153255a	Rename notes to description in frontend and backend (#1011 ) - Rename crawl notes to description - Add migration renaming notes -> description - Stop inheriting workflow description in crawl - Update frontend to replace crawl/upload notes with description - Remove setting of config description from crawl list - Adjust tests for changes	2023-07-26 13:00:04 -07:00
Tessa Walsh	fcd48b1831	Add totalSize to collections and make it sortable in list endpoint (#1001 ) * Precompute collection.totalSize and make sortable * Add migration to recompute collection data with totalSize	2023-07-24 13:12:23 -04:00
Tessa Walsh	9f32aa697b	Add collections and tags to upload API endpoints (#993 ) * Add collections and tags to uploads * Fix order of deletion check test * Re-add tags to UploadedCrawl model after rebase * Fix Users model heading	2023-07-21 16:44:56 +02:00
Tessa Walsh	4014d98243	Move pydantic models to separate module + refactor crawl response endpoints to be consistent (#983 ) * Move all pydantic models to models.py to avoid circular dependencies * Include automated crawl details in all-crawls GET endpoints - ensure /all-crawls endpoint resolves names / firstSeed data same as /crawls endpoint for crawls to ensure consistent frontend display. fields added in get and list all-crawl endpoints for automated crawls only: - cid - name - description - firstSeed - seedCount - profileName * Add automated crawl fields to list all-crawls test * Uncomment mongo readinessProbe * cleanup CrawlOutWithResources: - remove 'files' from output model, only resources should be returned - add _files_to_resources() to simplify computing presigned 'resources' from raw 'files' - update upload tests to be more consistent, 'files' never present, 'errors' always none --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-07-20 13:05:33 +02:00

21 Commits