browsertrix

Author	SHA1	Message	Date
Tessa Walsh	a031fab313	Backend work for public collections (#2198 ) Fixes #2182 This rather large PR adds the rest of what should be needed for public collections work in the frontend. New API endpoints include: - Public collections endpoints: GET, streaming download - Paginated list of URLs in collection with snapshot (page) info for each - Collection endpoint to set home URL - Collection endpoint to upload thumbnail as stream - DELETE endpoint to remove collection thumbnail Changes to existing API endpoints include: - Paginating public collection list results - Several `pages` endpoints that previously only supported `/crawls/` in their path, e.g. `/orgs/{oid}/crawls/all/pages/reAdd`, now support `/uploads/` and `/all-crawls/` namespaces as well. This is necessitated by adding pages for uploads to the database (see below). For `/orgs/{oid}/namespace/all/pages/reAdd`, `crawls` or `uploads` will serve as a filter to only affect crawls of that given type. Other endpoints are more liberal at this point, and will perform the same action regardless of the namespace used in the route (we'll likely want to change this in a follow-up to be more consistent). - `/orgs/{oid}/namespace/all/pages/reAdd` now kicks off a background job rather than doing all of the computation in an asyncio task in the backend container. The background job additionally updates collection date ranges, page/size counts, and tags for each collection in the org after pages have been (re)added. Other big changes: - New uploads will now have their pages read into the database! Collection page counts now also include uploads - A migration was added to start a background job for each org that will add the pages for previously-uploaded WACZ files to the database and update collections accordingly - Adds a new `ImageFile` subclass of `BaseFile` for thumbnails that we can use for other user-uploaded image files moving forward, with separate output models for authenticated and public endpoints	2025-01-13 15:15:48 -08:00
Ilya Kreymer	94e985ae13	optimize org quota lookups (#1973 ) - instead of looking up storage and exec min quotas from oid, and loading an org each time, load org once and then check quotas on the org object - many times the org was already available, and was looked up again - storage and exec quota checks become sync - rename can_run_crawl() to more generic can_write_data(), optionally also checks exec minutes - typing: get_org_by_id() always returns org, or throws, adjust methods accordingly (don't check for none, catch exception) - typing: fix typo in BaseOperator, catch type errors in operator 'org_ops' - operator quota check: use up-to-date 'status.size' for current job, ignore current job in all jobs list to avoid double-counting - follow up to #1969	2024-07-25 14:00:16 -07:00
Tessa Walsh	27ee16d308	Implement downloading archived item + QA runs as multi-WACZ (#1933 ) Fixes #1412 ## Changes ### Backend - Adds `all-crawls`, `crawls`, and `uploads` API endpoints to download archived item as multi-WACZ - Download QA runs as multi-WACZ - Adds backend tests for new endpoints - Update to new version of stream-zip library which does not require crc-32 to be present for ZIP members, computes after streaming, fixing invalid crc-32 issues as previously computed crc-32s from crawler may be invalid. ### Frontend Adds ability to download archived item from: - Button in archived item detail Files tab - Archived item details actions menu - Archived items list menu --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-25 10:28:57 -07:00
Ilya Kreymer	8c0321bdea	Pydantic 2.x update + type fixes + python 3.12 (#1947 ) * updates pydantic to 2.x * also update to python 3.12 * additional type fixes: - all Optional[] types must have a default value - update to constrained types - URL types converted from str - test updates Fixes #1940	2024-07-22 17:23:03 -07:00
Tessa Walsh	d41647e6c2	Document all API endpoints with response models (#1928 ) Fixes #1920 Adds response models to all API endpoints that were missing them, documenting current behavior without making any changes at this stage to standardize responses. Follow-up work will involve adding generics to some of the response models	2024-07-16 12:48:38 -07:00
Tessa Walsh	bdfc0948d3	Disable uploading and creating browser profiles when org is read-only (#1907 ) Fixes #1904 Follow-up to read-only enforcement, with improved tests.	2024-07-01 23:15:38 -07:00
Ilya Kreymer	4f676e4e82	QA Runs Initial Backend Implementation (#1586 ) Supports running QA Runs via the QA API! Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes #1498 Also requires the latest Browsertrix Crawler 1.1.0+ (from webrecorder/browsertrix-crawler#469 branch) Notable changes: - QARun objects contain info about QA runs, which are crawls performed on data loaded from existing crawls. - Various crawl db operations can be performed on either the crawl or `qa.` object, and core crawl fields have been moved to CoreCrawlable. - While running,`QARun` data stored in a single `qa` object, while finished qa runs are added to `qaFinished` dictionary on the Crawl. The QA list API returns data from the finished list, sorted by most recent first. - Includes additional type fixes / type safety, especially around BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific get_upload(), get_basecrawl(), get_crawl() getters for internal use and get_crawl_out() for API - Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-20 22:42:16 -07:00
Ilya Kreymer	fb3d88291f	Background Jobs Work (#1321 ) Fixes #1252 Supports a generic background job system, with two background jobs, CreateReplicaJob and DeleteReplicaJob. - CreateReplicaJob runs on new crawls, uploads, profiles and updates the `replicas` array with the info about the replica after the job succeeds. - DeleteReplicaJob deletes the replica. - Both jobs are created from the new `replica_job.yaml` template. The CreateReplicaJob sets secrets for primary storage + replica storage, while DeleteReplicaJob only needs the replica storage. - The job is processed in the operator when the job is finalized (deleted), which should happen immediately when the job is done, either because it succeeds or because the backoffLimit is reached (currently set to 3). - /jobs/ api lists all jobs using a paginated response, including filtering and sorting - /jobs/<job id> returns details for a particular job - tests: nightly tests updated to check create + delete replica jobs for crawls as well as uploads, job api endpoints - tests: also fixes to timeouts in nightly tests to avoid crawls finishing too quickly. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-11-02 13:02:17 -07:00
Ilya Kreymer	6384d8b5f1	Additional Type Hints / Type Fix Pass (#1320 ) This PR adds more type safety to the backend codebase: - All ops classes calls should be type checked - Avoiding circular references with TYPE_CHECKING conditional - Consistent UUID usage: uuid.UUID / UUID4 with just UUID - Crawl states moved to models, made into lists - Additional typing added as needed, fixed a few type related errors - CrawlOps / UploadOps / BaseCrawlOps now all have same param init order to simplify changes	2023-10-30 12:59:24 -04:00
Ilya Kreymer	6dc452ebad	Storage Refactor: Replication + Custom Storage Support (#1296 ) - Refactors storage to support replicas + custom storages on the Org. - There is a default primary + replica storage, while an Org can also have primary and replica storages. - StorageRef object is used to store references to default and custom storage. - CrawlFile has been updated to contain a StorageRef instead of a def_storage_name, which references either a default storage (in StorageOps) or custom storage (in Organization) - There is also a 'replicas' Optional[List[StorageRef]] which contains replicas, if any. - CrawlFileOut contain a numReplicas for how many replicas exist for a given file. - Migration: migration 0020 added to migrate existing Orgs, CrawlFile and ProfileFile objects to new storage system (CrawlFile and ProfileFile now extend BaseFile) Part of #1262 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-10-26 21:44:09 -07:00
Ilya Kreymer	4591db1afe	More stringent UUID types for user input / avoid 500 errors (#1309 ) Fixes #1297 Ensures proper typing for UUIDs in FastAPI input models, to avoid explicit conversions, which may throw errors. This avoids possible 500 errors (due to ValueError exceptions) when converting UUIDs from user input. Instead, will get more 422 errors from FastAPI. UUID conversions remaining are in operator / profile handling where UUIDs are retrieved from previously set fields, remaining user input conversions in user auth and collection list are wrapped in exceptions. For `profileid`, update fastapi models to support union of UUID, null, and EmptyStr (new empty string only type), to differentiate removing profile (empty string) vs not changing at all (null) for config updates	2023-10-25 15:15:53 -04:00
Ilya Kreymer	dc8d510b11	webhook tweak: pass oid to crawl finished and upload finished webhooks (#1287 ) Optimizes webhooks by passing oid directly to webhooks: - avoids extra crawl lookup - possible for crawl to be deleted before webhook is processed via operator (resulting in crawl lookup to fail) - add more typing to operator and webhooks	2023-10-16 10:51:36 -07:00
Ilya Kreymer	16e7a1d0a2	Storage Ops Refactor (#1257 ) * storage ops refactor: - create StorageOps class similar to other ops classes - init storages list in StorageOps, no longer require lookup up default storages via CrawlManager - convert all storage functions to members, add storageops to operator - remove unused params, ensure crawl exists for rollover restart - add env var to determine if using local minio to use correct endpoint URL * crawls /seeds endpoint: just return empty list if not a crawl (eg. upload) * crawlmanager: remove unused code, rename check_storage -> has_storage	2023-10-10 15:04:23 -07:00
Tessa Walsh	e9bac4c088	API delete endpoint improvements (#1232 ) - Applies user permissions check before deleting anything in all /delete endpoints - Shuts down running crawls before deleting anything in /all-crawls/delete as well as /crawls/delete - Splits delete_list.crawl_ids into crawls and upload lists at same time as checks in /all-crawls/delete - Updates frontend notification message to Only org owners can delete other users' archived items. when a crawler user attempts to delete another users' archived items	2023-10-03 13:05:00 -07:00
Tessa Walsh	7a56fa23f5	Remove username lookups for crawls and workflows by storing usernames in db (#1199 ) * store usernames (createdByName, modifiedByName, startedByName) in db for workflows * store userName for userid for crawls in db * update output models to return usernames * add migration 0018 to add usernames to existing crawls and crawlconfigs * updated tests for crawl and config usernames * use async for to iterate over crawls and crawlconfigs --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-09-28 09:37:23 -07:00
Tessa Walsh	094f27bcff	Track bytes stored per file type and include in org metrics (#1207 ) * Add bytes stored per type to org and metrics The org now tracks bytesStored by type of crawl, uploads, and browser profiles in addition to the total, and returns these values in the org metrics endpoint. A migration is added to precompute these values in existing deployments. In addition, all /metrics storage values are now returned solely as bytes, as the GB form wasn't being used in the frontend and is unnecessary. * Improve deletion of multiple archived item types via `/all-crawls` delete endpoint - Update `/all-crawls` delete test to check that org and workflow size values are correct following deletion. - Fix bug where it was always assumed only one crawl was deleted per cid and size was not tracked per cid - Add type check within delete_crawls	2023-09-22 12:55:21 -04:00
Ilya Kreymer	feb7ab7652	Improved type checking for backend with mypy (#1174 ) * add mypy type check - run type check on backend fix ambiguous typing issues - add mypy to lint gh action + precommit hook - add mypy.ini	2023-09-13 19:40:26 -07:00
Ilya Kreymer	4b34da033a	Refactor / Cleanup: move ops functions back into classes (#1171 ) * remove almost all standalone functions and move them back into ops member functions * operator now has access to all the ops classes as well * keep two standalone functions used only in migrations --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-09-13 11:56:09 -07:00
Tessa Walsh	d2ededc895	Add and enforce org storage quota (#1106 ) * Implement in backend - Track bytesStored in org - Add migration to pre-calculate based on size of crawlfiles and profilefiles - Add methods to increase or decrease org storage when crawl or profile files are added or deleted - Include storageQuotaReached boolean in API responses that alter storage - Don't start new crawls and fail uploads if storage quota reached * Implement in frontend - Add to orgs-list quotas - Update org's storageQuotaReached based on backend endpoint responses - Disable buttons when storage quota is met - Show toast notification when attempting to run a crawl when org storage quota is met	2023-09-07 12:45:43 -04:00
Tessa Walsh	147bfd9d44	Add event webhook notifications system to backend (#1061 ) Initial set of backend API for event webhook notifications for the following events: * Crawl started (including boolean indicating if crawl was scheduled) * Crawl finished * Upload finished * Archived item added to collection * Archived item removed from collection Configuration of URLs is done via /api/orgs/<oid>/event-webhook-urls. If a URL is configured for a given event, a webhook notification is added to the database and then attempted to be sent (up to a total of 5 tries per overall attempt, with an increasing backoff between, implemented via use of the backoff library, which supports async). webhook status available via /api/orgs/<oid>/webhooks (Additional testing + potential fastapi integration left in separate follow-ups Fixes #1041	2023-08-31 19:52:37 -07:00
Anish Lakhwara	8b16124675	feat: implement 'collections' array with {name, id} for archived item details (#1098 ) - rename 'collections' -> 'collectionIds', adding migration 0014 - only populate 'collections' array with {name, id} pair for get_crawl() / single archived item path, but not for aggregate/list methods - remove Crawl.get_crawl(), redundant with BaseCrawl.get_crawl() version - ensure _files_to_resources returns an empty [] instead of none if empty (matching BaseCrawl.get_crawl() behavior to Crawl.get_crawl()) - tests: update tests to use collectionIds for id list, add 'collections' for {name, id} test - frontend: change Crawl object to have collectionIds instead of collections --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-08-25 00:26:46 -07:00
Tessa Walsh	7ff57ce6b5	Backend: standardize search values, filters, and sorting for archived items (#1039 ) - all-crawls list endpoint filters now conform to 'Standardize list controls for archived items #1025' and URL decode values before passing them in - Uploads list endpoint now includes all all-crawls filters relevant to uploads - An all-crawls/search-values endpoint is added to support searching across all archived item types - Crawl configuration names are now copied to the crawl when the crawl is created, and crawl names and descriptions are now editable via the backend API (note: this will require frontend changes as well to make them editable via the UI) - Migration added to copy existing config names for active configs into their associated crawls. This migration has been tested in a local deployment - New statuses generate-wacz, uploading-wacz, and pending-wait are added when relevant to tests to ensure that they pass - Tests coverage added for all new all-crawls endpoints, filters, and sort values	2023-08-04 09:56:52 -07:00
Tessa Walsh	c21153255a	Rename notes to description in frontend and backend (#1011 ) - Rename crawl notes to description - Add migration renaming notes -> description - Stop inheriting workflow description in crawl - Update frontend to replace crawl/upload notes with description - Remove setting of config description from crawl list - Adjust tests for changes	2023-07-26 13:00:04 -07:00
Tessa Walsh	9f32aa697b	Add collections and tags to upload API endpoints (#993 ) * Add collections and tags to uploads * Fix order of deletion check test * Re-add tags to UploadedCrawl model after rebase * Fix Users model heading	2023-07-21 16:44:56 +02:00
Tessa Walsh	4014d98243	Move pydantic models to separate module + refactor crawl response endpoints to be consistent (#983 ) * Move all pydantic models to models.py to avoid circular dependencies * Include automated crawl details in all-crawls GET endpoints - ensure /all-crawls endpoint resolves names / firstSeed data same as /crawls endpoint for crawls to ensure consistent frontend display. fields added in get and list all-crawl endpoints for automated crawls only: - cid - name - description - firstSeed - seedCount - profileName * Add automated crawl fields to list all-crawls test * Uncomment mongo readinessProbe * cleanup CrawlOutWithResources: - remove 'files' from output model, only resources should be returned - add _files_to_resources() to simplify computing presigned 'resources' from raw 'files' - update upload tests to be more consistent, 'files' never present, 'errors' always none --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-07-20 13:05:33 +02:00
Ilya Kreymer	f1bce310d0	uploads api: support filtering uploads by collectionId (#969 ) tests: add collection filter test	2023-07-09 10:54:30 -07:00
Ilya Kreymer	00eb62214d	Uploads API: BaseCrawl refactor + Initial support for /uploads endpoint (#937 ) * basecrawl refactor: make crawls db more generic, supporting different types of 'base crawls': crawls, uploads, manual archives - move shared functionality to basecrawl.py - create a base BaseCrawl object, which contains start / finish time, metadata and files array - create BaseCrawlOps, base class for CrawlOps, which supports base crawl deletion, querying and collection add/remove * uploads api: (part of #929) - new UploadCrawl object which extends BaseCrawl, has name and description - support multipart form data data upload to /uploads/formdata - support streaming upload of a single file via /uploads/stream, using botocore multipart upload to upload to s3-endpoint in parts - require 'filename' param to set upload filename for streaming uploads (otherwise use form data names) - sanitize filename, place uploads in /uploads/<uuid>/<sanitized-filename>-<random>.wacz - uploads have internal id 'upload-<uuid>' - create UploadedCrawl object with CrawlFiles pointing to the newly uploaded files, set state to 'complete' - handle upload failures, abort multipart upload - ensure uploads added within org bucket path - return id / added when adding new UploadedCrawl - support listing, deleting, and patch /uploads - support upload details via /replay.json to support for replay - add support for 'replaceId=<id>', which would remove all previous files in upload after new upload succeeds. if replaceId doesn't exist, create new upload. (only for stream endpoint so far). - support patching upload metadata: notes, tags and name on uploads (UpdateUpload extends UpdateCrawl and adds 'name') * base crawls api: Add /all-crawls list and delete endpoints for all crawl types (without resources) - support all-crawls/<id>/replay.json with resources - Use ListCrawlOut model for /all-crawls list endpoint - Extend BaseCrawlOut from ListCrawlOut, add type - use 'type: crawl' for crawls and 'type: upload' for uploads - migration: ensure all previous crawl objects / missing type are set to 'type: crawl' - indexes: add db indices on 'type' field and with 'type' field and oid, cid, finished, state * tests: add test for multipart and streaming upload, listing uploads, deleting upload - add sample WACZ for upload testing: 'example.wacz' and 'example-2.wacz' * collections: support adding and remove both crawls and uploads via base crawl - include collection_ids in /all-crawls list - collections replay.json can include both crawls and uploads bump version to 1.6.0-beta.2 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-07-07 09:13:26 -07:00

27 Commits