browsertrix

Author	SHA1	Message	Date
Tessa Walsh	c800da1732	Add reviewStatus, qaState, and qaRunCount sort options to crawls/all-crawls list endpoints (#1686 ) Backend work for #1672 Adds new sort options to /crawls and /all-crawls GET list endpoints: - `reviewStatus` - `qaRunCount`: number of completed QA runs for crawl (also added to CrawlOut) - `qaState` (sorts by `activeQAState` first, then `lastQAState`, both of which are added to CrawlOut)	2024-04-16 23:54:09 -07:00
Tessa Walsh	87e0873f1a	Add mime field to Page model (#1678 )	2024-04-17 00:57:49 -04:00
Tessa Walsh	172a9bf0cd	Change crawl.reviewStatus to 1-5 scale int (#1664 )	2024-04-09 17:51:06 -07:00
Tessa Walsh	e9895e78a2	Add additional filters to page list endpoints (#1622 ) Fixes #1617 Filters added: - reviewed: filter by page has approval or at least one note (true) or neither (false) - approved: filter by approval value (accepts list of strings, comma-separated, each of which are coerced into True, False, or None, or ignored if they are invalid values) - hasNotes: filter by has at least one note (true) or not (false) Tests have also been added to ensure that results are as expected.	2024-03-21 21:33:07 -07:00
Ilya Kreymer	4f676e4e82	QA Runs Initial Backend Implementation (#1586 ) Supports running QA Runs via the QA API! Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes #1498 Also requires the latest Browsertrix Crawler 1.1.0+ (from webrecorder/browsertrix-crawler#469 branch) Notable changes: - QARun objects contain info about QA runs, which are crawls performed on data loaded from existing crawls. - Various crawl db operations can be performed on either the crawl or `qa.` object, and core crawl fields have been moved to CoreCrawlable. - While running,`QARun` data stored in a single `qa` object, while finished qa runs are added to `qaFinished` dictionary on the Crawl. The QA list API returns data from the finished list, sorted by most recent first. - Includes additional type fixes / type safety, especially around BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific get_upload(), get_basecrawl(), get_crawl() getters for internal use and get_crawl_out() for API - Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-20 22:42:16 -07:00
Tessa Walsh	21ae38362e	Add endpoints to read pages from older crawl WACZs into database (#1562 ) Fixes #1597 New endpoints (replacing old migration) to re-add crawl pages to db from WACZs. After a few implementation attempts, we settled on using [remotezip](https://github.com/gtsystem/python-remotezip) to handle parsing of the zip files and streaming their contents line-by-line for pages. I've also modified the sync log streaming to use remotezip as well, which allows us to remove our own zip module and let remotezip handle the complexity of parsing zip files. Database inserts for pages from WACZs are batched 100 at a time to help speed up the endpoint, and the task is kicked off using asyncio.create_task so as not to block before giving a response. StorageOps now contains a method for streaming the bytes of any file in a remote WACZ, requiring only the presigned URL for the WACZ and the name of the file to stream.	2024-03-19 14:14:21 -07:00
Tessa Walsh	c20e754269	Add updatable QA reviewStatus field to crawls (#1575 ) Fixes #1539 Adds `reviewStatus` field to `BaseCrawl` model, updatable via the crawl update API endpoint. Acceptable values are "good", "acceptable" or "failure", enforced by an Enum. Added to `BaseCrawl` so that we can extend support to uploads more easily later on, but for now we'll only display this for crawls in the frontend.	2024-03-05 16:49:23 -08:00
Tessa Walsh	14189b7cfb	Add crawl pages and related API endpoints (#1516 ) Fixes #1502 - Adds pages to database as they get added to Redis during crawl - Adds migration to add pages to database for older crawls from pages.jsonl and extraPages.jsonl files in WACZ - Adds GET, list GET, and PATCH update endpoints for pages - Adds POST (add), PATCH, and POST (delete) endpoints for page notes, each with their own id, timestamp, and user info in addition to text - Adds page_ops methods for 1. adding resources/urls to page, and 2. adding automated heuristics and supplemental info (mime, type, etc.) to page (for use in crawl QA job) - Modifies `Migration` class to accept kwargs so that we can pass in ops classes as needed for migrations - Deletes WACZ files and pages from database for failed crawls during crawl_finished process - Deletes crawl pages when a crawl is deleted Note: Requires a crawler version 1.0.0 beta3 or later, with support for `--writePagesToRedis` to populate pages at crawl completion. Beta 4 is configured in the test chart, which should be upgraded to stable 1.0.0 when it's released. Connected to https://github.com/webrecorder/browsertrix-crawler/pull/464 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-02-28 12:11:35 -05:00
Tessa Walsh	38a01860b8	Add API endpoints for crawl statistics (#1461 ) Fixes #1158 Introduces two new API endpoints that stream crawling statistics CSVs (with a suggested attachment filename header): - `GET /api/orgs/all/crawls/stats` - crawls from all orgs (superuser only) - `GET /api/orgs/{oid}/crawls/stats` - crawls from just one org (available to org crawler/admin users as well as superusers) Also includes tests for both endpoints.	2024-01-10 13:30:47 -08:00
Ilya Kreymer	63291e95a5	avoid exception if 'errors' key doesn't exist (#1301 ) - avoid exception if 'errors' (or 'files' keys) don't exist (part of #1297) - ensure 'errors' list always set on output model for consistency, defaulting to empty list - fix tests for 'errors' being an empty empty list follow-up to #1300 (merging 1.7.1 release into main)	2023-10-19 14:39:54 -07:00
Tessa Walsh	e9bac4c088	API delete endpoint improvements (#1232 ) - Applies user permissions check before deleting anything in all /delete endpoints - Shuts down running crawls before deleting anything in /all-crawls/delete as well as /crawls/delete - Splits delete_list.crawl_ids into crawls and upload lists at same time as checks in /all-crawls/delete - Updates frontend notification message to Only org owners can delete other users' archived items. when a crawler user attempts to delete another users' archived items	2023-10-03 13:05:00 -07:00
sua yoo	941a75ef12	Separate seeds into a new endpoints (#1217 ) - Remove config.seeds from workflow and crawl detail endpoints - Add new paginated GET /crawls/{crawl_id}/seeds and /crawlconfigs/{cid}/seeds endpoints to retrieve seeds for a crawl or workflow - Include firstSeed in GET /crawlconfigs/{cid} endpoint (was missing before) - Modify frontend to fetch seeds from new /seeds endpoints with loading indicator --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-10-02 10:56:12 -07:00
Tessa Walsh	7a56fa23f5	Remove username lookups for crawls and workflows by storing usernames in db (#1199 ) * store usernames (createdByName, modifiedByName, startedByName) in db for workflows * store userName for userid for crawls in db * update output models to return usernames * add migration 0018 to add usernames to existing crawls and crawlconfigs * updated tests for crawl and config usernames * use async for to iterate over crawls and crawlconfigs --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-09-28 09:37:23 -07:00
Tessa Walsh	d2ededc895	Add and enforce org storage quota (#1106 ) * Implement in backend - Track bytesStored in org - Add migration to pre-calculate based on size of crawlfiles and profilefiles - Add methods to increase or decrease org storage when crawl or profile files are added or deleted - Include storageQuotaReached boolean in API responses that alter storage - Don't start new crawls and fail uploads if storage quota reached * Implement in frontend - Add to orgs-list quotas - Update org's storageQuotaReached based on backend endpoint responses - Disable buttons when storage quota is met - Show toast notification when attempting to run a crawl when org storage quota is met	2023-09-07 12:45:43 -04:00
Tessa Walsh	f6369ee01e	Add support for collectionIds to archived item PATCH endpoints (#1121 ) * Add support for collectionIds to patch endpoints * Make update available via all-crawls/ and add test * Fix tests * Always remove collectionIds from udpate * Remove unnecessary fallback * One more pass on expected values before update	2023-08-30 10:41:30 -04:00
Tessa Walsh	7ff57ce6b5	Backend: standardize search values, filters, and sorting for archived items (#1039 ) - all-crawls list endpoint filters now conform to 'Standardize list controls for archived items #1025' and URL decode values before passing them in - Uploads list endpoint now includes all all-crawls filters relevant to uploads - An all-crawls/search-values endpoint is added to support searching across all archived item types - Crawl configuration names are now copied to the crawl when the crawl is created, and crawl names and descriptions are now editable via the backend API (note: this will require frontend changes as well to make them editable via the UI) - Migration added to copy existing config names for active configs into their associated crawls. This migration has been tested in a local deployment - New statuses generate-wacz, uploading-wacz, and pending-wait are added when relevant to tests to ensure that they pass - Tests coverage added for all new all-crawls endpoints, filters, and sort values	2023-08-04 09:56:52 -07:00
Tessa Walsh	c21153255a	Rename notes to description in frontend and backend (#1011 ) - Rename crawl notes to description - Add migration renaming notes -> description - Stop inheriting workflow description in crawl - Update frontend to replace crawl/upload notes with description - Remove setting of config description from crawl list - Adjust tests for changes	2023-07-26 13:00:04 -07:00
Tessa Walsh	c7051d5fbf	Backend API consistency pass (#921 ) * Make API add and update method returns consistent - Updates return {"updated": True} - Adds return {"added": True} - Both can additionally have other fields as needed, e.g. id or name - remove Profile response model, as returning added / id only - reformat --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-06-16 18:52:46 -07:00
Tessa Walsh	120f7ca158	Precompute crawl file stats (#906 )	2023-06-07 16:39:49 -07:00
Ilya Kreymer	3f42515914	crawls list: unset errors in crawls list response to avoid very large… (#904 ) * crawls list: unset errors in crawls list response to avoid very large responses #872 * Remove errors from crawl replay.json * Add tests to ensure errors are excluded from crawl GET endpoints * Update tests to accept None for errors --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-06-02 18:52:59 -07:00
Tessa Walsh	9c7a312a4c	Rework collections to track collections in Crawl (#878 ) * Track collections in Crawl rather than crawls in Collection * Add delete collection API endpoint and tests * Precompute collection crawlCount, pageCount, and tags and add them to GET collection responses * Add modified field to Collection * Update collection replay.json method * Make add and remove crawls accept list of crawl ids * Auto-add new workflow crawls to collections when they successfully complete via CrawlConfig.autoAddCollections field * Move long-running post-crawl operator tasks into asyncio task * Make CrawlConfig.autoAddCollections updatable via /update API endpoint	2023-05-25 15:41:50 -04:00
Ilya Kreymer	12f7db3ae2	tests: fixes for crawl cancel + crawl stopped (#864 ) * tests: - fix cancel crawl test by ensuring state is not running or waiting - fix stop crawl test by ensuring stop is only initiated after at least one page has been crawled, otherwise result may be failed, as no crawl data has been crawled yet (separate fix in crawler to avoid loop if stopped before any data written webrecorder/browsertrix-crawler#314) - bump page limit to 4 for tests to ensure crawl is partially complete, not fully complete when stopping - allow canceled or partial_complete due to race condition * chart: bump frontend limits in default, not just for tests (addresses #780) * crawl stop before starting: - if crawl stopped before it started, mark as canceled - add test for stopping immediately, which should result in 'canceled' crawl - attempt to increase resync interval for immediate failure - nightly tests: increase page limit to test timeout * backend: - detect stopped-before-start crawl as 'failed' instead of 'done' - stats: return stats counters as int instead of string	2023-05-22 20:17:29 -07:00
Ilya Kreymer	2cae065c46	Add Waiting state on the backend and frontend (#839 ) * operator: add waiting state - add pods as related objects - inspect pod status, set crawl status to 'waiting' if no pods are running frontend: - frontend support for 'waiting' state - show waiting icon from mocks --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2023-05-08 17:05:01 -07:00
Ilya Kreymer	70319594c2	crawlconfig: fix default filename template, make configurable (#835 ) * crawlconfig: fix default filename template, make configurable - make default crawl file template configurable with 'default_crawl_filename_template' value in values.yaml - set to '@ts-@hostsuffix.wacz' by default - allow updating via 'crawlFilenameTemplate' in crawlconfig patch, which updates configmap - tests: add test for custom 'default_crawl_filename_template'	2023-05-08 14:03:27 -07:00
Tessa Walsh	59e49eacd5	Update collections backend API (#759 ) * Re-implement collections, storing crawlIds in collection * Return collections for crawl endpoints and filter on coll name * Remove crawl from all collections when deleted * Revert get_collection_crawls to flat array of resources * Fix tests	2023-04-14 12:17:18 -04:00
Ilya Kreymer	1c47a648a9	Max page limit override (#737 ) * more page limit: update to #717, instead of setting --limit in each crawlconfig, apply override --maxPageLimit setting, implemented in crawler, to override individually configured page limit * update tests, no longer returning 'crawl_page_limit_exceeds_allowed'	2023-04-03 14:01:32 -07:00
Ilya Kreymer	887cb16146	Allow configurable max pages per crawl in deployment settings (#717 ) * backend: max pages per crawl limit, part of fix for #716: - set 'max_pages_crawl_limit' in values.yaml, default to 100,000 - if set/non-0, automatically set limit if none provided - if set/non-0, return 400 if adding config with limit exceeding max limit - return limit as 'maxPagesPerCrawl' in /api/settings - api: /all/crawls - add runningOnly=0 to show all crawls, default to 1/true (for more reliable testing) tests: add test for 'max_pages_per_crawl' setting - ensure 'limit' can not be set higher than max_pages_per_crawl - ensure pages crawled is at the limit - set test limit to max 2 pages - add settings test - check for pages.jsonl and extraPages.jsonl when crawling 2 pages	2023-03-28 16:26:29 -07:00
Tessa Walsh	4724754efc	Filter and sort crawl and workflow list API endpoints in backend (#724 ) * Re-implement pagination and paginate crawlconfig revs First step toward simplifying pagination to set us up for sorting and filtering of list endpoints. This commit removes fastapi-pagination as a dependency. * Migrate all HttpUrl seeds to Seeds This commit also updates the frontend to always use Seeds and to fix display issues resulting from the change. * Filter and sort crawls and workflows Crawls: - Filter by createdBy (via userid param) - Filter by state (comma-separated string for multiple values) - Filter by first_seed, name, description - Sort by started, finished, fileSize, firstSeed - Sort descending by default to match frontend Workflows: - Filter by createdBy (formerly userid) and modifiedBy - Filter by first_seed, name, description - Sort by created, modified, firstSeed, lastCrawlTime * Add crawlconfigs search-values API endpoint and test	2023-03-28 17:55:40 -04:00
Tessa Walsh	4136bdad2e	Add optional description to crawl configs and return in crawl endpoints (#707 )	2023-03-21 15:39:09 -04:00
Tessa Walsh	e98c7172a9	Paginate API list endpoints (#659 ) * Paginate API list endpoints fastapi-pagination is pinned to 0.9.3, the latest release that plays nicely with pinned versions of fastapi and fastapi-users. * Increase page size via overriden Params and Page classes * update api resource list keys --------- Co-authored-by: sua yoo <sua@suayoo.com>	2023-03-06 14:41:25 -05:00
Tessa Walsh	ed94dde7e6	Include firstSeed and seedCount in crawl endpoints (#618 )	2023-02-22 10:27:31 -05:00
Tessa Walsh	bd4fba7af7	Fix POST /orgs/{oid}/crawls/delete (#591 ) * Fix POST /orgs/{oid}/crawls/delete - Add permissions check to ensure crawler users can only delete their own crawls - Fix broken delete_crawls endpoint - Delete files from storage as well as deleting crawl from db - Add tests, including nightly test that ensures crawl files are no longer accessible after the crawl is deleted	2023-02-15 21:06:12 -05:00
Tessa Walsh	ce8f426978	Add notes to crawl and crawl updates (#587 )	2023-02-08 18:36:22 -08:00
Tessa Walsh	2e3b3cb228	Add API endpoint to update crawl tags (#545 ) * Add API endpoint to update crawls (tags only for now) * Allow setting tags to empty list in crawlconfig updates	2023-02-01 22:24:36 -05:00
Tessa Walsh	0fa60ebc45	Rename archives/teams -> orgs in codebase + add db migration (#486 ) * Rename archives to orgs and aid to oid on backend * Rename archive to org and aid to oid in frontend * Remove translation artifact * Rename team -> organization * Add database migrations and run once on startup * This commit also applies the new by_one_worker decorator to other asyncio tasks to prevent heavy tasks from being run in each worker. * Run black, pylint, and husky via pre-commit * Set db version and use in migrations * Update and prepare database in single task * Migrate k8s configmaps	2023-01-18 14:51:04 -08:00
Ilya Kreymer	2daa742585	Copy tags from crawlconfig to crawl (#467 ), fixes #466 - add tags to crawl object - ensure tags are copied from crawlconfig to crawl when crawl is created (both manually and scheduled) - tests: add test to ensure tags added to crawl, remove redundant wait replaced with fixtures	2023-01-12 17:46:19 -08:00
Tessa Walsh	49460bb070	Add default organization + invite to default org (#465 ), #455 - Add default switch to Archive (org) model - Set default org name via values.yaml - Add check to ensure only one org with default org name exists - Stop creating new orgs for new users - Add new API endpoints for creating and renaming orgs (part of #457) - Make Archive.name unique via index - Wait for db connection on init, log if waiting - Make archive-less invites invite user to default org with Owner role - Rename default org from chart value if changed - Don't create new org for invited users	2023-01-12 16:44:18 -08:00
Ilya Kreymer	7b5d82936d	backend: initial tags api support (addresses #365 ): (#434 ) * backend: initial tags api support (addresses #365): - add 'tags' field to crawlconfig (array of strings) - allow querying crawlconfigs to specify multiple 'tag' query args, eg. tag=A&tag=B - add /archives/<aid>/crawlconfigs/tags api to query by distinct tag, include index on aid + tag tests: add tests for adding configs, querying by tags tests: fix fixtures to retry login if initial attempts fails, use test seed of https://webrecorder.net instead of https://example.com/	2023-01-11 13:29:35 -08:00
Ilya Kreymer	56a6d7a5d8	Backend lint check (#451 ) - apply lint + format fixes to backend - add ci for lint + format fixes for backend - use fixed version of pydantic	2023-01-10 16:17:06 -08:00
Tessa Walsh	d1b59c9bd0	Use archive_viewer_dep permissions to GET crawls (#443 ) * Use archive_viewer_dep permissions to GET crawls * Add is_viewer check to archive_dep * Add API endpoint to add new user to archive directly (/archive/<id>/add-user) * Add tests * Refactor tests to use fixtures * And remove login test that duplicates fixtures	2023-01-09 19:11:53 -08:00
Ilya Kreymer	dfca09fc9c	Add single crawl info api at /crawls/{crawl_id} (#418 ) * backend: crawl info apis: - add /crawls/{crawl_id} api endpoint which just lists the crawl info, without resolving the individual files - move /crawls/{crawl_id}.json -> /crawls/{crawl_id}/replay.json for clarity that it's used for replay * frontend: update api for new replay.json endpoint	2022-12-19 14:54:48 -08:00
Ilya Kreymer	82ffc0dfbc	Local Deployment Work: Support running locally + test cluster on CI (#396 ) * k8s local deployment work: - make it easier to deploy w/o ingress by setting 'local_service_port' (suggested port 30870) - if using local minio, ensure file endpoints set to /data/ and /data/ proxies correctly to local bucket - if not using minio, ensure file endpoints point to correct access / endpoint url. - setup should work with docker desktop, minikube, microk8s and k3s! - nginx chart: bump nginx memory limit to 20Mi - nginx image: 00-default-override-resolver-config -> 00-browsertrix-nginx-init for clarity - nginx image: use default nginx.conf, pin to nginx 1.23.2 - mongo: readd readiness probe, bump connect wait timeout (needed for ci) - config: set superadmin username to 'admin' - config schema: set 'name' as required - add sample chart values overrides: - chart values: local-config.yaml for running locally with 'local_service_port' - chart values: add microk8s-hosted.yaml for configuring a hosted microk8s setup - chart values: add microk8s-ci.yaml for ci tests - ci: remove docker swarm tests - ci: add microk8s integration tests: launching cluster, logging in, running a crawl of example.com, downloading/checking WACZ - bump to 1.1.0-beta.2	2022-12-02 19:58:34 -08:00

42 Commits