browsertrix

Author	SHA1	Message	Date
Ilya Kreymer	d4a2a66d6d	additional scale / browser window cleanup to properly support QA: (#2663 ) - follow up to #2627 - use qa_num_browser_windows to set exact number of QA browsers, fallback to qa_scale - set num_browser_windows and num_browsers_per_pod using crawler / qa values depending if QA crawl - scale_from_browser_windows() accepts optional browsers_per_pod if dealing with possible QA override - store 'desiredScale' in CrawlStatus to avoid recomputing for later scale resolving - ensure status.scale is always the actual scale observed	2025-06-12 13:09:04 -04:00
Tessa Walsh	45aa0a32b6	Calculate total for crawl QA page endpoint (#2435 ) Fixes #2434 Patch fix for a regression in Browsertrix 1.4.0-1.4.1 where total was not being calculated for QA page list endpoint but still being included in response, which led to total always being 0 and pages not loading in the frontend review screen as a result.	2025-02-27 11:46:35 -08:00
Tessa Walsh	ba5ca3fdd9	Move org storage recalculation into background job (#2138 ) Fixes #2112 - Moves org storage recalculation to background job, modify endpoint to return job id as part of response - Updates crawl + QA backend tests that broke due to https://webrecorder.net website changes --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-11-19 17:32:57 -05:00
Tessa Walsh	27ee16d308	Implement downloading archived item + QA runs as multi-WACZ (#1933 ) Fixes #1412 ## Changes ### Backend - Adds `all-crawls`, `crawls`, and `uploads` API endpoints to download archived item as multi-WACZ - Download QA runs as multi-WACZ - Adds backend tests for new endpoints - Update to new version of stream-zip library which does not require crc-32 to be present for ZIP members, computes after streaming, fixing invalid crc-32 issues as previously computed crc-32s from crawler may be invalid. ### Frontend Adds ability to download archived item from: - Button in archived item detail Files tab - Archived item details actions menu - Archived items list menu --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-25 10:28:57 -07:00
Ilya Kreymer	8c0321bdea	Pydantic 2.x update + type fixes + python 3.12 (#1947 ) * updates pydantic to 2.x * also update to python 3.12 * additional type fixes: - all Optional[] types must have a default value - update to constrained types - URL types converted from str - test updates Fixes #1940	2024-07-22 17:23:03 -07:00
Tessa Walsh	8b0d1432af	Show QA meter while analysis is running (#1854 ) Fixes #1846 - Ensure meter auto-updates as new stats are ready - Switch meter to new QA run when new analysis run is started - Remove Files from QA meter (files and errors will be reported separately) Co-authored-by: emma <hi@emma.cafe> Co-authored-by: sua yoo <sua@webrecorder.org>	2024-06-12 12:32:01 -04:00
Ilya Kreymer	2ffb37bd14	tests: fix typo in waiting for qa run to stop test! (#1857 ) Fixes not properly testing if activeQA is null, hopefully fixes intermittent test failures!	2024-06-11 11:07:55 -04:00
Tessa Walsh	a85f9496b0	Include number of Identical Files in QA stats and meter (#1848 ) This PR adds Identical Files to the QA Page Match Analysis meter bars. To do this, the backend calculates the number of non-HTML pages once and includes it under the key `Files` in each of the `screenshotMatch` and `textMatch` QA stats return arrays. The backend additionally removes the file count from "No Data" to prevent these from being counted twice. --------- Co-authored-by: emma <hi@emma.cafe>	2024-06-06 13:15:19 -04:00
Tessa Walsh	b8caeb88e9	Ensure QA run WACZs are deleted (#1715 ) - When qa run is deleted - When crawl is deleted And adds tests for WACZ deletion. Fixes #1713	2024-04-22 18:04:09 -04:00
Ilya Kreymer	1844e761dc	Support sorting by last QA started time (#1712 ) To support #1683, it would be useful to be able to sort by 'last QA start time' in addition to/instead of last QA state. - make sorting consistent with workflow sorting - sortBy fields renamed to lastQAState and lastQAStarted - Current QA runs are now included in the lastQAState/lastQAStarted fields, rather than being separated out to different values --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-04-22 13:00:52 -07:00
Ilya Kreymer	4360e0c1b5	Update tests with latest crawler (#1711 ) tests: use 'latest' crawler release for testing, now that 1.1.x is released.	2024-04-20 15:56:26 -07:00
Ilya Kreymer	9609ff4194	Add 'activeQAStats' field (#1694 ) As additional support for #1683, include the active QA stats in the crawl response, along with active QA state. This will allow showing progress of QA run in the archived items list.	2024-04-18 10:05:39 -04:00
Tessa Walsh	30ab139ff2	Add QA run aggregate stats API endpoint (#1682 ) Fixes #1659 Takes an arbitrary set of thresholds for text and screenshot matches as a comma-separated list of floats. Returns a list of groupings for each that include the lower boundary and count for all thresholds passed in.	2024-04-17 13:24:18 -04:00
Tessa Walsh	c800da1732	Add reviewStatus, qaState, and qaRunCount sort options to crawls/all-crawls list endpoints (#1686 ) Backend work for #1672 Adds new sort options to /crawls and /all-crawls GET list endpoints: - `reviewStatus` - `qaRunCount`: number of completed QA runs for crawl (also added to CrawlOut) - `qaState` (sorts by `activeQAState` first, then `lastQAState`, both of which are added to CrawlOut)	2024-04-16 23:54:09 -07:00
Tessa Walsh	87e0873f1a	Add mime field to Page model (#1678 )	2024-04-17 00:57:49 -04:00
Vinzenz Sinapius	1b034957ff	Improve reliability of backend tests (#1675 ) - Remove globals from profile, uploads, and qa test modules in favor of fixtures - Add retries to fix intermittent test failures due to timing --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-04-16 14:22:41 -04:00
Tessa Walsh	4229b94736	Track failed QA runs and include in list endpoint (#1650 ) Fixes #1648 - Tracks failed QA runs in database, not only successful ones - Includes failed QA runs in list endpoint by default - Adds `skipFailed` param to list endpoint to return only successful runs --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-04-04 18:51:06 -07:00
Tessa Walsh	00ced6dd6b	Add single page QA GET endpoint (#1635 ) Fixes #1634 Also make sure other get page endpoint without qa uses PageOut model	2024-03-27 14:57:59 -07:00
Ilya Kreymer	4f676e4e82	QA Runs Initial Backend Implementation (#1586 ) Supports running QA Runs via the QA API! Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes #1498 Also requires the latest Browsertrix Crawler 1.1.0+ (from webrecorder/browsertrix-crawler#469 branch) Notable changes: - QARun objects contain info about QA runs, which are crawls performed on data loaded from existing crawls. - Various crawl db operations can be performed on either the crawl or `qa.` object, and core crawl fields have been moved to CoreCrawlable. - While running,`QARun` data stored in a single `qa` object, while finished qa runs are added to `qaFinished` dictionary on the Crawl. The QA list API returns data from the finished list, sorted by most recent first. - Includes additional type fixes / type safety, especially around BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific get_upload(), get_basecrawl(), get_crawl() getters for internal use and get_crawl_out() for API - Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-20 22:42:16 -07:00

19 Commits