browsertrix

Author	SHA1	Message	Date
Vinzenz Sinapius	1b034957ff	Improve reliability of backend tests (#1675 ) - Remove globals from profile, uploads, and qa test modules in favor of fixtures - Add retries to fix intermittent test failures due to timing --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-04-16 14:22:41 -04:00
Ilya Kreymer	95f5605af7	renumber crawl priority classes: (#1673 ) - priority classes <-10 are ignored by cluster-autoscaler so QA jobs with too low priorities never run - start crawl priorities at 0 going down (same as before) - start qa run priorities at -2 going down (instead of -100) - this means a crawl of with scale of 3 can be preempted by 1st qa pod, but otherwise crawls have higher priority - rename priority classes as they are otherwise immutable and error on helm upgrade This allows for more room in lower pri classes for other type of objects, while keeping in mind the -10 and below threshold: (see: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md)	2024-04-13 12:24:43 -07:00
Ilya Kreymer	f243d34395	Remove pages from QA Configmap (#1671 ) Fixes #1670 No longer need to pass pages to the ConfigMap. The ConfigMap has a size limit and will fail if there are too many pages. With this change, the page list for QA will be read directly from the WACZ files pages.jsonl / extraPages.jsonl entries.	2024-04-12 16:04:33 -07:00
Tessa Walsh	172a9bf0cd	Change crawl.reviewStatus to 1-5 scale int (#1664 )	2024-04-09 17:51:06 -07:00
Ilya Kreymer	a7cda3b11b	version: bump to 1.10.0-beta.1	2024-04-05 18:24:14 -07:00
Tessa Walsh	4229b94736	Track failed QA runs and include in list endpoint (#1650 ) Fixes #1648 - Tracks failed QA runs in database, not only successful ones - Includes failed QA runs in list endpoint by default - Adds `skipFailed` param to list endpoint to return only successful runs --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-04-04 18:51:06 -07:00
Ilya Kreymer	5c08c9679c	fix issue with incorrect number of total pages if any of the seeds is a redirect (#1649 ) Following changes in webrecorder/browsertrix-crawler#475, webrecorder/browsertrix-crawler#509, the crawler adds a redirected seed to the seen list. To account for this, it needs to be subtracted to get the total page count.	2024-04-04 15:55:44 -07:00
sua yoo	83c9203a11	Initial QA Review UI! (#1624 ) QA Details page: - Enables QA tab with ability to start automated analysis QA Run + view a and manual review status - Pages listed with review status + overall crawl review status shown on QA details (relates to #1508) - Initial placeholder for QA run analytics (part of #1589) - Addresses a good deal of #1477 Automated Analysis QA in Review Mode: - Ability to select from multiple analysis QA runs / view QA runs in QA details - Shows analysis screenshot, text and resources compare and replay tabs (fixes #1496) - Sorting by worst screenshot / worst text score for each QA run - Includes pages sidebar with screenshot/text/resource compare results (fixes #1497) Manual Review QA in Review Mode: - Per-page replay available as separate tab (fixes #1499) - Supports thumbs up, thumbs down, notes for each page - Supports entering review status approval (good/acceptable/bad can be entered when finishing review --------- Co-authored-by: Emma Segal-Grossman <hi@emma.cafe> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-04-04 15:09:52 -07:00
Ilya Kreymer	ffc4b5b58f	operator state fixes (follow up fomr #1639 ) (#1640 ) - increase time for going to waiting_capacity from starting to 150 seconds - relax requirement for state transitions, allow complete from waiting - additional type safety for different states, ensure mark_finished() only called with non-running states, add `Literal` types for all the state types.	2024-03-29 15:12:16 -07:00
Ilya Kreymer	3438133fcb	Crawler pod memory padding + auto scaling (#1631 ) - set memory limit to 1.2x memory request to provide extra padding and avoid OOM - attempt to resize crawler pods by 1.2x when exceeding 90% of available memory - do a 'soft OOM' (send extra SIGTERM) to pod when reaching 100% of requested memory, resulting in faster graceful restart, but avoiding a system-instant OOM Kill - Fixes #1632 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-28 16:39:00 -07:00
Tessa Walsh	00ced6dd6b	Add single page QA GET endpoint (#1635 ) Fixes #1634 Also make sure other get page endpoint without qa uses PageOut model	2024-03-27 14:57:59 -07:00
Tessa Walsh	66b4532321	Give test_crawl_timeout 10 mins to finish (#1627 ) Related to https://github.com/webrecorder/browsertrix-cloud/issues/1620 Follow-up to https://github.com/webrecorder/browsertrix-cloud/pull/1621, which didn't seem to fix the problem. I'm giving it much more time here in the hopes that it solves it (since it's a nightly test, time shouldn't be such a pressing issue).	2024-03-26 18:33:30 -07:00
Tessa Walsh	e9895e78a2	Add additional filters to page list endpoints (#1622 ) Fixes #1617 Filters added: - reviewed: filter by page has approval or at least one note (true) or neither (false) - approved: filter by approval value (accepts list of strings, comma-separated, each of which are coerced into True, False, or None, or ignored if they are invalid values) - hasNotes: filter by has at least one note (true) or not (false) Tests have also been added to ensure that results are as expected.	2024-03-21 21:33:07 -07:00
Tessa Walsh	b3b1e0d7d8	Fix intermittent crawl timeout test failure (#1621 ) Fixes #1620 This increases the total timeout from 60 seconds to 120 seconds for crawl to complete, which should be sufficient given how intermittently the failure has been happening. Can increase it further if needed.	2024-03-21 17:18:27 -07:00
Ilya Kreymer	4f676e4e82	QA Runs Initial Backend Implementation (#1586 ) Supports running QA Runs via the QA API! Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes #1498 Also requires the latest Browsertrix Crawler 1.1.0+ (from webrecorder/browsertrix-crawler#469 branch) Notable changes: - QARun objects contain info about QA runs, which are crawls performed on data loaded from existing crawls. - Various crawl db operations can be performed on either the crawl or `qa.` object, and core crawl fields have been moved to CoreCrawlable. - While running,`QARun` data stored in a single `qa` object, while finished qa runs are added to `qaFinished` dictionary on the Crawl. The QA list API returns data from the finished list, sorted by most recent first. - Includes additional type fixes / type safety, especially around BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific get_upload(), get_basecrawl(), get_crawl() getters for internal use and get_crawl_out() for API - Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-20 22:42:16 -07:00
Tessa Walsh	21ae38362e	Add endpoints to read pages from older crawl WACZs into database (#1562 ) Fixes #1597 New endpoints (replacing old migration) to re-add crawl pages to db from WACZs. After a few implementation attempts, we settled on using [remotezip](https://github.com/gtsystem/python-remotezip) to handle parsing of the zip files and streaming their contents line-by-line for pages. I've also modified the sync log streaming to use remotezip as well, which allows us to remove our own zip module and let remotezip handle the complexity of parsing zip files. Database inserts for pages from WACZs are batched 100 at a time to help speed up the endpoint, and the task is kicked off using asyncio.create_task so as not to block before giving a response. StorageOps now contains a method for streaming the bytes of any file in a remote WACZ, requiring only the presigned URL for the WACZ and the name of the file to stream.	2024-03-19 14:14:21 -07:00
Ilya Kreymer	5a4902b6d4	kubernetes api: avoid overriding content-type header in kubernetes-asyncio, pass in via arg instead (main) (#1605 ) - instead of overriding the content-type header globally, pass 'application/merge-patch+json' to self.custom_api.patch_namespaced_custom_object() directly - bump kubernetes-asyncio to 29.0.0 - fixes potential issues with global override of the header in kubernetes-asyncio - copy of #1602 for main	2024-03-18 11:17:54 -07:00
Ilya Kreymer	e7af081af1	profile browser fixes: better resource usage + load retry (main) (#1604 ) - Backend: Use separate resource constraints for profiles: default profile browser resources to either 'profile_browser_cpu' / 'profile_browser_memory' or single browser 'crawler_memory_base' / 'crawler_cpu_base', instead of scaled to the number of browser workers - Frontend: check that profile html page is loading, keep retrying if still getting nginx error instead of loading an iframe with the error. Fixes #1598 (Copy of #1599 from 1.9.4)	2024-03-16 15:07:04 -07:00
wvengen	6278157f40	Make storage deletion work on more S3 providers, don't use access URL for deletion (#1600 ) I came across [this problem](https://forum.webrecorder.net/t/deleting-crawl-failure/512) and noticed that the access URL is used when deleting files, causing my file deletions to fail on OpenStack SWIFT S3 (relates to #1090). This trivial change makes it work there.	2024-03-16 04:17:23 -04:00
Ilya Kreymer	08f6847194	Configurable Max Scale for frontend (#1557 ) Allow maximum scale option to be fully configurable via `max_crawl_scale`. Already configurable on the backend, and now exposed to the frontend via API `/api/settings` `maxCrawlScale` value. The workflow editor and workflow details are updated to allow selecting the scale up to the maxCrawlScale setting (which defaults to 3 if not set).	2024-03-11 16:21:20 -07:00
Ilya Kreymer	ea494fa6e6	Merge V1.9.3 changes into main (#1583 ) - Fix execution time checking by keeping lastUpdatedTime in db by @ikreymer in https://github.com/webrecorder/browsertrix-cloud/pull/1573 - disable postcss-lit for var css - Prevent closing tooltips from closing collection share dialog by @SuaYoo in https://github.com/webrecorder/browsertrix-cloud/pull/1579 - Fix pending exclusion pagination by @SuaYoo in https://github.com/webrecorder/browsertrix-cloud/pull/1578 - Fix regex escape in exclusion editor text match by @SuaYoo in https://github.com/webrecorder/browsertrix-cloud/pull/1577 --------- Co-authored-by: emma <hi@emma.cafe> Co-authored-by: sua yoo <sua@webrecorder.org>	2024-03-06 15:38:22 -08:00
Tessa Walsh	c20e754269	Add updatable QA reviewStatus field to crawls (#1575 ) Fixes #1539 Adds `reviewStatus` field to `BaseCrawl` model, updatable via the crawl update API endpoint. Acceptable values are "good", "acceptable" or "failure", enforced by an Enum. Added to `BaseCrawl` so that we can extend support to uploads more easily later on, but for now we'll only display this for crawls in the frontend.	2024-03-05 16:49:23 -08:00
Tessa Walsh	ec0db1c323	Temporarily remove pages migration (#1572 ) Removing until we have a better tested solution, including to avoid testing of QA runs for new crawls in beta.	2024-03-04 10:30:04 -08:00
Ilya Kreymer	09a0d51843	pages: set page status to 200 if unset and loadState != 0 (#1563 ) Follow up to #1516, ensure page status is set to 200 if no status is provided, if loadState is not 0	2024-02-29 15:15:17 -08:00
Ilya Kreymer	2ac6584942	Refactor operator class into module (#1564 ) The operator class has gotten fairly large, this is a first pass in refactoring operator.py into a submodule instead, with multiple operator instances which handle different types of objects. - The main k8s interface has been split into K8sOpApi which extends K8sApi and is shared across all operators. - Each operator extends BaseOperator which also has an instance of K8sOpApi - The CrawlOperator is still the bulk of the functionality, but will likely be further refactored to support QA jobs --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-02-29 14:40:12 -08:00
Tessa Walsh	da19691184	Add crawl errors incrementally during crawl (#1561 ) Fixes #1558 - Adds crawl errors to database incrementally during crawl rather than after crawl completes - Simplifies crawl /errors API endpoint to always return errors from database	2024-02-29 09:16:34 -08:00
Ilya Kreymer	804f755787	Increase startup probe time to account for long-running migrations (#1560 ) - increases the failureThreshold for startupProbe for the api backend container to account for long running migrations, upto 300 seconds - add `/healthzStartup` which checks if db is ready - bump - keeps `/healthz` to always return 200 when running - increases livenessProbe failureThreshold to be higher than readiness probe, following recommended best practice of liveness probe > readiness probe - fixes #1559	2024-02-28 14:22:33 -08:00
Tessa Walsh	14189b7cfb	Add crawl pages and related API endpoints (#1516 ) Fixes #1502 - Adds pages to database as they get added to Redis during crawl - Adds migration to add pages to database for older crawls from pages.jsonl and extraPages.jsonl files in WACZ - Adds GET, list GET, and PATCH update endpoints for pages - Adds POST (add), PATCH, and POST (delete) endpoints for page notes, each with their own id, timestamp, and user info in addition to text - Adds page_ops methods for 1. adding resources/urls to page, and 2. adding automated heuristics and supplemental info (mime, type, etc.) to page (for use in crawl QA job) - Modifies `Migration` class to accept kwargs so that we can pass in ops classes as needed for migrations - Deletes WACZ files and pages from database for failed crawls during crawl_finished process - Deletes crawl pages when a crawl is deleted Note: Requires a crawler version 1.0.0 beta3 or later, with support for `--writePagesToRedis` to populate pages at crawl completion. Beta 4 is configured in the test chart, which should be upgraded to stable 1.0.0 when it's released. Connected to https://github.com/webrecorder/browsertrix-crawler/pull/464 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-02-28 12:11:35 -05:00
Ilya Kreymer	8ae032ff88	More friendly WARC prefix inside WACZ based on Org slug + Crawl Name / First Seed URL. (#1537 ) Supports setting WARC prefix for WARCs inside WACZ to `<org slug>-<slug [crawl name \| first seed host]>`. - Prefix set via WARC_PREFIX env var, supported in browsertrix-crawler 1.0.0-beta.4 or higher If crawl name is provided, uses crawl name, other hostname of first seed. The name is 'sluggified', using lowercase alphanum characters separated by dashes. Ex: in an organization called `Default Org`, a crawl of `https://specs.webrecorder.net/` and no name will have WARCs named: `default-org-specs-webrecorder-net-....warc.gz` If the crawl is given the name `SPECS`, the WARCs will be named `default-org-specs-manual-....warc.gz` Fixes #412 in a default way.	2024-02-22 23:54:23 -08:00
Ilya Kreymer	a8e3ff1141	version: bump to 1.10.0-beta.0	2024-02-20 00:22:29 -08:00
Ilya Kreymer	c1cffe9ecd	version: bump to 1.9.1	2024-02-16 09:44:18 -08:00
Ilya Kreymer	64bf21311d	version: bump to 1.9.0!	2024-02-14 13:30:46 -08:00
Ilya Kreymer	1d266e3cea	bump to 1.9.0.beta.5	2024-02-12 18:29:39 -08:00
Ilya Kreymer	4bc8152640	version: bump to 1.9.0-beta.4	2024-02-09 16:17:13 -08:00
Ilya Kreymer	0653657637	better handling of failed redis connection + exec time updates (#1520 ) This PR addresses a possible failure when Redis pod was inaccessible from Crawler pod. - Ensure crawl is set to 'waiting_for_capacity' if either no crawler pods are available or no redis pod. previously, missing/inaccessible redis would not result in 'waiting_for_capacity' if crawler pods are available - Rework logic: if no crawler and redis after >60 seconds, shutdown redis. if crawler and no redis, init (or reinit) redis - track 'lastUpdatedTime' in db when incrementing exec time to avoid double counting if lastUpdatedTime has not changed, eg. if operator sync fails. - add redis timeout of 20 seconds to avoid timing out operator responses if redis conn takes too long, assume unavailable	2024-02-09 16:14:29 -08:00
Ilya Kreymer	65fec64197	storages: use asynccontextmanager instead of sync to close client (#1521 ) Follow-up to #1481, use the asyncontextmanager with `async with` as only used in async functions (which call run_in_executor)	2024-02-08 08:28:53 -08:00
Ilya Kreymer	b2a5dbf2cd	enable screenshots by default + fix py version formatting (#1518 ) configmap: add --screenshot thumbnail,view as default screenshots version: update update-version.sh to add newline in version.py to match new black formatting (from changes in #1507) Fixes #1519	2024-02-07 17:07:28 -08:00
Ilya Kreymer	7aebce66f6	version: bump to 1.9.0-beta.3	2024-02-07 15:21:10 -08:00
Tessa Walsh	a898c2b456	Format backend with Black 24 (#1507 ) Fixes #1506	2024-02-07 11:35:34 -08:00
Tessa Walsh	b252931c71	Add scale to CrawlOut (#1487 ) Fixes #1486 `scale` is already saved in the crawl but needed to be added to `CrawlOut`.	2024-01-23 14:10:37 -08:00
Ilya Kreymer	bf38063e0a	Close sync S3 client (#1481 ) Cleanup of boto3 sync client, ensure that it is used as a context manager like async client.	2024-01-18 18:18:41 -05:00
Tessa Walsh	950844dc92	Add migration to fix issues with previous migrations (#1480 ) Fixes #1479 - Update null crawlTimeouts in db from null to 0 - Update crawlerChannel in configmaps	2024-01-18 16:59:40 -05:00
Ilya Kreymer	ad19941318	operator: use 'default' CRAWLER_CHANNEL if none is set (#1478 ) Use default channel if CRAWLER_CHANNEL not set in crawlconfig configmap, consistent with how other configmap settings for cronjobs are used.	2024-01-18 11:13:03 -08:00
Ilya Kreymer	e43feedc43	version: bump to 1.9.0-beta.2	2024-01-18 10:01:38 -08:00
Ilya Kreymer	370590b14f	version: bump to 1.9.0-beta.1	2024-01-17 14:58:25 -08:00
Tessa Walsh	07fa46d9aa	Add custom user agent to workflows (#1465 ) Fixes #1341 Adds "User Agent" field to workflow editor under the Browser Settings tab. If not set, the crawler will use the browser's default user agent. Also added to docs and to the workflow details page (if set). --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-01-17 17:33:50 -05:00
Ilya Kreymer	90197b2a85	Backend mem usage fix - use fixed MOTOR_MAX_WORKERS + switch to gunicorn (#1468 ) Refactors backend deployment to: - Use MOTOR_MAX_WORKERS (defaulting to 1) to reduce threads used by mongodb connections - Also sets backend workers to 1 by default to reduce default memory usage - Switches to gunicorn with uvloop worker for production use instead of uvicorn (as recommended by uvicorn) Lower thread count should address memory leak/increased usage, which resulted in 5x thread x cpus x workers, eg. potentially 20 or 40 threads just for mongodb connections. Lower default number of workers should make it easier to scale backend with HPA if additional capacity. Fixes #1467	2024-01-16 15:32:42 -08:00
Tessa Walsh	032859f361	Support multiple crawler versions (#1420 ) Fixes #1385 ## Changes Supports multiple crawler 'channels' which can be configured to different browsertrix-crawler versions - Replaces `crawler_image` in helm chart with `crawler_channels` array similar to how storages are handled - The `default` crawler channel must always be provided and specifies the default crawler image - Adds backend `/orgs/{oid}/crawlconfigs/crawler-channels` API endpoint to fetch information about available crawler versions (name, image, and label) and test - Adds crawler channel select to workflow creation/edit screens and profile creation dialog, and updates related API endpoints and configmaps accordingly. The select dropdown is shown only if more than one channel is configured. - Adds `crawlerChannel` to workflow and crawl details. - Add `image` to crawler image, used to display actual image used as part of the crawl. - Modifies `crawler_crawl_id` backend test fixture to use `test` crawler version to ensure crawler versions other than latest work - Adds migration to add `crawlerChannel` set to `default` to existing workflow and profile objects and workflow configmaps --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-01-16 15:32:12 -08:00
Tessa Walsh	9f73bafd37	Fix browser profile name in crawl endpoints (#1464 ) Fixes #1388 Fixes browser profile name lookup by ensuring profileid is in CrawlOut model.	2024-01-14 16:30:27 -08:00
Tessa Walsh	38a01860b8	Add API endpoints for crawl statistics (#1461 ) Fixes #1158 Introduces two new API endpoints that stream crawling statistics CSVs (with a suggested attachment filename header): - `GET /api/orgs/all/crawls/stats` - crawls from all orgs (superuser only) - `GET /api/orgs/{oid}/crawls/stats` - crawls from just one org (available to org crawler/admin users as well as superusers) Also includes tests for both endpoints.	2024-01-10 13:30:47 -08:00

1 2 3 4 5 ...

418 Commits