browsertrix

Author	SHA1	Message	Date
Ilya Kreymer	f89027ac89	version: 1.10.0-beta.3	2024-04-24 15:45:17 +02:00
Ilya Kreymer	ec74eb4242	operator: add 'max_crawler_memory' to limit autosizing of crawler pods (#1746 ) Adds a `max_crawler_memory` chart setting, which, if set, will defines the upper crawler memory limit that crawler pods can be resized up to. If not set, auto resizing is disabled and pods are always set to 'crawler_memory' memory	2024-04-24 15:16:32 +02:00
Ilya Kreymer	41655ef829	version: bump to 1.10.0-beta.2	2024-04-23 23:19:16 +02:00
Ilya Kreymer	b94070160b	allow configuring designated registration org to which new users can register (#1735 ) if 'registration_enabled' is set, check 'registration_org_id' for org id of an existing org that new users should be added to when they register. if omitted, default to the default org Fixes #1729	2024-04-23 17:11:37 -04:00
sua yoo	1915274e26	Fix QA review comments (#1723 ) Fixes https://github.com/webrecorder/browsertrix/issues/1710 Fixes date and deletion for newly added comments.	2024-04-23 16:31:52 -04:00
Tessa Walsh	b8caeb88e9	Ensure QA run WACZs are deleted (#1715 ) - When qa run is deleted - When crawl is deleted And adds tests for WACZ deletion. Fixes #1713	2024-04-22 18:04:09 -04:00
Ilya Kreymer	1844e761dc	Support sorting by last QA started time (#1712 ) To support #1683, it would be useful to be able to sort by 'last QA start time' in addition to/instead of last QA state. - make sorting consistent with workflow sorting - sortBy fields renamed to lastQAState and lastQAStarted - Current QA runs are now included in the lastQAState/lastQAStarted fields, rather than being separated out to different values --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-04-22 13:00:52 -07:00
Ilya Kreymer	b574f00d2b	Add Repository Index + Chart Rename + Docs Rename (#1708 ) Repository Index: Generate an index.yaml in ./docx/helm-repo/index.yaml to allow for browsertrix to be a helm repository. docs: rename docs.browsertrix.cloud -> docs.browsertrix.com docs: update deployment doc to mention helm repo as preferred way to install docs build action: generate repository index in GH action publish action: update auto-generated message to mention installing from the repo. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-04-21 09:42:25 -07:00
Ilya Kreymer	4360e0c1b5	Update tests with latest crawler (#1711 ) tests: use 'latest' crawler release for testing, now that 1.1.x is released.	2024-04-20 15:56:26 -07:00
Tessa Walsh	80008a2853	Add post load delay to Browsertrix (#1700 ) Fixes #1699 Adds post load delay to: - Backend `RawCrawlConfig` model - Frontend (workflow editor and config details component) - Workflow setup docs	2024-04-18 20:03:47 -07:00
Ilya Kreymer	9609ff4194	Add 'activeQAStats' field (#1694 ) As additional support for #1683, include the active QA stats in the crawl response, along with active QA state. This will allow showing progress of QA run in the archived items list.	2024-04-18 10:05:39 -04:00
Tessa Walsh	b87860c68a	Ensure /all-crawls?sortBy=qaState always sorts crawls above uploads (#1691 ) Follow-up to #1686	2024-04-17 19:14:29 -07:00
Tessa Walsh	30ab139ff2	Add QA run aggregate stats API endpoint (#1682 ) Fixes #1659 Takes an arbitrary set of thresholds for text and screenshot matches as a comma-separated list of floats. Returns a list of groupings for each that include the lower boundary and count for all thresholds passed in.	2024-04-17 13:24:18 -04:00
Ilya Kreymer	835014d829	restrict qa runs to a 'min_qa_crawler_image' if set in the chart (#1685 ) - fixes #1684 - can be used to optionally restrict QA to only some crawls (eg. with browsertrix-crawler>=1.0.0) - enforce error on backend (return 400) and handle special error on the frontend	2024-04-17 08:48:33 -07:00
Tessa Walsh	c800da1732	Add reviewStatus, qaState, and qaRunCount sort options to crawls/all-crawls list endpoints (#1686 ) Backend work for #1672 Adds new sort options to /crawls and /all-crawls GET list endpoints: - `reviewStatus` - `qaRunCount`: number of completed QA runs for crawl (also added to CrawlOut) - `qaState` (sorts by `activeQAState` first, then `lastQAState`, both of which are added to CrawlOut)	2024-04-16 23:54:09 -07:00
Tessa Walsh	87e0873f1a	Add mime field to Page model (#1678 )	2024-04-17 00:57:49 -04:00
Vinzenz Sinapius	1b034957ff	Improve reliability of backend tests (#1675 ) - Remove globals from profile, uploads, and qa test modules in favor of fixtures - Add retries to fix intermittent test failures due to timing --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-04-16 14:22:41 -04:00
Ilya Kreymer	95f5605af7	renumber crawl priority classes: (#1673 ) - priority classes <-10 are ignored by cluster-autoscaler so QA jobs with too low priorities never run - start crawl priorities at 0 going down (same as before) - start qa run priorities at -2 going down (instead of -100) - this means a crawl of with scale of 3 can be preempted by 1st qa pod, but otherwise crawls have higher priority - rename priority classes as they are otherwise immutable and error on helm upgrade This allows for more room in lower pri classes for other type of objects, while keeping in mind the -10 and below threshold: (see: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md)	2024-04-13 12:24:43 -07:00
Ilya Kreymer	f243d34395	Remove pages from QA Configmap (#1671 ) Fixes #1670 No longer need to pass pages to the ConfigMap. The ConfigMap has a size limit and will fail if there are too many pages. With this change, the page list for QA will be read directly from the WACZ files pages.jsonl / extraPages.jsonl entries.	2024-04-12 16:04:33 -07:00
Tessa Walsh	172a9bf0cd	Change crawl.reviewStatus to 1-5 scale int (#1664 )	2024-04-09 17:51:06 -07:00
Ilya Kreymer	a7cda3b11b	version: bump to 1.10.0-beta.1	2024-04-05 18:24:14 -07:00
Tessa Walsh	4229b94736	Track failed QA runs and include in list endpoint (#1650 ) Fixes #1648 - Tracks failed QA runs in database, not only successful ones - Includes failed QA runs in list endpoint by default - Adds `skipFailed` param to list endpoint to return only successful runs --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-04-04 18:51:06 -07:00
Ilya Kreymer	5c08c9679c	fix issue with incorrect number of total pages if any of the seeds is a redirect (#1649 ) Following changes in webrecorder/browsertrix-crawler#475, webrecorder/browsertrix-crawler#509, the crawler adds a redirected seed to the seen list. To account for this, it needs to be subtracted to get the total page count.	2024-04-04 15:55:44 -07:00
sua yoo	83c9203a11	Initial QA Review UI! (#1624 ) QA Details page: - Enables QA tab with ability to start automated analysis QA Run + view a and manual review status - Pages listed with review status + overall crawl review status shown on QA details (relates to #1508) - Initial placeholder for QA run analytics (part of #1589) - Addresses a good deal of #1477 Automated Analysis QA in Review Mode: - Ability to select from multiple analysis QA runs / view QA runs in QA details - Shows analysis screenshot, text and resources compare and replay tabs (fixes #1496) - Sorting by worst screenshot / worst text score for each QA run - Includes pages sidebar with screenshot/text/resource compare results (fixes #1497) Manual Review QA in Review Mode: - Per-page replay available as separate tab (fixes #1499) - Supports thumbs up, thumbs down, notes for each page - Supports entering review status approval (good/acceptable/bad can be entered when finishing review --------- Co-authored-by: Emma Segal-Grossman <hi@emma.cafe> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-04-04 15:09:52 -07:00
Ilya Kreymer	ffc4b5b58f	operator state fixes (follow up fomr #1639 ) (#1640 ) - increase time for going to waiting_capacity from starting to 150 seconds - relax requirement for state transitions, allow complete from waiting - additional type safety for different states, ensure mark_finished() only called with non-running states, add `Literal` types for all the state types.	2024-03-29 15:12:16 -07:00
Ilya Kreymer	3438133fcb	Crawler pod memory padding + auto scaling (#1631 ) - set memory limit to 1.2x memory request to provide extra padding and avoid OOM - attempt to resize crawler pods by 1.2x when exceeding 90% of available memory - do a 'soft OOM' (send extra SIGTERM) to pod when reaching 100% of requested memory, resulting in faster graceful restart, but avoiding a system-instant OOM Kill - Fixes #1632 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-28 16:39:00 -07:00
Tessa Walsh	00ced6dd6b	Add single page QA GET endpoint (#1635 ) Fixes #1634 Also make sure other get page endpoint without qa uses PageOut model	2024-03-27 14:57:59 -07:00
Tessa Walsh	66b4532321	Give test_crawl_timeout 10 mins to finish (#1627 ) Related to https://github.com/webrecorder/browsertrix-cloud/issues/1620 Follow-up to https://github.com/webrecorder/browsertrix-cloud/pull/1621, which didn't seem to fix the problem. I'm giving it much more time here in the hopes that it solves it (since it's a nightly test, time shouldn't be such a pressing issue).	2024-03-26 18:33:30 -07:00
Tessa Walsh	e9895e78a2	Add additional filters to page list endpoints (#1622 ) Fixes #1617 Filters added: - reviewed: filter by page has approval or at least one note (true) or neither (false) - approved: filter by approval value (accepts list of strings, comma-separated, each of which are coerced into True, False, or None, or ignored if they are invalid values) - hasNotes: filter by has at least one note (true) or not (false) Tests have also been added to ensure that results are as expected.	2024-03-21 21:33:07 -07:00
Tessa Walsh	b3b1e0d7d8	Fix intermittent crawl timeout test failure (#1621 ) Fixes #1620 This increases the total timeout from 60 seconds to 120 seconds for crawl to complete, which should be sufficient given how intermittently the failure has been happening. Can increase it further if needed.	2024-03-21 17:18:27 -07:00
Ilya Kreymer	4f676e4e82	QA Runs Initial Backend Implementation (#1586 ) Supports running QA Runs via the QA API! Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes #1498 Also requires the latest Browsertrix Crawler 1.1.0+ (from webrecorder/browsertrix-crawler#469 branch) Notable changes: - QARun objects contain info about QA runs, which are crawls performed on data loaded from existing crawls. - Various crawl db operations can be performed on either the crawl or `qa.` object, and core crawl fields have been moved to CoreCrawlable. - While running,`QARun` data stored in a single `qa` object, while finished qa runs are added to `qaFinished` dictionary on the Crawl. The QA list API returns data from the finished list, sorted by most recent first. - Includes additional type fixes / type safety, especially around BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific get_upload(), get_basecrawl(), get_crawl() getters for internal use and get_crawl_out() for API - Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-20 22:42:16 -07:00
Tessa Walsh	21ae38362e	Add endpoints to read pages from older crawl WACZs into database (#1562 ) Fixes #1597 New endpoints (replacing old migration) to re-add crawl pages to db from WACZs. After a few implementation attempts, we settled on using [remotezip](https://github.com/gtsystem/python-remotezip) to handle parsing of the zip files and streaming their contents line-by-line for pages. I've also modified the sync log streaming to use remotezip as well, which allows us to remove our own zip module and let remotezip handle the complexity of parsing zip files. Database inserts for pages from WACZs are batched 100 at a time to help speed up the endpoint, and the task is kicked off using asyncio.create_task so as not to block before giving a response. StorageOps now contains a method for streaming the bytes of any file in a remote WACZ, requiring only the presigned URL for the WACZ and the name of the file to stream.	2024-03-19 14:14:21 -07:00
Ilya Kreymer	5a4902b6d4	kubernetes api: avoid overriding content-type header in kubernetes-asyncio, pass in via arg instead (main) (#1605 ) - instead of overriding the content-type header globally, pass 'application/merge-patch+json' to self.custom_api.patch_namespaced_custom_object() directly - bump kubernetes-asyncio to 29.0.0 - fixes potential issues with global override of the header in kubernetes-asyncio - copy of #1602 for main	2024-03-18 11:17:54 -07:00
Ilya Kreymer	e7af081af1	profile browser fixes: better resource usage + load retry (main) (#1604 ) - Backend: Use separate resource constraints for profiles: default profile browser resources to either 'profile_browser_cpu' / 'profile_browser_memory' or single browser 'crawler_memory_base' / 'crawler_cpu_base', instead of scaled to the number of browser workers - Frontend: check that profile html page is loading, keep retrying if still getting nginx error instead of loading an iframe with the error. Fixes #1598 (Copy of #1599 from 1.9.4)	2024-03-16 15:07:04 -07:00
wvengen	6278157f40	Make storage deletion work on more S3 providers, don't use access URL for deletion (#1600 ) I came across [this problem](https://forum.webrecorder.net/t/deleting-crawl-failure/512) and noticed that the access URL is used when deleting files, causing my file deletions to fail on OpenStack SWIFT S3 (relates to #1090). This trivial change makes it work there.	2024-03-16 04:17:23 -04:00
Ilya Kreymer	08f6847194	Configurable Max Scale for frontend (#1557 ) Allow maximum scale option to be fully configurable via `max_crawl_scale`. Already configurable on the backend, and now exposed to the frontend via API `/api/settings` `maxCrawlScale` value. The workflow editor and workflow details are updated to allow selecting the scale up to the maxCrawlScale setting (which defaults to 3 if not set).	2024-03-11 16:21:20 -07:00
Ilya Kreymer	ea494fa6e6	Merge V1.9.3 changes into main (#1583 ) - Fix execution time checking by keeping lastUpdatedTime in db by @ikreymer in https://github.com/webrecorder/browsertrix-cloud/pull/1573 - disable postcss-lit for var css - Prevent closing tooltips from closing collection share dialog by @SuaYoo in https://github.com/webrecorder/browsertrix-cloud/pull/1579 - Fix pending exclusion pagination by @SuaYoo in https://github.com/webrecorder/browsertrix-cloud/pull/1578 - Fix regex escape in exclusion editor text match by @SuaYoo in https://github.com/webrecorder/browsertrix-cloud/pull/1577 --------- Co-authored-by: emma <hi@emma.cafe> Co-authored-by: sua yoo <sua@webrecorder.org>	2024-03-06 15:38:22 -08:00
Tessa Walsh	c20e754269	Add updatable QA reviewStatus field to crawls (#1575 ) Fixes #1539 Adds `reviewStatus` field to `BaseCrawl` model, updatable via the crawl update API endpoint. Acceptable values are "good", "acceptable" or "failure", enforced by an Enum. Added to `BaseCrawl` so that we can extend support to uploads more easily later on, but for now we'll only display this for crawls in the frontend.	2024-03-05 16:49:23 -08:00
Tessa Walsh	ec0db1c323	Temporarily remove pages migration (#1572 ) Removing until we have a better tested solution, including to avoid testing of QA runs for new crawls in beta.	2024-03-04 10:30:04 -08:00
Ilya Kreymer	09a0d51843	pages: set page status to 200 if unset and loadState != 0 (#1563 ) Follow up to #1516, ensure page status is set to 200 if no status is provided, if loadState is not 0	2024-02-29 15:15:17 -08:00
Ilya Kreymer	2ac6584942	Refactor operator class into module (#1564 ) The operator class has gotten fairly large, this is a first pass in refactoring operator.py into a submodule instead, with multiple operator instances which handle different types of objects. - The main k8s interface has been split into K8sOpApi which extends K8sApi and is shared across all operators. - Each operator extends BaseOperator which also has an instance of K8sOpApi - The CrawlOperator is still the bulk of the functionality, but will likely be further refactored to support QA jobs --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-02-29 14:40:12 -08:00
Tessa Walsh	da19691184	Add crawl errors incrementally during crawl (#1561 ) Fixes #1558 - Adds crawl errors to database incrementally during crawl rather than after crawl completes - Simplifies crawl /errors API endpoint to always return errors from database	2024-02-29 09:16:34 -08:00
Ilya Kreymer	804f755787	Increase startup probe time to account for long-running migrations (#1560 ) - increases the failureThreshold for startupProbe for the api backend container to account for long running migrations, upto 300 seconds - add `/healthzStartup` which checks if db is ready - bump - keeps `/healthz` to always return 200 when running - increases livenessProbe failureThreshold to be higher than readiness probe, following recommended best practice of liveness probe > readiness probe - fixes #1559	2024-02-28 14:22:33 -08:00
Tessa Walsh	14189b7cfb	Add crawl pages and related API endpoints (#1516 ) Fixes #1502 - Adds pages to database as they get added to Redis during crawl - Adds migration to add pages to database for older crawls from pages.jsonl and extraPages.jsonl files in WACZ - Adds GET, list GET, and PATCH update endpoints for pages - Adds POST (add), PATCH, and POST (delete) endpoints for page notes, each with their own id, timestamp, and user info in addition to text - Adds page_ops methods for 1. adding resources/urls to page, and 2. adding automated heuristics and supplemental info (mime, type, etc.) to page (for use in crawl QA job) - Modifies `Migration` class to accept kwargs so that we can pass in ops classes as needed for migrations - Deletes WACZ files and pages from database for failed crawls during crawl_finished process - Deletes crawl pages when a crawl is deleted Note: Requires a crawler version 1.0.0 beta3 or later, with support for `--writePagesToRedis` to populate pages at crawl completion. Beta 4 is configured in the test chart, which should be upgraded to stable 1.0.0 when it's released. Connected to https://github.com/webrecorder/browsertrix-crawler/pull/464 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-02-28 12:11:35 -05:00
Ilya Kreymer	8ae032ff88	More friendly WARC prefix inside WACZ based on Org slug + Crawl Name / First Seed URL. (#1537 ) Supports setting WARC prefix for WARCs inside WACZ to `<org slug>-<slug [crawl name \| first seed host]>`. - Prefix set via WARC_PREFIX env var, supported in browsertrix-crawler 1.0.0-beta.4 or higher If crawl name is provided, uses crawl name, other hostname of first seed. The name is 'sluggified', using lowercase alphanum characters separated by dashes. Ex: in an organization called `Default Org`, a crawl of `https://specs.webrecorder.net/` and no name will have WARCs named: `default-org-specs-webrecorder-net-....warc.gz` If the crawl is given the name `SPECS`, the WARCs will be named `default-org-specs-manual-....warc.gz` Fixes #412 in a default way.	2024-02-22 23:54:23 -08:00
Ilya Kreymer	a8e3ff1141	version: bump to 1.10.0-beta.0	2024-02-20 00:22:29 -08:00
Ilya Kreymer	c1cffe9ecd	version: bump to 1.9.1	2024-02-16 09:44:18 -08:00
Ilya Kreymer	64bf21311d	version: bump to 1.9.0!	2024-02-14 13:30:46 -08:00
Ilya Kreymer	1d266e3cea	bump to 1.9.0.beta.5	2024-02-12 18:29:39 -08:00
Ilya Kreymer	4bc8152640	version: bump to 1.9.0-beta.4	2024-02-09 16:17:13 -08:00
Ilya Kreymer	0653657637	better handling of failed redis connection + exec time updates (#1520 ) This PR addresses a possible failure when Redis pod was inaccessible from Crawler pod. - Ensure crawl is set to 'waiting_for_capacity' if either no crawler pods are available or no redis pod. previously, missing/inaccessible redis would not result in 'waiting_for_capacity' if crawler pods are available - Rework logic: if no crawler and redis after >60 seconds, shutdown redis. if crawler and no redis, init (or reinit) redis - track 'lastUpdatedTime' in db when incrementing exec time to avoid double counting if lastUpdatedTime has not changed, eg. if operator sync fails. - add redis timeout of 20 seconds to avoid timing out operator responses if redis conn takes too long, assume unavailable	2024-02-09 16:14:29 -08:00
Ilya Kreymer	65fec64197	storages: use asynccontextmanager instead of sync to close client (#1521 ) Follow-up to #1481, use the asyncontextmanager with `async with` as only used in async functions (which call run_in_executor)	2024-02-08 08:28:53 -08:00
Ilya Kreymer	b2a5dbf2cd	enable screenshots by default + fix py version formatting (#1518 ) configmap: add --screenshot thumbnail,view as default screenshots version: update update-version.sh to add newline in version.py to match new black formatting (from changes in #1507) Fixes #1519	2024-02-07 17:07:28 -08:00
Ilya Kreymer	7aebce66f6	version: bump to 1.9.0-beta.3	2024-02-07 15:21:10 -08:00
Tessa Walsh	a898c2b456	Format backend with Black 24 (#1507 ) Fixes #1506	2024-02-07 11:35:34 -08:00
Tessa Walsh	b252931c71	Add scale to CrawlOut (#1487 ) Fixes #1486 `scale` is already saved in the crawl but needed to be added to `CrawlOut`.	2024-01-23 14:10:37 -08:00
Ilya Kreymer	bf38063e0a	Close sync S3 client (#1481 ) Cleanup of boto3 sync client, ensure that it is used as a context manager like async client.	2024-01-18 18:18:41 -05:00
Tessa Walsh	950844dc92	Add migration to fix issues with previous migrations (#1480 ) Fixes #1479 - Update null crawlTimeouts in db from null to 0 - Update crawlerChannel in configmaps	2024-01-18 16:59:40 -05:00
Ilya Kreymer	ad19941318	operator: use 'default' CRAWLER_CHANNEL if none is set (#1478 ) Use default channel if CRAWLER_CHANNEL not set in crawlconfig configmap, consistent with how other configmap settings for cronjobs are used.	2024-01-18 11:13:03 -08:00
Ilya Kreymer	e43feedc43	version: bump to 1.9.0-beta.2	2024-01-18 10:01:38 -08:00
Ilya Kreymer	370590b14f	version: bump to 1.9.0-beta.1	2024-01-17 14:58:25 -08:00
Tessa Walsh	07fa46d9aa	Add custom user agent to workflows (#1465 ) Fixes #1341 Adds "User Agent" field to workflow editor under the Browser Settings tab. If not set, the crawler will use the browser's default user agent. Also added to docs and to the workflow details page (if set). --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-01-17 17:33:50 -05:00
Ilya Kreymer	90197b2a85	Backend mem usage fix - use fixed MOTOR_MAX_WORKERS + switch to gunicorn (#1468 ) Refactors backend deployment to: - Use MOTOR_MAX_WORKERS (defaulting to 1) to reduce threads used by mongodb connections - Also sets backend workers to 1 by default to reduce default memory usage - Switches to gunicorn with uvloop worker for production use instead of uvicorn (as recommended by uvicorn) Lower thread count should address memory leak/increased usage, which resulted in 5x thread x cpus x workers, eg. potentially 20 or 40 threads just for mongodb connections. Lower default number of workers should make it easier to scale backend with HPA if additional capacity. Fixes #1467	2024-01-16 15:32:42 -08:00
Tessa Walsh	032859f361	Support multiple crawler versions (#1420 ) Fixes #1385 ## Changes Supports multiple crawler 'channels' which can be configured to different browsertrix-crawler versions - Replaces `crawler_image` in helm chart with `crawler_channels` array similar to how storages are handled - The `default` crawler channel must always be provided and specifies the default crawler image - Adds backend `/orgs/{oid}/crawlconfigs/crawler-channels` API endpoint to fetch information about available crawler versions (name, image, and label) and test - Adds crawler channel select to workflow creation/edit screens and profile creation dialog, and updates related API endpoints and configmaps accordingly. The select dropdown is shown only if more than one channel is configured. - Adds `crawlerChannel` to workflow and crawl details. - Add `image` to crawler image, used to display actual image used as part of the crawl. - Modifies `crawler_crawl_id` backend test fixture to use `test` crawler version to ensure crawler versions other than latest work - Adds migration to add `crawlerChannel` set to `default` to existing workflow and profile objects and workflow configmaps --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-01-16 15:32:12 -08:00
Tessa Walsh	9f73bafd37	Fix browser profile name in crawl endpoints (#1464 ) Fixes #1388 Fixes browser profile name lookup by ensuring profileid is in CrawlOut model.	2024-01-14 16:30:27 -08:00
Tessa Walsh	38a01860b8	Add API endpoints for crawl statistics (#1461 ) Fixes #1158 Introduces two new API endpoints that stream crawling statistics CSVs (with a suggested attachment filename header): - `GET /api/orgs/all/crawls/stats` - crawls from all orgs (superuser only) - `GET /api/orgs/{oid}/crawls/stats` - crawls from just one org (available to org crawler/admin users as well as superusers) Also includes tests for both endpoints.	2024-01-10 13:30:47 -08:00
Ilya Kreymer	a6936299d3	version: bump to 1.9.0-beta.0	2023-12-20 00:08:16 -08:00
Ilya Kreymer	d74d9ac09d	Recreate configmaps if missing (#1444 ) If configmap is missing (eg. was accidentally deleted from k8s) recreate the configmap when updating the crawl workflow or running a crawl. Previously, this would result in an error, but now the configmap should be correctly recreated.	2023-12-12 17:48:27 -05:00
Ilya Kreymer	d902cf5338	version: bump to 1.8.2	2023-12-07 13:34:37 -08:00
Tessa Walsh	be41c48c27	Add extra and gifted execution minutes (#1361 ) Fixes #1358 - Adds `extraExecMinutes` and `giftedExecMinutes` org quotas, which are not reset monthly but are updateable amounts that carry across months - Adds `quotaUpdate` field to `Organization` to track when quotas were updated with timestamp - Adds `extraExecMinutesAvailable` and `giftedExecMinutesAvailable` fields to `Organization` to help with tracking available time left (includes tested migration to initialize these to 0) - Modifies org backend to track time across multiple categories, using monthlyExecSeconds, then giftedExecSeconds, then extraExecSeconds. All time is also written into crawlExecSeconds, which is now the monthly total and also contains any overage time above the quotas - Updates Dashboard crawling meter to include all types of execution time if `extraExecMinutes` and/or `giftedExecMinutes` are set above 0 - Updates Dashboard Usage History table to include all types of execution time (only displaying columns that have data) - Adds backend nightly test to check handling of quotas and execution time - Includes migration to add new fields and copy crawlExecSeconds to monthlyExecSeconds for previous months Co-authored-by: emma <hi@emma.cafe>	2023-12-07 14:34:37 -05:00
Tessa Walsh	478b794f9b	Add API endpoint to retry all failed bg jobs (#1396 ) Fixes #1395 - Adds new `POST /orgs/<orgid>/jobs/retryFailed` API endpoint to retry all failed background jobs for a specific org. - Also adds `POST /orgs/all/jobs/retryFailed` for superadmin to retry all failed background jobs for all orgs	2023-12-05 13:00:45 -08:00
Tessa Walsh	3d93d0a0d0	Add API tests for browser profiles (#1392 ) Fixes #1330	2023-11-28 10:40:58 -05:00
Henry Wilkinson	f507f1d2ec	Fixes allowed actions for viewers and crawlers throughout the app (#1326 ) Closes #1294 ### Changes - `crawl-list` component - Adds a check if there are any items in the actions menu. If not, skip rendering the actions menu. - This allows us to give the component no actions! Currently required to remove them for viewers! - Collection Details - Hides "Remove from Collection" option for viewers - Crawls List - Removes the single "View Crawl Details" option from archived items for viewers - All the other actions were already set up correctly to be used by all roles! - Dashboard - Hides org settings gear icon button unless the user is an admin - Hides "Create New" dropdown for viewers - Workflow Details - Hides workflow edit icon button for viewers - Hides the "Delete Crawl" option in archived items for viewers - Hides the "Run Crawl" option for viewers - Workflow List - Hides all edit-related options for viewers, the only option now is copying tags - Removes the deactivate / delete options (were only visible when running a crawl) in the workflow list actions --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: sua yoo <sua@suayoo.com>	2023-11-17 14:41:21 -08:00
Ilya Kreymer	1218d6e767	version: bump to 1.8.1	2023-11-17 14:39:52 -08:00
Ilya Kreymer	b6f8c968e9	version: bump to 1.8.0	2023-11-15 17:57:43 -08:00
Ilya Kreymer	b23eed5003	Email Templates (#1375 ) - Emails are now processed from Jinja2 templates found in `charts/email-templates`, to support easier updates via helm chart in the future. - The available templates are: `invite`, `password_reset`, `validate` and `failed_bg_job`. - Each template can be text only or also include HTML. The format of the template is: ``` subject ~~~ <html content> ~~~ text ``` - A new `support_email` field is also added to the email block in values.yaml Invite Template: - Currently, only the invite template includes an HTML version, other templates are text only. - The same template is used for new and existing users, with slightly different text if adding user to an existing org. - If user is invited by the superadmin, the invited by field is not included, otherwise it also includes 'You have been invited by X to join Y'	2023-11-15 15:22:12 -08:00
Ilya Kreymer	7d985a9688	version: bump to 1.8.0-beta.4	2023-11-14 11:59:04 -08:00
Ilya Kreymer	dfba4b3940	Replace partial_complete -> stopped_by_user or stopped_quota_reached + operator edge cases (#1368 ) - Adds two new crawl finished state, stopped_by_user and stopped_quota_reached - Tracking other possible 'stop reasons' in operator, though not making them distinct states for now. - Updated frontend with 'Stopped by User' and 'Stopped: Time Quota Reached', shown with same icon as current partial_complete - Added migration of partial_complete to either stopped_by_user or complete (no historical quota data available) - Addresses edge case in scaling: if crawl never scaled (no redis entry, no pod), automatically scale down - Edge case in status: if crawl is somehow 'canceled' but not deleted, immediately delete crawl object and begin finalizing. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-11-14 11:17:16 -08:00
Ilya Kreymer	67892994a6	version: bump to 1.8.0-beta.3	2023-11-09 18:20:04 -08:00
Tessa Walsh	f3cbd9e179	Add crawl, upload, and collection delete webhook event notifications (#1363 ) Fixes #1307 Fixes #1132 Related to #1306 Deleted webhook notifications include the org id and item/collection id. This PR also includes API docs for the new webhooks and extends the existing tests to account for the new webhooks. This PR also does some additional cleanup for existing webhooks: - Remove `downloadUrls` from item finished webhook bodies - Rename collection webhook body `downloadUrls` to `downloadUrl`, since we only ever have one per collection - Fix API docs for existing webhooks, one of which had the wrong response body	2023-11-09 18:19:08 -08:00
Tessa Walsh	1afc411114	Implement retry API endpoint for failed background jobs (#1356 ) Fixes #1328 - Adds /retry endpoint for retrying failed jobs. - Returns 400 error if previous job still running or has succeeded - Keeps track of previous failed attempts in previousAttempts array on failed job. - Also amends the similar webhook /retry endpoint to use `POST` for consistency. - Remove duplicate api tag for backgroundjobs	2023-11-09 18:09:37 -08:00
Tessa Walsh	82a5d1e4e4	Regression fix: add profiles/ prefix to profile filenames (#1365 ) Fixes #1364 Regression fix for issue introduced in storage refactoring (see issue for more details). Changes: 1. Add `profiles/` prefix to profile filename passed in to crawler for profile creation and written into db 2. Remove hardcoded `profiles/` prefix from crawler YAML 3. Add migration to add `profiles/` prefix to profile filenames that don't already have it, including updating PROFILE_FILENAME in ConfigMaps This way between the related storage document and the profile filename, we have the full path to the object in the database rather than relying on additional prefixes hardcoded into k8s job YAML files. Note that this as a follow-up it'll be necessary to manually move any profiles that had been written into the `<oid>` "directory" in object storage rather than `<oid>/profiles` to the latter. This should only affect profiles created very recently in a 1.8.0-beta release.	2023-11-09 17:44:16 -08:00
Tessa Walsh	30bbefbeaa	Send email to superuser when background job fails (#1355 ) Fixes #1344 Sends email to superadmin when a background job fails.	2023-11-08 19:55:59 -08:00
Ilya Kreymer	ff10124d01	charts cleanup: (#1360 ) - move authsign secret to signer and make port configurable - rename storages to more general ops-configs - put 'storages.json' path into env var - rename backend secret to backend-auth - cronjobs: don't keep succeeded jobs around, triggers operator update	2023-11-08 19:24:00 -08:00
Ilya Kreymer	d2d7240455	background jobs fix: ensure bucket is parsed correctly (#1359 ) Follow-up to #1321 - correctly parse the endpoint_url into prefix and bucket path - also add region and s3 provider type to storage secrets	2023-11-08 15:08:23 -08:00
Ilya Kreymer	3aebf2e37f	version: bump to 1.8.0-beta.2	2023-11-06 16:35:15 -08:00
Ilya Kreymer	b4fd5e6e94	Crawl Timeout via elapsed time (#1338 ) Fixes #1337 Crawl timeout is tracked via `elapsedCrawlTime` field on the crawl status, which is similar to regular crawl execution time, but only counts one pod if scale > 1. If scale == 1, this time is equivalent. Crawl is gracefully stopped when the elapsed execution time exceeds the timeout. For more responsiveness, also adding current crawl time since last update interval. Details: - handle crawl timeout via elapsed crawl time - longest running time of a single pod, instead of expire time. - include current running from last update for best precision - more accurately count elapsed time crawl is actually running - store elapsedCrawlTime in addition to crawlExecTime, storing the longest duration of each pod since last test interval --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-11-06 16:32:58 -08:00
Ilya Kreymer	5530ca92e1	Move backend app templates to be installed from configmap volume (#1331 ) Instead of adding the app templates launched from the backend via `backend/btrixcloud/templates`, add them to a configmap and mount the configmap in the same location. This allows these templates to be updated, like other values in charts/... without having to rebuild any of the images, speeding up dev and maintenance time. Changes include: - move backend/btrixcloud/templates -> chart/app-templates/ - add app-templates/*.yaml to app-templates configmap - mount app-templates configmap to /app/btrixcloud/templates/ in api and op containers	2023-11-06 09:37:48 -08:00
Ilya Kreymer	0935d43a97	exclusion optimizations: dynamic exclusions (part of #1216 ): (#1268 ) - instead of restarting crawler when exclusion added/removed, add a message to a redis list (per crawler instance) - no longer filtering existing queue on backend, now handled via crawler (implemented in 0.12.0 via webrecorder/browsertrix-crawler#408) - match response optimization: instead of returning first 1000 matches, limits response to 500K and returns however many matches fit in that response size (for optional pagination on frontend)	2023-11-06 09:36:25 -08:00
Ilya Kreymer	fb3d88291f	Background Jobs Work (#1321 ) Fixes #1252 Supports a generic background job system, with two background jobs, CreateReplicaJob and DeleteReplicaJob. - CreateReplicaJob runs on new crawls, uploads, profiles and updates the `replicas` array with the info about the replica after the job succeeds. - DeleteReplicaJob deletes the replica. - Both jobs are created from the new `replica_job.yaml` template. The CreateReplicaJob sets secrets for primary storage + replica storage, while DeleteReplicaJob only needs the replica storage. - The job is processed in the operator when the job is finalized (deleted), which should happen immediately when the job is done, either because it succeeds or because the backoffLimit is reached (currently set to 3). - /jobs/ api lists all jobs using a paginated response, including filtering and sorting - /jobs/<job id> returns details for a particular job - tests: nightly tests updated to check create + delete replica jobs for crawls as well as uploads, job api endpoints - tests: also fixes to timeouts in nightly tests to avoid crawls finishing too quickly. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-11-02 13:02:17 -07:00
Ilya Kreymer	6384d8b5f1	Additional Type Hints / Type Fix Pass (#1320 ) This PR adds more type safety to the backend codebase: - All ops classes calls should be type checked - Avoiding circular references with TYPE_CHECKING conditional - Consistent UUID usage: uuid.UUID / UUID4 with just UUID - Crawl states moved to models, made into lists - Additional typing added as needed, fixed a few type related errors - CrawlOps / UploadOps / BaseCrawlOps now all have same param init order to simplify changes	2023-10-30 12:59:24 -04:00
Ilya Kreymer	72f1840ae7	fix regression in concurrent crawls: (#1324 ) - check the 'btrix.org' instead of 'oid' labels in getting related crawls - fixes regression introduced in #1296 where labels where all org id labels were switched to 'btrix.org' for consistency	2023-10-30 12:58:07 -04:00
Ilya Kreymer	8c09934298	version: bump to 1.8.0-beta.1	2023-10-27 14:35:24 -07:00
Ilya Kreymer	c1d3beda9c	users: add case-insensitive index to maintain backwards compatibility with fastapi-users (#1319 ) follow up to #1290 Based on implementation in: https://github.com/fastapi-users/fastapi-users-db-mongodb/blob/main/fastapi_users_db_mongodb/__init__.py	2023-10-27 14:31:29 -07:00
Ilya Kreymer	6dc452ebad	Storage Refactor: Replication + Custom Storage Support (#1296 ) - Refactors storage to support replicas + custom storages on the Org. - There is a default primary + replica storage, while an Org can also have primary and replica storages. - StorageRef object is used to store references to default and custom storage. - CrawlFile has been updated to contain a StorageRef instead of a def_storage_name, which references either a default storage (in StorageOps) or custom storage (in Organization) - There is also a 'replicas' Optional[List[StorageRef]] which contains replicas, if any. - CrawlFileOut contain a numReplicas for how many replicas exist for a given file. - Migration: migration 0020 added to migrate existing Orgs, CrawlFile and ProfileFile objects to new storage system (CrawlFile and ProfileFile now extend BaseFile) Part of #1262 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-10-26 21:44:09 -07:00
Tessa Walsh	38f32f11ea	Enforce quota and hard cap for monthly execution minutes (#1284 ) Fixes #1261 Closes #1092 The quota for monthly execution minutes is treated as a hard cap. Once it is exceeded, an alert indicating that an org has exceeded its monthly execution minutes will display and the user will be unable to start new crawls. Any running crawls will be stopped once the quota is exceeded. An execution minutes meter bar is also added in the Org Dashboard and displayed if a quota is set. More detail in #1305 which was merged into this branch. ## Changes - Enable setting 'maxExecMinutesPerMonth' in orgs list quotas by superadmin - Enforce quota by stopping crawls in operator once quota is reached - Show alert banner once execution time quota is hit: - Once quota is hit, disable Run Crawl buttons in frontend, return 403 message with `exec_minutes_quota_reached` detail in backend from crawl config `/run` endpoint, and don't run new workflows on creation (similar to storage quota) - Display execution time for crawls in the crawl details overview, immediately below - Show execution minutes meter on dashboard (from #1305) --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: sua yoo <sua@webrecorder.org>	2023-10-26 15:38:51 -07:00
Tessa Walsh	5fadc630ce	Check for empty string for SMTP password (#1317 ) Follow-up fix for #1136 based on this comment: https://github.com/webrecorder/browsertrix-cloud/issues/1136#issuecomment-1777119534	2023-10-26 09:44:55 -07:00
Ilya Kreymer	4591db1afe	More stringent UUID types for user input / avoid 500 errors (#1309 ) Fixes #1297 Ensures proper typing for UUIDs in FastAPI input models, to avoid explicit conversions, which may throw errors. This avoids possible 500 errors (due to ValueError exceptions) when converting UUIDs from user input. Instead, will get more 422 errors from FastAPI. UUID conversions remaining are in operator / profile handling where UUIDs are retrieved from previously set fields, remaining user input conversions in user auth and collection list are wrapped in exceptions. For `profileid`, update fastapi models to support union of UUID, null, and EmptyStr (new empty string only type), to differentiate removing profile (empty string) vs not changing at all (null) for config updates	2023-10-25 15:15:53 -04:00
Tessa Walsh	d58747dfa2	Provide full resources in archived items finished webhooks (#1308 ) Fixes #1306 - Include full `resources` with expireAt (as string) in crawlFinished and uploadFinished webhook notifications rather than using the `downloadUrls` field (this is retained for collections). - Set default presigned duration to one minute short of 1 week and enforce maximum supported by S3 - Add 'storage_presign_duration_minutes' commented out to helm values.yaml - Update tests --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-10-23 19:01:58 -07:00
Tessa Walsh	5c5ef68a8a	Prevent user from logging in after 5 consecutive failed login attempts until pw is reset (#1281 ) Fixes #1270 After 5 consecutive failed logins from the same user, we now prevent the user from logging in even with the correct password until they reset it via their email, or wait an hour. - After failure threshold is reached, all further login attempts are rejected - Attempts for invalid email addresses are also tracked - On 6th try, a reset password email is automatically sent, only once - Failed login counter resets after an hour of no further logins after last attempted login. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-10-20 14:10:56 -07:00
Tessa Walsh	733809b5a8	Update user names in crawls and workflows after username update (#1299 ) Fixes #1275	2023-10-19 23:34:49 -07:00
Ilya Kreymer	63291e95a5	avoid exception if 'errors' key doesn't exist (#1301 ) - avoid exception if 'errors' (or 'files' keys) don't exist (part of #1297) - ensure 'errors' list always set on output model for consistency, defaulting to empty list - fix tests for 'errors' being an empty empty list follow-up to #1300 (merging 1.7.1 release into main)	2023-10-19 14:39:54 -07:00
Ilya Kreymer	9a2787f9c4	User refactor + remove fastapi_users dependency + update fastapi (#1290 ) Fixes #1050 Major refactor of the user/auth system to remove fastapi_users dependency. Refactors users.py to be standalone and adds new auth.py module for handling auth. UserManager now works similar to other ops classes. The auth should be fully backwards compatible with fastapi_users auth, including accepting previous JWT tokens w/o having to re-login. The User data model in mongodb is also unchanged. Additional fixes: - allows updating fastapi to latest - add webhook docs to openapi (follow up to #1041) API changes: - Removing the`GET, PATCH, DELETE /users/<id>` endpoints, which were not in used before, as users are scoped to orgs. For deletion, probably auto-delete when user is removed from last org (to be implemented). - Rename `/users/me-with-orgs` is renamed to just `/users/me/` - New `PUT /users/me/change-password` endpoint with password required to update password, fixes #1269, supersedes #1272 Frontend changes: - Fixes from #1272 to support new change password endpoint. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: sua yoo <sua@suayoo.com>	2023-10-18 10:49:23 -07:00
sua yoo	4610d95cd7	Use org slug in place of UUIDs in app URLs (#1277 ) - Replaces org UUID in URL/browser location bar with org slug. - Refactor: Adds shared app state utility using https://sijakret.github.io/lit-shared-state/ to access org data from deep descendants. - Backwards compatible: org UUID URLs should auto-redirect to org slug URLs. - Show the org UUID in org settings general tab for use with APIs (Resolves #1258, Follows #1279)	2023-10-18 09:28:30 -07:00
Ilya Kreymer	36bd228115	version: update to 1.8.0-beta.0	2023-10-17 18:06:55 -07:00
Ilya Kreymer	b3f530f8e6	version: bump to 1.7.0	2023-10-16 18:39:20 -07:00
Ilya Kreymer	ddc4e03422	operator status typo fix: (#1293 ) - don't log normal exists as crashes! - set pod_status.exitCode to the exitCode - count exit code 13 as not-a-crash also (force interrupt)	2023-10-16 15:01:46 -07:00
Ilya Kreymer	1bc4697995	optimization: avoid updating whole org when only need to set one field (#1288 ) - add update_users and update_slug_and_name - rename update to update_full	2023-10-16 10:54:04 -07:00
Ilya Kreymer	dc8d510b11	webhook tweak: pass oid to crawl finished and upload finished webhooks (#1287 ) Optimizes webhooks by passing oid directly to webhooks: - avoids extra crawl lookup - possible for crawl to be deleted before webhook is processed via operator (resulting in crawl lookup to fail) - add more typing to operator and webhooks	2023-10-16 10:51:36 -07:00
Ilya Kreymer	a295f5d05d	version: bump to 1.7.0-beta.3	2023-10-15 18:31:03 -07:00
Tessa Walsh	2383b0d616	Set log download attachment name to crawl_id.log (#1280 ) Fixes #1271 Using .log for now due to broader support for opening with default viewers --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2023-10-13 20:00:37 -07:00
Tessa Walsh	c5ca250f37	Add id-slug lookup and restrict slugs endpoints to superadmins (#1279 ) Fixes #1278 - Adds `GET /orgs/slug-lookup` endpoint returning `{id: slug}` for all orgs - Restricts new endpoint and existing `GET /orgs/slugs` to superadmins	2023-10-13 17:02:19 -07:00
Ilya Kreymer	41c054d209	Storage ops followup type checking (#1274 ) * storage ops: follow up to #1257: - fix refactor typo - add type hints for all storageops apis (add mypy_boto3_s3 and types_aiobotocore_s3 for type hints)	2023-10-11 14:03:00 -07:00
Tessa Walsh	266afdf8d9	Add slugs to org backend (#1250 ) - Add slug field with uniqueness constraint to Organization - Use python-slugify to generate slug from name and import that in migration - Require name in all /rename and org creation requests - Auto-generate slug for new org with no slug or when /rename is called w/o a slug - Auto-generate slug for 'default-org' based on name - Add /api/orgs/slugs GET endpoint to return all slugs in use - tests: extend backend test-requirements.txt from requirements to allow testing slugify - tests: move get_redis_crawl_stats() to avoid extra dependency in utils	2023-10-10 18:30:09 -07:00
Ilya Kreymer	16e7a1d0a2	Storage Ops Refactor (#1257 ) * storage ops refactor: - create StorageOps class similar to other ops classes - init storages list in StorageOps, no longer require lookup up default storages via CrawlManager - convert all storage functions to members, add storageops to operator - remove unused params, ensure crawl exists for rollover restart - add env var to determine if using local minio to use correct endpoint URL * crawls /seeds endpoint: just return empty list if not a crawl (eg. upload) * crawlmanager: remove unused code, rename check_storage -> has_storage	2023-10-10 15:04:23 -07:00
Ilya Kreymer	5cad9acee9	Compute crawl execution time in operator (#1256 ) * store execution time in operator: - rename isNewCrash -> isNewExit, crashTime -> exitTime - keep track of exitCode - add execTime counter, increment when state has a 'finishedAt' and 'startedAt' state - ensure pods are complete before deleting - store 'crawlExecSeconds' on crawl and org levels, add to Crawl, CrawlOut, Organization models * support for fast cancel: - set redis ':canceled' key to immediately cancel crawl - delete crawl pods to ensure pod exits immediately - in finalizer, don't wait for pods to complete when canceling (but still check if terminated) - add currentTime in pod.status.running.startedAt times for all existing pods - logging: log exec time, missing finishedAt - logging: don't log exit code 11 (interrupt due to time/size limits) as a crash * don't wait for pods completed on failed with existing browsertrix-crawler image --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-10-09 17:45:00 -07:00
Tessa Walsh	748c86700d	fix: lookup user object operator to pass to CrawlConfig.add_new_crawl (#1254 ) fixes #1253 Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-10-05 21:30:10 -07:00
Ilya Kreymer	fa86555eed	Track pod resource usage, detect OOM crashes, handle auto-scaling (#1235 ) * keep track of per pod status on crawljob: - crashes time, and reason - 'used' vs 'allocated' resources - 'percent' used / allocated * crawl log errors: log error when crawler crashes via OOM, either via redis error log or to console * add initial autoscaling support! - detect if metrics server is available via K8SApi.is_pod_metrics_available() - if available, use metrics for 'used' fields - if no metrics, set memory used for redis only (using redis apis) - allow overriding memory and cpu via newMemory and newCpu settings on pod status - scale memory / cpu based on newMemory and newCpu setting - templates: update jinja templates to allow restarting crawler and redis with new resources - ci: enable metrics-server on k3d, microk8s and nightly k3d ci runs * roles: cleanup unused roles, add permissions for listing metrics * stats for running crawls: - update in db via operator - avoids losing stats if redis pod happens to be done - tradeoff is more db access in operator, but less extra connections to redis + already loading from db in backend - size stat: ensure size of previous files is added to the stats * crawler deployment tweaks: - adjust cpu/mem per browser - add --headless flag to configmap to use new headless mode by default!	2023-10-05 20:41:18 -07:00
Ilya Kreymer	20560abb81	version: bump to 1.7.0-beta.2	2023-10-05 20:33:38 -07:00
Tessa Walsh	bbdb7f8ce5	Require that all passwords are between 8 and 64 characters (#1239 ) - Require that all passwords are between 8 and 64 characters - Fixes account settings password reset form to only trigger logged-in event after successful password change. - Password validation can be extended within the UserManager's validate_password method to add or modify requirements. - Add tests for password validation	2023-10-03 18:57:46 -07:00
Tessa Walsh	b1ead614ee	Add --failOnFailedSeed checkbox to URL list workflows (#1236 ) - If set, and any of the seeds fails, the entire crawl is marked as a failure. - Add checkbox which adds --failOnFailedSeed checkbox to URL list workflows - Add 'Fail Crawl On Failed URL' to crawl workflow setup docs	2023-10-03 18:46:09 -07:00
Tessa Walsh	e9bac4c088	API delete endpoint improvements (#1232 ) - Applies user permissions check before deleting anything in all /delete endpoints - Shuts down running crawls before deleting anything in /all-crawls/delete as well as /crawls/delete - Splits delete_list.crawl_ids into crawls and upload lists at same time as checks in /all-crawls/delete - Updates frontend notification message to Only org owners can delete other users' archived items. when a crawler user attempts to delete another users' archived items	2023-10-03 13:05:00 -07:00
sua yoo	df190e12b9	Show running workflow error logs (#1224 ) - Adds "Logs" tab to workflow detail - Shows error logs in expandable section in "Watch" tab - Show corresponding message (no logs yet or logs temporarily unavailable) when `/errors` returns 503 based on crawl state - text tweaks: use error logs instead of logs, change 'crawl start' -> 'crawl continue' in log message --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-10-03 00:03:21 -07:00
Anish Lakhwara	a2dbad35c3	feat: use is_bool to check EMAIL_SMTP_USE_TLS (#1231 ) - use is_bool to check EMAIL_SMTP_USE_TLS - use is_bool for yaml values that are boolean	2023-10-02 21:29:36 -07:00
sua yoo	941a75ef12	Separate seeds into a new endpoints (#1217 ) - Remove config.seeds from workflow and crawl detail endpoints - Add new paginated GET /crawls/{crawl_id}/seeds and /crawlconfigs/{cid}/seeds endpoints to retrieve seeds for a crawl or workflow - Include firstSeed in GET /crawlconfigs/{cid} endpoint (was missing before) - Modify frontend to fetch seeds from new /seeds endpoints with loading indicator --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-10-02 10:56:12 -07:00
Anish Lakhwara	1bf531e1ec	Fix: Make Collections Public on Creation (#1213 ) - Add isPublic to Add Collection endpoint, send isPublic from frontend - Fixes #1212	2023-09-29 12:08:10 -07:00
Anish Lakhwara	037396f3d9	Fix: Stream log downloading from WACZ (#1225 ) * Fix(backend): Stream logs without causing OOM Also be smarter about when to use `heapq.merge` and when to use `itertools.chain`: If all the logs are coming from the same instance we `chain` them, otherwise we'll `merge` them iterator fixes: - group wacz files by instance by suffix, eg. -0.wacz, -1.wacz, -2.wacz - sort wacz files, and all logs within each wacz file - chain log iterators for all log files within wacz group - merge log iterators across wacz files in different groups - add type hints to help keep track of iterator helper functions - add iter_lines() from botocore, use that for line parsing for simplicity --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-09-28 18:54:52 -07:00
Ilya Kreymer	d6bc467c54	improvements to redis pod: (#1219 ) - add liveness check/fix readiness check - ensure 'redis-cli ping' actually returns 'PONG', as exit code is 0 even if errors will detect situations where redis is not available, such as due to to max clients being reached - bump redis memory/cpu for now (until autoscaling/automatic adjustment is available)	2023-09-28 13:00:31 -07:00
Ilya Kreymer	7eac0fdf95	optimization: convert all uses of 'async for' to use iterator directly (#1229 ) - optimization: convert all uses of 'async for' to use iterator directly instead of converting to list to avoid unbounded size lists - additional cursor.to_list() to async for conversions for stats computation, simply crawlconfigs stats computation --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-09-28 12:31:08 -07:00
Vinzenz Sinapius	cabf4ccc21	Disable `smtp_use_tls` with `false` instead of empty string (#1184 ) `smtp_use_tls = bool(os.environ.get("EMAIL_SMTP_USE_TLS", True))` would only disable tls when `EMAIL_SMTP_USE_TLS` is set to an empty string which is not intuitive	2023-09-28 12:10:20 -07:00
Ilya Kreymer	86a424af93	migration improvements: (#1228 ) * migration improvements + rerunning migrations: (fixes #1227) - avoid starting some workers while migration is still running - ensure workers that aren't performing migration await for migration to complete - backend will not be valid until migration is run * allow rerunning migration from specified version via --set rerun_from_migration=<VERSION> (replaces rerun_last_migration)	2023-09-28 12:04:19 -07:00
Tessa Walsh	1f74f03447	Recalculate Organization.storedBytes in migration 0017 (#1220 )	2023-09-28 11:22:10 -07:00
Tessa Walsh	7a56fa23f5	Remove username lookups for crawls and workflows by storing usernames in db (#1199 ) * store usernames (createdByName, modifiedByName, startedByName) in db for workflows * store userName for userid for crawls in db * update output models to return usernames * add migration 0018 to add usernames to existing crawls and crawlconfigs * updated tests for crawl and config usernames * use async for to iterate over crawls and crawlconfigs --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-09-28 09:37:23 -07:00
Ilya Kreymer	e6bccac953	exclude match api pagination: (#1214 ) - limit how many exclusion matches are returned at once - option to specify 'offset', 'limit' and return 'nextOffset' for further pagination - set page limit to 1000 by default	2023-09-26 13:45:54 -07:00
Tessa Walsh	094f27bcff	Track bytes stored per file type and include in org metrics (#1207 ) * Add bytes stored per type to org and metrics The org now tracks bytesStored by type of crawl, uploads, and browser profiles in addition to the total, and returns these values in the org metrics endpoint. A migration is added to precompute these values in existing deployments. In addition, all /metrics storage values are now returned solely as bytes, as the GB form wasn't being used in the frontend and is unnecessary. * Improve deletion of multiple archived item types via `/all-crawls` delete endpoint - Update `/all-crawls` delete test to check that org and workflow size values are correct following deletion. - Fix bug where it was always assumed only one crawl was deleted per cid and size was not tracked per cid - Add type check within delete_crawls	2023-09-22 12:55:21 -04:00
Tessa Walsh	83f80d4103	Add org metrics API endpoint (#1196 ) * Initial implementation of org metrics (This can eventually be sped up significantly by precomputing the values and storing them in the db.) * Rename storageQuota to storageQuotaBytes to be consistent * Update tests to include metrics	2023-09-19 16:24:27 -05:00
Tessa Walsh	859f2271da	fix(backend): call run now when updating crawlConfig #1194 Update backend/btrixcloud/crawlconfigs.py Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-09-19 11:57:41 -07:00
Tessa Walsh	9224f52f51	Remove config from list endpoints to speed up responses (#1193 ) * Remove config from list endpoints - Remove config field from workflow and crawl list endpoints - Add seedCount to CrawlConfigOut on backend and Workflow on frontend - Refactor CrawlConfig and CrawlConfigOut to extend CrawlConfigCore + CrawlConfigAdditional - Refactor workflow list in frontend to use firstSeed and seedCount - Frontend uses ListWorkflow type which is Omit<Workflow, "config">	2023-09-19 11:05:48 -05:00
Ilya Kreymer	65b7c10ba1	bump version to 1.7.0-beta.1	2023-09-18 14:33:03 -07:00
Ilya Kreymer	ff327c0b8b	Reset crawl state to running when any crawlers are running (after post-process states) (#1179 ) * operator state changes: (fixes #1178) - if at least one crawler is 'running' ensure state is reset back to running - for multiple instances, set status to earliest state (not latest) to be consistent, eg. if at least one crawl is running, set to running, if at least one is generating wacz, set to that	2023-09-15 09:16:46 -07:00
Tessa Walsh	2efc461b9b	Implement sync streaming for finished crawl logs (#1168 ) - Crawl logs streamed from WACZs using the sync boto client	2023-09-14 17:05:19 -07:00
Tessa Walsh	c7cd4e61fd	Increase wait to 30 seconds to ensure webhooks are sent (#1173 )	2023-09-13 20:20:47 -07:00
Ilya Kreymer	feb7ab7652	Improved type checking for backend with mypy (#1174 ) * add mypy type check - run type check on backend fix ambiguous typing issues - add mypy to lint gh action + precommit hook - add mypy.ini	2023-09-13 19:40:26 -07:00
Ilya Kreymer	4b34da033a	Refactor / Cleanup: move ops functions back into classes (#1171 ) * remove almost all standalone functions and move them back into ops member functions * operator now has access to all the ops classes as well * keep two standalone functions used only in migrations --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-09-13 11:56:09 -07:00
Ilya Kreymer	9159c7c914	ensure max crawl size and max crawl timeout values are set to 0 when unused, instead of null (#1167 ) - convert None->0 when creating CrawlJob - ensure frontend sends 0 not null - make input model require 'int = 0' instead of 'Optional[int] = 0'	2023-09-13 09:51:26 -07:00
Tessa Walsh	7cf2b11eb7	Add event webhook tests (#1155 ) * Add success filter to webhook list GET endpoint * Add sorting to webhooks list API and add event filter * Test webhooks via echo server * Set address to echo server on host from CI env var for k3d and microk8s * Add -s back to pytest command for k3d ci * Change pytest test path to avoid hanging on collecting tests * Revert microk8s to only run on push to main	2023-09-12 22:08:40 -07:00
Tessa Walsh	f980c3c509	Expect that crawl deleted response is bool, not int (#1170 )	2023-09-12 15:03:17 -07:00
Ilya Kreymer	c9c39d47b7	Scheduled Crawl Refactor: Handle via Operator + Add Skipped Crawls on Quota Reached (#1162 ) * use metacontroller's decoratorcontroller to create CrawlJob from Job * scheduled job work: - use existing job name for scheduled crawljob - use suspended job, set startTime, completionTime and succeeded status on job when crawljob is done - simplify cronjob template: remove job_image, cron_namespace, using same namespace as crawls, placeholder job image for cronjobs * move storage quota check to crawljob handler: - add 'skipped_quota_reached' as new failed status type - check for storage quota before checking if crawljob can be started, fail if not (check before any pods/pvcs created) * frontend: - show all crawls in crawl workflow, no need to filter by status - add 'skipped_quota_reached' status, show as 'Skipped (Quota Reached)', render same as failed * migration: make release namespace available as DEFAULT_NAMESPACE, delete old cronjobs in DEFAULT_NAMESPACE and recreate in crawlers namespace with new template	2023-09-12 13:05:43 -07:00
Tessa Walsh	9377a6f456	Issue all non-upload storage-quota-update events from LiteElement (#1151 ) - More specific toast notification error messages to the action being attempted - Single dismissable global banner shown when org storage is reached - Removed check for storage quota reached in `runNow`, since buttons are disabled in UI, and errors handled if request fails. - Allow creating new workflow when storage quota reached - More responsive storage quota updates: add storageQuotaReached to archived item replay.json, updates w/o reload when crawl pushes quota over limit - Modify LiteElement to check for storageQuotaReached on GET requests --------- Co-authored-by: sua yoo <sua@suayoo.com>	2023-09-11 18:17:48 -07:00
Ilya Kreymer	ad9bca2e92	Operator refactor to control pods + pvcs directly instead of statefulsets (#1149 ) - Ability for pod to be Completed, unlike in Statefulset - eg. if 3 pods are running and first one finishes, all 3 must be running until all 3 are done. With this setup, the first finished pod can remain in Completed state. - Fixed shutdown order - crawler pods now correctly shutdown first before redis pods, by switching to background deletion. - Pod priority decreases with scale: 1st instance of a new crawl can preempt 3rd or 2nd instance of another crawl - Create priority classes upto 'max_crawl_scale, configured in values.yaml - Improved scale change reconciliation: if increasing scale, immediately scale up. If decreasing scale, graceful stop scaled-down instance to complete via redis 'stopone' key, wait until they exit with Completed state before adjust status.scale / removing scaled down pods. Ensures unaccepted interrupts don't cause scaled down data to be deleted. - Redis pod remains inactive until crawler is first active, or after no crawl pods are active for 60 seconds - Configurable Redis storage with 'redis_storage' value, set to 3Gi by default - CrawlJob deletion starts as soon as post-finish crawl operations are run - Post-crawl operations get their own redis instance, since one during response is being cleaned up in finalizer - Finalizer ignores request with incorrect state (returns 400 if reported as not finished while crawl is finished) - Current resource usage added to status - Profile browser: also manage single pod directly without statefulset for consistency. - Restart pods via restartTime value: if spec.restartTime != status.restartTime, clear out pods and update status.restartTime (using OnDelete policy to avoid recreate loops in edge cases). - Update to latest metacontroller (v4.11.0) - Add --restartOnError flag for crawler (for browsertrix-crawler 0.11.0) - Failed crawl logging: dd 'fail_crawl()' to be used for failing a crawl, which prints logs for default container (if enabled) as well as pod status - tests: check other finished states to avoid stuck in infinite loop if crawl fails - tests: disable disk utilization check, which adds unpredictability to crawl testing! fixes #1147 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-09-11 10:38:04 -07:00

1 2 3 4 5 ...

534 Commits