browsertrix

Author	SHA1	Message	Date
Ilya Kreymer	5f53db75ee	fix resetting of invalid logins: (#2002 ) * Fixes issue in FailedLogin model: - fix data-model to remove nested 'attempted.attempted' - migrate existing data to remove nested field * Also, avoid setting dt_now() in model as that results in fixed date for all objects: - update FailedLogin to update 'attempted' date on every attempt - also update PageNote object to set date in constructor * Update text for too many logins to make it clear it is set only if its a valid email * fixes #2001	2024-08-07 12:36:06 -07:00
Ilya Kreymer	7fa2b61b29	Execution time tracking tweaks (#1994 ) Tweaks to how execution time is tracked for more accuracy + excluding waiting states: - don't update if crawl state is in a 'waiting state' (waiting for capacity or waiting for org limit) - rename start states -> waiting states for clarity - reset lastUpdatedTime if two consecutive updates of non-running state, to ensure non-running states don't count, but also account for occasional hiccups -- if only one update detects non-running state, don't reset - webhooks: move start webhook to when crawl actually starts for first time (db lastUpdatedTime is not yet + crawl is running) - don't set lastUpdatedTime until pods actually running - set crawljob update interval to every 10 seconds for more accurate execution time tracking - frontend: show seconds in 'Execution Time' display	2024-08-06 09:44:44 -07:00
Ilya Kreymer	1c153dfd3c	Subscription Update Quotas (#1988 ) - Follow-up to #1914, allows SubscriptionUpdate event to also update quotas. - Passes current usage info + current billing page URL to portalUrl request for external app to be able to respond with best portalUrl - get_origin() moved to utils to be available more generally. - Updates billing tab to show current plans, switches order of quotas to list execution time, storage first	2024-08-05 15:59:47 -07:00
Ilya Kreymer	894aa29d4b	remove crc32 from CrawlFile (#1980 ) - no longer being used with latest stream-zip - was not computed correctly in the crawler - counterpart to webrecorder/browsertrix-crawler#657 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-07-30 11:23:15 -07:00
Ilya Kreymer	e9aeff1836	add a 'stopped_org_readonly' state for crawls that are running while org is made read-only (#1977 ) an org is made read-only while crawls are running: - treat similar to other stopped_* states, do a graceful stop - update UI to display "Stopped: Crawling Disabled" for this status - don't add corresponding skipped status - just skip running crawls if org is read-only	2024-07-29 12:24:40 -07:00
Tessa Walsh	551660bb62	Add webhooks for qaAnalysisStarted, qaAnalysisFinished, and crawlReviewed (#1974 ) Fixes #1957 Adds three new webhook events related to QA: analysis started, analysis ended, and crawl reviewed. Tests have been updated accordingly. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-25 16:53:49 -07:00
Tessa Walsh	d38abbca7f	Standardize handling of storage and execution time quotas (#1969 ) Fixes #1968 Changes: - `stopped_quota_reached` and `skipped_quota_reached` migrated to new values that indicate which quota was reached - Before crawls are run, the operator checks if storage or exec mins quotas are reached and if so fails the crawl with the appropriate state of `skipped_storage_quota_reached` or `skipped_time_quota_reached` - While crawls are running, the operator checks if the exec mins quota is reached or if the size of all running crawls will mean the storage quota is reached once uploaded; if so, the crawl is stopped gracefully and given `stopped_storage_quota_needed` or `stopped_time_quota_reached` state as appropriate - Adds new nightly tests for enforcing storage quota	2024-07-25 12:49:11 -07:00
Ilya Kreymer	8c0321bdea	Pydantic 2.x update + type fixes + python 3.12 (#1947 ) * updates pydantic to 2.x * also update to python 3.12 * additional type fixes: - all Optional[] types must have a default value - update to constrained types - URL types converted from str - test updates Fixes #1940	2024-07-22 17:23:03 -07:00
Ilya Kreymer	cd00f52cca	Fix queue response models + additional testing for queue + exclusions (#1948 ) Follow-up to regressions from #1928, this PR: - Fixes response models for queue endpoints, which had incorrect model - Adds tests for queue get, queue match, and exclusions add / remove to ensure regressions like this can be caught via tests. This involves starting a new crawl in test_run_crawls() instead of relying on implicit running via fixtures, make it easier to test crawl while it's running. - Adds additional typing for crawls apis, including making delete_crawls() have correct typing, consistent derived class override - Adds check to ensure queue + exclusion operations can not be called when crawl is not running	2024-07-22 09:00:23 -07:00
Tessa Walsh	3bf7967754	Fix regression with saving new workflow due to profileid type error (#1946 ) Fixes #1945	2024-07-18 09:35:52 -07:00
Tessa Walsh	c772ee2362	Fix response model for crawl errors API endpoint (#1939 ) Follow-up fix for #1920 for crawl errors endpoint, which returns a 500 following #1928, caught in nightly tests.	2024-07-17 10:52:14 -07:00
Ilya Kreymer	4db3053a9f	fix crawlFilenameTemplate + add_crawl_config cleanup (fixes #1932 ) (#1935 ) - ensure crawlFilenameTemplate is part of the CrawlConfig model - change CrawlConfig init to use type-safe construction - add a run_now_internal() that is shared for starting crawl, either on demand or from new config - add OrgOps.can_run_crawls() to check against org quotas for crawling - cleanup profile updates, remove _lookup_profile, only check for EmptyStr in update --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-07-17 10:48:25 -07:00
Tessa Walsh	60afb19472	Add API endpoint to import subscription for existing org (#1930 ) Fixes #1926 - adds /subscriptions/import endpoint for importing an existing subscription to an existing org - add SubscriptionImport object and log as 'import' event in subscription events collection --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-16 16:17:02 -07:00
Tessa Walsh	d41647e6c2	Document all API endpoints with response models (#1928 ) Fixes #1920 Adds response models to all API endpoints that were missing them, documenting current behavior without making any changes at this stage to standardize responses. Follow-up work will involve adding generics to some of the response models	2024-07-16 12:48:38 -07:00
Tessa Walsh	aaf18e70a0	Add created date to Organization and fix datetimes across backend (#1921 ) Fixes #1916 - Add `created` field to Organization and OrgOut, set on org creation - Add migration to backfill `created` dates from first workflow `created` - Replace `datetime.now()` and `datetime.utcnow()` across app with consistent timezone-aware `utils.dt_now` helper function, which now uses `datetime.now(timezone.utc)`. This is in part to ensure consistency in how we handle datetimes, and also to get ahead of timezone naive datetime creation methods like `datetime.utcnow()` being deprecated in Python 3.12. For more, see: https://blog.miguelgrinberg.com/post/it-s-time-for-a-change-datetime-utcnow-is-now-deprecated	2024-07-15 19:46:32 -07:00
Ilya Kreymer	9a67e28f13	Adds Subscription API (#1914 ) Fixes https://github.com/webrecorder/browsertrix/issues/1905 - adds a new top-level `/api/subscriptions` endpoint and SubOps handler on the backend. - enable subscriptions API endpoints available only if `billing_enabled` is set in helm chart - new POST /subscriptions/create, /subscriptions/update, /subscriptions/cancel API endpoints - Subscriptions mongo collection storing timestamped /subscription API events - GET /subscriptions/events API to get subscription events, support for filtering and sorting - Subscription data model - Support for setting and handling readOnlyOnCancel on org - /orgs/<id>/billing-portal to lookup portalUrl using external API - subscription in org getter and list views - mark org as readOnly for subscription status `paused_payment_failed`, clears it on status `active` --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-07-10 17:41:16 -07:00
Ilya Kreymer	1c42e21b8a	Refactor Invites and Registration, Flatten Per-User Invites (#1902 ) Fixes #1432 Refactors the invite + registration system to be simpler and more consistent with regards to existing user invites. Previously, per-user invites are stored in the user.invites dict instead of in the invites collection, which creates a few issues: - Existing user do not show up in Org Invites list: #1432 - Existing user invites also do not expire, unlike new user invites, creating potential security issue. Instead, existing user invites should be treated like new user invites. This PR moves them into the same collection, adding a `userid` field to InvitePending to match with an existing user. If a user already exists, it will be matched by userid, instead of by email. This allows for user to update their email while still being invited. Note that the email of the invited existing user will not change in the invite email. This is also by design: an admin of one org should not be given any hint that an invited user already has an account, such as by having their email automatically update. For an org admin, the invite to a new or existing user should be indistinguishable. The sha256 of invite token is stored instead of actual token for better security. The registration system has also been refactored with the following changes: - Auto-creation of new orgs for new users has been removed - User.create_user() replaces the old User._create() and just creates the user with additional complex logic around org auto-add - Users are added to org in org add_user_to_org() - Users are added to org through invites with add_user_with_invite() Tests: - Additional tests include verifying that existing and new pending invites appear in the pending invites list - Tests for `/users/invite/<token>?email=` and `/users/me/invite/<token>` endpoints - Deleting pending invites - Additional tests added for user self-registration, including existing user self-registration to default org of existing user (in nightly tests)	2024-07-02 15:13:27 -07:00
Tessa Walsh	f076e7d9e3	Add superuser API endpoints to export and import org data (#1394 ) Fixes #890 This PR introduces new streaming superuser-only API endpoints to export and import database information for an organization. New Adminstrator deployment documentation on how to manage the process and copy files between S3 buckets as needed is also included. --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-02 17:14:34 -04:00
Ilya Kreymer	e1ef894275	Extends Org Create endpont + shared secret auth (#1897 ) Updates the /api/orgs/create endpoint to: - not have name / slug be required, will be renamed on first user via #1870 - support optional quotas - support optional first admin user email, who will receive an invite to join the org. Also supports a new shared secret mechanism, to allow an external automation to access the /api/orgs/create endpoint (and only that endpoint thus far) via a shared secret instead of normal login.	2024-07-01 09:37:02 -07:00
Tessa Walsh	9140dd75bc	Add and enforce readOnly field in Organization (#1886 ) Fixes https://github.com/webrecorder/browsertrix/issues/1883 Backend work for https://github.com/webrecorder/browsertrix/issues/1876 - If readOnly is set true, disallow crawls and QA analysis runs - If readOnly is set to true, skip scheduled crawls - Add endpoint to set `readOnly` with optional `readOnlyReason` (which is automatically set back to an empty string when `readOnly` is being set to false), which can be displayed in banner - Operator: ensures cronjobs that are skipped due to internal logic (eg. readonly mode) simply succeed right away and do not leave a k8s job dangling. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-06-25 19:30:53 -07:00
Tessa Walsh	879e509b39	Backend: Move page file and error counts to crawl replay.json endpoint (#1868 ) Backend work for #1859 - Remove file count from qa stats endpoint - Compute isFile or isError per page when page is added - Increment filePageCount and errorPageCount per crawl to count number of isFile or isError pages - Add file and error counts to crawl replay.json endpoint (filePageCount and errorPageCount) - Add migration 0028 to set isFile / isError for each page, aggregate filePageCount / errorPageCount per crawl - Determine if page is a file based on loadState == 2, mime type or status code and lack of title	2024-06-20 19:02:57 -07:00
Tessa Walsh	4edc05d503	Use standard firstSeed/seedCount fallback for workflows with no name in profile details (#1852 ) Fixes #1833 - Add firstSeed and seedCount to workflow information in profile detail API endpoint (tests updated accordingly), update name of model used for limited workflow information to be more accurate - Fix name display in Crawl Workflows list at bottom of Profile detail page to be consistent with rest of application --------- Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>	2024-06-06 14:28:19 -04:00
Tessa Walsh	7e5d742fd1	Backend: Add modified field and track created/modifier users for profiles (#1820 ) This PR introduces backend changes that add the following fields to the Profile model: - `modified` - `modifiedBy` - `modifiedByName` - `createdBy` - `createdByName` Modified fields are set to the same as the created fields when the resource is created, and changed when the profile is updated (profile itself or metadata). The list profiles endpoint now also supports `sortBy` and `sortDirection` options. The endpoint defaults to sorting by `modified` in descending order, but can also sort on `created` and `name`. Tests have also been updated to reflect all new behavior.	2024-05-28 17:25:22 -04:00
Ilya Kreymer	b061a39d5f	Handle registration when user already exists (#1744 ) Fixes https://github.com/webrecorder/browsertrix/issues/1743 On the backend, support adding user to new org and improved error messaging: - if user exists is not part of the org to be added to, add user to the registration org, return 201 - if user is already part of the org, return 400 with 'user_already_is_org_member' error - if user is not being added to a new org but already exists, return 'user_already_exists' frontend: - if user user same password, they will just be logged in and added to registration org (same as before) - if user uses wrong password, show alert message - handle both user_already_is_org_member and user_already_registered with alert message and link to log-in page. note: this means existing user is added to the registration org even if they provide wrong password, but they won't be able to login until they use correct password/reset/etc...	2024-04-24 16:40:25 +02:00
Ilya Kreymer	1844e761dc	Support sorting by last QA started time (#1712 ) To support #1683, it would be useful to be able to sort by 'last QA start time' in addition to/instead of last QA state. - make sorting consistent with workflow sorting - sortBy fields renamed to lastQAState and lastQAStarted - Current QA runs are now included in the lastQAState/lastQAStarted fields, rather than being separated out to different values --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-04-22 13:00:52 -07:00
Tessa Walsh	80008a2853	Add post load delay to Browsertrix (#1700 ) Fixes #1699 Adds post load delay to: - Backend `RawCrawlConfig` model - Frontend (workflow editor and config details component) - Workflow setup docs	2024-04-18 20:03:47 -07:00
Ilya Kreymer	9609ff4194	Add 'activeQAStats' field (#1694 ) As additional support for #1683, include the active QA stats in the crawl response, along with active QA state. This will allow showing progress of QA run in the archived items list.	2024-04-18 10:05:39 -04:00
Tessa Walsh	30ab139ff2	Add QA run aggregate stats API endpoint (#1682 ) Fixes #1659 Takes an arbitrary set of thresholds for text and screenshot matches as a comma-separated list of floats. Returns a list of groupings for each that include the lower boundary and count for all thresholds passed in.	2024-04-17 13:24:18 -04:00
Tessa Walsh	c800da1732	Add reviewStatus, qaState, and qaRunCount sort options to crawls/all-crawls list endpoints (#1686 ) Backend work for #1672 Adds new sort options to /crawls and /all-crawls GET list endpoints: - `reviewStatus` - `qaRunCount`: number of completed QA runs for crawl (also added to CrawlOut) - `qaState` (sorts by `activeQAState` first, then `lastQAState`, both of which are added to CrawlOut)	2024-04-16 23:54:09 -07:00
Tessa Walsh	87e0873f1a	Add mime field to Page model (#1678 )	2024-04-17 00:57:49 -04:00
Ilya Kreymer	f243d34395	Remove pages from QA Configmap (#1671 ) Fixes #1670 No longer need to pass pages to the ConfigMap. The ConfigMap has a size limit and will fail if there are too many pages. With this change, the page list for QA will be read directly from the WACZ files pages.jsonl / extraPages.jsonl entries.	2024-04-12 16:04:33 -07:00
Tessa Walsh	172a9bf0cd	Change crawl.reviewStatus to 1-5 scale int (#1664 )	2024-04-09 17:51:06 -07:00
Ilya Kreymer	ffc4b5b58f	operator state fixes (follow up fomr #1639 ) (#1640 ) - increase time for going to waiting_capacity from starting to 150 seconds - relax requirement for state transitions, allow complete from waiting - additional type safety for different states, ensure mark_finished() only called with non-running states, add `Literal` types for all the state types.	2024-03-29 15:12:16 -07:00
Ilya Kreymer	4f676e4e82	QA Runs Initial Backend Implementation (#1586 ) Supports running QA Runs via the QA API! Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes #1498 Also requires the latest Browsertrix Crawler 1.1.0+ (from webrecorder/browsertrix-crawler#469 branch) Notable changes: - QARun objects contain info about QA runs, which are crawls performed on data loaded from existing crawls. - Various crawl db operations can be performed on either the crawl or `qa.` object, and core crawl fields have been moved to CoreCrawlable. - While running,`QARun` data stored in a single `qa` object, while finished qa runs are added to `qaFinished` dictionary on the Crawl. The QA list API returns data from the finished list, sorted by most recent first. - Includes additional type fixes / type safety, especially around BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific get_upload(), get_basecrawl(), get_crawl() getters for internal use and get_crawl_out() for API - Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-20 22:42:16 -07:00
Tessa Walsh	c20e754269	Add updatable QA reviewStatus field to crawls (#1575 ) Fixes #1539 Adds `reviewStatus` field to `BaseCrawl` model, updatable via the crawl update API endpoint. Acceptable values are "good", "acceptable" or "failure", enforced by an Enum. Added to `BaseCrawl` so that we can extend support to uploads more easily later on, but for now we'll only display this for crawls in the frontend.	2024-03-05 16:49:23 -08:00
Tessa Walsh	14189b7cfb	Add crawl pages and related API endpoints (#1516 ) Fixes #1502 - Adds pages to database as they get added to Redis during crawl - Adds migration to add pages to database for older crawls from pages.jsonl and extraPages.jsonl files in WACZ - Adds GET, list GET, and PATCH update endpoints for pages - Adds POST (add), PATCH, and POST (delete) endpoints for page notes, each with their own id, timestamp, and user info in addition to text - Adds page_ops methods for 1. adding resources/urls to page, and 2. adding automated heuristics and supplemental info (mime, type, etc.) to page (for use in crawl QA job) - Modifies `Migration` class to accept kwargs so that we can pass in ops classes as needed for migrations - Deletes WACZ files and pages from database for failed crawls during crawl_finished process - Deletes crawl pages when a crawl is deleted Note: Requires a crawler version 1.0.0 beta3 or later, with support for `--writePagesToRedis` to populate pages at crawl completion. Beta 4 is configured in the test chart, which should be upgraded to stable 1.0.0 when it's released. Connected to https://github.com/webrecorder/browsertrix-crawler/pull/464 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-02-28 12:11:35 -05:00
Tessa Walsh	a898c2b456	Format backend with Black 24 (#1507 ) Fixes #1506	2024-02-07 11:35:34 -08:00
Tessa Walsh	b252931c71	Add scale to CrawlOut (#1487 ) Fixes #1486 `scale` is already saved in the crawl but needed to be added to `CrawlOut`.	2024-01-23 14:10:37 -08:00
Tessa Walsh	07fa46d9aa	Add custom user agent to workflows (#1465 ) Fixes #1341 Adds "User Agent" field to workflow editor under the Browser Settings tab. If not set, the crawler will use the browser's default user agent. Also added to docs and to the workflow details page (if set). --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-01-17 17:33:50 -05:00
Tessa Walsh	032859f361	Support multiple crawler versions (#1420 ) Fixes #1385 ## Changes Supports multiple crawler 'channels' which can be configured to different browsertrix-crawler versions - Replaces `crawler_image` in helm chart with `crawler_channels` array similar to how storages are handled - The `default` crawler channel must always be provided and specifies the default crawler image - Adds backend `/orgs/{oid}/crawlconfigs/crawler-channels` API endpoint to fetch information about available crawler versions (name, image, and label) and test - Adds crawler channel select to workflow creation/edit screens and profile creation dialog, and updates related API endpoints and configmaps accordingly. The select dropdown is shown only if more than one channel is configured. - Adds `crawlerChannel` to workflow and crawl details. - Add `image` to crawler image, used to display actual image used as part of the crawl. - Modifies `crawler_crawl_id` backend test fixture to use `test` crawler version to ensure crawler versions other than latest work - Adds migration to add `crawlerChannel` set to `default` to existing workflow and profile objects and workflow configmaps --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-01-16 15:32:12 -08:00
Tessa Walsh	9f73bafd37	Fix browser profile name in crawl endpoints (#1464 ) Fixes #1388 Fixes browser profile name lookup by ensuring profileid is in CrawlOut model.	2024-01-14 16:30:27 -08:00
Tessa Walsh	be41c48c27	Add extra and gifted execution minutes (#1361 ) Fixes #1358 - Adds `extraExecMinutes` and `giftedExecMinutes` org quotas, which are not reset monthly but are updateable amounts that carry across months - Adds `quotaUpdate` field to `Organization` to track when quotas were updated with timestamp - Adds `extraExecMinutesAvailable` and `giftedExecMinutesAvailable` fields to `Organization` to help with tracking available time left (includes tested migration to initialize these to 0) - Modifies org backend to track time across multiple categories, using monthlyExecSeconds, then giftedExecSeconds, then extraExecSeconds. All time is also written into crawlExecSeconds, which is now the monthly total and also contains any overage time above the quotas - Updates Dashboard crawling meter to include all types of execution time if `extraExecMinutes` and/or `giftedExecMinutes` are set above 0 - Updates Dashboard Usage History table to include all types of execution time (only displaying columns that have data) - Adds backend nightly test to check handling of quotas and execution time - Includes migration to add new fields and copy crawlExecSeconds to monthlyExecSeconds for previous months Co-authored-by: emma <hi@emma.cafe>	2023-12-07 14:34:37 -05:00
Ilya Kreymer	b23eed5003	Email Templates (#1375 ) - Emails are now processed from Jinja2 templates found in `charts/email-templates`, to support easier updates via helm chart in the future. - The available templates are: `invite`, `password_reset`, `validate` and `failed_bg_job`. - Each template can be text only or also include HTML. The format of the template is: ``` subject ~~~ <html content> ~~~ text ``` - A new `support_email` field is also added to the email block in values.yaml Invite Template: - Currently, only the invite template includes an HTML version, other templates are text only. - The same template is used for new and existing users, with slightly different text if adding user to an existing org. - If user is invited by the superadmin, the invited by field is not included, otherwise it also includes 'You have been invited by X to join Y'	2023-11-15 15:22:12 -08:00
Ilya Kreymer	dfba4b3940	Replace partial_complete -> stopped_by_user or stopped_quota_reached + operator edge cases (#1368 ) - Adds two new crawl finished state, stopped_by_user and stopped_quota_reached - Tracking other possible 'stop reasons' in operator, though not making them distinct states for now. - Updated frontend with 'Stopped by User' and 'Stopped: Time Quota Reached', shown with same icon as current partial_complete - Added migration of partial_complete to either stopped_by_user or complete (no historical quota data available) - Addresses edge case in scaling: if crawl never scaled (no redis entry, no pod), automatically scale down - Edge case in status: if crawl is somehow 'canceled' but not deleted, immediately delete crawl object and begin finalizing. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-11-14 11:17:16 -08:00
Tessa Walsh	f3cbd9e179	Add crawl, upload, and collection delete webhook event notifications (#1363 ) Fixes #1307 Fixes #1132 Related to #1306 Deleted webhook notifications include the org id and item/collection id. This PR also includes API docs for the new webhooks and extends the existing tests to account for the new webhooks. This PR also does some additional cleanup for existing webhooks: - Remove `downloadUrls` from item finished webhook bodies - Rename collection webhook body `downloadUrls` to `downloadUrl`, since we only ever have one per collection - Fix API docs for existing webhooks, one of which had the wrong response body	2023-11-09 18:19:08 -08:00
Tessa Walsh	1afc411114	Implement retry API endpoint for failed background jobs (#1356 ) Fixes #1328 - Adds /retry endpoint for retrying failed jobs. - Returns 400 error if previous job still running or has succeeded - Keeps track of previous failed attempts in previousAttempts array on failed job. - Also amends the similar webhook /retry endpoint to use `POST` for consistency. - Remove duplicate api tag for backgroundjobs	2023-11-09 18:09:37 -08:00
Ilya Kreymer	fb3d88291f	Background Jobs Work (#1321 ) Fixes #1252 Supports a generic background job system, with two background jobs, CreateReplicaJob and DeleteReplicaJob. - CreateReplicaJob runs on new crawls, uploads, profiles and updates the `replicas` array with the info about the replica after the job succeeds. - DeleteReplicaJob deletes the replica. - Both jobs are created from the new `replica_job.yaml` template. The CreateReplicaJob sets secrets for primary storage + replica storage, while DeleteReplicaJob only needs the replica storage. - The job is processed in the operator when the job is finalized (deleted), which should happen immediately when the job is done, either because it succeeds or because the backoffLimit is reached (currently set to 3). - /jobs/ api lists all jobs using a paginated response, including filtering and sorting - /jobs/<job id> returns details for a particular job - tests: nightly tests updated to check create + delete replica jobs for crawls as well as uploads, job api endpoints - tests: also fixes to timeouts in nightly tests to avoid crawls finishing too quickly. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-11-02 13:02:17 -07:00
Ilya Kreymer	6384d8b5f1	Additional Type Hints / Type Fix Pass (#1320 ) This PR adds more type safety to the backend codebase: - All ops classes calls should be type checked - Avoiding circular references with TYPE_CHECKING conditional - Consistent UUID usage: uuid.UUID / UUID4 with just UUID - Crawl states moved to models, made into lists - Additional typing added as needed, fixed a few type related errors - CrawlOps / UploadOps / BaseCrawlOps now all have same param init order to simplify changes	2023-10-30 12:59:24 -04:00
Ilya Kreymer	6dc452ebad	Storage Refactor: Replication + Custom Storage Support (#1296 ) - Refactors storage to support replicas + custom storages on the Org. - There is a default primary + replica storage, while an Org can also have primary and replica storages. - StorageRef object is used to store references to default and custom storage. - CrawlFile has been updated to contain a StorageRef instead of a def_storage_name, which references either a default storage (in StorageOps) or custom storage (in Organization) - There is also a 'replicas' Optional[List[StorageRef]] which contains replicas, if any. - CrawlFileOut contain a numReplicas for how many replicas exist for a given file. - Migration: migration 0020 added to migrate existing Orgs, CrawlFile and ProfileFile objects to new storage system (CrawlFile and ProfileFile now extend BaseFile) Part of #1262 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-10-26 21:44:09 -07:00
Tessa Walsh	38f32f11ea	Enforce quota and hard cap for monthly execution minutes (#1284 ) Fixes #1261 Closes #1092 The quota for monthly execution minutes is treated as a hard cap. Once it is exceeded, an alert indicating that an org has exceeded its monthly execution minutes will display and the user will be unable to start new crawls. Any running crawls will be stopped once the quota is exceeded. An execution minutes meter bar is also added in the Org Dashboard and displayed if a quota is set. More detail in #1305 which was merged into this branch. ## Changes - Enable setting 'maxExecMinutesPerMonth' in orgs list quotas by superadmin - Enforce quota by stopping crawls in operator once quota is reached - Show alert banner once execution time quota is hit: - Once quota is hit, disable Run Crawl buttons in frontend, return 403 message with `exec_minutes_quota_reached` detail in backend from crawl config `/run` endpoint, and don't run new workflows on creation (similar to storage quota) - Display execution time for crawls in the crawl details overview, immediately below - Show execution minutes meter on dashboard (from #1305) --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: sua yoo <sua@webrecorder.org>	2023-10-26 15:38:51 -07:00

1 2

83 Commits