browsertrix

Author	SHA1	Message	Date
Ilya Kreymer	62da0fbd6c	security: tweak get /invite endpoints / InviteOut to: (#2087 ) don't set inviterEmail / inviterName if the inviter is the superuser: - return fromSuperuser true/false - if fromSuperuser, don't set inviterEmail / inviterName - tests: add tests for non-superuser admin invites	2024-09-20 11:52:56 -07:00
Ilya Kreymer	feb6b1f26c	Ensure email comparisons are case-insensitive, emails stored as lowercase (#2084 ) (#2086 ) (fixes from 1.11.7) - Add a custom EmailStr type which lowercases the full e-mail, not just the domain. - Ensure EmailStr is used throughout wherever e-mails are used, both for invites and user models - Tests: update to check for lowercase email responses, e-mails returned from APIs are always lowercase - Tests: remove tests where '@' was ur-lencoded, should not be possible since POSTing JSON and no url-decoding is done/expected. E-mails should have '@' present. - Fixes #2083 where invites were rejected due to case differences - CI: pin pymongo dependency due to latest releases update, update python used for CI	2024-09-19 12:20:34 -07:00
Tessa Walsh	123705c53f	Serialize datetimes with Z suffix (#2058 ) Use timezone aware datetimes instead of timezone naive datetimes: - Update mongodb client to use tz-aware conversion - Convert dt_now() to return timezone aware UTC date - Rename to_k8s_date -> date_to_str, just returns ISO UTC date with 'Z' (instead of '+00:00' suffix) - Rename from_k8s_date -> str_to_date, returns timezone aware date from str - Standardize all string<->date conversion to use either date_to_str or str_to_date - Update frontend to assume iso date, not append 'Z' directly - Update tests to check for 'Z' suffix on some dates --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-09-12 16:16:13 -07:00
sua yoo	4c36c80351	feat: Display scale as number of browser windows (#2057 ) Resolves https://github.com/webrecorder/browsertrix/issues/2048 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-09-05 17:32:40 -07:00
sua yoo	337454f8c9	feat: Add link to hosted sign-up page (#2045 ) Resolves https://github.com/webrecorder/browsertrix/issues/2043 <!-- Fixes #issue_number --> ### Changes - Shows link to sign up in UI if `sign_up_url` is configured. - Expires settings in session storage (for now)	2024-08-26 17:26:25 -07:00
Ilya Kreymer	04c8b50423	add a crawling defaults on the Org to allow setting certain crawl workflow fields as defaults: (#2031 ) - add POST /orgs/<id>/defaults/crawling API to update all defaults (defaults unset are cleared) - defaults returned as 'crawlingDefaults' object on Org, if set - fixes #2016 --------- Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>	2024-08-22 10:36:04 -07:00
Ilya Kreymer	8c9a14b6a2	Ensure Subscription Update doesn't update the gifted quotas (#2012 ) - add a separate OrgQuotasIn where all quota updates are optional - ensure gifted quotas are never updated as part of org update - update tests	2024-08-20 13:15:03 -07:00
Tessa Walsh	916813af2d	Include user and user org info in login response (#2014 ) Fixes #2013 Adds the `/users/me` response data to the API login endpoint response under the key `user_info` and adds a test.	2024-08-12 18:51:42 -07:00
Ilya Kreymer	5f53db75ee	fix resetting of invalid logins: (#2002 ) * Fixes issue in FailedLogin model: - fix data-model to remove nested 'attempted.attempted' - migrate existing data to remove nested field * Also, avoid setting dt_now() in model as that results in fixed date for all objects: - update FailedLogin to update 'attempted' date on every attempt - also update PageNote object to set date in constructor * Update text for too many logins to make it clear it is set only if its a valid email * fixes #2001	2024-08-07 12:36:06 -07:00
Ilya Kreymer	41d43ae249	Fix forgot password for invalid user (#1999 ) - fix validation error if user doesn'r exist - always return success even if user doesn't exist for security reasons - add test for forgot password endpoint	2024-08-07 11:02:40 -07:00
Ilya Kreymer	1c153dfd3c	Subscription Update Quotas (#1988 ) - Follow-up to #1914, allows SubscriptionUpdate event to also update quotas. - Passes current usage info + current billing page URL to portalUrl request for external app to be able to respond with best portalUrl - get_origin() moved to utils to be available more generally. - Updates billing tab to show current plans, switches order of quotas to list execution time, storage first	2024-08-05 15:59:47 -07:00
Tessa Walsh	551660bb62	Add webhooks for qaAnalysisStarted, qaAnalysisFinished, and crawlReviewed (#1974 ) Fixes #1957 Adds three new webhook events related to QA: analysis started, analysis ended, and crawl reviewed. Tests have been updated accordingly. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-25 16:53:49 -07:00
Tessa Walsh	27ee16d308	Implement downloading archived item + QA runs as multi-WACZ (#1933 ) Fixes #1412 ## Changes ### Backend - Adds `all-crawls`, `crawls`, and `uploads` API endpoints to download archived item as multi-WACZ - Download QA runs as multi-WACZ - Adds backend tests for new endpoints - Update to new version of stream-zip library which does not require crc-32 to be present for ZIP members, computes after streaming, fixing invalid crc-32 issues as previously computed crc-32s from crawler may be invalid. ### Frontend Adds ability to download archived item from: - Button in archived item detail Files tab - Archived item details actions menu - Archived items list menu --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-25 10:28:57 -07:00
Ilya Kreymer	a8c5f07b7c	Add support e-mail to settings (#1960 ) Adds support email to /api/settings Also adds a response model for this endpoint and consolidates api tests Addresses request in #1912	2024-07-23 20:58:12 -04:00
Tessa Walsh	a02f7a6826	Ensure lexical sort for org names (#1958 ) Fixes #1955 Orgs list endpoint sorting now works as follows: - Default org is always sorted first - Name sorting now works on a lowercased version of the org names to ensure lexical sorting The lodash `sortBy` resorting of orgs in the "All Organizations" dropdown list in the nav bar has also been removed so that the backend sorting is applied instead. Tests have been updated accordingly.	2024-07-23 13:13:04 -07:00
Ilya Kreymer	8c0321bdea	Pydantic 2.x update + type fixes + python 3.12 (#1947 ) * updates pydantic to 2.x * also update to python 3.12 * additional type fixes: - all Optional[] types must have a default value - update to constrained types - URL types converted from str - test updates Fixes #1940	2024-07-22 17:23:03 -07:00
Ilya Kreymer	cd00f52cca	Fix queue response models + additional testing for queue + exclusions (#1948 ) Follow-up to regressions from #1928, this PR: - Fixes response models for queue endpoints, which had incorrect model - Adds tests for queue get, queue match, and exclusions add / remove to ensure regressions like this can be caught via tests. This involves starting a new crawl in test_run_crawls() instead of relying on implicit running via fixtures, make it easier to test crawl while it's running. - Adds additional typing for crawls apis, including making delete_crawls() have correct typing, consistent derived class override - Adds check to ensure queue + exclusion operations can not be called when crawl is not running	2024-07-22 09:00:23 -07:00
Tessa Walsh	2237120cd5	Add API endpoint to recalculate org storage (#1943 ) Fixes #1942 This process might be a bit slow for large orgs, may consider moving it to background job in #1898.	2024-07-19 18:29:20 -07:00
Tessa Walsh	6ccaad26d8	Ensure org name and slug uniqueness is case-insensitive (#1929 ) Fixes #1927 Also adds tests to ensure index is working as expected, and migration to rename orgs that have names or slugs identical to other orgs except for case before the new case-insensitive index is built.	2024-07-18 15:30:12 -07:00
Ilya Kreymer	b1ccdc4d16	OpenAPI Metadata for API Endpoints (#1941 ) - Updates the `/docs` and `/redoc` API endpoints to have better metadata, including using Browsertrix favicon and our logo for the `/redoc` endpoint. - add new logo file 'docs-logo.svg' to root Based on info at: https://fastapi.tiangolo.com/how-to/extending-openapi/ https://fastapi.tiangolo.com/tutorial/metadata/ --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-07-18 11:11:38 -07:00
Tessa Walsh	3bf7967754	Fix regression with saving new workflow due to profileid type error (#1946 ) Fixes #1945	2024-07-18 09:35:52 -07:00
Tessa Walsh	60afb19472	Add API endpoint to import subscription for existing org (#1930 ) Fixes #1926 - adds /subscriptions/import endpoint for importing an existing subscription to an existing org - add SubscriptionImport object and log as 'import' event in subscription events collection --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-16 16:17:02 -07:00
Tessa Walsh	aaf18e70a0	Add created date to Organization and fix datetimes across backend (#1921 ) Fixes #1916 - Add `created` field to Organization and OrgOut, set on org creation - Add migration to backfill `created` dates from first workflow `created` - Replace `datetime.now()` and `datetime.utcnow()` across app with consistent timezone-aware `utils.dt_now` helper function, which now uses `datetime.now(timezone.utc)`. This is in part to ensure consistency in how we handle datetimes, and also to get ahead of timezone naive datetime creation methods like `datetime.utcnow()` being deprecated in Python 3.12. For more, see: https://blog.miguelgrinberg.com/post/it-s-time-for-a-change-datetime-utcnow-is-now-deprecated	2024-07-15 19:46:32 -07:00
Tessa Walsh	a546fb6fe0	Improve handling of duplicate org name/slug (#1917 ) Initial implementation of #1892 - Modifies the backend to return `duplicate_org_name` or `duplicate_org_slug` as appropriate on a pymongo `DuplicateKeyError` - Updates frontend to handle `duplicate_org_name`, `duplicate_org_slug`, and `invalid_slug` error details - Update errors to be more consistent, also return `duplicate_org_subscription.subId` for duplicate subscription instead of the more generic `already_exists` --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-10 19:24:50 -07:00
Ilya Kreymer	9a67e28f13	Adds Subscription API (#1914 ) Fixes https://github.com/webrecorder/browsertrix/issues/1905 - adds a new top-level `/api/subscriptions` endpoint and SubOps handler on the backend. - enable subscriptions API endpoints available only if `billing_enabled` is set in helm chart - new POST /subscriptions/create, /subscriptions/update, /subscriptions/cancel API endpoints - Subscriptions mongo collection storing timestamped /subscription API events - GET /subscriptions/events API to get subscription events, support for filtering and sorting - Subscription data model - Support for setting and handling readOnlyOnCancel on org - /orgs/<id>/billing-portal to lookup portalUrl using external API - subscription in org getter and list views - mark org as readOnly for subscription status `paused_payment_failed`, clears it on status `active` --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-07-10 17:41:16 -07:00
sua yoo	c97900ec2b	Merge branch 'main' into frontend-org-manage-readonly	2024-07-08 11:20:30 -07:00
Tessa Walsh	192737ea99	Add API endpoint to delete org (#1448 ) Fixes #903 Adds superuser-only API endpoint to delete an org and all of its data --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-03 16:00:11 -04:00
Tessa Walsh	497cfdc561	Merge branch 'main' into frontend-org-manage-readonly	2024-07-03 11:15:47 -04:00
Tessa Walsh	d3fb33a78a	Add and apply backend sorting for org list The default org will always be sorted first, regardless of sort options. Orgs after the first will be sorted by name ascending by default. Sorting currently supported on name, slug, and readOnly.	2024-07-03 11:01:01 -04:00
Ilya Kreymer	1c42e21b8a	Refactor Invites and Registration, Flatten Per-User Invites (#1902 ) Fixes #1432 Refactors the invite + registration system to be simpler and more consistent with regards to existing user invites. Previously, per-user invites are stored in the user.invites dict instead of in the invites collection, which creates a few issues: - Existing user do not show up in Org Invites list: #1432 - Existing user invites also do not expire, unlike new user invites, creating potential security issue. Instead, existing user invites should be treated like new user invites. This PR moves them into the same collection, adding a `userid` field to InvitePending to match with an existing user. If a user already exists, it will be matched by userid, instead of by email. This allows for user to update their email while still being invited. Note that the email of the invited existing user will not change in the invite email. This is also by design: an admin of one org should not be given any hint that an invited user already has an account, such as by having their email automatically update. For an org admin, the invite to a new or existing user should be indistinguishable. The sha256 of invite token is stored instead of actual token for better security. The registration system has also been refactored with the following changes: - Auto-creation of new orgs for new users has been removed - User.create_user() replaces the old User._create() and just creates the user with additional complex logic around org auto-add - Users are added to org in org add_user_to_org() - Users are added to org through invites with add_user_with_invite() Tests: - Additional tests include verifying that existing and new pending invites appear in the pending invites list - Tests for `/users/invite/<token>?email=` and `/users/me/invite/<token>` endpoints - Deleting pending invites - Additional tests added for user self-registration, including existing user self-registration to default org of existing user (in nightly tests)	2024-07-02 15:13:27 -07:00
Tessa Walsh	f076e7d9e3	Add superuser API endpoints to export and import org data (#1394 ) Fixes #890 This PR introduces new streaming superuser-only API endpoints to export and import database information for an organization. New Adminstrator deployment documentation on how to manage the process and copy files between S3 buckets as needed is also included. --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-02 17:14:34 -04:00
Tessa Walsh	bdfc0948d3	Disable uploading and creating browser profiles when org is read-only (#1907 ) Fixes #1904 Follow-up to read-only enforcement, with improved tests.	2024-07-01 23:15:38 -07:00
Ilya Kreymer	e1ef894275	Extends Org Create endpont + shared secret auth (#1897 ) Updates the /api/orgs/create endpoint to: - not have name / slug be required, will be renamed on first user via #1870 - support optional quotas - support optional first admin user email, who will receive an invite to join the org. Also supports a new shared secret mechanism, to allow an external automation to access the /api/orgs/create endpoint (and only that endpoint thus far) via a shared secret instead of normal login.	2024-07-01 09:37:02 -07:00
Tessa Walsh	b7631d1b91	Add slug validation and test (#1891 ) Fixes #1890 Adds validation for org slugs, ensuring that they contain only ASCII alphanumeric characters and dashes (`-`). If an invalid slug is provided, an HTTPException is returned with status code 400 and detail `invalid_slug`.	2024-06-26 15:04:54 -04:00
Tessa Walsh	9140dd75bc	Add and enforce readOnly field in Organization (#1886 ) Fixes https://github.com/webrecorder/browsertrix/issues/1883 Backend work for https://github.com/webrecorder/browsertrix/issues/1876 - If readOnly is set true, disallow crawls and QA analysis runs - If readOnly is set to true, skip scheduled crawls - Add endpoint to set `readOnly` with optional `readOnlyReason` (which is automatically set back to an empty string when `readOnly` is being set to false), which can be displayed in banner - Operator: ensures cronjobs that are skipped due to internal logic (eg. readonly mode) simply succeed right away and do not leave a k8s job dangling. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-06-25 19:30:53 -07:00
Tessa Walsh	7af3980323	Add billing enabled and sales email to Helm chart and /settings API endpoint (#1873 ) Backend work for first two tasks of https://github.com/webrecorder/browsertrix/issues/1875 New /billing API endpoint to be added separately once we have a better idea of what data we can get from the payment processor.	2024-06-25 10:55:29 -04:00
Tessa Walsh	879e509b39	Backend: Move page file and error counts to crawl replay.json endpoint (#1868 ) Backend work for #1859 - Remove file count from qa stats endpoint - Compute isFile or isError per page when page is added - Increment filePageCount and errorPageCount per crawl to count number of isFile or isError pages - Add file and error counts to crawl replay.json endpoint (filePageCount and errorPageCount) - Add migration 0028 to set isFile / isError for each page, aggregate filePageCount / errorPageCount per crawl - Determine if page is a file based on loadState == 2, mime type or status code and lack of title	2024-06-20 19:02:57 -07:00
Tessa Walsh	8b0d1432af	Show QA meter while analysis is running (#1854 ) Fixes #1846 - Ensure meter auto-updates as new stats are ready - Switch meter to new QA run when new analysis run is started - Remove Files from QA meter (files and errors will be reported separately) Co-authored-by: emma <hi@emma.cafe> Co-authored-by: sua yoo <sua@webrecorder.org>	2024-06-12 12:32:01 -04:00
Ilya Kreymer	2ffb37bd14	tests: fix typo in waiting for qa run to stop test! (#1857 ) Fixes not properly testing if activeQA is null, hopefully fixes intermittent test failures!	2024-06-11 11:07:55 -04:00
Tessa Walsh	4edc05d503	Use standard firstSeed/seedCount fallback for workflows with no name in profile details (#1852 ) Fixes #1833 - Add firstSeed and seedCount to workflow information in profile detail API endpoint (tests updated accordingly), update name of model used for limited workflow information to be more accurate - Fix name display in Crawl Workflows list at bottom of Profile detail page to be consistent with rest of application --------- Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>	2024-06-06 14:28:19 -04:00
Tessa Walsh	a85f9496b0	Include number of Identical Files in QA stats and meter (#1848 ) This PR adds Identical Files to the QA Page Match Analysis meter bars. To do this, the backend calculates the number of non-HTML pages once and includes it under the key `Files` in each of the `screenshotMatch` and `textMatch` QA stats return arrays. The backend additionally removes the file count from "No Data" to prevent these from being counted twice. --------- Co-authored-by: emma <hi@emma.cafe>	2024-06-06 13:15:19 -04:00
sua yoo	4d4c8a04d4	feat: User-sort browser profiles list (#1839 ) Resolves https://github.com/webrecorder/browsertrix/issues/1409 ### Changes - Enables clicking on Browser Profiles column header to sort the table, including by starting URL - More consistent column widths throughout app --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: emma <hi@emma.cafe> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-06-04 13:57:03 -04:00
Tessa Walsh	7e5d742fd1	Backend: Add modified field and track created/modifier users for profiles (#1820 ) This PR introduces backend changes that add the following fields to the Profile model: - `modified` - `modifiedBy` - `modifiedByName` - `createdBy` - `createdByName` Modified fields are set to the same as the created fields when the resource is created, and changed when the profile is updated (profile itself or metadata). The list profiles endpoint now also supports `sortBy` and `sortDirection` options. The endpoint defaults to sorting by `modified` in descending order, but can also sort on `created` and `name`. Tests have also been updated to reflect all new behavior.	2024-05-28 17:25:22 -04:00
Ilya Kreymer	85cd214101	Fix regression to changing user roles via PATCH /user-role API (#1824 ) clean up adding user vs changing role logic: - when adding user, ensure user doesn't exist - when changing roles, ensure user does exist add test for changing roles of existing user Fixes #1821	2024-05-24 10:41:05 -07:00
Tessa Walsh	b8caeb88e9	Ensure QA run WACZs are deleted (#1715 ) - When qa run is deleted - When crawl is deleted And adds tests for WACZ deletion. Fixes #1713	2024-04-22 18:04:09 -04:00
Ilya Kreymer	1844e761dc	Support sorting by last QA started time (#1712 ) To support #1683, it would be useful to be able to sort by 'last QA start time' in addition to/instead of last QA state. - make sorting consistent with workflow sorting - sortBy fields renamed to lastQAState and lastQAStarted - Current QA runs are now included in the lastQAState/lastQAStarted fields, rather than being separated out to different values --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-04-22 13:00:52 -07:00
Ilya Kreymer	4360e0c1b5	Update tests with latest crawler (#1711 ) tests: use 'latest' crawler release for testing, now that 1.1.x is released.	2024-04-20 15:56:26 -07:00
Ilya Kreymer	9609ff4194	Add 'activeQAStats' field (#1694 ) As additional support for #1683, include the active QA stats in the crawl response, along with active QA state. This will allow showing progress of QA run in the archived items list.	2024-04-18 10:05:39 -04:00
Tessa Walsh	b87860c68a	Ensure /all-crawls?sortBy=qaState always sorts crawls above uploads (#1691 ) Follow-up to #1686	2024-04-17 19:14:29 -07:00
Tessa Walsh	30ab139ff2	Add QA run aggregate stats API endpoint (#1682 ) Fixes #1659 Takes an arbitrary set of thresholds for text and screenshot matches as a comma-separated list of floats. Returns a list of groupings for each that include the lower boundary and count for all thresholds passed in.	2024-04-17 13:24:18 -04:00
Tessa Walsh	c800da1732	Add reviewStatus, qaState, and qaRunCount sort options to crawls/all-crawls list endpoints (#1686 ) Backend work for #1672 Adds new sort options to /crawls and /all-crawls GET list endpoints: - `reviewStatus` - `qaRunCount`: number of completed QA runs for crawl (also added to CrawlOut) - `qaState` (sorts by `activeQAState` first, then `lastQAState`, both of which are added to CrawlOut)	2024-04-16 23:54:09 -07:00
Tessa Walsh	87e0873f1a	Add mime field to Page model (#1678 )	2024-04-17 00:57:49 -04:00
Vinzenz Sinapius	1b034957ff	Improve reliability of backend tests (#1675 ) - Remove globals from profile, uploads, and qa test modules in favor of fixtures - Add retries to fix intermittent test failures due to timing --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-04-16 14:22:41 -04:00
Tessa Walsh	172a9bf0cd	Change crawl.reviewStatus to 1-5 scale int (#1664 )	2024-04-09 17:51:06 -07:00
Tessa Walsh	4229b94736	Track failed QA runs and include in list endpoint (#1650 ) Fixes #1648 - Tracks failed QA runs in database, not only successful ones - Includes failed QA runs in list endpoint by default - Adds `skipFailed` param to list endpoint to return only successful runs --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-04-04 18:51:06 -07:00
Tessa Walsh	00ced6dd6b	Add single page QA GET endpoint (#1635 ) Fixes #1634 Also make sure other get page endpoint without qa uses PageOut model	2024-03-27 14:57:59 -07:00
Tessa Walsh	e9895e78a2	Add additional filters to page list endpoints (#1622 ) Fixes #1617 Filters added: - reviewed: filter by page has approval or at least one note (true) or neither (false) - approved: filter by approval value (accepts list of strings, comma-separated, each of which are coerced into True, False, or None, or ignored if they are invalid values) - hasNotes: filter by has at least one note (true) or not (false) Tests have also been added to ensure that results are as expected.	2024-03-21 21:33:07 -07:00
Ilya Kreymer	4f676e4e82	QA Runs Initial Backend Implementation (#1586 ) Supports running QA Runs via the QA API! Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes #1498 Also requires the latest Browsertrix Crawler 1.1.0+ (from webrecorder/browsertrix-crawler#469 branch) Notable changes: - QARun objects contain info about QA runs, which are crawls performed on data loaded from existing crawls. - Various crawl db operations can be performed on either the crawl or `qa.` object, and core crawl fields have been moved to CoreCrawlable. - While running,`QARun` data stored in a single `qa` object, while finished qa runs are added to `qaFinished` dictionary on the Crawl. The QA list API returns data from the finished list, sorted by most recent first. - Includes additional type fixes / type safety, especially around BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific get_upload(), get_basecrawl(), get_crawl() getters for internal use and get_crawl_out() for API - Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-20 22:42:16 -07:00
Tessa Walsh	21ae38362e	Add endpoints to read pages from older crawl WACZs into database (#1562 ) Fixes #1597 New endpoints (replacing old migration) to re-add crawl pages to db from WACZs. After a few implementation attempts, we settled on using [remotezip](https://github.com/gtsystem/python-remotezip) to handle parsing of the zip files and streaming their contents line-by-line for pages. I've also modified the sync log streaming to use remotezip as well, which allows us to remove our own zip module and let remotezip handle the complexity of parsing zip files. Database inserts for pages from WACZs are batched 100 at a time to help speed up the endpoint, and the task is kicked off using asyncio.create_task so as not to block before giving a response. StorageOps now contains a method for streaming the bytes of any file in a remote WACZ, requiring only the presigned URL for the WACZ and the name of the file to stream.	2024-03-19 14:14:21 -07:00
Ilya Kreymer	08f6847194	Configurable Max Scale for frontend (#1557 ) Allow maximum scale option to be fully configurable via `max_crawl_scale`. Already configurable on the backend, and now exposed to the frontend via API `/api/settings` `maxCrawlScale` value. The workflow editor and workflow details are updated to allow selecting the scale up to the maxCrawlScale setting (which defaults to 3 if not set).	2024-03-11 16:21:20 -07:00
Tessa Walsh	c20e754269	Add updatable QA reviewStatus field to crawls (#1575 ) Fixes #1539 Adds `reviewStatus` field to `BaseCrawl` model, updatable via the crawl update API endpoint. Acceptable values are "good", "acceptable" or "failure", enforced by an Enum. Added to `BaseCrawl` so that we can extend support to uploads more easily later on, but for now we'll only display this for crawls in the frontend.	2024-03-05 16:49:23 -08:00
Tessa Walsh	14189b7cfb	Add crawl pages and related API endpoints (#1516 ) Fixes #1502 - Adds pages to database as they get added to Redis during crawl - Adds migration to add pages to database for older crawls from pages.jsonl and extraPages.jsonl files in WACZ - Adds GET, list GET, and PATCH update endpoints for pages - Adds POST (add), PATCH, and POST (delete) endpoints for page notes, each with their own id, timestamp, and user info in addition to text - Adds page_ops methods for 1. adding resources/urls to page, and 2. adding automated heuristics and supplemental info (mime, type, etc.) to page (for use in crawl QA job) - Modifies `Migration` class to accept kwargs so that we can pass in ops classes as needed for migrations - Deletes WACZ files and pages from database for failed crawls during crawl_finished process - Deletes crawl pages when a crawl is deleted Note: Requires a crawler version 1.0.0 beta3 or later, with support for `--writePagesToRedis` to populate pages at crawl completion. Beta 4 is configured in the test chart, which should be upgraded to stable 1.0.0 when it's released. Connected to https://github.com/webrecorder/browsertrix-crawler/pull/464 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-02-28 12:11:35 -05:00
Tessa Walsh	a898c2b456	Format backend with Black 24 (#1507 ) Fixes #1506	2024-02-07 11:35:34 -08:00
Tessa Walsh	032859f361	Support multiple crawler versions (#1420 ) Fixes #1385 ## Changes Supports multiple crawler 'channels' which can be configured to different browsertrix-crawler versions - Replaces `crawler_image` in helm chart with `crawler_channels` array similar to how storages are handled - The `default` crawler channel must always be provided and specifies the default crawler image - Adds backend `/orgs/{oid}/crawlconfigs/crawler-channels` API endpoint to fetch information about available crawler versions (name, image, and label) and test - Adds crawler channel select to workflow creation/edit screens and profile creation dialog, and updates related API endpoints and configmaps accordingly. The select dropdown is shown only if more than one channel is configured. - Adds `crawlerChannel` to workflow and crawl details. - Add `image` to crawler image, used to display actual image used as part of the crawl. - Modifies `crawler_crawl_id` backend test fixture to use `test` crawler version to ensure crawler versions other than latest work - Adds migration to add `crawlerChannel` set to `default` to existing workflow and profile objects and workflow configmaps --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-01-16 15:32:12 -08:00
Tessa Walsh	38a01860b8	Add API endpoints for crawl statistics (#1461 ) Fixes #1158 Introduces two new API endpoints that stream crawling statistics CSVs (with a suggested attachment filename header): - `GET /api/orgs/all/crawls/stats` - crawls from all orgs (superuser only) - `GET /api/orgs/{oid}/crawls/stats` - crawls from just one org (available to org crawler/admin users as well as superusers) Also includes tests for both endpoints.	2024-01-10 13:30:47 -08:00
Tessa Walsh	3d93d0a0d0	Add API tests for browser profiles (#1392 ) Fixes #1330	2023-11-28 10:40:58 -05:00
Ilya Kreymer	dfba4b3940	Replace partial_complete -> stopped_by_user or stopped_quota_reached + operator edge cases (#1368 ) - Adds two new crawl finished state, stopped_by_user and stopped_quota_reached - Tracking other possible 'stop reasons' in operator, though not making them distinct states for now. - Updated frontend with 'Stopped by User' and 'Stopped: Time Quota Reached', shown with same icon as current partial_complete - Added migration of partial_complete to either stopped_by_user or complete (no historical quota data available) - Addresses edge case in scaling: if crawl never scaled (no redis entry, no pod), automatically scale down - Edge case in status: if crawl is somehow 'canceled' but not deleted, immediately delete crawl object and begin finalizing. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-11-14 11:17:16 -08:00
Tessa Walsh	f3cbd9e179	Add crawl, upload, and collection delete webhook event notifications (#1363 ) Fixes #1307 Fixes #1132 Related to #1306 Deleted webhook notifications include the org id and item/collection id. This PR also includes API docs for the new webhooks and extends the existing tests to account for the new webhooks. This PR also does some additional cleanup for existing webhooks: - Remove `downloadUrls` from item finished webhook bodies - Rename collection webhook body `downloadUrls` to `downloadUrl`, since we only ever have one per collection - Fix API docs for existing webhooks, one of which had the wrong response body	2023-11-09 18:19:08 -08:00
Tessa Walsh	1afc411114	Implement retry API endpoint for failed background jobs (#1356 ) Fixes #1328 - Adds /retry endpoint for retrying failed jobs. - Returns 400 error if previous job still running or has succeeded - Keeps track of previous failed attempts in previousAttempts array on failed job. - Also amends the similar webhook /retry endpoint to use `POST` for consistency. - Remove duplicate api tag for backgroundjobs	2023-11-09 18:09:37 -08:00
Ilya Kreymer	c1d3beda9c	users: add case-insensitive index to maintain backwards compatibility with fastapi-users (#1319 ) follow up to #1290 Based on implementation in: https://github.com/fastapi-users/fastapi-users-db-mongodb/blob/main/fastapi_users_db_mongodb/__init__.py	2023-10-27 14:31:29 -07:00
Ilya Kreymer	6dc452ebad	Storage Refactor: Replication + Custom Storage Support (#1296 ) - Refactors storage to support replicas + custom storages on the Org. - There is a default primary + replica storage, while an Org can also have primary and replica storages. - StorageRef object is used to store references to default and custom storage. - CrawlFile has been updated to contain a StorageRef instead of a def_storage_name, which references either a default storage (in StorageOps) or custom storage (in Organization) - There is also a 'replicas' Optional[List[StorageRef]] which contains replicas, if any. - CrawlFileOut contain a numReplicas for how many replicas exist for a given file. - Migration: migration 0020 added to migrate existing Orgs, CrawlFile and ProfileFile objects to new storage system (CrawlFile and ProfileFile now extend BaseFile) Part of #1262 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-10-26 21:44:09 -07:00
Tessa Walsh	d58747dfa2	Provide full resources in archived items finished webhooks (#1308 ) Fixes #1306 - Include full `resources` with expireAt (as string) in crawlFinished and uploadFinished webhook notifications rather than using the `downloadUrls` field (this is retained for collections). - Set default presigned duration to one minute short of 1 week and enforce maximum supported by S3 - Add 'storage_presign_duration_minutes' commented out to helm values.yaml - Update tests --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-10-23 19:01:58 -07:00
Tessa Walsh	5c5ef68a8a	Prevent user from logging in after 5 consecutive failed login attempts until pw is reset (#1281 ) Fixes #1270 After 5 consecutive failed logins from the same user, we now prevent the user from logging in even with the correct password until they reset it via their email, or wait an hour. - After failure threshold is reached, all further login attempts are rejected - Attempts for invalid email addresses are also tracked - On 6th try, a reset password email is automatically sent, only once - Failed login counter resets after an hour of no further logins after last attempted login. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-10-20 14:10:56 -07:00
Tessa Walsh	733809b5a8	Update user names in crawls and workflows after username update (#1299 ) Fixes #1275	2023-10-19 23:34:49 -07:00
Ilya Kreymer	63291e95a5	avoid exception if 'errors' key doesn't exist (#1301 ) - avoid exception if 'errors' (or 'files' keys) don't exist (part of #1297) - ensure 'errors' list always set on output model for consistency, defaulting to empty list - fix tests for 'errors' being an empty empty list follow-up to #1300 (merging 1.7.1 release into main)	2023-10-19 14:39:54 -07:00
Ilya Kreymer	9a2787f9c4	User refactor + remove fastapi_users dependency + update fastapi (#1290 ) Fixes #1050 Major refactor of the user/auth system to remove fastapi_users dependency. Refactors users.py to be standalone and adds new auth.py module for handling auth. UserManager now works similar to other ops classes. The auth should be fully backwards compatible with fastapi_users auth, including accepting previous JWT tokens w/o having to re-login. The User data model in mongodb is also unchanged. Additional fixes: - allows updating fastapi to latest - add webhook docs to openapi (follow up to #1041) API changes: - Removing the`GET, PATCH, DELETE /users/<id>` endpoints, which were not in used before, as users are scoped to orgs. For deletion, probably auto-delete when user is removed from last org (to be implemented). - Rename `/users/me-with-orgs` is renamed to just `/users/me/` - New `PUT /users/me/change-password` endpoint with password required to update password, fixes #1269, supersedes #1272 Frontend changes: - Fixes from #1272 to support new change password endpoint. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: sua yoo <sua@suayoo.com>	2023-10-18 10:49:23 -07:00
Tessa Walsh	c5ca250f37	Add id-slug lookup and restrict slugs endpoints to superadmins (#1279 ) Fixes #1278 - Adds `GET /orgs/slug-lookup` endpoint returning `{id: slug}` for all orgs - Restricts new endpoint and existing `GET /orgs/slugs` to superadmins	2023-10-13 17:02:19 -07:00
Tessa Walsh	266afdf8d9	Add slugs to org backend (#1250 ) - Add slug field with uniqueness constraint to Organization - Use python-slugify to generate slug from name and import that in migration - Require name in all /rename and org creation requests - Auto-generate slug for new org with no slug or when /rename is called w/o a slug - Auto-generate slug for 'default-org' based on name - Add /api/orgs/slugs GET endpoint to return all slugs in use - tests: extend backend test-requirements.txt from requirements to allow testing slugify - tests: move get_redis_crawl_stats() to avoid extra dependency in utils	2023-10-10 18:30:09 -07:00
Tessa Walsh	bbdb7f8ce5	Require that all passwords are between 8 and 64 characters (#1239 ) - Require that all passwords are between 8 and 64 characters - Fixes account settings password reset form to only trigger logged-in event after successful password change. - Password validation can be extended within the UserManager's validate_password method to add or modify requirements. - Add tests for password validation	2023-10-03 18:57:46 -07:00
Tessa Walsh	e9bac4c088	API delete endpoint improvements (#1232 ) - Applies user permissions check before deleting anything in all /delete endpoints - Shuts down running crawls before deleting anything in /all-crawls/delete as well as /crawls/delete - Splits delete_list.crawl_ids into crawls and upload lists at same time as checks in /all-crawls/delete - Updates frontend notification message to Only org owners can delete other users' archived items. when a crawler user attempts to delete another users' archived items	2023-10-03 13:05:00 -07:00
sua yoo	941a75ef12	Separate seeds into a new endpoints (#1217 ) - Remove config.seeds from workflow and crawl detail endpoints - Add new paginated GET /crawls/{crawl_id}/seeds and /crawlconfigs/{cid}/seeds endpoints to retrieve seeds for a crawl or workflow - Include firstSeed in GET /crawlconfigs/{cid} endpoint (was missing before) - Modify frontend to fetch seeds from new /seeds endpoints with loading indicator --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-10-02 10:56:12 -07:00
Anish Lakhwara	1bf531e1ec	Fix: Make Collections Public on Creation (#1213 ) - Add isPublic to Add Collection endpoint, send isPublic from frontend - Fixes #1212	2023-09-29 12:08:10 -07:00
Tessa Walsh	7a56fa23f5	Remove username lookups for crawls and workflows by storing usernames in db (#1199 ) * store usernames (createdByName, modifiedByName, startedByName) in db for workflows * store userName for userid for crawls in db * update output models to return usernames * add migration 0018 to add usernames to existing crawls and crawlconfigs * updated tests for crawl and config usernames * use async for to iterate over crawls and crawlconfigs --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-09-28 09:37:23 -07:00
Tessa Walsh	094f27bcff	Track bytes stored per file type and include in org metrics (#1207 ) * Add bytes stored per type to org and metrics The org now tracks bytesStored by type of crawl, uploads, and browser profiles in addition to the total, and returns these values in the org metrics endpoint. A migration is added to precompute these values in existing deployments. In addition, all /metrics storage values are now returned solely as bytes, as the GB form wasn't being used in the frontend and is unnecessary. * Improve deletion of multiple archived item types via `/all-crawls` delete endpoint - Update `/all-crawls` delete test to check that org and workflow size values are correct following deletion. - Fix bug where it was always assumed only one crawl was deleted per cid and size was not tracked per cid - Add type check within delete_crawls	2023-09-22 12:55:21 -04:00
Tessa Walsh	83f80d4103	Add org metrics API endpoint (#1196 ) * Initial implementation of org metrics (This can eventually be sped up significantly by precomputing the values and storing them in the db.) * Rename storageQuota to storageQuotaBytes to be consistent * Update tests to include metrics	2023-09-19 16:24:27 -05:00
Tessa Walsh	9224f52f51	Remove config from list endpoints to speed up responses (#1193 ) * Remove config from list endpoints - Remove config field from workflow and crawl list endpoints - Add seedCount to CrawlConfigOut on backend and Workflow on frontend - Refactor CrawlConfig and CrawlConfigOut to extend CrawlConfigCore + CrawlConfigAdditional - Refactor workflow list in frontend to use firstSeed and seedCount - Frontend uses ListWorkflow type which is Omit<Workflow, "config">	2023-09-19 11:05:48 -05:00
Tessa Walsh	c7cd4e61fd	Increase wait to 30 seconds to ensure webhooks are sent (#1173 )	2023-09-13 20:20:47 -07:00
Tessa Walsh	7cf2b11eb7	Add event webhook tests (#1155 ) * Add success filter to webhook list GET endpoint * Add sorting to webhooks list API and add event filter * Test webhooks via echo server * Set address to echo server on host from CI env var for k3d and microk8s * Add -s back to pytest command for k3d ci * Change pytest test path to avoid hanging on collecting tests * Revert microk8s to only run on push to main	2023-09-12 22:08:40 -07:00
Ilya Kreymer	ad9bca2e92	Operator refactor to control pods + pvcs directly instead of statefulsets (#1149 ) - Ability for pod to be Completed, unlike in Statefulset - eg. if 3 pods are running and first one finishes, all 3 must be running until all 3 are done. With this setup, the first finished pod can remain in Completed state. - Fixed shutdown order - crawler pods now correctly shutdown first before redis pods, by switching to background deletion. - Pod priority decreases with scale: 1st instance of a new crawl can preempt 3rd or 2nd instance of another crawl - Create priority classes upto 'max_crawl_scale, configured in values.yaml - Improved scale change reconciliation: if increasing scale, immediately scale up. If decreasing scale, graceful stop scaled-down instance to complete via redis 'stopone' key, wait until they exit with Completed state before adjust status.scale / removing scaled down pods. Ensures unaccepted interrupts don't cause scaled down data to be deleted. - Redis pod remains inactive until crawler is first active, or after no crawl pods are active for 60 seconds - Configurable Redis storage with 'redis_storage' value, set to 3Gi by default - CrawlJob deletion starts as soon as post-finish crawl operations are run - Post-crawl operations get their own redis instance, since one during response is being cleaned up in finalizer - Finalizer ignores request with incorrect state (returns 400 if reported as not finished while crawl is finished) - Current resource usage added to status - Profile browser: also manage single pod directly without statefulset for consistency. - Restart pods via restartTime value: if spec.restartTime != status.restartTime, clear out pods and update status.restartTime (using OnDelete policy to avoid recreate loops in edge cases). - Update to latest metacontroller (v4.11.0) - Add --restartOnError flag for crawler (for browsertrix-crawler 0.11.0) - Failed crawl logging: dd 'fail_crawl()' to be used for failing a crawl, which prints logs for default container (if enabled) as well as pod status - tests: check other finished states to avoid stuck in infinite loop if crawl fails - tests: disable disk utilization check, which adds unpredictability to crawl testing! fixes #1147 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-09-11 10:38:04 -07:00
Tessa Walsh	d2ededc895	Add and enforce org storage quota (#1106 ) * Implement in backend - Track bytesStored in org - Add migration to pre-calculate based on size of crawlfiles and profilefiles - Add methods to increase or decrease org storage when crawl or profile files are added or deleted - Include storageQuotaReached boolean in API responses that alter storage - Don't start new crawls and fail uploads if storage quota reached * Implement in frontend - Add to orgs-list quotas - Update org's storageQuotaReached based on backend endpoint responses - Disable buttons when storage quota is met - Show toast notification when attempting to run a crawl when org storage quota is met	2023-09-07 12:45:43 -04:00
Ilya Kreymer	68bc053ba0	Print crawl log to operator log (mostly for testing) (#1148 ) * log only if 'log_failed_crawl_lines' value is set to number of last lines to log from failed container --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-09-06 17:53:02 -07:00
Tessa Walsh	147bfd9d44	Add event webhook notifications system to backend (#1061 ) Initial set of backend API for event webhook notifications for the following events: * Crawl started (including boolean indicating if crawl was scheduled) * Crawl finished * Upload finished * Archived item added to collection * Archived item removed from collection Configuration of URLs is done via /api/orgs/<oid>/event-webhook-urls. If a URL is configured for a given event, a webhook notification is added to the database and then attempted to be sent (up to a total of 5 tries per overall attempt, with an increasing backoff between, implemented via use of the backoff library, which supports async). webhook status available via /api/orgs/<oid>/webhooks (Additional testing + potential fastapi integration left in separate follow-ups Fixes #1041	2023-08-31 19:52:37 -07:00
Tessa Walsh	1aa951132c	Fix unsetting all collections via PATCH update (#1126 )	2023-08-30 18:16:21 -04:00
Tessa Walsh	f6369ee01e	Add support for collectionIds to archived item PATCH endpoints (#1121 ) * Add support for collectionIds to patch endpoints * Make update available via all-crawls/ and add test * Fix tests * Always remove collectionIds from udpate * Remove unnecessary fallback * One more pass on expected values before update	2023-08-30 10:41:30 -04:00
Tessa Walsh	e667fe2e97	Add max crawl size option to backend and frontend (#1045 ) Backend: - add 'maxCrawlSize' to models and crawljob spec - add 'MAX_CRAWL_SIZE' to configmap - add maxCrawlSize to new crawlconfig + update APIs - operator: gracefully stop crawl if current size (from stats) exceeds maxCrawlSize - tests: add max crawl size tests Frontend: - Add Max Crawl Size text box Limits tab - Users enter max crawl size in GB, convert to bytes - Add BYTES_PER_GB as constant for converting to bytes - docs: Crawl Size Limit to user guide workflow setup section Operator Refactor: - use 'status.stopping' instead of 'crawl.stopping' to indicate crawl is being stopped, as changing later has no effect in operator - add is_crawl_stopping() to return if crawl is being stopped, based on crawl.stopping or size or time limit being reached - crawlerjob status: store byte size under 'size', human readable size under 'sizeHuman' for clarity - size stat always exists so remove unneeded conditional (defaults to 0) - store raw byte size in 'size', human readable size in 'sizeHuman' Charts: - subchart: update crawlerjob crd in btrix-crds to show status.stopping instead of spec.stopping - subchart: show 'sizeHuman' property instead of 'size' - bump subchart version to 0.1.1 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-08-26 22:00:37 -07:00
Anish Lakhwara	8b16124675	feat: implement 'collections' array with {name, id} for archived item details (#1098 ) - rename 'collections' -> 'collectionIds', adding migration 0014 - only populate 'collections' array with {name, id} pair for get_crawl() / single archived item path, but not for aggregate/list methods - remove Crawl.get_crawl(), redundant with BaseCrawl.get_crawl() version - ensure _files_to_resources returns an empty [] instead of none if empty (matching BaseCrawl.get_crawl() behavior to Crawl.get_crawl()) - tests: update tests to use collectionIds for id list, add 'collections' for {name, id} test - frontend: change Crawl object to have collectionIds instead of collections --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-08-25 00:26:46 -07:00
sua yoo	37733483d5	Standardize archived item filtering, sorting and labels (#1054 ) Frontend: - Renames list view to "All Archived Items" - Refactors fetches to use single all-crawls endpoints - Removes search by config ID for more search parity with uploads - Adds sort by size - Refactors property and method names to replace crawl* - Replaces remaining references to "crawl" in copy with "item"' - Rename Upload Archive button to Upload WACZ - Fix focusout in item menu so menus close Backend: - Filter search values by type as well - Only get list of cids for crawls in search values - Don't list crawl/workflow ids in search values --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-08-09 12:13:55 -07:00
Ilya Kreymer	8d0a4f2ca9	fix public collections endpoint returning 404 when not public (#1052 ) tests: add tests for public collections endpoint when collection is public and when not	2023-08-04 13:29:13 -04:00
Tessa Walsh	7ff57ce6b5	Backend: standardize search values, filters, and sorting for archived items (#1039 ) - all-crawls list endpoint filters now conform to 'Standardize list controls for archived items #1025' and URL decode values before passing them in - Uploads list endpoint now includes all all-crawls filters relevant to uploads - An all-crawls/search-values endpoint is added to support searching across all archived item types - Crawl configuration names are now copied to the crawl when the crawl is created, and crawl names and descriptions are now editable via the backend API (note: this will require frontend changes as well to make them editable via the UI) - Migration added to copy existing config names for active configs into their associated crawls. This migration has been tested in a local deployment - New statuses generate-wacz, uploading-wacz, and pending-wait are added when relevant to tests to ensure that they pass - Tests coverage added for all new all-crawls endpoints, filters, and sort values	2023-08-04 09:56:52 -07:00
Ilya Kreymer	6506965d98	Streaming Download for Collections (#1012 ) * support streaming download of collections (part of #927) - WACZ zip created on the fly using stream-zip - add 'Download Collection' option to collection detail and list - after editing collection, return to collection view - tests: add test for streaming download, ensure WACZ files + datapackage present, STORE compression used --------- Co-authored-by: sua yoo <sua@suayoo.com>	2023-07-26 15:42:17 -07:00

1 2 3 4 5

218 Commits