browsertrix

Author	SHA1	Message	Date
Tessa Walsh	190bdeb868	Add public API endpoint for public collections (#2174 ) Fixes #1051 If org with provided slug doesn't exist or no public collections exist for that org, return same 404 response with a detail of "public_profile_not_found" to prevent people from using public endpoint to determine whether an org exists. Endpoint is `GET /api/public-collections/<org-slug>` (no auth needed) to avoid collisions with existing org and collection endpoints.	2025-01-13 15:15:48 -08:00
Tessa Walsh	42ebfd303d	Make changes to collections to support publicly listed collections (#2164 ) Fixes #2158 - Adds `Organization.listPublicCollections` field and API endpoint to update it - Replaces `Collection.isPublic` boolean with `Collection.access` (values: `private`, `unlisted`, `public`) and add database migration - Update frontend to use `Collection.access` instead of `isPublic`, otherwise not changing current behavior --------- Co-authored-by: sua yoo <sua@suayoo.com>	2025-01-13 15:15:47 -08:00
Ilya Kreymer	a21b2ff0df	version: bump to 1.13.2	2025-01-08 22:58:33 -08:00
Tessa Walsh	589819682e	Optionally delay replica deletion (#2252 ) Fixes #2170 The number of days to delay file replication deletion by is configurable in the Helm chart with `replica_deletion_delay_days` (set by default to 7 days in `values.yaml` to encourage good practice, though we could change this). When `replica_deletion_delay_days` is set to an int above 0, when a delete replica job would otherwise be started as a Kubernetes Job, a CronJob is created instead with a cron schedule set to run yearly, starting x days from the current moment. This cronjob is then deleted by the operator after the job successfully completes. If a failed background job is retried, it is re-run immediately as a Job rather than being scheduled out into the future again. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-12-19 18:50:28 -08:00
Ilya Kreymer	2060ee78b4	Support Presigning for use with custom domain (#2249 ) If access_endpoint_url is provided: - Use virtual host addressing style, so presigned URLs are of the form `https://bucket.s3-host.example.com/path/` instead of `https://s3-host.example.com/bucket/path/` - Allow for replacing `https://bucket.s3-host.example.com/path/` -> `https://my-custom-domain.example.com/path/`, where `https://my-custom-domain.example.com/path/` is the access_endpoint_url - Remove old `use_access_for_presign` which is no longer used - Fixes #2248 - docs: update deployment docs storages section to mention custom storages, access_endpoint_url --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-12-19 18:41:47 -08:00
Ilya Kreymer	8e375335cd	Related crawljob filtering by role (#2262 ) add filtering by role to related crawljobs query: - for regular crawls (role 'job'), only count other regular crawls - for qa runs (role 'qa-job') only count other qa jobs - ensures that concurrent crawl limits apply separately to regular crawls and qa runs - fixes #2261	2024-12-19 17:20:15 -08:00
Ilya Kreymer	60d07762be	version: bump to 1.13.1	2024-12-19 12:01:47 -08:00
Ilya Kreymer	cf60c43df2	version: bump to 1.13.0! (#2242 )	2024-12-13 20:32:38 -08:00
Emma Segal-Grossman	b650762a45	Allow configuring available languages from helm chart (#2230 ) Closes #2223 - [x] Adds `localesAvailable` to `/api/settings` endpoint, and uses that list if available, rather than the full list of translated locales, to determine which options to display to users - [x] ~~Uses the user's browser locales, filtered to the current language setting, for formatting numbers, dates, and durations~~ - [x] Adds & persists checkbox for "use same language for formatting dates and numbers" in user settings - [x] Replaces uses of `sl-format-bytes` with `localize.bytes(...)`, and `sl-format-date` with replacement `btrix-format-date` that properly handles fallback locales - [x] Caches all number/duration/datetime formatters by a combined key consisting of app language, browser language, browser setting, and formatter options so that all formatters can be reused if needed (previously any formatter with non-default options would be recreated every render) - [x] Splits out ordinal formatting from number formatter, as it didn't make much sense in some non-English locales - [x] Adds a little demo of date/time/duration/number formatting so you can see what effect your language settings have https://github.com/user-attachments/assets/724858cb-b140-4d72-a38d-83f602c71bc7 --------- Signed-off-by: emma <hi@emma.cafe> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-12-13 22:31:26 -05:00
Ilya Kreymer	db39333ef4	Send subscription cancelation email (#2234 ) Adds sending a cancellation email when a subscription is cancelled. - The email may also include an option survey optional survey URL, if configured in helm chart `survey_url` setting. - Cancellation e-mail configured in `sub_cancel` e-mail template - E-mails are sent to all org admins. - Also adds `trialing_canceled` subscription state to differentiate from a default `trialing` which will automatically rollover into `active`. - The email is sent when: a new cancellation date is added for an `active` subscription, or a `trialing` subscription is changed to to `trialing_canceled`. (A subscription can be canceled/uncanceled several times before actual date, and e-mail is sent every time it is canceled.) - The 'You have X days left of your trial' is also always displayed when state is in trialing_canceled. Fixes #2229 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-12-12 11:52:38 -08:00
Tessa Walsh	b7604ee61d	Add superuser endpoint to get user emails with org info (#2211 ) Fixes #2203 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-12-09 16:38:01 -08:00
Tessa Walsh	661e5d9fae	Fix issue with failed background job emails not being sent (#2187 ) Fixes #2186 Background job emails will no longer fail to send for jobs unrelated to file replication or replica deletion. Also uses `AnyJob` for paginated background job response model, to fix typing being out of data following addition of other types of background jobs and lower overhead for adding new ones moving forward.	2024-11-27 17:00:35 -08:00
Ilya Kreymer	50dac7dc50	1.12.2 release -> main (#2181 ) Merge 1.12.2 release changes into main, includes: - Collection replay full refresh on metadata / archived items (#2176) - Fix for self-registration default org (#2178) - Prepend missing https in start URL (#2177) - Updated billing to support free trial messaging (#2179) --------- Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: sua yoo <sua@suayoo.com> Co-authored-by: SuaYoo <SuaYoo@users.noreply.github.com>	2024-11-26 11:17:07 -08:00
Tessa Walsh	ba5ca3fdd9	Move org storage recalculation into background job (#2138 ) Fixes #2112 - Moves org storage recalculation to background job, modify endpoint to return job id as part of response - Updates crawl + QA backend tests that broke due to https://webrecorder.net website changes --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-11-19 17:32:57 -05:00
Henry Wilkinson	74161f8477	Update Webrecorder.net links (#2120 ) - Updates documentation links to point to new Browsertrix landing page - Updates redoc links	2024-10-31 16:33:54 -04:00
Tessa Walsh	55a758f342	Consolidate ops class initialization (#2117 ) Fixes #2111 The background job and operator entrypoints now use a shared function that initalizes and returns the ops classes. This is not applied in the main entrypoint as that also initializes the backend API, which we don't want in the other entrypoints.	2024-10-30 15:33:22 -04:00
Tessa Walsh	f7426cc46a	Fix nightly tests: modify kubectl exec syntax for creating new minio bucket (#2097 ) Fixes #2096 For example failing test run, see: https://github.com/webrecorder/browsertrix/actions/runs/11121185534/job/30899729448 --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-10-21 17:41:19 -07:00
Tessa Walsh	1b1819ba5a	Move org deletion to background job with access to backend ops classes (#2098 ) This PR introduces background jobs that have full access to the backend ops classes and moves the delete org job to a background job.	2024-10-10 14:41:05 -04:00
Ilya Kreymer	84a74c43a4	version: bump to 1.13.0-beta.0	2024-10-10 11:38:13 -07:00
Ilya Kreymer	6032e28231	fix: firstOrgAdmin being set to true even if invite was not for an admin (#2110 ) Non-admin users should not be given option to rename org when invited to a new org: - set firstOrgAdmin to true only when invite is for an admin - default to false instead of null - update tests to check	2024-10-08 16:42:30 -07:00
Ilya Kreymer	8192e5bed6	version: bump to 1.12.0	2024-10-03 16:45:54 -07:00
Ilya Kreymer	104ea097c4	switch to simpler streaming download + multiwacz metadata improvements: (#1982 ) - download via presigned URLs via requests instead of boto APIs, remove boto - follow-up to #1933 for streaming download improvements - fixes datapackage.json in multi-wacz to contain the same resources objects with: `name`, `path`, `hash`, `bytes` to match single WACZ. - Add additional metadata to multi-wacz datapackage.json, including `type` (`crawl`, `upload`, `collection`, `qaRun`), `id` (unique id for the object), `title` / `description` if available (for crawl/upload/collection), and `crawlId` for `qaRun`	2024-10-03 16:13:31 -07:00
Vinzenz Sinapius	bb6e703f6a	Configure browsertrix proxies (#1847 ) Resolves #1354 Supports crawling through pre-configured proxy servers, allowing users to select which proxy servers to use (requires browsertrix crawler 1.3+) Config: - proxies defined in btrix-proxies subchart - can be configured via btrix-proxies key or separate proxies.yaml file via separate subchart - proxies list refreshed automatically if crawler_proxies.json changes if subchart is deployed - support for ssh and socks5 proxies - proxy keys added to secrets in subchart - support for default proxy to be always used if no other proxy configured, prevent starting cluster if default proxy not available - prevent starting manual crawl if previously configured proxy is no longer available, return error - force 'btrix' username and group name on browsertrix-crawler non-root user to support ssh Operator: - support crawling through proxies, pass proxyId in CrawlJob - support running profile browsers which designated proxy, pass proxyId to ProfileJob - prevent starting scheduled crawl if previously configured proxy is no longer available API / Access: - /api/orgs/all/crawlconfigs/crawler-proxies - get all proxies (superadmin only) - /api/orgs/{oid}/crawlconfigs/crawler-proxies - get proxies available to particular org - /api/orgs/{oid}/proxies - update allowed proxies for particular org (superadmin only) - superadmin can configure which orgs can use which proxies, stored on the org - superadmin can also allow an org to access all 'shared' proxies, to avoid having to allow a shared proxy on each org. UI: - Superadmin has 'Edit Proxies' dialog to configure for each org if it has: dedicated proxies, has access to shared proxies. - User can select a proxy in Crawl Workflow browser settings - Users can choose to launch a browser profile with a particular proxy - Display which proxy is used to create profile in profile selector - Users can choose with default proxy to use for new workflows in Crawling Defaults --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-10-02 18:35:45 -07:00
Ilya Kreymer	62da0fbd6c	security: tweak get /invite endpoints / InviteOut to: (#2087 ) don't set inviterEmail / inviterName if the inviter is the superuser: - return fromSuperuser true/false - if fromSuperuser, don't set inviterEmail / inviterName - tests: add tests for non-superuser admin invites	2024-09-20 11:52:56 -07:00
Ilya Kreymer	feb6b1f26c	Ensure email comparisons are case-insensitive, emails stored as lowercase (#2084 ) (#2086 ) (fixes from 1.11.7) - Add a custom EmailStr type which lowercases the full e-mail, not just the domain. - Ensure EmailStr is used throughout wherever e-mails are used, both for invites and user models - Tests: update to check for lowercase email responses, e-mails returned from APIs are always lowercase - Tests: remove tests where '@' was ur-lencoded, should not be possible since POSTing JSON and no url-decoding is done/expected. E-mails should have '@' present. - Fixes #2083 where invites were rejected due to case differences - CI: pin pymongo dependency due to latest releases update, update python used for CI	2024-09-19 12:20:34 -07:00
Tessa Walsh	123705c53f	Serialize datetimes with Z suffix (#2058 ) Use timezone aware datetimes instead of timezone naive datetimes: - Update mongodb client to use tz-aware conversion - Convert dt_now() to return timezone aware UTC date - Rename to_k8s_date -> date_to_str, just returns ISO UTC date with 'Z' (instead of '+00:00' suffix) - Rename from_k8s_date -> str_to_date, returns timezone aware date from str - Standardize all string<->date conversion to use either date_to_str or str_to_date - Update frontend to assume iso date, not append 'Z' directly - Update tests to check for 'Z' suffix on some dates --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-09-12 16:16:13 -07:00
Ilya Kreymer	c242bb96d2	version: bump to 1.12.0-beta.0	2024-09-12 14:30:15 -07:00
Ilya Kreymer	1f919de294	Allow custom auto-resize crawler volume ratio adjustable (#2076 ) Make the avail / used storage ratio (for crawler volumes) adjustable. Disable auto-resize if set to 0. Follow-up to #2023	2024-09-12 09:28:19 -07:00
sua yoo	4c36c80351	feat: Display scale as number of browser windows (#2057 ) Resolves https://github.com/webrecorder/browsertrix/issues/2048 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-09-05 17:32:40 -07:00
Ilya Kreymer	b3c1195878	version: bump to 1.11.6	2024-09-05 17:31:10 -07:00
Ilya Kreymer	ea252e8da9	version: bump to 1.11.5	2024-08-27 10:00:53 -07:00
sua yoo	337454f8c9	feat: Add link to hosted sign-up page (#2045 ) Resolves https://github.com/webrecorder/browsertrix/issues/2043 <!-- Fixes #issue_number --> ### Changes - Shows link to sign up in UI if `sign_up_url` is configured. - Expires settings in session storage (for now)	2024-08-26 17:26:25 -07:00
Ilya Kreymer	95969ec747	Attempt to auto-adjust storage if usage is running out while crawl is running (#2023 ) Attempt to auto-adjust PVC storage if: - used storage (as reported in redis by the crawler) * 2.5 > total_storage - will cause PVC to resize, if possible (not supported by all drivers) - uses multiples of 1Gi, rounding up to next GB - AVAIL_STORAGE_RATIO hard-coded to 2.5 for now, to account for 2x space for WACZ plus change for fast updating crawls Some caveats: - only works if the storageClass used for PVCs has `allowVolumeExpansion: true`, if not, it will have no effect - designed as a last resort option: the `crawl_storage` in values and `--sizeLimit` and `--diskUtilization` should generally result in this not being needed. - can be useful in cases where a crawl is rapidly capturing a lot of content in one page, and there's no time to interrupt / restart, since the other limits apply only at page end. - May want to have crawler update the disk usage more frequently, not just at page end to make this more effective.	2024-08-26 14:19:20 -07:00
Ilya Kreymer	a1df689729	stats recompute fixes: (#2022 ) - fix stats_recompute_last() and stats_recompute_all() to not update the lastCrawl* properties of a crawl workflow if a crawl is running, as those stats now point to the running crawl - refactor _add_running_curr_crawl_stats() to make it clear stats only updated if crawl is running - stats_recompute_all() change order to ascending to actually get last crawl, not first!	2024-08-26 14:18:59 -07:00
Ilya Kreymer	135c97419d	version: update to 1.11.4	2024-08-26 12:31:56 -07:00
Ilya Kreymer	96e393e80d	update crawler channel fix: add crawlerChannel to update check (#2046 ) Add missing check for crawlerChannel update	2024-08-26 10:41:54 -04:00
Ilya Kreymer	04c8b50423	add a crawling defaults on the Org to allow setting certain crawl workflow fields as defaults: (#2031 ) - add POST /orgs/<id>/defaults/crawling API to update all defaults (defaults unset are cleared) - defaults returned as 'crawlingDefaults' object on Org, if set - fixes #2016 --------- Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>	2024-08-22 10:36:04 -07:00
Ilya Kreymer	86c9e538c1	quickfix: webhooks: ensure the 'crawl_reviewed' webhook is sent async, doesn't delay submitting a review (#2033 ) make the call to `create_crawl_reviewed_notification` be called with create_task (similar to other user-initiated webhook events), to avoid extra wait for webhook to complete	2024-08-20 17:50:18 -07:00
Ilya Kreymer	8c9a14b6a2	Ensure Subscription Update doesn't update the gifted quotas (#2012 ) - add a separate OrgQuotasIn where all quota updates are optional - ensure gifted quotas are never updated as part of org update - update tests	2024-08-20 13:15:03 -07:00
Tessa Walsh	916813af2d	Include user and user org info in login response (#2014 ) Fixes #2013 Adds the `/users/me` response data to the API login endpoint response under the key `user_info` and adds a test.	2024-08-12 18:51:42 -07:00
Ilya Kreymer	d9f49afcc5	type fixes on util functions (#2009 ) Some additional typing for util.py functions and resultant changes	2024-08-12 10:54:45 -07:00
Ilya Kreymer	12f994b864	QA: Count QA execution minutes separately for now (#2011 ) For now, keep QA exec time separate, as it may be scaled differently and currently still in beta.	2024-08-09 13:13:21 -07:00
Ilya Kreymer	4ec7cf8adc	Additional operator edge case fixes (#2007 ) Fix a few edge-case situations: - Restart evicted pods that have reached the terminal `Failed` state with reason `Evicted`, by just recreating them. These pods will not be automatically retried, so need to be recreated (usually happens due to memory pressure from the node) - Don't treat containers in ContainerCreating as running, even though this state is usually quick, its possible for containers to get stuck there, and will improve accuracy of exec seconds tracking. - Consolidate state transition for running states, either sets to running or to pending-wait/generate-wacz/upload-wacz and allows changing from to either of these states from each other or waiting_capacity	2024-08-09 13:12:25 -07:00
Ilya Kreymer	8ff1ad39a7	version: bump to 1.11.3	2024-08-08 15:16:18 -07:00
Ilya Kreymer	ed9038fbdb	version: bump to 1.11.2	2024-08-07 12:37:26 -07:00
Ilya Kreymer	5f53db75ee	fix resetting of invalid logins: (#2002 ) * Fixes issue in FailedLogin model: - fix data-model to remove nested 'attempted.attempted' - migrate existing data to remove nested field * Also, avoid setting dt_now() in model as that results in fixed date for all objects: - update FailedLogin to update 'attempted' date on every attempt - also update PageNote object to set date in constructor * Update text for too many logins to make it clear it is set only if its a valid email * fixes #2001	2024-08-07 12:36:06 -07:00
Ilya Kreymer	41d43ae249	Fix forgot password for invalid user (#1999 ) - fix validation error if user doesn'r exist - always return success even if user doesn't exist for security reasons - add test for forgot password endpoint	2024-08-07 11:02:40 -07:00
Ilya Kreymer	7fa2b61b29	Execution time tracking tweaks (#1994 ) Tweaks to how execution time is tracked for more accuracy + excluding waiting states: - don't update if crawl state is in a 'waiting state' (waiting for capacity or waiting for org limit) - rename start states -> waiting states for clarity - reset lastUpdatedTime if two consecutive updates of non-running state, to ensure non-running states don't count, but also account for occasional hiccups -- if only one update detects non-running state, don't reset - webhooks: move start webhook to when crawl actually starts for first time (db lastUpdatedTime is not yet + crawl is running) - don't set lastUpdatedTime until pods actually running - set crawljob update interval to every 10 seconds for more accurate execution time tracking - frontend: show seconds in 'Execution Time' display	2024-08-06 09:44:44 -07:00
Ilya Kreymer	4a2725aaa6	operator: adjust state transition rules to ensure 'running' state always accounted for in db (#1989 ) don't rely on current status, always set state to running when running to ensure idempotency in case of multiple calls	2024-08-05 16:00:21 -07:00
Ilya Kreymer	1c153dfd3c	Subscription Update Quotas (#1988 ) - Follow-up to #1914, allows SubscriptionUpdate event to also update quotas. - Passes current usage info + current billing page URL to portalUrl request for external app to be able to respond with best portalUrl - get_origin() moved to utils to be available more generally. - Updates billing tab to show current plans, switches order of quotas to list execution time, storage first	2024-08-05 15:59:47 -07:00
Ilya Kreymer	0c29008b7d	version: bump to 1.11.1	2024-07-30 11:23:41 -07:00
Ilya Kreymer	894aa29d4b	remove crc32 from CrawlFile (#1980 ) - no longer being used with latest stream-zip - was not computed correctly in the crawler - counterpart to webrecorder/browsertrix-crawler#657 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-07-30 11:23:15 -07:00
Ilya Kreymer	4aca107710	version: bump to 1.11.0	2024-07-29 12:52:39 -07:00
Ilya Kreymer	e9aeff1836	add a 'stopped_org_readonly' state for crawls that are running while org is made read-only (#1977 ) an org is made read-only while crawls are running: - treat similar to other stopped_* states, do a graceful stop - update UI to display "Stopped: Crawling Disabled" for this status - don't add corresponding skipped status - just skip running crawls if org is read-only	2024-07-29 12:24:40 -07:00
Ilya Kreymer	96691a33fa	Fix for cronjob skipping response (#1976 ) If a cronjob is disabled, the operator should quickly return a success value so that the job can be terminated. Was previously returning an incorrect response, causing disabled cronjobs to not be cleaned up. Add proper typing to always return correct response	2024-07-29 12:24:18 -07:00
Tessa Walsh	551660bb62	Add webhooks for qaAnalysisStarted, qaAnalysisFinished, and crawlReviewed (#1974 ) Fixes #1957 Adds three new webhook events related to QA: analysis started, analysis ended, and crawl reviewed. Tests have been updated accordingly. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-25 16:53:49 -07:00
Ilya Kreymer	94e985ae13	optimize org quota lookups (#1973 ) - instead of looking up storage and exec min quotas from oid, and loading an org each time, load org once and then check quotas on the org object - many times the org was already available, and was looked up again - storage and exec quota checks become sync - rename can_run_crawl() to more generic can_write_data(), optionally also checks exec minutes - typing: get_org_by_id() always returns org, or throws, adjust methods accordingly (don't check for none, catch exception) - typing: fix typo in BaseOperator, catch type errors in operator 'org_ops' - operator quota check: use up-to-date 'status.size' for current job, ignore current job in all jobs list to avoid double-counting - follow up to #1969	2024-07-25 14:00:16 -07:00
Tessa Walsh	d38abbca7f	Standardize handling of storage and execution time quotas (#1969 ) Fixes #1968 Changes: - `stopped_quota_reached` and `skipped_quota_reached` migrated to new values that indicate which quota was reached - Before crawls are run, the operator checks if storage or exec mins quotas are reached and if so fails the crawl with the appropriate state of `skipped_storage_quota_reached` or `skipped_time_quota_reached` - While crawls are running, the operator checks if the exec mins quota is reached or if the size of all running crawls will mean the storage quota is reached once uploaded; if so, the crawl is stopped gracefully and given `stopped_storage_quota_needed` or `stopped_time_quota_reached` state as appropriate - Adds new nightly tests for enforcing storage quota	2024-07-25 12:49:11 -07:00
Tessa Walsh	27ee16d308	Implement downloading archived item + QA runs as multi-WACZ (#1933 ) Fixes #1412 ## Changes ### Backend - Adds `all-crawls`, `crawls`, and `uploads` API endpoints to download archived item as multi-WACZ - Download QA runs as multi-WACZ - Adds backend tests for new endpoints - Update to new version of stream-zip library which does not require crc-32 to be present for ZIP members, computes after streaming, fixing invalid crc-32 issues as previously computed crc-32s from crawler may be invalid. ### Frontend Adds ability to download archived item from: - Button in archived item detail Files tab - Archived item details actions menu - Archived items list menu --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-25 10:28:57 -07:00
Ilya Kreymer	01ddf95a56	allow disabling of auto-resize of crawler pods (#1964 ) - only enable if 'enable_auto_resize' is true, default to false - if true, set memory limit to 1.2 of memory requests, resize when hitting 'soft oom' of initial request, adjust by 1.2 (current behavior) up to max_crawler_memory - if false, set memory limit to max_crawler_memory and never adjust memory requests or memory limits - part of #1959	2024-07-23 21:00:40 -04:00
Ilya Kreymer	a8c5f07b7c	Add support e-mail to settings (#1960 ) Adds support email to /api/settings Also adds a response model for this endpoint and consolidates api tests Addresses request in #1912	2024-07-23 20:58:12 -04:00
Tessa Walsh	a02f7a6826	Ensure lexical sort for org names (#1958 ) Fixes #1955 Orgs list endpoint sorting now works as follows: - Default org is always sorted first - Name sorting now works on a lowercased version of the org names to ensure lexical sorting The lodash `sortBy` resorting of orgs in the "All Organizations" dropdown list in the nav bar has also been removed so that the backend sorting is applied instead. Tests have been updated accordingly.	2024-07-23 13:13:04 -07:00
Ilya Kreymer	8c0321bdea	Pydantic 2.x update + type fixes + python 3.12 (#1947 ) * updates pydantic to 2.x * also update to python 3.12 * additional type fixes: - all Optional[] types must have a default value - update to constrained types - URL types converted from str - test updates Fixes #1940	2024-07-22 17:23:03 -07:00
Ilya Kreymer	cb909ffc95	api docs cleanup + readd webhooks: (#1949 ) - readd webhooks (regression from #1941) - set order of tags in docs - add missing tag to route	2024-07-22 09:00:59 -07:00
Ilya Kreymer	cd00f52cca	Fix queue response models + additional testing for queue + exclusions (#1948 ) Follow-up to regressions from #1928, this PR: - Fixes response models for queue endpoints, which had incorrect model - Adds tests for queue get, queue match, and exclusions add / remove to ensure regressions like this can be caught via tests. This involves starting a new crawl in test_run_crawls() instead of relying on implicit running via fixtures, make it easier to test crawl while it's running. - Adds additional typing for crawls apis, including making delete_crawls() have correct typing, consistent derived class override - Adds check to ensure queue + exclusion operations can not be called when crawl is not running	2024-07-22 09:00:23 -07:00
Tessa Walsh	2237120cd5	Add API endpoint to recalculate org storage (#1943 ) Fixes #1942 This process might be a bit slow for large orgs, may consider moving it to background job in #1898.	2024-07-19 18:29:20 -07:00
Tessa Walsh	6ccaad26d8	Ensure org name and slug uniqueness is case-insensitive (#1929 ) Fixes #1927 Also adds tests to ensure index is working as expected, and migration to rename orgs that have names or slugs identical to other orgs except for case before the new case-insensitive index is built.	2024-07-18 15:30:12 -07:00
Ilya Kreymer	b1ccdc4d16	OpenAPI Metadata for API Endpoints (#1941 ) - Updates the `/docs` and `/redoc` API endpoints to have better metadata, including using Browsertrix favicon and our logo for the `/redoc` endpoint. - add new logo file 'docs-logo.svg' to root Based on info at: https://fastapi.tiangolo.com/how-to/extending-openapi/ https://fastapi.tiangolo.com/tutorial/metadata/ --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-07-18 11:11:38 -07:00
Tessa Walsh	3bf7967754	Fix regression with saving new workflow due to profileid type error (#1946 ) Fixes #1945	2024-07-18 09:35:52 -07:00
Tessa Walsh	c772ee2362	Fix response model for crawl errors API endpoint (#1939 ) Follow-up fix for #1920 for crawl errors endpoint, which returns a 500 following #1928, caught in nightly tests.	2024-07-17 10:52:14 -07:00
Ilya Kreymer	335700e683	Additional typing cleanup (#1938 ) Misc typing fixes, including in profiles and time functions --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-07-17 10:49:22 -07:00
Ilya Kreymer	4db3053a9f	fix crawlFilenameTemplate + add_crawl_config cleanup (fixes #1932 ) (#1935 ) - ensure crawlFilenameTemplate is part of the CrawlConfig model - change CrawlConfig init to use type-safe construction - add a run_now_internal() that is shared for starting crawl, either on demand or from new config - add OrgOps.can_run_crawls() to check against org quotas for crawling - cleanup profile updates, remove _lookup_profile, only check for EmptyStr in update --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-07-17 10:48:25 -07:00
Ilya Kreymer	27059c91a5	version: bump to 1.11.0-beta.1	2024-07-17 10:06:49 -07:00
Tessa Walsh	60afb19472	Add API endpoint to import subscription for existing org (#1930 ) Fixes #1926 - adds /subscriptions/import endpoint for importing an existing subscription to an existing org - add SubscriptionImport object and log as 'import' event in subscription events collection --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-16 16:17:02 -07:00
Tessa Walsh	d41647e6c2	Document all API endpoints with response models (#1928 ) Fixes #1920 Adds response models to all API endpoints that were missing them, documenting current behavior without making any changes at this stage to standardize responses. Follow-up work will involve adding generics to some of the response models	2024-07-16 12:48:38 -07:00
Tessa Walsh	aaf18e70a0	Add created date to Organization and fix datetimes across backend (#1921 ) Fixes #1916 - Add `created` field to Organization and OrgOut, set on org creation - Add migration to backfill `created` dates from first workflow `created` - Replace `datetime.now()` and `datetime.utcnow()` across app with consistent timezone-aware `utils.dt_now` helper function, which now uses `datetime.now(timezone.utc)`. This is in part to ensure consistency in how we handle datetimes, and also to get ahead of timezone naive datetime creation methods like `datetime.utcnow()` being deprecated in Python 3.12. For more, see: https://blog.miguelgrinberg.com/post/it-s-time-for-a-change-datetime-utcnow-is-now-deprecated	2024-07-15 19:46:32 -07:00
Tessa Walsh	a546fb6fe0	Improve handling of duplicate org name/slug (#1917 ) Initial implementation of #1892 - Modifies the backend to return `duplicate_org_name` or `duplicate_org_slug` as appropriate on a pymongo `DuplicateKeyError` - Updates frontend to handle `duplicate_org_name`, `duplicate_org_slug`, and `invalid_slug` error details - Update errors to be more consistent, also return `duplicate_org_subscription.subId` for duplicate subscription instead of the more generic `already_exists` --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-10 19:24:50 -07:00
Ilya Kreymer	9a67e28f13	Adds Subscription API (#1914 ) Fixes https://github.com/webrecorder/browsertrix/issues/1905 - adds a new top-level `/api/subscriptions` endpoint and SubOps handler on the backend. - enable subscriptions API endpoints available only if `billing_enabled` is set in helm chart - new POST /subscriptions/create, /subscriptions/update, /subscriptions/cancel API endpoints - Subscriptions mongo collection storing timestamped /subscription API events - GET /subscriptions/events API to get subscription events, support for filtering and sorting - Subscription data model - Support for setting and handling readOnlyOnCancel on org - /orgs/<id>/billing-portal to lookup portalUrl using external API - subscription in org getter and list views - mark org as readOnly for subscription status `paused_payment_failed`, clears it on status `active` --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-07-10 17:41:16 -07:00
sua yoo	c97900ec2b	Merge branch 'main' into frontend-org-manage-readonly	2024-07-08 11:20:30 -07:00
Tessa Walsh	192737ea99	Add API endpoint to delete org (#1448 ) Fixes #903 Adds superuser-only API endpoint to delete an org and all of its data --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-03 16:00:11 -04:00
Tessa Walsh	bca05ac185	Fix typing	2024-07-03 11:25:01 -04:00
Tessa Walsh	497cfdc561	Merge branch 'main' into frontend-org-manage-readonly	2024-07-03 11:15:47 -04:00
Tessa Walsh	787ebc8738	Add one more pylint disable comment	2024-07-03 11:14:46 -04:00
Tessa Walsh	5a563d20d9	Fix linting issues	2024-07-03 11:10:10 -04:00
Tessa Walsh	d3fb33a78a	Add and apply backend sorting for org list The default org will always be sorted first, regardless of sort options. Orgs after the first will be sorted by name ascending by default. Sorting currently supported on name, slug, and readOnly.	2024-07-03 11:01:01 -04:00
Ilya Kreymer	1c42e21b8a	Refactor Invites and Registration, Flatten Per-User Invites (#1902 ) Fixes #1432 Refactors the invite + registration system to be simpler and more consistent with regards to existing user invites. Previously, per-user invites are stored in the user.invites dict instead of in the invites collection, which creates a few issues: - Existing user do not show up in Org Invites list: #1432 - Existing user invites also do not expire, unlike new user invites, creating potential security issue. Instead, existing user invites should be treated like new user invites. This PR moves them into the same collection, adding a `userid` field to InvitePending to match with an existing user. If a user already exists, it will be matched by userid, instead of by email. This allows for user to update their email while still being invited. Note that the email of the invited existing user will not change in the invite email. This is also by design: an admin of one org should not be given any hint that an invited user already has an account, such as by having their email automatically update. For an org admin, the invite to a new or existing user should be indistinguishable. The sha256 of invite token is stored instead of actual token for better security. The registration system has also been refactored with the following changes: - Auto-creation of new orgs for new users has been removed - User.create_user() replaces the old User._create() and just creates the user with additional complex logic around org auto-add - Users are added to org in org add_user_to_org() - Users are added to org through invites with add_user_with_invite() Tests: - Additional tests include verifying that existing and new pending invites appear in the pending invites list - Tests for `/users/invite/<token>?email=` and `/users/me/invite/<token>` endpoints - Deleting pending invites - Additional tests added for user self-registration, including existing user self-registration to default org of existing user (in nightly tests)	2024-07-02 15:13:27 -07:00
Tessa Walsh	f076e7d9e3	Add superuser API endpoints to export and import org data (#1394 ) Fixes #890 This PR introduces new streaming superuser-only API endpoints to export and import database information for an organization. New Adminstrator deployment documentation on how to manage the process and copy files between S3 buckets as needed is also included. --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-02 17:14:34 -04:00
Tessa Walsh	bdfc0948d3	Disable uploading and creating browser profiles when org is read-only (#1907 ) Fixes #1904 Follow-up to read-only enforcement, with improved tests.	2024-07-01 23:15:38 -07:00
Ilya Kreymer	e1ef894275	Extends Org Create endpont + shared secret auth (#1897 ) Updates the /api/orgs/create endpoint to: - not have name / slug be required, will be renamed on first user via #1870 - support optional quotas - support optional first admin user email, who will receive an invite to join the org. Also supports a new shared secret mechanism, to allow an external automation to access the /api/orgs/create endpoint (and only that endpoint thus far) via a shared secret instead of normal login.	2024-07-01 09:37:02 -07:00
Ilya Kreymer	3cd52342a7	Remove Crawl Workflow Configmaps (#1894 ) Fixes #1893 - Removes crawl workflow-scoped configmaps, and replaces with operator-controlled per-crawl configmaps that only contain the json config passed to Browsertrix Crawler (as a volume). - Other configmap settings replaced are replaced the custom CrawlJob options (mostly already were, just added profile_filename and storage_filename) - Cron jobs also updated to create CrawlJob without relying on configmaps, querying the db for additional settings. - The `userid` associated with cron jobs is set to the user that last modified the schedule of the crawl, rather than whomever last modified the workflow - Various functions that deal with updating configmaps have been removed, including in migrations. - New migration 0029 added to remove all crawl workflow configmaps	2024-06-28 15:25:23 -07:00
Tessa Walsh	8a904c9031	feat: Rename org when accepting org invite for first admin (#1870 ) Resolves https://github.com/webrecorder/browsertrix/issues/1874 Support for new two-part sign up flow if first admin user is added to org - If new user, user registers first, then is able to change the org name / slug on following screen - If existing user, user accepts invite, then is able to change the org name / slug on following screen - After confirming org slug name, user is taken to dashboard, or error is shown if org name or slug already taken. - If org name == org id, org name and slug is automatically set to `{Your Name}'s Archive` when first user is registered / accepts invite - Email templates updated to better reflect new / existing users and not show org name if it is 'unset' (org name == org id internally) - tests: frontend unit testing for accept + invite screens. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: sua yoo <sua@suayoo.com> Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com> Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>	2024-06-27 16:08:31 -07:00
Tessa Walsh	b7631d1b91	Add slug validation and test (#1891 ) Fixes #1890 Adds validation for org slugs, ensuring that they contain only ASCII alphanumeric characters and dashes (`-`). If an invalid slug is provided, an HTTPException is returned with status code 400 and detail `invalid_slug`.	2024-06-26 15:04:54 -04:00
Ilya Kreymer	6df10d5fb0	Improved Scale Handling (#1889 ) Fixes #1888 Refactors scale handling: - Ensures number of scaled instances does not exceed number of pages, but is also at minimum 1 - Checks for finish condition to be numFailed + numDone >= desired scale - If at least one instance succeeds, crawl considers successful / done. - If all instances fail, crawl considered failed - Ensures that pod done count >= redis done count --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-06-26 10:24:45 -07:00
Tessa Walsh	9140dd75bc	Add and enforce readOnly field in Organization (#1886 ) Fixes https://github.com/webrecorder/browsertrix/issues/1883 Backend work for https://github.com/webrecorder/browsertrix/issues/1876 - If readOnly is set true, disallow crawls and QA analysis runs - If readOnly is set to true, skip scheduled crawls - Add endpoint to set `readOnly` with optional `readOnlyReason` (which is automatically set back to an empty string when `readOnly` is being set to false), which can be displayed in banner - Operator: ensures cronjobs that are skipped due to internal logic (eg. readonly mode) simply succeed right away and do not leave a k8s job dangling. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-06-25 19:30:53 -07:00
Ilya Kreymer	3bd714ea9d	QA stats aggregation: exclude isFile / isError pages from stats (#1879 ) Follow-up to: #1868, exclude pages that have isFile or isError set to true from the stats aggregation.	2024-06-25 08:54:42 -07:00
Tessa Walsh	7af3980323	Add billing enabled and sales email to Helm chart and /settings API endpoint (#1873 ) Backend work for first two tasks of https://github.com/webrecorder/browsertrix/issues/1875 New /billing API endpoint to be added separately once we have a better idea of what data we can get from the payment processor.	2024-06-25 10:55:29 -04:00
Tessa Walsh	879e509b39	Backend: Move page file and error counts to crawl replay.json endpoint (#1868 ) Backend work for #1859 - Remove file count from qa stats endpoint - Compute isFile or isError per page when page is added - Increment filePageCount and errorPageCount per crawl to count number of isFile or isError pages - Add file and error counts to crawl replay.json endpoint (filePageCount and errorPageCount) - Add migration 0028 to set isFile / isError for each page, aggregate filePageCount / errorPageCount per crawl - Determine if page is a file based on loadState == 2, mime type or status code and lack of title	2024-06-20 19:02:57 -07:00
Ilya Kreymer	553e2e352b	Merge branch 'main' into 1.10.2-release	2024-06-12 23:59:56 -07:00
Ilya Kreymer	fa6627ce70	ensure QA configmap is updated for long running QA runs: (#1865 ) - add a 'expire_at_duration_seconds' which is 75% of actual presign duration time, or <25% remaining until presigned URL actually expires to ensure presigned URLs are updated early than when they actually expire - set cached expireAt time to the renew at time for more frequent updates - update QA configmap in place with updated presigned URLs when expireAt time is reached - mount qa config volume under /tmp/qa/ without subPath to get automatic updates, which crawler will handle - tests: fix qa test typo (from main) - fixes #1864	2024-06-12 10:51:35 -07:00
Tessa Walsh	8b0d1432af	Show QA meter while analysis is running (#1854 ) Fixes #1846 - Ensure meter auto-updates as new stats are ready - Switch meter to new QA run when new analysis run is started - Remove Files from QA meter (files and errors will be reported separately) Co-authored-by: emma <hi@emma.cafe> Co-authored-by: sua yoo <sua@webrecorder.org>	2024-06-12 12:32:01 -04:00

1 2 3 4 5 ...

513 Commits