browsertrix

Author	SHA1	Message	Date
Dmitriy Pertsev	246bcc73c5	Use new ingressClassName only by default (#2268 ) - By default, use only `ingressClassName` for ingress class name and corresponding field in cert-manager - Only use old 'kubernetes.io/ingress.class' if ingress.useOldClassAnnotation is set - Allow for using old annotation only for backwards compatibility, eg. for GCP - Closes #2267 and #1570 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-01-15 23:23:50 -08:00
Ilya Kreymer	12f358b826	Merge pull request #2271 from webrecorder/public-collections-feature feat: Public collections, includes: - feat: Public org profile page #2172 - feat: Collection thumbnails, start page, and public view updates #2209 - feat: Track collection events #2256	2025-01-13 19:32:45 -08:00
Ilya Kreymer	bab5345ad5	version: bump to 1.14.0-beta.0 for public collections!	2025-01-13 19:29:54 -08:00
sua yoo	b36ed9f730	feat: Track collection events (#2256 ) - Renames `inject_analytics` to `inject_extra` and updates docs - Manually tracks page views to enable passing custom props - Tracks copying collection share link and downloading a public collection --------- Co-authored-by: emma <hi@emma.cafe>	2025-01-13 15:15:49 -08:00
Tessa Walsh	a031fab313	Backend work for public collections (#2198 ) Fixes #2182 This rather large PR adds the rest of what should be needed for public collections work in the frontend. New API endpoints include: - Public collections endpoints: GET, streaming download - Paginated list of URLs in collection with snapshot (page) info for each - Collection endpoint to set home URL - Collection endpoint to upload thumbnail as stream - DELETE endpoint to remove collection thumbnail Changes to existing API endpoints include: - Paginating public collection list results - Several `pages` endpoints that previously only supported `/crawls/` in their path, e.g. `/orgs/{oid}/crawls/all/pages/reAdd`, now support `/uploads/` and `/all-crawls/` namespaces as well. This is necessitated by adding pages for uploads to the database (see below). For `/orgs/{oid}/namespace/all/pages/reAdd`, `crawls` or `uploads` will serve as a filter to only affect crawls of that given type. Other endpoints are more liberal at this point, and will perform the same action regardless of the namespace used in the route (we'll likely want to change this in a follow-up to be more consistent). - `/orgs/{oid}/namespace/all/pages/reAdd` now kicks off a background job rather than doing all of the computation in an asyncio task in the backend container. The background job additionally updates collection date ranges, page/size counts, and tags for each collection in the org after pages have been (re)added. Other big changes: - New uploads will now have their pages read into the database! Collection page counts now also include uploads - A migration was added to start a background job for each org that will add the pages for previously-uploaded WACZ files to the database and update collections accordingly - Adds a new `ImageFile` subclass of `BaseFile` for thumbnails that we can use for other user-uploaded image files moving forward, with separate output models for authenticated and public endpoints	2025-01-13 15:15:48 -08:00
Ilya Kreymer	a21b2ff0df	version: bump to 1.13.2	2025-01-08 22:58:33 -08:00
Tessa Walsh	589819682e	Optionally delay replica deletion (#2252 ) Fixes #2170 The number of days to delay file replication deletion by is configurable in the Helm chart with `replica_deletion_delay_days` (set by default to 7 days in `values.yaml` to encourage good practice, though we could change this). When `replica_deletion_delay_days` is set to an int above 0, when a delete replica job would otherwise be started as a Kubernetes Job, a CronJob is created instead with a cron schedule set to run yearly, starting x days from the current moment. This cronjob is then deleted by the operator after the job successfully completes. If a failed background job is retried, it is re-run immediately as a Job rather than being scheduled out into the future again. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-12-19 18:50:28 -08:00
Ilya Kreymer	2060ee78b4	Support Presigning for use with custom domain (#2249 ) If access_endpoint_url is provided: - Use virtual host addressing style, so presigned URLs are of the form `https://bucket.s3-host.example.com/path/` instead of `https://s3-host.example.com/bucket/path/` - Allow for replacing `https://bucket.s3-host.example.com/path/` -> `https://my-custom-domain.example.com/path/`, where `https://my-custom-domain.example.com/path/` is the access_endpoint_url - Remove old `use_access_for_presign` which is no longer used - Fixes #2248 - docs: update deployment docs storages section to mention custom storages, access_endpoint_url --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-12-19 18:41:47 -08:00
Ilya Kreymer	60d07762be	version: bump to 1.13.1	2024-12-19 12:01:47 -08:00
Ilya Kreymer	cf60c43df2	version: bump to 1.13.0! (#2242 )	2024-12-13 20:32:38 -08:00
Ilya Kreymer	74ae3b0f8d	Add new locales (#2240 ) - By default, all locales are enabled to make it easy for local deployments to test new locales - Adds DE, FR, PT locales to make way for translation in Weblate	2024-12-13 19:59:09 -08:00
Emma Segal-Grossman	b650762a45	Allow configuring available languages from helm chart (#2230 ) Closes #2223 - [x] Adds `localesAvailable` to `/api/settings` endpoint, and uses that list if available, rather than the full list of translated locales, to determine which options to display to users - [x] ~~Uses the user's browser locales, filtered to the current language setting, for formatting numbers, dates, and durations~~ - [x] Adds & persists checkbox for "use same language for formatting dates and numbers" in user settings - [x] Replaces uses of `sl-format-bytes` with `localize.bytes(...)`, and `sl-format-date` with replacement `btrix-format-date` that properly handles fallback locales - [x] Caches all number/duration/datetime formatters by a combined key consisting of app language, browser language, browser setting, and formatter options so that all formatters can be reused if needed (previously any formatter with non-default options would be recreated every render) - [x] Splits out ordinal formatting from number formatter, as it didn't make much sense in some non-English locales - [x] Adds a little demo of date/time/duration/number formatting so you can see what effect your language settings have https://github.com/user-attachments/assets/724858cb-b140-4d72-a38d-83f602c71bc7 --------- Signed-off-by: emma <hi@emma.cafe> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-12-13 22:31:26 -05:00
Ilya Kreymer	db39333ef4	Send subscription cancelation email (#2234 ) Adds sending a cancellation email when a subscription is cancelled. - The email may also include an option survey optional survey URL, if configured in helm chart `survey_url` setting. - Cancellation e-mail configured in `sub_cancel` e-mail template - E-mails are sent to all org admins. - Also adds `trialing_canceled` subscription state to differentiate from a default `trialing` which will automatically rollover into `active`. - The email is sent when: a new cancellation date is added for an `active` subscription, or a `trialing` subscription is changed to to `trialing_canceled`. (A subscription can be canceled/uncanceled several times before actual date, and e-mail is sent every time it is canceled.) - The 'You have X days left of your trial' is also always displayed when state is in trialing_canceled. Fixes #2229 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-12-12 11:52:38 -08:00
Emma Segal-Grossman	a65ca49ddd	Plausible analytics (#2226 ) Closes #2222 Adds a runtime script that gets set to either inject the plausible script tags, or do nothing, that runs at initialization of the frontend container.	2024-12-10 16:30:22 -08:00
Tessa Walsh	661e5d9fae	Fix issue with failed background job emails not being sent (#2187 ) Fixes #2186 Background job emails will no longer fail to send for jobs unrelated to file replication or replica deletion. Also uses `AnyJob` for paginated background job response model, to fix typing being out of data following addition of other types of background jobs and lower overhead for adding new ones moving forward.	2024-11-27 17:00:35 -08:00
Ilya Kreymer	50dac7dc50	1.12.2 release -> main (#2181 ) Merge 1.12.2 release changes into main, includes: - Collection replay full refresh on metadata / archived items (#2176) - Fix for self-registration default org (#2178) - Prepend missing https in start URL (#2177) - Updated billing to support free trial messaging (#2179) --------- Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: sua yoo <sua@suayoo.com> Co-authored-by: SuaYoo <SuaYoo@users.noreply.github.com>	2024-11-26 11:17:07 -08:00
Tessa Walsh	1b1819ba5a	Move org deletion to background job with access to backend ops classes (#2098 ) This PR introduces background jobs that have full access to the backend ops classes and moves the delete org job to a background job.	2024-10-10 14:41:05 -04:00
Ilya Kreymer	84a74c43a4	version: bump to 1.13.0-beta.0	2024-10-10 11:38:13 -07:00
Ilya Kreymer	c33f749515	Frontend hosted-docs (#2107 ) Fixes #2106 Docs are now hosted as part of the frontend at `/docs` by default. - If `docs_url` is set in the helm chart, the `/docs` endpoint will redirect to that endpoint instead - Use multi-stage python image to build mkdocs as part of frontend, then copy static output - Dir layout: mkdocs.yml and docs into frontend/docs - CI: Update docs build GH action to use new path - Update all frontend paths to use `/docs/` instead of `https://docs.browsertrix.com/` --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-10-08 14:56:34 -07:00
Ilya Kreymer	8192e5bed6	version: bump to 1.12.0	2024-10-03 16:45:54 -07:00
Vinzenz Sinapius	bb6e703f6a	Configure browsertrix proxies (#1847 ) Resolves #1354 Supports crawling through pre-configured proxy servers, allowing users to select which proxy servers to use (requires browsertrix crawler 1.3+) Config: - proxies defined in btrix-proxies subchart - can be configured via btrix-proxies key or separate proxies.yaml file via separate subchart - proxies list refreshed automatically if crawler_proxies.json changes if subchart is deployed - support for ssh and socks5 proxies - proxy keys added to secrets in subchart - support for default proxy to be always used if no other proxy configured, prevent starting cluster if default proxy not available - prevent starting manual crawl if previously configured proxy is no longer available, return error - force 'btrix' username and group name on browsertrix-crawler non-root user to support ssh Operator: - support crawling through proxies, pass proxyId in CrawlJob - support running profile browsers which designated proxy, pass proxyId to ProfileJob - prevent starting scheduled crawl if previously configured proxy is no longer available API / Access: - /api/orgs/all/crawlconfigs/crawler-proxies - get all proxies (superadmin only) - /api/orgs/{oid}/crawlconfigs/crawler-proxies - get proxies available to particular org - /api/orgs/{oid}/proxies - update allowed proxies for particular org (superadmin only) - superadmin can configure which orgs can use which proxies, stored on the org - superadmin can also allow an org to access all 'shared' proxies, to avoid having to allow a shared proxy on each org. UI: - Superadmin has 'Edit Proxies' dialog to configure for each org if it has: dedicated proxies, has access to shared proxies. - User can select a proxy in Crawl Workflow browser settings - Users can choose to launch a browser profile with a particular proxy - Display which proxy is used to create profile in profile selector - Users can choose with default proxy to use for new workflows in Crawling Defaults --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-10-02 18:35:45 -07:00
Ilya Kreymer	c242bb96d2	version: bump to 1.12.0-beta.0	2024-09-12 14:30:15 -07:00
Ilya Kreymer	1f919de294	Allow custom auto-resize crawler volume ratio adjustable (#2076 ) Make the avail / used storage ratio (for crawler volumes) adjustable. Disable auto-resize if set to 0. Follow-up to #2023	2024-09-12 09:28:19 -07:00
sua yoo	4c36c80351	feat: Display scale as number of browser windows (#2057 ) Resolves https://github.com/webrecorder/browsertrix/issues/2048 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-09-05 17:32:40 -07:00
Ilya Kreymer	b3c1195878	version: bump to 1.11.6	2024-09-05 17:31:10 -07:00
Ilya Kreymer	ea252e8da9	version: bump to 1.11.5	2024-08-27 10:00:53 -07:00
sua yoo	337454f8c9	feat: Add link to hosted sign-up page (#2045 ) Resolves https://github.com/webrecorder/browsertrix/issues/2043 <!-- Fixes #issue_number --> ### Changes - Shows link to sign up in UI if `sign_up_url` is configured. - Expires settings in session storage (for now)	2024-08-26 17:26:25 -07:00
Ilya Kreymer	95969ec747	Attempt to auto-adjust storage if usage is running out while crawl is running (#2023 ) Attempt to auto-adjust PVC storage if: - used storage (as reported in redis by the crawler) * 2.5 > total_storage - will cause PVC to resize, if possible (not supported by all drivers) - uses multiples of 1Gi, rounding up to next GB - AVAIL_STORAGE_RATIO hard-coded to 2.5 for now, to account for 2x space for WACZ plus change for fast updating crawls Some caveats: - only works if the storageClass used for PVCs has `allowVolumeExpansion: true`, if not, it will have no effect - designed as a last resort option: the `crawl_storage` in values and `--sizeLimit` and `--diskUtilization` should generally result in this not being needed. - can be useful in cases where a crawl is rapidly capturing a lot of content in one page, and there's no time to interrupt / restart, since the other limits apply only at page end. - May want to have crawler update the disk usage more frequently, not just at page end to make this more effective.	2024-08-26 14:19:20 -07:00
Ilya Kreymer	135c97419d	version: update to 1.11.4	2024-08-26 12:31:56 -07:00
Ilya Kreymer	8ff1ad39a7	version: bump to 1.11.3	2024-08-08 15:16:18 -07:00
Ilya Kreymer	ed9038fbdb	version: bump to 1.11.2	2024-08-07 12:37:26 -07:00
Ilya Kreymer	7fa2b61b29	Execution time tracking tweaks (#1994 ) Tweaks to how execution time is tracked for more accuracy + excluding waiting states: - don't update if crawl state is in a 'waiting state' (waiting for capacity or waiting for org limit) - rename start states -> waiting states for clarity - reset lastUpdatedTime if two consecutive updates of non-running state, to ensure non-running states don't count, but also account for occasional hiccups -- if only one update detects non-running state, don't reset - webhooks: move start webhook to when crawl actually starts for first time (db lastUpdatedTime is not yet + crawl is running) - don't set lastUpdatedTime until pods actually running - set crawljob update interval to every 10 seconds for more accurate execution time tracking - frontend: show seconds in 'Execution Time' display	2024-08-06 09:44:44 -07:00
Ilya Kreymer	0c29008b7d	version: bump to 1.11.1	2024-07-30 11:23:41 -07:00
Ilya Kreymer	4aca107710	version: bump to 1.11.0	2024-07-29 12:52:39 -07:00
Ilya Kreymer	96691a33fa	Fix for cronjob skipping response (#1976 ) If a cronjob is disabled, the operator should quickly return a success value so that the job can be terminated. Was previously returning an incorrect response, causing disabled cronjobs to not be cleaned up. Add proper typing to always return correct response	2024-07-29 12:24:18 -07:00
Ilya Kreymer	b35669af8d	disable behaviors for QA runs via configmap (#1963 ) - make crawl args a reusable template - adds QA_ARGS to configmap, setting to same value as CRAWL_ARGS but with --behaviors= prepended to disable behaviors for QA, to improve performance of QA runs. fixes #1962	2024-07-23 19:54:21 -07:00
Ilya Kreymer	01ddf95a56	allow disabling of auto-resize of crawler pods (#1964 ) - only enable if 'enable_auto_resize' is true, default to false - if true, set memory limit to 1.2 of memory requests, resize when hitting 'soft oom' of initial request, adjust by 1.2 (current behavior) up to max_crawler_memory - if false, set memory limit to max_crawler_memory and never adjust memory requests or memory limits - part of #1959	2024-07-23 21:00:40 -04:00
Ilya Kreymer	27059c91a5	version: bump to 1.11.0-beta.1	2024-07-17 10:06:49 -07:00
Ilya Kreymer	9a67e28f13	Adds Subscription API (#1914 ) Fixes https://github.com/webrecorder/browsertrix/issues/1905 - adds a new top-level `/api/subscriptions` endpoint and SubOps handler on the backend. - enable subscriptions API endpoints available only if `billing_enabled` is set in helm chart - new POST /subscriptions/create, /subscriptions/update, /subscriptions/cancel API endpoints - Subscriptions mongo collection storing timestamped /subscription API events - GET /subscriptions/events API to get subscription events, support for filtering and sorting - Subscription data model - Support for setting and handling readOnlyOnCancel on org - /orgs/<id>/billing-portal to lookup portalUrl using external API - subscription in org getter and list views - mark org as readOnly for subscription status `paused_payment_failed`, clears it on status `active` --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-07-10 17:41:16 -07:00
Vinzenz Sinapius	01d8bdc5e6	Crawler network policy (#1727 ) Limit egress traffic from crawler/profilebrowser pods to the internet and limited internal services like dns, redis, frontend, auth-signer on certain ports --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-03 10:55:03 -07:00
Ilya Kreymer	1c42e21b8a	Refactor Invites and Registration, Flatten Per-User Invites (#1902 ) Fixes #1432 Refactors the invite + registration system to be simpler and more consistent with regards to existing user invites. Previously, per-user invites are stored in the user.invites dict instead of in the invites collection, which creates a few issues: - Existing user do not show up in Org Invites list: #1432 - Existing user invites also do not expire, unlike new user invites, creating potential security issue. Instead, existing user invites should be treated like new user invites. This PR moves them into the same collection, adding a `userid` field to InvitePending to match with an existing user. If a user already exists, it will be matched by userid, instead of by email. This allows for user to update their email while still being invited. Note that the email of the invited existing user will not change in the invite email. This is also by design: an admin of one org should not be given any hint that an invited user already has an account, such as by having their email automatically update. For an org admin, the invite to a new or existing user should be indistinguishable. The sha256 of invite token is stored instead of actual token for better security. The registration system has also been refactored with the following changes: - Auto-creation of new orgs for new users has been removed - User.create_user() replaces the old User._create() and just creates the user with additional complex logic around org auto-add - Users are added to org in org add_user_to_org() - Users are added to org through invites with add_user_with_invite() Tests: - Additional tests include verifying that existing and new pending invites appear in the pending invites list - Tests for `/users/invite/<token>?email=` and `/users/me/invite/<token>` endpoints - Deleting pending invites - Additional tests added for user self-registration, including existing user self-registration to default org of existing user (in nightly tests)	2024-07-02 15:13:27 -07:00
Tessa Walsh	f076e7d9e3	Add superuser API endpoints to export and import org data (#1394 ) Fixes #890 This PR introduces new streaming superuser-only API endpoints to export and import database information for an organization. New Adminstrator deployment documentation on how to manage the process and copy files between S3 buckets as needed is also included. --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-02 17:14:34 -04:00
Ilya Kreymer	e1ef894275	Extends Org Create endpont + shared secret auth (#1897 ) Updates the /api/orgs/create endpoint to: - not have name / slug be required, will be renamed on first user via #1870 - support optional quotas - support optional first admin user email, who will receive an invite to join the org. Also supports a new shared secret mechanism, to allow an external automation to access the /api/orgs/create endpoint (and only that endpoint thus far) via a shared secret instead of normal login.	2024-07-01 09:37:02 -07:00
Ilya Kreymer	3cd52342a7	Remove Crawl Workflow Configmaps (#1894 ) Fixes #1893 - Removes crawl workflow-scoped configmaps, and replaces with operator-controlled per-crawl configmaps that only contain the json config passed to Browsertrix Crawler (as a volume). - Other configmap settings replaced are replaced the custom CrawlJob options (mostly already were, just added profile_filename and storage_filename) - Cron jobs also updated to create CrawlJob without relying on configmaps, querying the db for additional settings. - The `userid` associated with cron jobs is set to the user that last modified the schedule of the crawl, rather than whomever last modified the workflow - Various functions that deal with updating configmaps have been removed, including in migrations. - New migration 0029 added to remove all crawl workflow configmaps	2024-06-28 15:25:23 -07:00
Ilya Kreymer	946739b08b	Update authsigner to 0.5.2 (#1899 ) - needed to support js-wacz signing requests in upcoming crawler versions - Also has slightly increased memory requirements due to new versions of some libraries. - 0.5.2 adds a fix to dropping the fractional part of the second, to make it work with ISO date strings that have microseconds, such as those from js-wacz.	2024-06-28 13:38:24 -07:00
Tessa Walsh	8a904c9031	feat: Rename org when accepting org invite for first admin (#1870 ) Resolves https://github.com/webrecorder/browsertrix/issues/1874 Support for new two-part sign up flow if first admin user is added to org - If new user, user registers first, then is able to change the org name / slug on following screen - If existing user, user accepts invite, then is able to change the org name / slug on following screen - After confirming org slug name, user is taken to dashboard, or error is shown if org name or slug already taken. - If org name == org id, org name and slug is automatically set to `{Your Name}'s Archive` when first user is registered / accepts invite - Email templates updated to better reflect new / existing users and not show org name if it is 'unset' (org name == org id internally) - tests: frontend unit testing for accept + invite screens. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: sua yoo <sua@suayoo.com> Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com> Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>	2024-06-27 16:08:31 -07:00
Ilya Kreymer	6df10d5fb0	Improved Scale Handling (#1889 ) Fixes #1888 Refactors scale handling: - Ensures number of scaled instances does not exceed number of pages, but is also at minimum 1 - Checks for finish condition to be numFailed + numDone >= desired scale - If at least one instance succeeds, crawl considers successful / done. - If all instances fail, crawl considered failed - Ensures that pod done count >= redis done count --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-06-26 10:24:45 -07:00
Henry Wilkinson	48dfa485e5	Adds GitHub and Forum to the email invite template (#1887 ) - Adds Browsertrix GitHub repo and Webrecorder forum to the bottom of the support email. - Adds note about having an applicable plan to contact support --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-06-25 19:27:03 -04:00
Tessa Walsh	7af3980323	Add billing enabled and sales email to Helm chart and /settings API endpoint (#1873 ) Backend work for first two tasks of https://github.com/webrecorder/browsertrix/issues/1875 New /billing API endpoint to be added separately once we have a better idea of what data we can get from the payment processor.	2024-06-25 10:55:29 -04:00
Ilya Kreymer	553e2e352b	Merge branch 'main' into 1.10.2-release	2024-06-12 23:59:56 -07:00
Ilya Kreymer	fa6627ce70	ensure QA configmap is updated for long running QA runs: (#1865 ) - add a 'expire_at_duration_seconds' which is 75% of actual presign duration time, or <25% remaining until presigned URL actually expires to ensure presigned URLs are updated early than when they actually expire - set cached expireAt time to the renew at time for more frequent updates - update QA configmap in place with updated presigned URLs when expireAt time is reached - mount qa config volume under /tmp/qa/ without subPath to get automatic updates, which crawler will handle - tests: fix qa test typo (from main) - fixes #1864	2024-06-12 10:51:35 -07:00
Ilya Kreymer	1ab6ec325b	version: bump to 1.10.2	2024-06-11 16:28:40 -07:00
Ilya Kreymer	e3ee63f9b0	version: bump to 1.11.0-beta.0	2024-06-04 13:37:44 -07:00
Ilya Kreymer	d42de92d75	QA analysis scale configurable in helm chart (#1843 ) - allow configuring QA run scale via 'qa_scale' setting in helm values (overriding any setting on the qa crawljob) - adds additional comments to browser instances helm values settings for clarity - fixes #1842	2024-05-30 12:59:21 -07:00
Ilya Kreymer	61239a40ed	include workflow config in QA runs + different browser instances for QA (#1829 ) Currently, the workflow crawl settings were not being included at all in QA runs. This mounts the crawl workflow config, as well as QA configmap, into QA run crawls, allowing for page limits from crawl workflow to be applied to QA runs. It also allows a different number of browser instances to be used for QA runs, as QA runs might work better with less browsers, (eg. 2 instead of 4). This can be set with `qa_browser_instances` in helm chart. Default qa browser workers to 1 if unset (for now, for best results) Fixes #1828	2024-05-29 13:32:25 -07:00
Ilya Kreymer	4b6dd97c11	version: bump to 1.10.1	2024-05-23 22:24:58 -07:00
Ilya Kreymer	e853b62401	version: update to 1.10.0!	2024-05-20 19:30:22 -07:00
Ilya Kreymer	94d57b98ce	version bump to 1.10.0-beta.7	2024-05-15 11:30:05 -07:00
Ilya Kreymer	e022994f4e	version: update to 1.10.0-beta.6	2024-04-30 20:34:11 +02:00
Ilya Kreymer	a3911f6a8a	version: bump to 1.10.0-beta.5	2024-04-25 09:00:54 +02:00
Ilya Kreymer	f6c0791dc1	fix missing settings / typos: (#1748 ) - ensure max_crawler_memory_size is inited before it is set! - pass profile_browser_memory / profile_browser_cpu from chart values - map volume to /tmp/home to avoid persisting /tmp for profiles	2024-04-25 09:00:17 +02:00
Ilya Kreymer	a09f565ce5	version: bump to 1.10.0-beta.4	2024-04-24 16:53:39 +02:00
Ilya Kreymer	f89027ac89	version: 1.10.0-beta.3	2024-04-24 15:45:17 +02:00
Ilya Kreymer	ec74eb4242	operator: add 'max_crawler_memory' to limit autosizing of crawler pods (#1746 ) Adds a `max_crawler_memory` chart setting, which, if set, will defines the upper crawler memory limit that crawler pods can be resized up to. If not set, auto resizing is disabled and pods are always set to 'crawler_memory' memory	2024-04-24 15:16:32 +02:00
Ilya Kreymer	41655ef829	version: bump to 1.10.0-beta.2	2024-04-23 23:19:16 +02:00
Ilya Kreymer	b94070160b	allow configuring designated registration org to which new users can register (#1735 ) if 'registration_enabled' is set, check 'registration_org_id' for org id of an existing org that new users should be added to when they register. if omitted, default to the default org Fixes #1729	2024-04-23 17:11:37 -04:00
Ilya Kreymer	b574f00d2b	Add Repository Index + Chart Rename + Docs Rename (#1708 ) Repository Index: Generate an index.yaml in ./docx/helm-repo/index.yaml to allow for browsertrix to be a helm repository. docs: rename docs.browsertrix.cloud -> docs.browsertrix.com docs: update deployment doc to mention helm repo as preferred way to install docs build action: generate repository index in GH action publish action: update auto-generated message to mention installing from the repo. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-04-21 09:42:25 -07:00
Ilya Kreymer	4360e0c1b5	Update tests with latest crawler (#1711 ) tests: use 'latest' crawler release for testing, now that 1.1.x is released.	2024-04-20 15:56:26 -07:00
Vinzenz Sinapius	a8336925b6	Run crawler and profilebrowser with non-root user (#1625 ) With these changes, crawler and profilebrowser jobs run as a non-root user.	2024-04-17 12:03:33 -07:00
Ilya Kreymer	835014d829	restrict qa runs to a 'min_qa_crawler_image' if set in the chart (#1685 ) - fixes #1684 - can be used to optionally restrict QA to only some crawls (eg. with browsertrix-crawler>=1.0.0) - enforce error on backend (return 400) and handle special error on the frontend	2024-04-17 08:48:33 -07:00
Vinzenz Sinapius	1b034957ff	Improve reliability of backend tests (#1675 ) - Remove globals from profile, uploads, and qa test modules in favor of fixtures - Add retries to fix intermittent test failures due to timing --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-04-16 14:22:41 -04:00
Ilya Kreymer	95f5605af7	renumber crawl priority classes: (#1673 ) - priority classes <-10 are ignored by cluster-autoscaler so QA jobs with too low priorities never run - start crawl priorities at 0 going down (same as before) - start qa run priorities at -2 going down (instead of -100) - this means a crawl of with scale of 3 can be preempted by 1st qa pod, but otherwise crawls have higher priority - rename priority classes as they are otherwise immutable and error on helm upgrade This allows for more room in lower pri classes for other type of objects, while keeping in mind the -10 and below threshold: (see: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md)	2024-04-13 12:24:43 -07:00
Ilya Kreymer	17f49a52de	email templates update + customization + doc update (fixes #1652 ) (#1653 ) - modify invite email template to answer common questions - email templates: make each email template overridable with --set-file - docs: update customization doc to document how to customize email templates --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-04-08 12:27:47 -07:00
Ilya Kreymer	a7cda3b11b	version: bump to 1.10.0-beta.1	2024-04-05 18:24:14 -07:00
Ilya Kreymer	c1817cbe04	add horizontal pod autoscaler for backend and frontend via helm charts (#1633 ) Supports horizontal pod autoscaling (hpa) for backend and frontend pods: - use cpu and memory averages - adjust base memory + cpu for backend - threshold set to 80% cpu and 95% memory utilization by default (configurable in values.yaml) - instead of backend and frontend replicas, set max replicas in values.yaml - only enable hpa if backend_max_replicas or frontend_max_replicas is >1, default to 1 for now	2024-03-28 16:39:27 -07:00
Ilya Kreymer	3438133fcb	Crawler pod memory padding + auto scaling (#1631 ) - set memory limit to 1.2x memory request to provide extra padding and avoid OOM - attempt to resize crawler pods by 1.2x when exceeding 90% of available memory - do a 'soft OOM' (send extra SIGTERM) to pod when reaching 100% of requested memory, resulting in faster graceful restart, but avoiding a system-instant OOM Kill - Fixes #1632 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-28 16:39:00 -07:00
Ilya Kreymer	86311ab4ea	merge 1.9.5 fixes (#1637 ) retry loading profile if initial load fails, follow-up to #1604 - Add missing setTimeout to retry profile loading bump RWP to 1.8.15	2024-03-27 21:49:19 -07:00
Ilya Kreymer	412eb2ef32	MetaController update (#1630 ) Bump metacontroller to latest (4.11)	2024-03-27 08:49:56 -07:00
Ilya Kreymer	4f676e4e82	QA Runs Initial Backend Implementation (#1586 ) Supports running QA Runs via the QA API! Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes #1498 Also requires the latest Browsertrix Crawler 1.1.0+ (from webrecorder/browsertrix-crawler#469 branch) Notable changes: - QARun objects contain info about QA runs, which are crawls performed on data loaded from existing crawls. - Various crawl db operations can be performed on either the crawl or `qa.` object, and core crawl fields have been moved to CoreCrawlable. - While running,`QARun` data stored in a single `qa` object, while finished qa runs are added to `qaFinished` dictionary on the Crawl. The QA list API returns data from the finished list, sorted by most recent first. - Includes additional type fixes / type safety, especially around BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific get_upload(), get_basecrawl(), get_crawl() getters for internal use and get_crawl_out() for API - Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-20 22:42:16 -07:00
Tessa Walsh	21ae38362e	Add endpoints to read pages from older crawl WACZs into database (#1562 ) Fixes #1597 New endpoints (replacing old migration) to re-add crawl pages to db from WACZs. After a few implementation attempts, we settled on using [remotezip](https://github.com/gtsystem/python-remotezip) to handle parsing of the zip files and streaming their contents line-by-line for pages. I've also modified the sync log streaming to use remotezip as well, which allows us to remove our own zip module and let remotezip handle the complexity of parsing zip files. Database inserts for pages from WACZs are batched 100 at a time to help speed up the endpoint, and the task is kicked off using asyncio.create_task so as not to block before giving a response. StorageOps now contains a method for streaming the bytes of any file in a remote WACZ, requiring only the presigned URL for the WACZ and the name of the file to stream.	2024-03-19 14:14:21 -07:00
Ilya Kreymer	e7af081af1	profile browser fixes: better resource usage + load retry (main) (#1604 ) - Backend: Use separate resource constraints for profiles: default profile browser resources to either 'profile_browser_cpu' / 'profile_browser_memory' or single browser 'crawler_memory_base' / 'crawler_cpu_base', instead of scaled to the number of browser workers - Frontend: check that profile html page is loading, keep retrying if still getting nginx error instead of loading an iframe with the error. Fixes #1598 (Copy of #1599 from 1.9.4)	2024-03-16 15:07:04 -07:00
Henry Wilkinson	8ba29ca776	Browsertrix Cloud → Browsertrix text rename (#1466 ) Part of #1241 ### Changes - Renames all instances of "Browsertrix Cloud" to "Browsertrix" on the front end, emails, and documentation --------- Co-authored-by: emma <hi@emma.cafe>	2024-03-12 11:30:05 -04:00
Ilya Kreymer	804f755787	Increase startup probe time to account for long-running migrations (#1560 ) - increases the failureThreshold for startupProbe for the api backend container to account for long running migrations, upto 300 seconds - add `/healthzStartup` which checks if db is ready - bump - keeps `/healthz` to always return 200 when running - increases livenessProbe failureThreshold to be higher than readiness probe, following recommended best practice of liveness probe > readiness probe - fixes #1559	2024-02-28 14:22:33 -08:00
Tessa Walsh	14189b7cfb	Add crawl pages and related API endpoints (#1516 ) Fixes #1502 - Adds pages to database as they get added to Redis during crawl - Adds migration to add pages to database for older crawls from pages.jsonl and extraPages.jsonl files in WACZ - Adds GET, list GET, and PATCH update endpoints for pages - Adds POST (add), PATCH, and POST (delete) endpoints for page notes, each with their own id, timestamp, and user info in addition to text - Adds page_ops methods for 1. adding resources/urls to page, and 2. adding automated heuristics and supplemental info (mime, type, etc.) to page (for use in crawl QA job) - Modifies `Migration` class to accept kwargs so that we can pass in ops classes as needed for migrations - Deletes WACZ files and pages from database for failed crawls during crawl_finished process - Deletes crawl pages when a crawl is deleted Note: Requires a crawler version 1.0.0 beta3 or later, with support for `--writePagesToRedis` to populate pages at crawl completion. Beta 4 is configured in the test chart, which should be upgraded to stable 1.0.0 when it's released. Connected to https://github.com/webrecorder/browsertrix-crawler/pull/464 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-02-28 12:11:35 -05:00
Ilya Kreymer	8ae032ff88	More friendly WARC prefix inside WACZ based on Org slug + Crawl Name / First Seed URL. (#1537 ) Supports setting WARC prefix for WARCs inside WACZ to `<org slug>-<slug [crawl name \| first seed host]>`. - Prefix set via WARC_PREFIX env var, supported in browsertrix-crawler 1.0.0-beta.4 or higher If crawl name is provided, uses crawl name, other hostname of first seed. The name is 'sluggified', using lowercase alphanum characters separated by dashes. Ex: in an organization called `Default Org`, a crawl of `https://specs.webrecorder.net/` and no name will have WARCs named: `default-org-specs-webrecorder-net-....warc.gz` If the crawl is given the name `SPECS`, the WARCs will be named `default-org-specs-manual-....warc.gz` Fixes #412 in a default way.	2024-02-22 23:54:23 -08:00
Ilya Kreymer	a8e3ff1141	version: bump to 1.10.0-beta.0	2024-02-20 00:22:29 -08:00
Ilya Kreymer	c1cffe9ecd	version: bump to 1.9.1	2024-02-16 09:44:18 -08:00
Ilya Kreymer	64bf21311d	version: bump to 1.9.0!	2024-02-14 13:30:46 -08:00
Ilya Kreymer	1d266e3cea	bump to 1.9.0.beta.5	2024-02-12 18:29:39 -08:00
Ilya Kreymer	4bc8152640	version: bump to 1.9.0-beta.4	2024-02-09 16:17:13 -08:00
Ilya Kreymer	b2a5dbf2cd	enable screenshots by default + fix py version formatting (#1518 ) configmap: add --screenshot thumbnail,view as default screenshots version: update update-version.sh to add newline in version.py to match new black formatting (from changes in #1507) Fixes #1519	2024-02-07 17:07:28 -08:00
Ilya Kreymer	7aebce66f6	version: bump to 1.9.0-beta.3	2024-02-07 15:21:10 -08:00
Ilya Kreymer	e43feedc43	version: bump to 1.9.0-beta.2	2024-01-18 10:01:38 -08:00
Ilya Kreymer	370590b14f	version: bump to 1.9.0-beta.1	2024-01-17 14:58:25 -08:00
Tessa Walsh	07fa46d9aa	Add custom user agent to workflows (#1465 ) Fixes #1341 Adds "User Agent" field to workflow editor under the Browser Settings tab. If not set, the crawler will use the browser's default user agent. Also added to docs and to the workflow details page (if set). --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-01-17 17:33:50 -05:00
Ilya Kreymer	90197b2a85	Backend mem usage fix - use fixed MOTOR_MAX_WORKERS + switch to gunicorn (#1468 ) Refactors backend deployment to: - Use MOTOR_MAX_WORKERS (defaulting to 1) to reduce threads used by mongodb connections - Also sets backend workers to 1 by default to reduce default memory usage - Switches to gunicorn with uvloop worker for production use instead of uvicorn (as recommended by uvicorn) Lower thread count should address memory leak/increased usage, which resulted in 5x thread x cpus x workers, eg. potentially 20 or 40 threads just for mongodb connections. Lower default number of workers should make it easier to scale backend with HPA if additional capacity. Fixes #1467	2024-01-16 15:32:42 -08:00
Tessa Walsh	032859f361	Support multiple crawler versions (#1420 ) Fixes #1385 ## Changes Supports multiple crawler 'channels' which can be configured to different browsertrix-crawler versions - Replaces `crawler_image` in helm chart with `crawler_channels` array similar to how storages are handled - The `default` crawler channel must always be provided and specifies the default crawler image - Adds backend `/orgs/{oid}/crawlconfigs/crawler-channels` API endpoint to fetch information about available crawler versions (name, image, and label) and test - Adds crawler channel select to workflow creation/edit screens and profile creation dialog, and updates related API endpoints and configmaps accordingly. The select dropdown is shown only if more than one channel is configured. - Adds `crawlerChannel` to workflow and crawl details. - Add `image` to crawler image, used to display actual image used as part of the crawl. - Modifies `crawler_crawl_id` backend test fixture to use `test` crawler version to ensure crawler versions other than latest work - Adds migration to add `crawlerChannel` set to `default` to existing workflow and profile objects and workflow configmaps --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-01-16 15:32:12 -08:00
Ilya Kreymer	a6936299d3	version: bump to 1.9.0-beta.0	2023-12-20 00:08:16 -08:00
Ilya Kreymer	d902cf5338	version: bump to 1.8.2	2023-12-07 13:34:37 -08:00
Ilya Kreymer	1218d6e767	version: bump to 1.8.1	2023-11-17 14:39:52 -08:00

1 2 3 4 5 ...

298 Commits