browsertrix

Author	SHA1	Message	Date
Ilya Kreymer	4e0e9c87c2	qa: run siteSpecific behaviors on QA (#2739 ) - allow the page loading waiting time to be applied for sites with site-specific behaviors (eg. social media)	2025-07-15 16:17:55 -07:00
Ilya Kreymer	d4a2a66d6d	additional scale / browser window cleanup to properly support QA: (#2663 ) - follow up to #2627 - use qa_num_browser_windows to set exact number of QA browsers, fallback to qa_scale - set num_browser_windows and num_browsers_per_pod using crawler / qa values depending if QA crawl - scale_from_browser_windows() accepts optional browsers_per_pod if dealing with possible QA override - store 'desiredScale' in CrawlStatus to avoid recomputing for later scale resolving - ensure status.scale is always the actual scale observed	2025-06-12 13:09:04 -04:00
Vinzenz Sinapius	0e0e663363	helm: add crawler_network_policy_additional_egress (#2641 ) - Adds `crawler_network_policy_additional_egress` setting, to add egress rules to the existing crawler network policy. Useful for when you want to allow-list a single IPs without replacing the whole network policy. - Adds docs about `crawler_network_policy_additional_egress` to the customization page. - Resolves #2121 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-06-10 16:19:42 -07:00
Tessa Walsh	dc41468daf	Allow users to run crawls with 1 or 2 browser windows (#2627 ) Fixes #2425 ## Changed - Switch backend to primarily using number of browser windows rather than scale multiplier (including migration to calculate `browserWindows` from `scale` for existing workflows and crawls) - Still support `scale` in addition to `browserWindows` in input models for creating and updating workflows and re-adjusting live crawl scale for backwards compatibility - Adds new `max_browser_windows` value to Helm chart, but calculates the value from `max_crawl_scale` as fallback for users with that value already set in local charts - Rework frontend to allow users to select multiples of `crawler_browser_instances` or any value below `crawler_browser_instances` for browser windows. For instance, with `crawler_browser_instances=4` and `max_browser_windows=8`, the user would be presented with the following options: 1, 2, 3, 4, 8 - Sets maximum width of screencast to image width returned by `message` --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: sua yoo <sua@suayoo.com> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-06-03 13:37:30 -07:00
Ilya Kreymer	cb50c7c2c2	Pause / Resume Crawls Initial Implmentation. (#2572 ) - add 'pause' crawl state (fixes #2567) - gracefully shut down crawler pods, and then redis pod when paused - crawler uploads WACZ before shutting down (dependent on webrecorder/browsertrix-crawler#824, supported in 1.6.1+) - add 'paused_at' on crawl spec to indicate when crawl is paused - support max pause time limit, after which crawl becomes automatically stopped. - add 'stopped_pause_expired' when pause automatically expires and crawl is stopped - /crawl/<id>/{pause,resume} apis to toggle 'paused' on crawl spec - ui: add pause/resume button, paused state (partially addresses #2568) - ui: add pausing/resuming derivative states when crawl is running and pausing, or paused and not pausing (partially addresses #2569) - Designed to work with crawler 1.6.1+ which support pausing + uploading on pause Work on #2566, Fixes #2576 --------- Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: sua yoo <sua@suayoo.com>	2025-05-21 14:05:16 -07:00
Ilya Kreymer	c134b576ae	Optimize presigning for replay.json (#2516 ) Fixes #2515. This PR introduces a significantly optimized logic for presigning URLs for crawls and collections. - For collections, the files needed from all crawls are looked up, and then the 'presign_urls' table is merged in one pass, resulting in a unified iterator containing files and presign urls for those files. - For crawls, the presign URLs are also looked up once, and the same iterator is used for a single crawl with passed in list of CrawlFiles - URLs that are already signed are added to the return list. - For any remaining URLs to be signed, a bulk presigning function is added, which shares an HTTP connection and signing 8 files in parallels (customizable via helm chart, though may not be needed). This function is used to call the presigning API in parallel.	2025-05-20 12:09:35 -07:00
Ilya Kreymer	f1fd11c031	storage: use s3v4 signature for presigning urls (#2611 ) Use V4 ('s3v4') signature version for for all presigning URLs to support backblaze, fixes #2472 - add 'access_addressing_style' to be able to choose virtual/path addressing for access endpoint (default to 'virtual' as before) - fix minio presigning with v4 by using 'path' addressing style for minio - if path matches '/data/' for internal minio bucket, then always use 'path' - also make minio access path '/data/' configurable also simplify running in any namespace with default settings: - don't hardcode 'local-minio.default' - in crawlers namespace, add a 'local-minio' externalName service which maps to the main namespace service.	2025-05-19 15:44:36 -07:00
Tessa Walsh	a51f7c635e	Add behavior logs from Redis to database and add endpoint to serve (#2526 ) Backend work for #2524 This PR adds a second dedicated endpoint similar to `/errors`, as a combined log endpoint would give a false impression of being the complete crawl logs (which is far from what we're serving in Browsertrix at this point). Eventually when we have support for streaming live crawl logs in `crawls/<id>/logs` I'd ideally like to deprecate these two dedicated endpoints in favor of using that, but for now this seems like the best solution. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-04-08 02:16:10 +02:00
Ilya Kreymer	62e47a8817	support overriding crawler image pull policy per channel (#2523 ) - add 'imagePullPolicy' field to each crawler channel declaration - if unset, defaults to the setting in the existing 'crawler_image_pull_policy' field. fixes #2522 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-03-31 14:11:41 -07:00
Ilya Kreymer	9250befea4	ingress: remove X-Forward-Proto snippet, no longer needed (and now possibly considered unsafe) (#2519 ) X-Forward-Proto is now already provided by the standard ingress-nginx config	2025-03-25 17:24:55 -07:00
Ilya Kreymer	e13c3bfb48	move db migrations to initContainers: (#2449 ) - should avoid gunicorn worker timeouts for long running migrations, also fixes #2439 - add main_migrations as entrypoint to just run db migrations, using existing init_ops() call - first run 'migrations' container with same resources as 'app' and 'op' - additional typing for initializing db - cleanup unused code related to running only once, waiting for db to be ready - fixes #2447	2025-03-03 13:13:15 -08:00
Ilya Kreymer	67668438c0	ingress: only set ssl-redirect if using tls (#2432 ) otherwise, http path should be accessible. Can be used when TLS termination handled outside of ingress.	2025-02-26 23:12:07 -08:00
Tessa Walsh	f8fb2d2c8d	Rework crawl page migration + MongoDB Query Optimizations (#2412 ) Fixes #2406 Converts migration 0042 to launch a background job (parallelized across several pods) to migrate all crawls by optimizing their pages and setting `version: 2` on the crawl when complete. Also Optimizes MongoDB queries for better performance. Migration Improvements: - Add `isMigrating` and `version` fields to `BaseCrawl` - Add new background job type to use in migration with accompanying `migration_job.yaml` template that allows for parallelization - Add new API endpoint to launch this crawl migration job, and ensure that we have list and retry endpoints for superusers that work with background jobs that aren't tied to a specific org - Rework background job models and methods now that not all background jobs are tied to a single org - Ensure new crawls and uploads have `version` set to `2` - Modify crawl and collection replay.json endpoints to only include fields for replay optimization (`initialPages`, `pageQueryUrl`, `preloadResources`) if all relevant crawls/uploads have `version` set to `2` - Remove `distinct` calls from migration pathways - Consolidate collection recompute stats Query Optimizations: - Remove all uses of $group and $facet - Optimize /replay.json endpoints to precompute preload_resources, avoid fetching crawl list twice - Optimize /collections endpoint by not fetching resources - Rename /urls -> /pageUrlCounts and avoid $group, instead sort with index, either by seed + ts or by url to get top matches. - Use $gte instead of $regex to get prefix matches on URL - Use $text instead of $regex to get text search on title - Remove total from /pages and /pageUrlCounts queries by not using $facet - frontend: only call /pageUrlCounts when dialog is opened. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Emma Segal-Grossman <hi@emma.cafe> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-02-20 15:26:11 -08:00
Tessa Walsh	0e9e70f3a3	Add WACZ filename, depth, favIconUrl, isSeed to pages (#2352 ) Adds `filename` to pages, pointed to the WACZ file those files come from, as well as depth, favIconUrl, and isSeed. Also adds an idempotent migration to backfill this information for existing pages, and increases the backend container's startupProbe time to 24 hours to give it sufficient time to finish the migration. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-02-05 15:50:04 -05:00
Tessa Walsh	5684e896af	Add support for autoclick (#2313 ) Fixes #2259 This PR brings backend and frontend support for the new autoclick behavior in Browsertrix, introduces in Browsertrix 1.5.0+ On the backend, we introduce `min_autoclick_crawler_image` to `values.yaml`, with a default value of `"docker.io/webrecorder/browsertrix-crawler:1.5.0"`. If this is set and the crawler version for a new crawl is less than this value, the autoclick behavior is removed from the behaviors list in the configmap created for the crawl. The one caveat for this is that a crawler image tag like "latest" will always be parsed as greater than `min_autoclick_crawler_image`, so there is the potential for the crawler to run into issues if using a non-numeric image tag with an older version of the crawler. For production we use hardcoded specific versions of the crawler except for the dev channel, which from here on out will including autoclick support, so I think this should be okay (and is also true of the existing implementation for checking `min_qa_crawler_image`). On the frontend, I've added a checkbox (unchecked by default) in the "Limits" section just below the current checkbox for autoscroll. We might want to move these to a different section eventually - I'm not sure Limits is the right place for them - but I wanted to be consistent with things as they are. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-01-16 12:44:00 -08:00
Dmitriy Pertsev	246bcc73c5	Use new ingressClassName only by default (#2268 ) - By default, use only `ingressClassName` for ingress class name and corresponding field in cert-manager - Only use old 'kubernetes.io/ingress.class' if ingress.useOldClassAnnotation is set - Allow for using old annotation only for backwards compatibility, eg. for GCP - Closes #2267 and #1570 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-01-15 23:23:50 -08:00
sua yoo	b36ed9f730	feat: Track collection events (#2256 ) - Renames `inject_analytics` to `inject_extra` and updates docs - Manually tracks page views to enable passing custom props - Tracks copying collection share link and downloading a public collection --------- Co-authored-by: emma <hi@emma.cafe>	2025-01-13 15:15:49 -08:00
Tessa Walsh	589819682e	Optionally delay replica deletion (#2252 ) Fixes #2170 The number of days to delay file replication deletion by is configurable in the Helm chart with `replica_deletion_delay_days` (set by default to 7 days in `values.yaml` to encourage good practice, though we could change this). When `replica_deletion_delay_days` is set to an int above 0, when a delete replica job would otherwise be started as a Kubernetes Job, a CronJob is created instead with a cron schedule set to run yearly, starting x days from the current moment. This cronjob is then deleted by the operator after the job successfully completes. If a failed background job is retried, it is re-run immediately as a Job rather than being scheduled out into the future again. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-12-19 18:50:28 -08:00
Emma Segal-Grossman	b650762a45	Allow configuring available languages from helm chart (#2230 ) Closes #2223 - [x] Adds `localesAvailable` to `/api/settings` endpoint, and uses that list if available, rather than the full list of translated locales, to determine which options to display to users - [x] ~~Uses the user's browser locales, filtered to the current language setting, for formatting numbers, dates, and durations~~ - [x] Adds & persists checkbox for "use same language for formatting dates and numbers" in user settings - [x] Replaces uses of `sl-format-bytes` with `localize.bytes(...)`, and `sl-format-date` with replacement `btrix-format-date` that properly handles fallback locales - [x] Caches all number/duration/datetime formatters by a combined key consisting of app language, browser language, browser setting, and formatter options so that all formatters can be reused if needed (previously any formatter with non-default options would be recreated every render) - [x] Splits out ordinal formatting from number formatter, as it didn't make much sense in some non-English locales - [x] Adds a little demo of date/time/duration/number formatting so you can see what effect your language settings have https://github.com/user-attachments/assets/724858cb-b140-4d72-a38d-83f602c71bc7 --------- Signed-off-by: emma <hi@emma.cafe> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-12-13 22:31:26 -05:00
Ilya Kreymer	db39333ef4	Send subscription cancelation email (#2234 ) Adds sending a cancellation email when a subscription is cancelled. - The email may also include an option survey optional survey URL, if configured in helm chart `survey_url` setting. - Cancellation e-mail configured in `sub_cancel` e-mail template - E-mails are sent to all org admins. - Also adds `trialing_canceled` subscription state to differentiate from a default `trialing` which will automatically rollover into `active`. - The email is sent when: a new cancellation date is added for an `active` subscription, or a `trialing` subscription is changed to to `trialing_canceled`. (A subscription can be canceled/uncanceled several times before actual date, and e-mail is sent every time it is canceled.) - The 'You have X days left of your trial' is also always displayed when state is in trialing_canceled. Fixes #2229 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-12-12 11:52:38 -08:00
Emma Segal-Grossman	a65ca49ddd	Plausible analytics (#2226 ) Closes #2222 Adds a runtime script that gets set to either inject the plausible script tags, or do nothing, that runs at initialization of the frontend container.	2024-12-10 16:30:22 -08:00
Tessa Walsh	1b1819ba5a	Move org deletion to background job with access to backend ops classes (#2098 ) This PR introduces background jobs that have full access to the backend ops classes and moves the delete org job to a background job.	2024-10-10 14:41:05 -04:00
Ilya Kreymer	c33f749515	Frontend hosted-docs (#2107 ) Fixes #2106 Docs are now hosted as part of the frontend at `/docs` by default. - If `docs_url` is set in the helm chart, the `/docs` endpoint will redirect to that endpoint instead - Use multi-stage python image to build mkdocs as part of frontend, then copy static output - Dir layout: mkdocs.yml and docs into frontend/docs - CI: Update docs build GH action to use new path - Update all frontend paths to use `/docs/` instead of `https://docs.browsertrix.com/` --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-10-08 14:56:34 -07:00
Vinzenz Sinapius	bb6e703f6a	Configure browsertrix proxies (#1847 ) Resolves #1354 Supports crawling through pre-configured proxy servers, allowing users to select which proxy servers to use (requires browsertrix crawler 1.3+) Config: - proxies defined in btrix-proxies subchart - can be configured via btrix-proxies key or separate proxies.yaml file via separate subchart - proxies list refreshed automatically if crawler_proxies.json changes if subchart is deployed - support for ssh and socks5 proxies - proxy keys added to secrets in subchart - support for default proxy to be always used if no other proxy configured, prevent starting cluster if default proxy not available - prevent starting manual crawl if previously configured proxy is no longer available, return error - force 'btrix' username and group name on browsertrix-crawler non-root user to support ssh Operator: - support crawling through proxies, pass proxyId in CrawlJob - support running profile browsers which designated proxy, pass proxyId to ProfileJob - prevent starting scheduled crawl if previously configured proxy is no longer available API / Access: - /api/orgs/all/crawlconfigs/crawler-proxies - get all proxies (superadmin only) - /api/orgs/{oid}/crawlconfigs/crawler-proxies - get proxies available to particular org - /api/orgs/{oid}/proxies - update allowed proxies for particular org (superadmin only) - superadmin can configure which orgs can use which proxies, stored on the org - superadmin can also allow an org to access all 'shared' proxies, to avoid having to allow a shared proxy on each org. UI: - Superadmin has 'Edit Proxies' dialog to configure for each org if it has: dedicated proxies, has access to shared proxies. - User can select a proxy in Crawl Workflow browser settings - Users can choose to launch a browser profile with a particular proxy - Display which proxy is used to create profile in profile selector - Users can choose with default proxy to use for new workflows in Crawling Defaults --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-10-02 18:35:45 -07:00
Ilya Kreymer	1f919de294	Allow custom auto-resize crawler volume ratio adjustable (#2076 ) Make the avail / used storage ratio (for crawler volumes) adjustable. Disable auto-resize if set to 0. Follow-up to #2023	2024-09-12 09:28:19 -07:00
sua yoo	4c36c80351	feat: Display scale as number of browser windows (#2057 ) Resolves https://github.com/webrecorder/browsertrix/issues/2048 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-09-05 17:32:40 -07:00
sua yoo	337454f8c9	feat: Add link to hosted sign-up page (#2045 ) Resolves https://github.com/webrecorder/browsertrix/issues/2043 <!-- Fixes #issue_number --> ### Changes - Shows link to sign up in UI if `sign_up_url` is configured. - Expires settings in session storage (for now)	2024-08-26 17:26:25 -07:00
Ilya Kreymer	7fa2b61b29	Execution time tracking tweaks (#1994 ) Tweaks to how execution time is tracked for more accuracy + excluding waiting states: - don't update if crawl state is in a 'waiting state' (waiting for capacity or waiting for org limit) - rename start states -> waiting states for clarity - reset lastUpdatedTime if two consecutive updates of non-running state, to ensure non-running states don't count, but also account for occasional hiccups -- if only one update detects non-running state, don't reset - webhooks: move start webhook to when crawl actually starts for first time (db lastUpdatedTime is not yet + crawl is running) - don't set lastUpdatedTime until pods actually running - set crawljob update interval to every 10 seconds for more accurate execution time tracking - frontend: show seconds in 'Execution Time' display	2024-08-06 09:44:44 -07:00
Ilya Kreymer	96691a33fa	Fix for cronjob skipping response (#1976 ) If a cronjob is disabled, the operator should quickly return a success value so that the job can be terminated. Was previously returning an incorrect response, causing disabled cronjobs to not be cleaned up. Add proper typing to always return correct response	2024-07-29 12:24:18 -07:00
Ilya Kreymer	b35669af8d	disable behaviors for QA runs via configmap (#1963 ) - make crawl args a reusable template - adds QA_ARGS to configmap, setting to same value as CRAWL_ARGS but with --behaviors= prepended to disable behaviors for QA, to improve performance of QA runs. fixes #1962	2024-07-23 19:54:21 -07:00
Ilya Kreymer	01ddf95a56	allow disabling of auto-resize of crawler pods (#1964 ) - only enable if 'enable_auto_resize' is true, default to false - if true, set memory limit to 1.2 of memory requests, resize when hitting 'soft oom' of initial request, adjust by 1.2 (current behavior) up to max_crawler_memory - if false, set memory limit to max_crawler_memory and never adjust memory requests or memory limits - part of #1959	2024-07-23 21:00:40 -04:00
Ilya Kreymer	9a67e28f13	Adds Subscription API (#1914 ) Fixes https://github.com/webrecorder/browsertrix/issues/1905 - adds a new top-level `/api/subscriptions` endpoint and SubOps handler on the backend. - enable subscriptions API endpoints available only if `billing_enabled` is set in helm chart - new POST /subscriptions/create, /subscriptions/update, /subscriptions/cancel API endpoints - Subscriptions mongo collection storing timestamped /subscription API events - GET /subscriptions/events API to get subscription events, support for filtering and sorting - Subscription data model - Support for setting and handling readOnlyOnCancel on org - /orgs/<id>/billing-portal to lookup portalUrl using external API - subscription in org getter and list views - mark org as readOnly for subscription status `paused_payment_failed`, clears it on status `active` --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-07-10 17:41:16 -07:00
Vinzenz Sinapius	01d8bdc5e6	Crawler network policy (#1727 ) Limit egress traffic from crawler/profilebrowser pods to the internet and limited internal services like dns, redis, frontend, auth-signer on certain ports --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-03 10:55:03 -07:00
Tessa Walsh	f076e7d9e3	Add superuser API endpoints to export and import org data (#1394 ) Fixes #890 This PR introduces new streaming superuser-only API endpoints to export and import database information for an organization. New Adminstrator deployment documentation on how to manage the process and copy files between S3 buckets as needed is also included. --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-02 17:14:34 -04:00
Ilya Kreymer	e1ef894275	Extends Org Create endpont + shared secret auth (#1897 ) Updates the /api/orgs/create endpoint to: - not have name / slug be required, will be renamed on first user via #1870 - support optional quotas - support optional first admin user email, who will receive an invite to join the org. Also supports a new shared secret mechanism, to allow an external automation to access the /api/orgs/create endpoint (and only that endpoint thus far) via a shared secret instead of normal login.	2024-07-01 09:37:02 -07:00
Ilya Kreymer	3cd52342a7	Remove Crawl Workflow Configmaps (#1894 ) Fixes #1893 - Removes crawl workflow-scoped configmaps, and replaces with operator-controlled per-crawl configmaps that only contain the json config passed to Browsertrix Crawler (as a volume). - Other configmap settings replaced are replaced the custom CrawlJob options (mostly already were, just added profile_filename and storage_filename) - Cron jobs also updated to create CrawlJob without relying on configmaps, querying the db for additional settings. - The `userid` associated with cron jobs is set to the user that last modified the schedule of the crawl, rather than whomever last modified the workflow - Various functions that deal with updating configmaps have been removed, including in migrations. - New migration 0029 added to remove all crawl workflow configmaps	2024-06-28 15:25:23 -07:00
Tessa Walsh	7af3980323	Add billing enabled and sales email to Helm chart and /settings API endpoint (#1873 ) Backend work for first two tasks of https://github.com/webrecorder/browsertrix/issues/1875 New /billing API endpoint to be added separately once we have a better idea of what data we can get from the payment processor.	2024-06-25 10:55:29 -04:00
Ilya Kreymer	fa6627ce70	ensure QA configmap is updated for long running QA runs: (#1865 ) - add a 'expire_at_duration_seconds' which is 75% of actual presign duration time, or <25% remaining until presigned URL actually expires to ensure presigned URLs are updated early than when they actually expire - set cached expireAt time to the renew at time for more frequent updates - update QA configmap in place with updated presigned URLs when expireAt time is reached - mount qa config volume under /tmp/qa/ without subPath to get automatic updates, which crawler will handle - tests: fix qa test typo (from main) - fixes #1864	2024-06-12 10:51:35 -07:00
Ilya Kreymer	d42de92d75	QA analysis scale configurable in helm chart (#1843 ) - allow configuring QA run scale via 'qa_scale' setting in helm values (overriding any setting on the qa crawljob) - adds additional comments to browser instances helm values settings for clarity - fixes #1842	2024-05-30 12:59:21 -07:00
Ilya Kreymer	61239a40ed	include workflow config in QA runs + different browser instances for QA (#1829 ) Currently, the workflow crawl settings were not being included at all in QA runs. This mounts the crawl workflow config, as well as QA configmap, into QA run crawls, allowing for page limits from crawl workflow to be applied to QA runs. It also allows a different number of browser instances to be used for QA runs, as QA runs might work better with less browsers, (eg. 2 instead of 4). This can be set with `qa_browser_instances` in helm chart. Default qa browser workers to 1 if unset (for now, for best results) Fixes #1828	2024-05-29 13:32:25 -07:00
Ilya Kreymer	f6c0791dc1	fix missing settings / typos: (#1748 ) - ensure max_crawler_memory_size is inited before it is set! - pass profile_browser_memory / profile_browser_cpu from chart values - map volume to /tmp/home to avoid persisting /tmp for profiles	2024-04-25 09:00:17 +02:00
Ilya Kreymer	ec74eb4242	operator: add 'max_crawler_memory' to limit autosizing of crawler pods (#1746 ) Adds a `max_crawler_memory` chart setting, which, if set, will defines the upper crawler memory limit that crawler pods can be resized up to. If not set, auto resizing is disabled and pods are always set to 'crawler_memory' memory	2024-04-24 15:16:32 +02:00
Ilya Kreymer	b94070160b	allow configuring designated registration org to which new users can register (#1735 ) if 'registration_enabled' is set, check 'registration_org_id' for org id of an existing org that new users should be added to when they register. if omitted, default to the default org Fixes #1729	2024-04-23 17:11:37 -04:00
Vinzenz Sinapius	a8336925b6	Run crawler and profilebrowser with non-root user (#1625 ) With these changes, crawler and profilebrowser jobs run as a non-root user.	2024-04-17 12:03:33 -07:00
Ilya Kreymer	835014d829	restrict qa runs to a 'min_qa_crawler_image' if set in the chart (#1685 ) - fixes #1684 - can be used to optionally restrict QA to only some crawls (eg. with browsertrix-crawler>=1.0.0) - enforce error on backend (return 400) and handle special error on the frontend	2024-04-17 08:48:33 -07:00
Ilya Kreymer	95f5605af7	renumber crawl priority classes: (#1673 ) - priority classes <-10 are ignored by cluster-autoscaler so QA jobs with too low priorities never run - start crawl priorities at 0 going down (same as before) - start qa run priorities at -2 going down (instead of -100) - this means a crawl of with scale of 3 can be preempted by 1st qa pod, but otherwise crawls have higher priority - rename priority classes as they are otherwise immutable and error on helm upgrade This allows for more room in lower pri classes for other type of objects, while keeping in mind the -10 and below threshold: (see: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md)	2024-04-13 12:24:43 -07:00
Ilya Kreymer	17f49a52de	email templates update + customization + doc update (fixes #1652 ) (#1653 ) - modify invite email template to answer common questions - email templates: make each email template overridable with --set-file - docs: update customization doc to document how to customize email templates --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-04-08 12:27:47 -07:00
Ilya Kreymer	c1817cbe04	add horizontal pod autoscaler for backend and frontend via helm charts (#1633 ) Supports horizontal pod autoscaling (hpa) for backend and frontend pods: - use cpu and memory averages - adjust base memory + cpu for backend - threshold set to 80% cpu and 95% memory utilization by default (configurable in values.yaml) - instead of backend and frontend replicas, set max replicas in values.yaml - only enable hpa if backend_max_replicas or frontend_max_replicas is >1, default to 1 for now	2024-03-28 16:39:27 -07:00
Ilya Kreymer	3438133fcb	Crawler pod memory padding + auto scaling (#1631 ) - set memory limit to 1.2x memory request to provide extra padding and avoid OOM - attempt to resize crawler pods by 1.2x when exceeding 90% of available memory - do a 'soft OOM' (send extra SIGTERM) to pod when reaching 100% of requested memory, resulting in faster graceful restart, but avoiding a system-instant OOM Kill - Fixes #1632 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-28 16:39:00 -07:00
Ilya Kreymer	4f676e4e82	QA Runs Initial Backend Implementation (#1586 ) Supports running QA Runs via the QA API! Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes #1498 Also requires the latest Browsertrix Crawler 1.1.0+ (from webrecorder/browsertrix-crawler#469 branch) Notable changes: - QARun objects contain info about QA runs, which are crawls performed on data loaded from existing crawls. - Various crawl db operations can be performed on either the crawl or `qa.` object, and core crawl fields have been moved to CoreCrawlable. - While running,`QARun` data stored in a single `qa` object, while finished qa runs are added to `qaFinished` dictionary on the Crawl. The QA list API returns data from the finished list, sorted by most recent first. - Includes additional type fixes / type safety, especially around BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific get_upload(), get_basecrawl(), get_crawl() getters for internal use and get_crawl_out() for API - Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-20 22:42:16 -07:00

1 2 3 4

155 Commits