browsertrix

Author	SHA1	Message	Date
Emma Segal-Grossman	b0f2d87ce2	hotfix: workflow list - rewrite arrays in url search params to remove items (#2734 ) ## Changes - Deletes and rewrites arrays in URL search params in workflow list when editing array filters (i.e. tags & profiles) - Removes a missed `console.log` - bump to 1.17.3 cc @SuaYoo --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-07-14 14:30:18 -07:00
Ilya Kreymer	6f1ced0b01	chart: default to latest replayweb.page release by default (#2724 ) Avoid having Browsertrix be stuck on old RWP releases, probably better to just use latest as RWP releases now have additional testing.	2025-07-10 13:35:20 -07:00
Ilya Kreymer	b915e734d1	version: bump to 1.17.2	2025-06-30 14:20:43 -07:00
Tessa Walsh	db4621602e	Bump version to 1.17.1 (#2678 )	2025-06-18 13:09:49 -04:00
Ilya Kreymer	dde23426b2	version: bump to 1.17.0!	2025-06-12 17:37:07 -04:00
Ilya Kreymer	d4a2a66d6d	additional scale / browser window cleanup to properly support QA: (#2663 ) - follow up to #2627 - use qa_num_browser_windows to set exact number of QA browsers, fallback to qa_scale - set num_browser_windows and num_browsers_per_pod using crawler / qa values depending if QA crawl - scale_from_browser_windows() accepts optional browsers_per_pod if dealing with possible QA override - store 'desiredScale' in CrawlStatus to avoid recomputing for later scale resolving - ensure status.scale is always the actual scale observed	2025-06-12 13:09:04 -04:00
Ilya Kreymer	001277ac9d	docs: add docs for path / virtual addressing (#2669 ) Add docs about path / virtual 'access_addressing_style' that is available for each storage option. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-06-12 13:08:27 -04:00
Vinzenz Sinapius	0e0e663363	helm: add crawler_network_policy_additional_egress (#2641 ) - Adds `crawler_network_policy_additional_egress` setting, to add egress rules to the existing crawler network policy. Useful for when you want to allow-list a single IPs without replacing the whole network policy. - Adds docs about `crawler_network_policy_additional_egress` to the customization page. - Resolves #2121 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-06-10 16:19:42 -07:00
Ilya Kreymer	8ea16393c5	Optimize single-page crawl workflows (#2656 ) For single page crawls: - Always force 1 browser to be used, ignoring browser windows/scale setting - Don't use custom PVC volumes in crawler / redis, just use emptyDir - no chance of crawler being interrupted and restarted on different machine for a single page. Adds a 'is_single_page' check to CrawlConfig, checking for either limit or scopeType / no extra hops. Fixes #2655	2025-06-10 12:13:57 -07:00
Tessa Walsh	dc41468daf	Allow users to run crawls with 1 or 2 browser windows (#2627 ) Fixes #2425 ## Changed - Switch backend to primarily using number of browser windows rather than scale multiplier (including migration to calculate `browserWindows` from `scale` for existing workflows and crawls) - Still support `scale` in addition to `browserWindows` in input models for creating and updating workflows and re-adjusting live crawl scale for backwards compatibility - Adds new `max_browser_windows` value to Helm chart, but calculates the value from `max_crawl_scale` as fallback for users with that value already set in local charts - Rework frontend to allow users to select multiples of `crawler_browser_instances` or any value below `crawler_browser_instances` for browser windows. For instance, with `crawler_browser_instances=4` and `max_browser_windows=8`, the user would be presented with the following options: 1, 2, 3, 4, 8 - Sets maximum width of screencast to image width returned by `message` --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: sua yoo <sua@suayoo.com> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-06-03 13:37:30 -07:00
Ilya Kreymer	0e06ccd746	version: bump to 1.17.0-beta.0	2025-06-02 14:46:32 -07:00
Ilya Kreymer	cb50c7c2c2	Pause / Resume Crawls Initial Implmentation. (#2572 ) - add 'pause' crawl state (fixes #2567) - gracefully shut down crawler pods, and then redis pod when paused - crawler uploads WACZ before shutting down (dependent on webrecorder/browsertrix-crawler#824, supported in 1.6.1+) - add 'paused_at' on crawl spec to indicate when crawl is paused - support max pause time limit, after which crawl becomes automatically stopped. - add 'stopped_pause_expired' when pause automatically expires and crawl is stopped - /crawl/<id>/{pause,resume} apis to toggle 'paused' on crawl spec - ui: add pause/resume button, paused state (partially addresses #2568) - ui: add pausing/resuming derivative states when crawl is running and pausing, or paused and not pausing (partially addresses #2569) - Designed to work with crawler 1.6.1+ which support pausing + uploading on pause Work on #2566, Fixes #2576 --------- Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: sua yoo <sua@suayoo.com>	2025-05-21 14:05:16 -07:00
Ilya Kreymer	e995811dd4	version: bump to 1.16.2	2025-05-20 18:43:22 -07:00
Ilya Kreymer	f1fd11c031	storage: use s3v4 signature for presigning urls (#2611 ) Use V4 ('s3v4') signature version for for all presigning URLs to support backblaze, fixes #2472 - add 'access_addressing_style' to be able to choose virtual/path addressing for access endpoint (default to 'virtual' as before) - fix minio presigning with v4 by using 'path' addressing style for minio - if path matches '/data/' for internal minio bucket, then always use 'path' - also make minio access path '/data/' configurable also simplify running in any namespace with default settings: - don't hardcode 'local-minio.default' - in crawlers namespace, add a 'local-minio' externalName service which maps to the main namespace service.	2025-05-19 15:44:36 -07:00
Tessa Walsh	c73512dbd4	Bump version to 1.16.1 (#2606 )	2025-05-13 17:29:49 -04:00
Ilya Kreymer	652e8a6085	version: bump to 1.16.0	2025-05-08 14:30:00 -07:00
Ilya Kreymer	0cb3bd19f6	version: update to 1.15.0	2025-04-09 12:28:01 +02:00
Ilya Kreymer	b5b4c4da15	version: update to 1.14.8	2025-03-31 14:17:53 -07:00
Ilya Kreymer	62e47a8817	support overriding crawler image pull policy per channel (#2523 ) - add 'imagePullPolicy' field to each crawler channel declaration - if unset, defaults to the setting in the existing 'crawler_image_pull_policy' field. fixes #2522 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-03-31 14:11:41 -07:00
Ilya Kreymer	b3950dd03f	version: update to 1.14.7	2025-03-25 17:25:24 -07:00
Ilya Kreymer	46be6a0cf6	version: bump to 1.14.6	2025-03-20 16:52:20 -07:00
Ilya Kreymer	b63caf74ad	cleanup unused chart values + change mongo default (#2484 ) - Removes chart values that are unused - Also change `local-mongo.default` -> `local-mongo`, `local-minio.default` -> `local-minio` as some users have reported issues with `.default` and it will certainly break if not deploying Browsertrix in the `default `namespace.	2025-03-20 08:30:45 -07:00
Ilya Kreymer	eb300815a7	Fixes #2488 (#2493 ) - Fixes #2488 - Adds a k8s api call to set `suspend=false` on Job when associated CrawlJob is finished. - bump version - released as 1.14.5	2025-03-19 10:06:25 -07:00
Ilya Kreymer	d8365c734f	version: bump to 1.14.4	2025-03-08 15:58:18 -08:00
Ilya Kreymer	03fa00df45	set default crawler channel if not set, possible fix for #2458 (#2469 ) update default RWP version	2025-03-07 12:32:19 -08:00
Ilya Kreymer	9466e83d18	version: bump to 1.14.3	2025-03-03 15:20:40 -08:00
Ilya Kreymer	cb52da66dc	version: bump to 1.14.2	2025-02-27 14:13:03 -08:00
Ilya Kreymer	376c9981dc	version: bump to 1.14.1	2025-02-26 23:15:01 -08:00
Ilya Kreymer	e67708bd4f	version: update to 1.14.0	2025-02-24 14:49:46 -08:00
Ilya Kreymer	8a507f0473	Consolidate list page endpoints + better QA sorting + optimize pages fix (#2417 ) - consolidate list_pages() and list_replay_query_pages() into list_pages() - to keep backwards compatibility, add <crawl>/pagesSearch that does not include page totals, keep <crawl>/pages with page total (slower) - qa frontend: add default 'Crawl Order' sort order, to better show pages in QA view - bgjob: account for parallelism in bgjobs, add logging if succeeded mismatches parallelism - QA sorting: default to 'crawl order' by default to get better results. - Optimize pages job: also cover crawls that may not have any pages but have pages listed in done stats - Bgjobs: give custom op jobs more memory	2025-02-21 13:47:20 -08:00
Ilya Kreymer	3ca68bf1d2	version: 1.14.0-beta.6	2025-02-20 15:37:33 -08:00
Tessa Walsh	f8fb2d2c8d	Rework crawl page migration + MongoDB Query Optimizations (#2412 ) Fixes #2406 Converts migration 0042 to launch a background job (parallelized across several pods) to migrate all crawls by optimizing their pages and setting `version: 2` on the crawl when complete. Also Optimizes MongoDB queries for better performance. Migration Improvements: - Add `isMigrating` and `version` fields to `BaseCrawl` - Add new background job type to use in migration with accompanying `migration_job.yaml` template that allows for parallelization - Add new API endpoint to launch this crawl migration job, and ensure that we have list and retry endpoints for superusers that work with background jobs that aren't tied to a specific org - Rework background job models and methods now that not all background jobs are tied to a single org - Ensure new crawls and uploads have `version` set to `2` - Modify crawl and collection replay.json endpoints to only include fields for replay optimization (`initialPages`, `pageQueryUrl`, `preloadResources`) if all relevant crawls/uploads have `version` set to `2` - Remove `distinct` calls from migration pathways - Consolidate collection recompute stats Query Optimizations: - Remove all uses of $group and $facet - Optimize /replay.json endpoints to precompute preload_resources, avoid fetching crawl list twice - Optimize /collections endpoint by not fetching resources - Rename /urls -> /pageUrlCounts and avoid $group, instead sort with index, either by seed + ts or by url to get top matches. - Use $gte instead of $regex to get prefix matches on URL - Use $text instead of $regex to get text search on title - Remove total from /pages and /pageUrlCounts queries by not using $facet - frontend: only call /pageUrlCounts when dialog is opened. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Emma Segal-Grossman <hi@emma.cafe> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-02-20 15:26:11 -08:00
Ilya Kreymer	a7c8ca4028	version: bump to 1.14.0-beta.1	2025-02-17 16:48:27 -08:00
Tessa Walsh	5684e896af	Add support for autoclick (#2313 ) Fixes #2259 This PR brings backend and frontend support for the new autoclick behavior in Browsertrix, introduces in Browsertrix 1.5.0+ On the backend, we introduce `min_autoclick_crawler_image` to `values.yaml`, with a default value of `"docker.io/webrecorder/browsertrix-crawler:1.5.0"`. If this is set and the crawler version for a new crawl is less than this value, the autoclick behavior is removed from the behaviors list in the configmap created for the crawl. The one caveat for this is that a crawler image tag like "latest" will always be parsed as greater than `min_autoclick_crawler_image`, so there is the potential for the crawler to run into issues if using a non-numeric image tag with an older version of the crawler. For production we use hardcoded specific versions of the crawler except for the dev channel, which from here on out will including autoclick support, so I think this should be okay (and is also true of the existing implementation for checking `min_qa_crawler_image`). On the frontend, I've added a checkbox (unchecked by default) in the "Limits" section just below the current checkbox for autoscroll. We might want to move these to a different section eventually - I'm not sure Limits is the right place for them - but I wanted to be consistent with things as they are. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-01-16 12:44:00 -08:00
Dmitriy Pertsev	246bcc73c5	Use new ingressClassName only by default (#2268 ) - By default, use only `ingressClassName` for ingress class name and corresponding field in cert-manager - Only use old 'kubernetes.io/ingress.class' if ingress.useOldClassAnnotation is set - Allow for using old annotation only for backwards compatibility, eg. for GCP - Closes #2267 and #1570 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-01-15 23:23:50 -08:00
Ilya Kreymer	12f358b826	Merge pull request #2271 from webrecorder/public-collections-feature feat: Public collections, includes: - feat: Public org profile page #2172 - feat: Collection thumbnails, start page, and public view updates #2209 - feat: Track collection events #2256	2025-01-13 19:32:45 -08:00
Ilya Kreymer	bab5345ad5	version: bump to 1.14.0-beta.0 for public collections!	2025-01-13 19:29:54 -08:00
sua yoo	b36ed9f730	feat: Track collection events (#2256 ) - Renames `inject_analytics` to `inject_extra` and updates docs - Manually tracks page views to enable passing custom props - Tracks copying collection share link and downloading a public collection --------- Co-authored-by: emma <hi@emma.cafe>	2025-01-13 15:15:49 -08:00
Ilya Kreymer	a21b2ff0df	version: bump to 1.13.2	2025-01-08 22:58:33 -08:00
Tessa Walsh	589819682e	Optionally delay replica deletion (#2252 ) Fixes #2170 The number of days to delay file replication deletion by is configurable in the Helm chart with `replica_deletion_delay_days` (set by default to 7 days in `values.yaml` to encourage good practice, though we could change this). When `replica_deletion_delay_days` is set to an int above 0, when a delete replica job would otherwise be started as a Kubernetes Job, a CronJob is created instead with a cron schedule set to run yearly, starting x days from the current moment. This cronjob is then deleted by the operator after the job successfully completes. If a failed background job is retried, it is re-run immediately as a Job rather than being scheduled out into the future again. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-12-19 18:50:28 -08:00
Ilya Kreymer	2060ee78b4	Support Presigning for use with custom domain (#2249 ) If access_endpoint_url is provided: - Use virtual host addressing style, so presigned URLs are of the form `https://bucket.s3-host.example.com/path/` instead of `https://s3-host.example.com/bucket/path/` - Allow for replacing `https://bucket.s3-host.example.com/path/` -> `https://my-custom-domain.example.com/path/`, where `https://my-custom-domain.example.com/path/` is the access_endpoint_url - Remove old `use_access_for_presign` which is no longer used - Fixes #2248 - docs: update deployment docs storages section to mention custom storages, access_endpoint_url --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-12-19 18:41:47 -08:00
Ilya Kreymer	60d07762be	version: bump to 1.13.1	2024-12-19 12:01:47 -08:00
Ilya Kreymer	cf60c43df2	version: bump to 1.13.0! (#2242 )	2024-12-13 20:32:38 -08:00
Ilya Kreymer	74ae3b0f8d	Add new locales (#2240 ) - By default, all locales are enabled to make it easy for local deployments to test new locales - Adds DE, FR, PT locales to make way for translation in Weblate	2024-12-13 19:59:09 -08:00
Emma Segal-Grossman	b650762a45	Allow configuring available languages from helm chart (#2230 ) Closes #2223 - [x] Adds `localesAvailable` to `/api/settings` endpoint, and uses that list if available, rather than the full list of translated locales, to determine which options to display to users - [x] ~~Uses the user's browser locales, filtered to the current language setting, for formatting numbers, dates, and durations~~ - [x] Adds & persists checkbox for "use same language for formatting dates and numbers" in user settings - [x] Replaces uses of `sl-format-bytes` with `localize.bytes(...)`, and `sl-format-date` with replacement `btrix-format-date` that properly handles fallback locales - [x] Caches all number/duration/datetime formatters by a combined key consisting of app language, browser language, browser setting, and formatter options so that all formatters can be reused if needed (previously any formatter with non-default options would be recreated every render) - [x] Splits out ordinal formatting from number formatter, as it didn't make much sense in some non-English locales - [x] Adds a little demo of date/time/duration/number formatting so you can see what effect your language settings have https://github.com/user-attachments/assets/724858cb-b140-4d72-a38d-83f602c71bc7 --------- Signed-off-by: emma <hi@emma.cafe> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-12-13 22:31:26 -05:00
Ilya Kreymer	db39333ef4	Send subscription cancelation email (#2234 ) Adds sending a cancellation email when a subscription is cancelled. - The email may also include an option survey optional survey URL, if configured in helm chart `survey_url` setting. - Cancellation e-mail configured in `sub_cancel` e-mail template - E-mails are sent to all org admins. - Also adds `trialing_canceled` subscription state to differentiate from a default `trialing` which will automatically rollover into `active`. - The email is sent when: a new cancellation date is added for an `active` subscription, or a `trialing` subscription is changed to to `trialing_canceled`. (A subscription can be canceled/uncanceled several times before actual date, and e-mail is sent every time it is canceled.) - The 'You have X days left of your trial' is also always displayed when state is in trialing_canceled. Fixes #2229 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-12-12 11:52:38 -08:00
Emma Segal-Grossman	a65ca49ddd	Plausible analytics (#2226 ) Closes #2222 Adds a runtime script that gets set to either inject the plausible script tags, or do nothing, that runs at initialization of the frontend container.	2024-12-10 16:30:22 -08:00
Ilya Kreymer	50dac7dc50	1.12.2 release -> main (#2181 ) Merge 1.12.2 release changes into main, includes: - Collection replay full refresh on metadata / archived items (#2176) - Fix for self-registration default org (#2178) - Prepend missing https in start URL (#2177) - Updated billing to support free trial messaging (#2179) --------- Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: sua yoo <sua@suayoo.com> Co-authored-by: SuaYoo <SuaYoo@users.noreply.github.com>	2024-11-26 11:17:07 -08:00
Ilya Kreymer	84a74c43a4	version: bump to 1.13.0-beta.0	2024-10-10 11:38:13 -07:00
Ilya Kreymer	c33f749515	Frontend hosted-docs (#2107 ) Fixes #2106 Docs are now hosted as part of the frontend at `/docs` by default. - If `docs_url` is set in the helm chart, the `/docs` endpoint will redirect to that endpoint instead - Use multi-stage python image to build mkdocs as part of frontend, then copy static output - Dir layout: mkdocs.yml and docs into frontend/docs - CI: Update docs build GH action to use new path - Update all frontend paths to use `/docs/` instead of `https://docs.browsertrix.com/` --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-10-08 14:56:34 -07:00

1 2 3 4

190 Commits