browsertrix

Author	SHA1	Message	Date
Aleksey	c0f37842fe	add support for custom cluster name	2025-08-12 13:41:00 +00:00
Aleksey	fbb54993bf	add support for custom cluster name	2025-08-12 13:40:55 +00:00
Ilya Kreymer	a2c0ad665e	file upload cleanup cronjob: (#2790 ) - use larger resources (saw an OOM on one job) - don't keep successful job pods (nothing to see in logs except ids of deleted files, if any)	2025-08-04 10:31:26 -07:00
Emma Segal-Grossman	8db0e44843	Feat: New email templating system & service (#2712 ) Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-08-01 17:00:24 -04:00
Tessa Walsh	d3e241ad03	Validate seed files on backend and add tests (#2781 ) Fixes #2780 This PR adds additional backend validation for seed file uploads to fail a seed upload if no valid seeds are found. It adds two new test cases to ensure seed uploads will fail for binary files and for text files that do not contain any valid URLs.	2025-07-31 23:20:58 -07:00
Tessa Walsh	3a05002491	Add saveStorage option to workflow (#2757 ) Fixes #2753 - Adds `saveStorage` to `RawCrawlConfig` model in backend - Adds option to Browser Settings pane of workflow editor - Adds option to config details component - Adds setting to docs	2025-07-31 22:58:15 -07:00
Ilya Kreymer	5a4add84a8	Nightly test fix for crawler 1.7.0 (#2789 ) - Use latest crawler image for tests - Due to webrecorder/browsertrix-crawler#861 change, a crawl with no successful pages should be treated as failed. Update fixture to allow both failed or complete state for backwards compatibility for now.	2025-07-31 21:58:53 -07:00
Tessa Walsh	0c8c397fca	Add option to fail crawl if not logged in (#2754 ) This PR adds a new checkbox to both page and seed crawl workflow types, which will fail the crawl if behaviors detect the browser is not logged in for supported sites. Changes include: - Backend support for the new crawler flag - A new `failed_not_logged_in` crawl state - Checkbox workflow editor and config details in the frontend (currently in the Scope section - I think it makes sense to have this option up front, but worth considering) - User Guide documentation of new option - A new nightly test for the new workflow option and `failed_not_logged_in` state --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: sua yoo <sua@webrecorder.org>	2025-07-28 22:58:43 -07:00
Tessa Walsh	993f82a49b	Add last crawl's stats object to CrawlConfigOut (#2714 ) Fixes #2709 Will allow us to display information about page counts (found, done) in the workflow list.	2025-07-23 20:10:46 -07:00
Ilya Kreymer	89027ef16e	quickfix: delete seedfile after the workflow has been deleted (#2763 ) Since seedfile deletion checks that the seedfile is not used in any workflow, it should be deleted after the workflow is removed. noticed in checking #2744	2025-07-23 20:10:29 -07:00
Ilya Kreymer	dc639ac6fe	version: bump to 1.18.0	2025-07-22 21:21:24 -07:00
Tessa Walsh	f7ba712646	Add seed file support to Browsertrix backend (#2710 ) Fixes #2673 Changes in this PR: - Adds a new `file_uploads.py` module and corresponding `/files` API prefix with methods/endpoints for uploading, GETing, and deleting seed files (can be extended to other types of files moving forward) - Seed files are supported via `CrawlConfig.config.seedFileId` on POST and PATCH endpoints. This seedFileId is replaced by a presigned url when passed to the crawler by the operator - Seed files are read when first uploaded to calculate `firstSeed` and `seedCount` and store them in the database, and this is copied into the workflow and crawl documents when they are created. - Logic is added to store `firstSeed` and `seedCount` for other workflows as well, and a migration added to backfill data, to maintain consistency and fix some of the pymongo aggregations that previously assumed all workflows would have at least one `Seed` object in `CrawlConfig.seeds` - Seed file and thumbnail storage stats are added to org stats - Seed file and thumbnail uploads first check that the org's storage quota has not been exceeded and return a 400 if so - A cron background job (run weekly each Sunday at midnight by default, but configurable) is added to look for seed files at least x minutes old (1440 minutes, or 1 day, by default, but configurable) that are not in use in any workflows, and to delete them when they are found. The backend pods will ensure this k8s batch job exists when starting up and create it if it does not already exist. A database entry for each run of the job is created in the operator on job completion so that it'll appear in the `/jobs` API endpoints, but retrying of this type of regularly scheduled background job is not supported as we don't want to accidentally create multiple competing scheduled jobs. - Adds a `min_seed_file_crawler_image` value to the Helm chart that is checked before creating a crawl from a workflow if set. If a workflow cannot be run, return the detail of the exception in `CrawlConfigAddedResponse.errorDetail` so that we can display the reason in the frontend - Add SeedFile model from base UserFile (former ImageFIle), ensure all APIs returning uploaded files return an absolute pre-signed URL (either with external origin or internal service origin) --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-07-22 19:11:02 -07:00
Ilya Kreymer	8107b054f6	Profiles: Make browser commit API call idempotent (#2728 ) - Fix race condition related to browser commit time - The profile commit request waits for browser to actual finish, and profile saved. This can cause request to time out, resulting in a retry, in which the browser has already been closed. - With these changes, the commit is now 'idempotent' and returns a waiting_for_browser until the profile is actually committed. - On frontend, keep pinging commit endpoint with a timeout while 'waiting_for_browser' is returned, actual committed when endpoint returns profile id. --------- Co-authored-by: sua yoo <sua@suayoo.com>	2025-07-22 17:59:49 -07:00
Ilya Kreymer	3af94ca03d	Ensure replay.json returns correct origin for pagesQueryUrl (#2741 ) - Use the Host + X-Forwarded-Proto header from API request - Fixes #2740, better fix for #2720 avoiding need for separate alias	2025-07-16 10:48:24 -07:00
Ilya Kreymer	0402f14b5e	version: bump to 1.17.4	2025-07-16 10:22:10 -07:00
Emma Segal-Grossman	945c458011	Use curly quote for default archive name instead of straight quote (#2700 ) Tiny little thing that's been bugging me for a little while now.	2025-07-16 11:41:19 -04:00
Tessa Walsh	d91a3bc088	Run webhook tests nightly (#2738 ) Fixes #2737 - Moves webhook-related tests to run nightly, to speed up CI runs and avoid the periodic failures we've been getting lately. - Also ensures all try/except blocks that have time.sleep in the 'try' also have a time.sleep in 'except' to avoid fast-looping retries --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-07-15 18:05:57 -07:00
Emma Segal-Grossman	b0f2d87ce2	hotfix: workflow list - rewrite arrays in url search params to remove items (#2734 ) ## Changes - Deletes and rewrites arrays in URL search params in workflow list when editing array filters (i.e. tags & profiles) - Removes a missed `console.log` - bump to 1.17.3 cc @SuaYoo --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-07-14 14:30:18 -07:00
Emma Segal-Grossman	f91bfda42e	Allow searching by multiple tags & profiles with "and"/"or" options for tags (#2717 ) Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-07-11 22:35:52 -04:00
Ilya Kreymer	23700cba2d	concurrent crawls: revert change in #2701 , previous logic was already correct (#2726 ) - change was unneeded (and made it worse), just add more comments	2025-07-10 13:34:56 -07:00
Emma Segal-Grossman	74c72ce551	Include tag counts in tag filter & tag input autocomplete (#2711 )	2025-07-08 15:20:41 -04:00
Ilya Kreymer	8152223750	concurrent crawls: filter concurrent crawls check (#2701 ) ensure concurrent crawls check only counts running or waiting crawls only, not all existing crawljobs	2025-07-03 09:57:07 -07:00
Tessa Walsh	5b4fee73e6	Remove workflows from GET profile endpoint + add inUse flag instead (#2703 ) Connected to #2661 - Removes crawl workflows from being returned as part of the profile response. - Frontend: removes display of workflows in profile details. - Adds 'inUse' flag to all profile responses to indicate profile is in use by at least one workflow - Adds 'profileid' as possible filter for workflows search in preparation for filtering by profile id (#2708) - Make 'profile_in_use' a proper error (returning 400) on profile delete. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-07-02 16:44:12 -07:00
Ilya Kreymer	b915e734d1	version: bump to 1.17.2	2025-06-30 14:20:43 -07:00
Emma Segal-Grossman	1cfdb97d57	Allow filtering workflows to only running, & add dashboard links (#2607 ) Co-authored-by: sua yoo <sua@suayoo.com> Co-authored-by: sua yoo <sua@webrecorder.org>	2025-06-30 15:19:05 -04:00
Tessa Walsh	db4621602e	Bump version to 1.17.1 (#2678 )	2025-06-18 13:09:49 -04:00
Ilya Kreymer	dde23426b2	version: bump to 1.17.0!	2025-06-12 17:37:07 -04:00
Tessa Walsh	67bf949802	Set fields in AIOConfig to prevent MissingContentLength error on upload (#2665 ) Needed to support upload with certain S3 providers. Fixes #2664	2025-06-12 15:27:38 -04:00
Ilya Kreymer	d4a2a66d6d	additional scale / browser window cleanup to properly support QA: (#2663 ) - follow up to #2627 - use qa_num_browser_windows to set exact number of QA browsers, fallback to qa_scale - set num_browser_windows and num_browsers_per_pod using crawler / qa values depending if QA crawl - scale_from_browser_windows() accepts optional browsers_per_pod if dealing with possible QA override - store 'desiredScale' in CrawlStatus to avoid recomputing for later scale resolving - ensure status.scale is always the actual scale observed	2025-06-12 13:09:04 -04:00
Ilya Kreymer	8ea16393c5	Optimize single-page crawl workflows (#2656 ) For single page crawls: - Always force 1 browser to be used, ignoring browser windows/scale setting - Don't use custom PVC volumes in crawler / redis, just use emptyDir - no chance of crawler being interrupted and restarted on different machine for a single page. Adds a 'is_single_page' check to CrawlConfig, checking for either limit or scopeType / no extra hops. Fixes #2655	2025-06-10 12:13:57 -07:00
Tessa Walsh	dc41468daf	Allow users to run crawls with 1 or 2 browser windows (#2627 ) Fixes #2425 ## Changed - Switch backend to primarily using number of browser windows rather than scale multiplier (including migration to calculate `browserWindows` from `scale` for existing workflows and crawls) - Still support `scale` in addition to `browserWindows` in input models for creating and updating workflows and re-adjusting live crawl scale for backwards compatibility - Adds new `max_browser_windows` value to Helm chart, but calculates the value from `max_crawl_scale` as fallback for users with that value already set in local charts - Rework frontend to allow users to select multiples of `crawler_browser_instances` or any value below `crawler_browser_instances` for browser windows. For instance, with `crawler_browser_instances=4` and `max_browser_windows=8`, the user would be presented with the following options: 1, 2, 3, 4, 8 - Sets maximum width of screencast to image width returned by `message` --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: sua yoo <sua@suayoo.com> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-06-03 13:37:30 -07:00
Ilya Kreymer	0e06ccd746	version: bump to 1.17.0-beta.0	2025-06-02 14:46:32 -07:00
sua yoo	858ae15ce6	feat: Handle paused state + workflow performance improvements (#2610 ) - Handles `paused` workflow state. - Adds "Copy Crawl ID" and "View Archived Item" buttons to workflow detail - Fixes file size not updating in workflow crawls list - Fixes superadmin banner showing over workflow tabs - Refactors workflow detail API calls to use `Task` to improve poll performance. - Fixes execution time rendering when less than a minute --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-05-28 19:26:38 -07:00
Ilya Kreymer	cb50c7c2c2	Pause / Resume Crawls Initial Implmentation. (#2572 ) - add 'pause' crawl state (fixes #2567) - gracefully shut down crawler pods, and then redis pod when paused - crawler uploads WACZ before shutting down (dependent on webrecorder/browsertrix-crawler#824, supported in 1.6.1+) - add 'paused_at' on crawl spec to indicate when crawl is paused - support max pause time limit, after which crawl becomes automatically stopped. - add 'stopped_pause_expired' when pause automatically expires and crawl is stopped - /crawl/<id>/{pause,resume} apis to toggle 'paused' on crawl spec - ui: add pause/resume button, paused state (partially addresses #2568) - ui: add pausing/resuming derivative states when crawl is running and pausing, or paused and not pausing (partially addresses #2569) - Designed to work with crawler 1.6.1+ which support pausing + uploading on pause Work on #2566, Fixes #2576 --------- Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: sua yoo <sua@suayoo.com>	2025-05-21 14:05:16 -07:00
Ilya Kreymer	e995811dd4	version: bump to 1.16.2	2025-05-20 18:43:22 -07:00
Ilya Kreymer	8a713155ef	remove deleted collections from crawlconfigs (#2615 ) simplified version of #2608, add a remove_collection_from_all_configs() in CrawlConfigs, also check org. update tests to ensure removal	2025-05-20 18:38:40 -07:00
Ilya Kreymer	86e35e358d	Add Org Check for Collection access (#2616 ) Ensure collection access checks org membership	2025-05-20 15:30:22 -07:00
Ilya Kreymer	c134b576ae	Optimize presigning for replay.json (#2516 ) Fixes #2515. This PR introduces a significantly optimized logic for presigning URLs for crawls and collections. - For collections, the files needed from all crawls are looked up, and then the 'presign_urls' table is merged in one pass, resulting in a unified iterator containing files and presign urls for those files. - For crawls, the presign URLs are also looked up once, and the same iterator is used for a single crawl with passed in list of CrawlFiles - URLs that are already signed are added to the return list. - For any remaining URLs to be signed, a bulk presigning function is added, which shares an HTTP connection and signing 8 files in parallels (customizable via helm chart, though may not be needed). This function is used to call the presigning API in parallel.	2025-05-20 12:09:35 -07:00
Ilya Kreymer	f1fd11c031	storage: use s3v4 signature for presigning urls (#2611 ) Use V4 ('s3v4') signature version for for all presigning URLs to support backblaze, fixes #2472 - add 'access_addressing_style' to be able to choose virtual/path addressing for access endpoint (default to 'virtual' as before) - fix minio presigning with v4 by using 'path' addressing style for minio - if path matches '/data/' for internal minio bucket, then always use 'path' - also make minio access path '/data/' configurable also simplify running in any namespace with default settings: - don't hardcode 'local-minio.default' - in crawlers namespace, add a 'local-minio' externalName service which maps to the main namespace service.	2025-05-19 15:44:36 -07:00
Tessa Walsh	c73512dbd4	Bump version to 1.16.1 (#2606 )	2025-05-13 17:29:49 -04:00
Tessa Walsh	1492397656	Add ISO-639-1 language code validation to backend (#2602 ) - Add backend validation for language codes - Add migration to look for invalid ISO-639-1 language codes in workflows, crawls, and org crawling defaults, and fix any found	2025-05-13 16:54:33 -04:00
Tessa Walsh	6f81d588a9	Ensure crawl page counts are correct when re-adding pages (#2601 ) Fixes #2600 This PR fixes the issue by ensuring that crawl page counts (total, unique, files, errors) are reset to 0 when crawl pages are deleted, such as right before being re-added. It also adds a migration will recalculates file and error page counts for each crawl without re-adding pages from the WACZ files.	2025-05-13 14:05:41 -04:00
Ilya Kreymer	652e8a6085	version: bump to 1.16.0	2025-05-08 14:30:00 -07:00
Ilya Kreymer	1570011ec7	compute top page origins for each collection (#2483 ) A quick PR to fix #2482: - compute topPageHosts as part of existing collection stats compute - store top 10 results in collection for now. - display in collection About sidebar - fixes #2482 Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-05-08 14:22:40 -07:00
Emma Segal-Grossman	0691f43be6	Sort running crawls first by default (#2587 )	2025-05-08 17:21:17 -04:00
Tessa Walsh	3e169ebc15	Add API endpoint to check if subscription is activated (#2582 ) Subscription Management: used check to ensure subscription can be auto-canceled if not activated. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-05-06 17:36:58 -07:00
sua yoo	1fa43335c0	feat: Apply saved workflow settings to current crawl (#2514 ) Resolves https://github.com/webrecorder/browsertrix/issues/2366 ## Changes Allows users to update current crawl with newly saved workflow settings. ## Manual testing 1. Log in as crawler 2. Start a crawl 3. Go to edit workflow. Verify "Update Crawl" button is shown 4. Click "Update Crawl". Verify crawl is updated with new settings --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-04-29 11:43:14 -07:00
Ilya Kreymer	0cb3bd19f6	version: update to 1.15.0	2025-04-09 12:28:01 +02:00
Tessa Walsh	785fd85105	Ensure error and behavior logs are written to database in order (#2540 ) Fixes #2539	2025-04-08 09:35:50 -04:00
Tessa Walsh	55bedcb0b7	feat: Custom autoclick selector (#2517 ) Resolves #2504 ## Changes - Allows users to customize autoclick selector in workflows - Refactors `btrix-syntax-input` to support rendering label and help text `sl-input` - Show autoclick selector in workflow / crawl settings - Adds 'clickSelector' with default of 'a' to backend crawl config. --------- Co-authored-by: sua yoo <sua@suayoo.com> Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>	2025-04-08 05:53:40 +02:00

1 2 3 4 5 ...

679 Commits