browsertrix

Author	SHA1	Message	Date
Tessa Walsh	f7ba712646	Add seed file support to Browsertrix backend (#2710 ) Fixes #2673 Changes in this PR: - Adds a new `file_uploads.py` module and corresponding `/files` API prefix with methods/endpoints for uploading, GETing, and deleting seed files (can be extended to other types of files moving forward) - Seed files are supported via `CrawlConfig.config.seedFileId` on POST and PATCH endpoints. This seedFileId is replaced by a presigned url when passed to the crawler by the operator - Seed files are read when first uploaded to calculate `firstSeed` and `seedCount` and store them in the database, and this is copied into the workflow and crawl documents when they are created. - Logic is added to store `firstSeed` and `seedCount` for other workflows as well, and a migration added to backfill data, to maintain consistency and fix some of the pymongo aggregations that previously assumed all workflows would have at least one `Seed` object in `CrawlConfig.seeds` - Seed file and thumbnail storage stats are added to org stats - Seed file and thumbnail uploads first check that the org's storage quota has not been exceeded and return a 400 if so - A cron background job (run weekly each Sunday at midnight by default, but configurable) is added to look for seed files at least x minutes old (1440 minutes, or 1 day, by default, but configurable) that are not in use in any workflows, and to delete them when they are found. The backend pods will ensure this k8s batch job exists when starting up and create it if it does not already exist. A database entry for each run of the job is created in the operator on job completion so that it'll appear in the `/jobs` API endpoints, but retrying of this type of regularly scheduled background job is not supported as we don't want to accidentally create multiple competing scheduled jobs. - Adds a `min_seed_file_crawler_image` value to the Helm chart that is checked before creating a crawl from a workflow if set. If a workflow cannot be run, return the detail of the exception in `CrawlConfigAddedResponse.errorDetail` so that we can display the reason in the frontend - Add SeedFile model from base UserFile (former ImageFIle), ensure all APIs returning uploaded files return an absolute pre-signed URL (either with external origin or internal service origin) --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-07-22 19:11:02 -07:00
Ilya Kreymer	8107b054f6	Profiles: Make browser commit API call idempotent (#2728 ) - Fix race condition related to browser commit time - The profile commit request waits for browser to actual finish, and profile saved. This can cause request to time out, resulting in a retry, in which the browser has already been closed. - With these changes, the commit is now 'idempotent' and returns a waiting_for_browser until the profile is actually committed. - On frontend, keep pinging commit endpoint with a timeout while 'waiting_for_browser' is returned, actual committed when endpoint returns profile id. --------- Co-authored-by: sua yoo <sua@suayoo.com>	2025-07-22 17:59:49 -07:00
Ilya Kreymer	0402f14b5e	version: bump to 1.17.4	2025-07-16 10:22:10 -07:00
Ilya Kreymer	4e0e9c87c2	qa: run siteSpecific behaviors on QA (#2739 ) - allow the page loading waiting time to be applied for sites with site-specific behaviors (eg. social media)	2025-07-15 16:17:55 -07:00
Emma Segal-Grossman	b0f2d87ce2	hotfix: workflow list - rewrite arrays in url search params to remove items (#2734 ) ## Changes - Deletes and rewrites arrays in URL search params in workflow list when editing array filters (i.e. tags & profiles) - Removes a missed `console.log` - bump to 1.17.3 cc @SuaYoo --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-07-14 14:30:18 -07:00
Ilya Kreymer	6f1ced0b01	chart: default to latest replayweb.page release by default (#2724 ) Avoid having Browsertrix be stuck on old RWP releases, probably better to just use latest as RWP releases now have additional testing.	2025-07-10 13:35:20 -07:00
Ilya Kreymer	841c45fe59	volumes: use emptyDir for tmp dir volume (#2713 ) - don't use a persistent volume for /tmp, instead use a temporary emptyDir - use volume to avoid permission issues with default /tmp dir - follow-up to #2623	2025-07-08 13:10:02 -07:00
Ilya Kreymer	b915e734d1	version: bump to 1.17.2	2025-06-30 14:20:43 -07:00
Tessa Walsh	db4621602e	Bump version to 1.17.1 (#2678 )	2025-06-18 13:09:49 -04:00
Ilya Kreymer	dde23426b2	version: bump to 1.17.0!	2025-06-12 17:37:07 -04:00
Ilya Kreymer	d4a2a66d6d	additional scale / browser window cleanup to properly support QA: (#2663 ) - follow up to #2627 - use qa_num_browser_windows to set exact number of QA browsers, fallback to qa_scale - set num_browser_windows and num_browsers_per_pod using crawler / qa values depending if QA crawl - scale_from_browser_windows() accepts optional browsers_per_pod if dealing with possible QA override - store 'desiredScale' in CrawlStatus to avoid recomputing for later scale resolving - ensure status.scale is always the actual scale observed	2025-06-12 13:09:04 -04:00
Ilya Kreymer	001277ac9d	docs: add docs for path / virtual addressing (#2669 ) Add docs about path / virtual 'access_addressing_style' that is available for each storage option. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-06-12 13:08:27 -04:00
Vinzenz Sinapius	0e0e663363	helm: add crawler_network_policy_additional_egress (#2641 ) - Adds `crawler_network_policy_additional_egress` setting, to add egress rules to the existing crawler network policy. Useful for when you want to allow-list a single IPs without replacing the whole network policy. - Adds docs about `crawler_network_policy_additional_egress` to the customization page. - Resolves #2121 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-06-10 16:19:42 -07:00
Ilya Kreymer	223221c31e	Add securityContext for Redis pod (#2640 ) It seems the latest redis image changed security settings so root-mounted volumes no longer work. This change: - mount redis volumes as redis user/group 999 - needed to run with redis >=8.0.2	2025-06-10 15:20:18 -07:00
Ilya Kreymer	8ea16393c5	Optimize single-page crawl workflows (#2656 ) For single page crawls: - Always force 1 browser to be used, ignoring browser windows/scale setting - Don't use custom PVC volumes in crawler / redis, just use emptyDir - no chance of crawler being interrupted and restarted on different machine for a single page. Adds a 'is_single_page' check to CrawlConfig, checking for either limit or scopeType / no extra hops. Fixes #2655	2025-06-10 12:13:57 -07:00
Tessa Walsh	dc41468daf	Allow users to run crawls with 1 or 2 browser windows (#2627 ) Fixes #2425 ## Changed - Switch backend to primarily using number of browser windows rather than scale multiplier (including migration to calculate `browserWindows` from `scale` for existing workflows and crawls) - Still support `scale` in addition to `browserWindows` in input models for creating and updating workflows and re-adjusting live crawl scale for backwards compatibility - Adds new `max_browser_windows` value to Helm chart, but calculates the value from `max_crawl_scale` as fallback for users with that value already set in local charts - Rework frontend to allow users to select multiples of `crawler_browser_instances` or any value below `crawler_browser_instances` for browser windows. For instance, with `crawler_browser_instances=4` and `max_browser_windows=8`, the user would be presented with the following options: 1, 2, 3, 4, 8 - Sets maximum width of screencast to image width returned by `message` --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: sua yoo <sua@suayoo.com> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-06-03 13:37:30 -07:00
Ilya Kreymer	0e06ccd746	version: bump to 1.17.0-beta.0	2025-06-02 14:46:32 -07:00
Pierre	8b54444b7e	docs: update remote deployment docs with working nginx-install example (#2625 ) - Update the docs on k3s deployment for installing `ingress-nginx`, fixes #2619. - Also fix the indentation on the code blocks so markdown carries on list numbering. At the moment the numbering confusingly resets after point 3. - Update indentation on all code blocks so they show up as part of list + wrap long commands. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-05-28 20:07:02 -07:00
Ilya Kreymer	5b0f851857	Fix securityContext for pod (#2623 ) Some of the `securityContext` settings need to be on the container, not on the pod, including the read-only file system, which was not previously enabled. This now enables the read-only file system. Also map the crawler /tmp directory to use the same volume as crawls (as crawler currently uses /tmp dir) as /tmp becomes read-only otherwise.	2025-05-27 10:59:50 -07:00
Ilya Kreymer	cb50c7c2c2	Pause / Resume Crawls Initial Implmentation. (#2572 ) - add 'pause' crawl state (fixes #2567) - gracefully shut down crawler pods, and then redis pod when paused - crawler uploads WACZ before shutting down (dependent on webrecorder/browsertrix-crawler#824, supported in 1.6.1+) - add 'paused_at' on crawl spec to indicate when crawl is paused - support max pause time limit, after which crawl becomes automatically stopped. - add 'stopped_pause_expired' when pause automatically expires and crawl is stopped - /crawl/<id>/{pause,resume} apis to toggle 'paused' on crawl spec - ui: add pause/resume button, paused state (partially addresses #2568) - ui: add pausing/resuming derivative states when crawl is running and pausing, or paused and not pausing (partially addresses #2569) - Designed to work with crawler 1.6.1+ which support pausing + uploading on pause Work on #2566, Fixes #2576 --------- Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: sua yoo <sua@suayoo.com>	2025-05-21 14:05:16 -07:00
Ilya Kreymer	e995811dd4	version: bump to 1.16.2	2025-05-20 18:43:22 -07:00
Ilya Kreymer	e29db33629	tests: fix nightly test config after #2611 (#2614 ) remove namespace from minio config to match settings	2025-05-20 12:25:15 -07:00
Ilya Kreymer	c134b576ae	Optimize presigning for replay.json (#2516 ) Fixes #2515. This PR introduces a significantly optimized logic for presigning URLs for crawls and collections. - For collections, the files needed from all crawls are looked up, and then the 'presign_urls' table is merged in one pass, resulting in a unified iterator containing files and presign urls for those files. - For crawls, the presign URLs are also looked up once, and the same iterator is used for a single crawl with passed in list of CrawlFiles - URLs that are already signed are added to the return list. - For any remaining URLs to be signed, a bulk presigning function is added, which shares an HTTP connection and signing 8 files in parallels (customizable via helm chart, though may not be needed). This function is used to call the presigning API in parallel.	2025-05-20 12:09:35 -07:00
Ilya Kreymer	f1fd11c031	storage: use s3v4 signature for presigning urls (#2611 ) Use V4 ('s3v4') signature version for for all presigning URLs to support backblaze, fixes #2472 - add 'access_addressing_style' to be able to choose virtual/path addressing for access endpoint (default to 'virtual' as before) - fix minio presigning with v4 by using 'path' addressing style for minio - if path matches '/data/' for internal minio bucket, then always use 'path' - also make minio access path '/data/' configurable also simplify running in any namespace with default settings: - don't hardcode 'local-minio.default' - in crawlers namespace, add a 'local-minio' externalName service which maps to the main namespace service.	2025-05-19 15:44:36 -07:00
Tessa Walsh	c73512dbd4	Bump version to 1.16.1 (#2606 )	2025-05-13 17:29:49 -04:00
Ilya Kreymer	652e8a6085	version: bump to 1.16.0	2025-05-08 14:30:00 -07:00
Ilya Kreymer	0cb3bd19f6	version: update to 1.15.0	2025-04-09 12:28:01 +02:00
Tessa Walsh	a51f7c635e	Add behavior logs from Redis to database and add endpoint to serve (#2526 ) Backend work for #2524 This PR adds a second dedicated endpoint similar to `/errors`, as a combined log endpoint would give a false impression of being the complete crawl logs (which is far from what we're serving in Browsertrix at this point). Eventually when we have support for streaming live crawl logs in `crawls/<id>/logs` I'd ideally like to deprecate these two dedicated endpoints in favor of using that, but for now this seems like the best solution. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-04-08 02:16:10 +02:00
Ilya Kreymer	b5b4c4da15	version: update to 1.14.8	2025-03-31 14:17:53 -07:00
Ilya Kreymer	62e47a8817	support overriding crawler image pull policy per channel (#2523 ) - add 'imagePullPolicy' field to each crawler channel declaration - if unset, defaults to the setting in the existing 'crawler_image_pull_policy' field. fixes #2522 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-03-31 14:11:41 -07:00
Ilya Kreymer	b3950dd03f	version: update to 1.14.7	2025-03-25 17:25:24 -07:00
Ilya Kreymer	9250befea4	ingress: remove X-Forward-Proto snippet, no longer needed (and now possibly considered unsafe) (#2519 ) X-Forward-Proto is now already provided by the standard ingress-nginx config	2025-03-25 17:24:55 -07:00
Ilya Kreymer	46be6a0cf6	version: bump to 1.14.6	2025-03-20 16:52:20 -07:00
Ilya Kreymer	b63caf74ad	cleanup unused chart values + change mongo default (#2484 ) - Removes chart values that are unused - Also change `local-mongo.default` -> `local-mongo`, `local-minio.default` -> `local-minio` as some users have reported issues with `.default` and it will certainly break if not deploying Browsertrix in the `default `namespace.	2025-03-20 08:30:45 -07:00
Ilya Kreymer	eb300815a7	Fixes #2488 (#2493 ) - Fixes #2488 - Adds a k8s api call to set `suspend=false` on Job when associated CrawlJob is finished. - bump version - released as 1.14.5	2025-03-19 10:06:25 -07:00
Ilya Kreymer	d8365c734f	version: bump to 1.14.4	2025-03-08 15:58:18 -08:00
Ilya Kreymer	03fa00df45	set default crawler channel if not set, possible fix for #2458 (#2469 ) update default RWP version	2025-03-07 12:32:19 -08:00
Tessa Walsh	13bf818914	Fix nightly tests (#2460 ) Fixes #2459 - Set `/data/` as primary storage `access_endpoint_url` in nightly test chart - Modify nightly test GH Actions workflow to spawn a separate job per nightly test module using dynamic matrix - Set configuration not to fail other jobs if one job fails - Modify failing tests: - Add fixture to background job nightly test module so it can run alone - Add retry loop to crawlconfig stats nightly test so it's less dependent on timing GitHub limits each workflow to 256 jobs, so this should continue to be able to scale up for us without issue. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-03-06 16:23:30 -08:00
Ilya Kreymer	9466e83d18	version: bump to 1.14.3	2025-03-03 15:20:40 -08:00
Ilya Kreymer	e13c3bfb48	move db migrations to initContainers: (#2449 ) - should avoid gunicorn worker timeouts for long running migrations, also fixes #2439 - add main_migrations as entrypoint to just run db migrations, using existing init_ops() call - first run 'migrations' container with same resources as 'app' and 'op' - additional typing for initializing db - cleanup unused code related to running only once, waiting for db to be ready - fixes #2447	2025-03-03 13:13:15 -08:00
Ilya Kreymer	cb52da66dc	version: bump to 1.14.2	2025-02-27 14:13:03 -08:00
Ilya Kreymer	376c9981dc	version: bump to 1.14.1	2025-02-26 23:15:01 -08:00
Ilya Kreymer	67668438c0	ingress: only set ssl-redirect if using tls (#2432 ) otherwise, http path should be accessible. Can be used when TLS termination handled outside of ingress.	2025-02-26 23:12:07 -08:00
Ilya Kreymer	e67708bd4f	version: update to 1.14.0	2025-02-24 14:49:46 -08:00
Ilya Kreymer	8a507f0473	Consolidate list page endpoints + better QA sorting + optimize pages fix (#2417 ) - consolidate list_pages() and list_replay_query_pages() into list_pages() - to keep backwards compatibility, add <crawl>/pagesSearch that does not include page totals, keep <crawl>/pages with page total (slower) - qa frontend: add default 'Crawl Order' sort order, to better show pages in QA view - bgjob: account for parallelism in bgjobs, add logging if succeeded mismatches parallelism - QA sorting: default to 'crawl order' by default to get better results. - Optimize pages job: also cover crawls that may not have any pages but have pages listed in done stats - Bgjobs: give custom op jobs more memory	2025-02-21 13:47:20 -08:00
Ilya Kreymer	3ca68bf1d2	version: 1.14.0-beta.6	2025-02-20 15:37:33 -08:00
Tessa Walsh	f8fb2d2c8d	Rework crawl page migration + MongoDB Query Optimizations (#2412 ) Fixes #2406 Converts migration 0042 to launch a background job (parallelized across several pods) to migrate all crawls by optimizing their pages and setting `version: 2` on the crawl when complete. Also Optimizes MongoDB queries for better performance. Migration Improvements: - Add `isMigrating` and `version` fields to `BaseCrawl` - Add new background job type to use in migration with accompanying `migration_job.yaml` template that allows for parallelization - Add new API endpoint to launch this crawl migration job, and ensure that we have list and retry endpoints for superusers that work with background jobs that aren't tied to a specific org - Rework background job models and methods now that not all background jobs are tied to a single org - Ensure new crawls and uploads have `version` set to `2` - Modify crawl and collection replay.json endpoints to only include fields for replay optimization (`initialPages`, `pageQueryUrl`, `preloadResources`) if all relevant crawls/uploads have `version` set to `2` - Remove `distinct` calls from migration pathways - Consolidate collection recompute stats Query Optimizations: - Remove all uses of $group and $facet - Optimize /replay.json endpoints to precompute preload_resources, avoid fetching crawl list twice - Optimize /collections endpoint by not fetching resources - Rename /urls -> /pageUrlCounts and avoid $group, instead sort with index, either by seed + ts or by url to get top matches. - Use $gte instead of $regex to get prefix matches on URL - Use $text instead of $regex to get text search on title - Remove total from /pages and /pageUrlCounts queries by not using $facet - frontend: only call /pageUrlCounts when dialog is opened. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Emma Segal-Grossman <hi@emma.cafe> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-02-20 15:26:11 -08:00
Ilya Kreymer	88a9f3baf7	ensure running crawl configmap is updated when exclusions are added/removed (#2409 ) exclusions are already updated dynamically if crawler pod is running, but when crawler pod is restarted, this ensures new exclusions are also picked up: - mount configmap in separate path, avoiding subPath, to allow dynamic updates of mounted volume - adds a lastConfigUpdate timestamp to CrawlJob - if lastConfigUpdate in spec is different from current, the configmap is recreated by operator - operator: also update image from channel avoid any issues with updating crawler in channel - only updates for exclusion add/remove so far, can later be expanded to other crawler settings (see: #2355 for broader running crawl config updates) - fixes #2408	2025-02-19 11:42:19 -08:00
Ilya Kreymer	a7c8ca4028	version: bump to 1.14.0-beta.1	2025-02-17 16:48:27 -08:00
Ilya Kreymer	5bebb6161a	Issue 2396 readd pages fixes (#2398 ) readd pages fixes: - add additional mem to background job - copy page qa data to separate temp coll when re-adding pages, then merge back in	2025-02-17 13:52:11 -08:00

1 2 3 4 5 ...

300 Commits