browsertrix

Author	SHA1	Message	Date
Ilya Kreymer	652e8a6085	version: bump to 1.16.0	2025-05-08 14:30:00 -07:00
Ilya Kreymer	1570011ec7	compute top page origins for each collection (#2483 ) A quick PR to fix #2482: - compute topPageHosts as part of existing collection stats compute - store top 10 results in collection for now. - display in collection About sidebar - fixes #2482 Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-05-08 14:22:40 -07:00
Emma Segal-Grossman	0691f43be6	Sort running crawls first by default (#2587 )	2025-05-08 17:21:17 -04:00
Tessa Walsh	3e169ebc15	Add API endpoint to check if subscription is activated (#2582 ) Subscription Management: used check to ensure subscription can be auto-canceled if not activated. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-05-06 17:36:58 -07:00
sua yoo	1fa43335c0	feat: Apply saved workflow settings to current crawl (#2514 ) Resolves https://github.com/webrecorder/browsertrix/issues/2366 ## Changes Allows users to update current crawl with newly saved workflow settings. ## Manual testing 1. Log in as crawler 2. Start a crawl 3. Go to edit workflow. Verify "Update Crawl" button is shown 4. Click "Update Crawl". Verify crawl is updated with new settings --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-04-29 11:43:14 -07:00
Ilya Kreymer	0cb3bd19f6	version: update to 1.15.0	2025-04-09 12:28:01 +02:00
Tessa Walsh	785fd85105	Ensure error and behavior logs are written to database in order (#2540 ) Fixes #2539	2025-04-08 09:35:50 -04:00
Tessa Walsh	55bedcb0b7	feat: Custom autoclick selector (#2517 ) Resolves #2504 ## Changes - Allows users to customize autoclick selector in workflows - Refactors `btrix-syntax-input` to support rendering label and help text `sl-input` - Show autoclick selector in workflow / crawl settings - Adds 'clickSelector' with default of 'a' to backend crawl config. --------- Co-authored-by: sua yoo <sua@suayoo.com> Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>	2025-04-08 05:53:40 +02:00
Tessa Walsh	a51f7c635e	Add behavior logs from Redis to database and add endpoint to serve (#2526 ) Backend work for #2524 This PR adds a second dedicated endpoint similar to `/errors`, as a combined log endpoint would give a false impression of being the complete crawl logs (which is far from what we're serving in Browsertrix at this point). Eventually when we have support for streaming live crawl logs in `crawls/<id>/logs` I'd ideally like to deprecate these two dedicated endpoints in favor of using that, but for now this seems like the best solution. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-04-08 02:16:10 +02:00
Tessa Walsh	f84f6f55e0	Add basic backend validation for selectLinks (#2510 ) Follow-up to #2152 Related to https://github.com/webrecorder/browsertrix/pull/2487 This PR provides very basic validation of the `config.selectLinks` argument on workflow creation and update. Namely, it checks that: - `config.selectLinks` is not an empty array - Each entry consists of two non-empty text sequences separated by `->` At this point we're not validating the actual CSS selector on the backend, though we could add that down the road. Tests have been added accordingly. Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-04-07 21:36:05 +02:00
Tessa Walsh	cd7b695520	Add backend support for custom behaviors + validation endpoint (#2505 ) Backend support for #2151 Adds support for specifying custom behaviors via a list of strings. When workflows are added or modified, minimal backend validation is done to ensure that all custom behavior URLs are valid URLs (after removing the git prefix and custom query arguments). A separate `POST /crawlconfigs/validate/custom-behavior` endpoint is also added, which can be used to validate a custom behavior URL. It performs the same syntax check as above and then: - For URL directly to behavior file, ensures URL resolves and returns a 2xx/3xx status code - For Git repositories, uses `git ls-remote` to ensure they exist (and that branch exists if specified) --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-04-02 16:20:51 -07:00
Ilya Kreymer	c067a0fe7c	fix qa page sorting: (#2530 ) was sorting on qa.{qa_run_id} after the value was already replaced with 'qa', thus was sorting on non-existent value fixes #2529	2025-04-02 09:25:38 -07:00
Ilya Kreymer	b5b4c4da15	version: update to 1.14.8	2025-03-31 14:17:53 -07:00
Ilya Kreymer	62e47a8817	support overriding crawler image pull policy per channel (#2523 ) - add 'imagePullPolicy' field to each crawler channel declaration - if unset, defaults to the setting in the existing 'crawler_image_pull_policy' field. fixes #2522 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-03-31 14:11:41 -07:00
Ilya Kreymer	b3950dd03f	version: update to 1.14.7	2025-03-25 17:25:24 -07:00
Ilya Kreymer	21a372057b	Fix user emails use userout (#2511 ) Follow-up to #2495, actually ensure org subscription data is in included in admin email response --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-03-24 12:04:39 -07:00
Ilya Kreymer	46be6a0cf6	version: bump to 1.14.6	2025-03-20 16:52:20 -07:00
Ilya Kreymer	4c0ddd0fe3	crawl replay: remove isSeed=true from initialPages query (#2509 ) - matches initial query for collections - fixes 'Show Non-Seed Pages' not appearing for crawl replay	2025-03-20 15:03:41 -07:00
Ilya Kreymer	cb14ac3a00	add org subs info to /api/users/emails endpoint (#2495 ) Include additional info in this superadmin-only endpoint.	2025-03-20 08:31:23 -07:00
Ilya Kreymer	6be1f6674c	fixes token lifetime bug / improve security (#2490 ) - fix jwt_token_lifetime being in hours, not minutes, remove extra * 60 - don't return userids in user list for org admins, instead just key users by email, which is already unique	2025-03-19 10:07:09 -07:00
Ilya Kreymer	eb300815a7	Fixes #2488 (#2493 ) - Fixes #2488 - Adds a k8s api call to set `suspend=false` on Job when associated CrawlJob is finished. - bump version - released as 1.14.5	2025-03-19 10:06:25 -07:00
Emma Segal-Grossman	b2c5b9bc59	Hide breadcrumbs for private orgs (#2477 ) Hides "Back to [org name]" breadcrumb when viewing a public/unlisted collection when the public gallery isn't enabled for the org (except when logged into that org).	2025-03-11 15:05:35 -04:00
emma	a42d83c9f6	add content-length and etag headers to thumbnail endpoint	2025-03-10 13:58:41 -04:00
Ilya Kreymer	d8365c734f	version: bump to 1.14.4	2025-03-08 15:58:18 -08:00
Ilya Kreymer	03fa00df45	set default crawler channel if not set, possible fix for #2458 (#2469 ) update default RWP version	2025-03-07 12:32:19 -08:00
Ilya Kreymer	6c192df49d	Add thumbnail endpoint (#2468 ) - Add /thumbnail collections endpoint to serve the thumbnail as an image for public collections. - Also fix uploading thumbnail images to use correct mime, if available.	2025-03-07 12:29:36 -08:00
Tessa Walsh	13bf818914	Fix nightly tests (#2460 ) Fixes #2459 - Set `/data/` as primary storage `access_endpoint_url` in nightly test chart - Modify nightly test GH Actions workflow to spawn a separate job per nightly test module using dynamic matrix - Set configuration not to fail other jobs if one job fails - Modify failing tests: - Add fixture to background job nightly test module so it can run alone - Add retry loop to crawlconfig stats nightly test so it's less dependent on timing GitHub limits each workflow to 256 jobs, so this should continue to be able to scale up for us without issue. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-03-06 16:23:30 -08:00
Ilya Kreymer	9466e83d18	version: bump to 1.14.3	2025-03-03 15:20:40 -08:00
Ilya Kreymer	afa892000b	replay api: add downloadUrl to replay endpoints to be used by RWP (#2456 ) RWP (2.3.3+) can determine if the 'Download Archive' menu item should be showed based on the value of downloadUrl. If set to 'null', will hide the menu item: - set downloadUrl to public collection download for public collections replay - set downloadUrl to null for private collection and crawl replay to hide the download menu item in RWP (otherwise have to add the auth_header query with bearer token and should assess security before doing that..) --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-03-03 14:11:28 -08:00
Ilya Kreymer	e13c3bfb48	move db migrations to initContainers: (#2449 ) - should avoid gunicorn worker timeouts for long running migrations, also fixes #2439 - add main_migrations as entrypoint to just run db migrations, using existing init_ops() call - first run 'migrations' container with same resources as 'app' and 'op' - additional typing for initializing db - cleanup unused code related to running only once, waiting for db to be ready - fixes #2447	2025-03-03 13:13:15 -08:00
Ilya Kreymer	702c9ab3b7	Better cacheing of presigned URLs + support for thumbnails (#2446 ) Overhauls URL presigning by: - cache the presigned urls in a flat, separate mongodb collection which has an expiring index - update presigned urls if not found / expired automatically in index - remove logic on storing presignedUrl in files - support cacheing presigned URL for thumbnails. - add endpoints to clear presigned urls for org or for all files in all orgs (superadmin only) - supersedes #2438, fix for #2437 - removes previous presignedUrl and expireAt data from crawls and QA runs --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-03-03 12:05:23 -08:00
Ilya Kreymer	631b019baf	optimize public collection loading: (#2444 ) - remove query for /collections endpoint just to get the org name - add orgName to single /collection endpoint, where it is already available on the backend	2025-03-03 10:13:30 -08:00
Ilya Kreymer	2263745df3	Fix replay.json 400 response for empty collection (#2445 ) - fix #2443 - don't throw error in list_pages() if no crawls provided, just return empty list - ensure an empty collection returns 200 on replay.json, add tests	2025-03-03 09:38:19 -08:00
Ilya Kreymer	cb52da66dc	version: bump to 1.14.2	2025-02-27 14:13:03 -08:00
Tessa Walsh	45aa0a32b6	Calculate total for crawl QA page endpoint (#2435 ) Fixes #2434 Patch fix for a regression in Browsertrix 1.4.0-1.4.1 where total was not being calculated for QA page list endpoint but still being included in response, which led to total always being 0 and pages not loading in the frontend review screen as a result.	2025-02-27 11:46:35 -08:00
Ilya Kreymer	376c9981dc	version: bump to 1.14.1	2025-02-26 23:15:01 -08:00
Tessa Walsh	3dc8c825c6	Add superadmin endpoint to readd scheduled workflow cronjobs (#2430 ) Adds new superadmin-only `POST /orgs/all/crawlconfigs/reAddCronjobs` endpoint to update/recreate scheduled workflow cronjobs across all orgs.	2025-02-26 23:13:53 -08:00
Ilya Kreymer	e67708bd4f	version: update to 1.14.0	2025-02-24 14:49:46 -08:00
Ilya Kreymer	83180efac9	remove dropping page index on migrations (#2418 ) Don't need it for now, and this will now be slow due to amount of pages. Can readd in future migrations if we need it.. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-02-24 12:29:02 -08:00
Ilya Kreymer	8a507f0473	Consolidate list page endpoints + better QA sorting + optimize pages fix (#2417 ) - consolidate list_pages() and list_replay_query_pages() into list_pages() - to keep backwards compatibility, add <crawl>/pagesSearch that does not include page totals, keep <crawl>/pages with page total (slower) - qa frontend: add default 'Crawl Order' sort order, to better show pages in QA view - bgjob: account for parallelism in bgjobs, add logging if succeeded mismatches parallelism - QA sorting: default to 'crawl order' by default to get better results. - Optimize pages job: also cover crawls that may not have any pages but have pages listed in done stats - Bgjobs: give custom op jobs more memory	2025-02-21 13:47:20 -08:00
Ilya Kreymer	3ca68bf1d2	version: 1.14.0-beta.6	2025-02-20 15:37:33 -08:00
Tessa Walsh	f8fb2d2c8d	Rework crawl page migration + MongoDB Query Optimizations (#2412 ) Fixes #2406 Converts migration 0042 to launch a background job (parallelized across several pods) to migrate all crawls by optimizing their pages and setting `version: 2` on the crawl when complete. Also Optimizes MongoDB queries for better performance. Migration Improvements: - Add `isMigrating` and `version` fields to `BaseCrawl` - Add new background job type to use in migration with accompanying `migration_job.yaml` template that allows for parallelization - Add new API endpoint to launch this crawl migration job, and ensure that we have list and retry endpoints for superusers that work with background jobs that aren't tied to a specific org - Rework background job models and methods now that not all background jobs are tied to a single org - Ensure new crawls and uploads have `version` set to `2` - Modify crawl and collection replay.json endpoints to only include fields for replay optimization (`initialPages`, `pageQueryUrl`, `preloadResources`) if all relevant crawls/uploads have `version` set to `2` - Remove `distinct` calls from migration pathways - Consolidate collection recompute stats Query Optimizations: - Remove all uses of $group and $facet - Optimize /replay.json endpoints to precompute preload_resources, avoid fetching crawl list twice - Optimize /collections endpoint by not fetching resources - Rename /urls -> /pageUrlCounts and avoid $group, instead sort with index, either by seed + ts or by url to get top matches. - Use $gte instead of $regex to get prefix matches on URL - Use $text instead of $regex to get text search on title - Remove total from /pages and /pageUrlCounts queries by not using $facet - frontend: only call /pageUrlCounts when dialog is opened. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Emma Segal-Grossman <hi@emma.cafe> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-02-20 15:26:11 -08:00
Ilya Kreymer	36e723cc51	Adjust crawler pvc on exit code 3 (out of storage) (#2375 ) crawler 1.5.0 now has an exit code 3 for when crawler is actually out of disk space. The operator should handle this by immediately adjusting the PVC size. Ideally, crawler will be improved to avoid this, but since this can still happen, operator should be able to respond and fix the issue.	2025-02-20 11:03:28 -08:00
Ilya Kreymer	88a9f3baf7	ensure running crawl configmap is updated when exclusions are added/removed (#2409 ) exclusions are already updated dynamically if crawler pod is running, but when crawler pod is restarted, this ensures new exclusions are also picked up: - mount configmap in separate path, avoiding subPath, to allow dynamic updates of mounted volume - adds a lastConfigUpdate timestamp to CrawlJob - if lastConfigUpdate in spec is different from current, the configmap is recreated by operator - operator: also update image from channel avoid any issues with updating crawler in channel - only updates for exclusion add/remove so far, can later be expanded to other crawler settings (see: #2355 for broader running crawl config updates) - fixes #2408	2025-02-19 11:42:19 -08:00
Ilya Kreymer	d23bca1f73	style change: remove spaces from python version docstring	2025-02-17 16:52:49 -08:00
Ilya Kreymer	a7c8ca4028	version: bump to 1.14.0-beta.1	2025-02-17 16:48:27 -08:00
Tessa Walsh	6c2d8c88c8	Modify page upload migration (#2400 ) Related to #2396 Changes to migration 0037: - Re-adds pages in migration rather than in background job to avoid race condition with later migrations - Re-adds pages for all uploads in all orgs Fix for readd pages for org: - Ensure org filter is applied! - Fix wrong type - Remove distinct, use iterator to iterate over crawls faster. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-02-17 16:47:58 -08:00
Ilya Kreymer	5bebb6161a	Issue 2396 readd pages fixes (#2398 ) readd pages fixes: - add additional mem to background job - copy page qa data to separate temp coll when re-adding pages, then merge back in	2025-02-17 13:52:11 -08:00
Ilya Kreymer	e112f96614	Upload Fixes: (#2397 ) - ensure upload pages are always added with a new uuid, to avoid any duplicates with existing uploads, even if upload wacz is actually a crawl from different browsertrix instance, etc.. - cleanup upload names with slugify, which also replaces spaces, fixes uploading wacz filenames with spaces in them - part of fix for #2396 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-02-17 13:05:33 -08:00
Tessa Walsh	39d99e7c5d	Add support for custom link selectors to backend (#2346 ) Related to #2152 This PR adds backend support for custom link selectors via `selectLinks` on the crawl workflow config. Tests have been updated as well. It also adds `selectLinks` to the frontend in a minimal and for now hardcoded way that we can use as a basis for proper frontend support moving forward. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-02-13 22:22:27 -08:00

1 2 3 4 5 ...

637 Commits