browsertrix

Author	SHA1	Message	Date
Tessa Walsh	13bf818914	Fix nightly tests (#2460 ) Fixes #2459 - Set `/data/` as primary storage `access_endpoint_url` in nightly test chart - Modify nightly test GH Actions workflow to spawn a separate job per nightly test module using dynamic matrix - Set configuration not to fail other jobs if one job fails - Modify failing tests: - Add fixture to background job nightly test module so it can run alone - Add retry loop to crawlconfig stats nightly test so it's less dependent on timing GitHub limits each workflow to 256 jobs, so this should continue to be able to scale up for us without issue. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-03-06 16:23:30 -08:00
Ilya Kreymer	9466e83d18	version: bump to 1.14.3	2025-03-03 15:20:40 -08:00
Ilya Kreymer	afa892000b	replay api: add downloadUrl to replay endpoints to be used by RWP (#2456 ) RWP (2.3.3+) can determine if the 'Download Archive' menu item should be showed based on the value of downloadUrl. If set to 'null', will hide the menu item: - set downloadUrl to public collection download for public collections replay - set downloadUrl to null for private collection and crawl replay to hide the download menu item in RWP (otherwise have to add the auth_header query with bearer token and should assess security before doing that..) --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-03-03 14:11:28 -08:00
Ilya Kreymer	e13c3bfb48	move db migrations to initContainers: (#2449 ) - should avoid gunicorn worker timeouts for long running migrations, also fixes #2439 - add main_migrations as entrypoint to just run db migrations, using existing init_ops() call - first run 'migrations' container with same resources as 'app' and 'op' - additional typing for initializing db - cleanup unused code related to running only once, waiting for db to be ready - fixes #2447	2025-03-03 13:13:15 -08:00
Ilya Kreymer	702c9ab3b7	Better cacheing of presigned URLs + support for thumbnails (#2446 ) Overhauls URL presigning by: - cache the presigned urls in a flat, separate mongodb collection which has an expiring index - update presigned urls if not found / expired automatically in index - remove logic on storing presignedUrl in files - support cacheing presigned URL for thumbnails. - add endpoints to clear presigned urls for org or for all files in all orgs (superadmin only) - supersedes #2438, fix for #2437 - removes previous presignedUrl and expireAt data from crawls and QA runs --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-03-03 12:05:23 -08:00
Ilya Kreymer	631b019baf	optimize public collection loading: (#2444 ) - remove query for /collections endpoint just to get the org name - add orgName to single /collection endpoint, where it is already available on the backend	2025-03-03 10:13:30 -08:00
Ilya Kreymer	2263745df3	Fix replay.json 400 response for empty collection (#2445 ) - fix #2443 - don't throw error in list_pages() if no crawls provided, just return empty list - ensure an empty collection returns 200 on replay.json, add tests	2025-03-03 09:38:19 -08:00
Ilya Kreymer	cb52da66dc	version: bump to 1.14.2	2025-02-27 14:13:03 -08:00
Tessa Walsh	45aa0a32b6	Calculate total for crawl QA page endpoint (#2435 ) Fixes #2434 Patch fix for a regression in Browsertrix 1.4.0-1.4.1 where total was not being calculated for QA page list endpoint but still being included in response, which led to total always being 0 and pages not loading in the frontend review screen as a result.	2025-02-27 11:46:35 -08:00
Ilya Kreymer	376c9981dc	version: bump to 1.14.1	2025-02-26 23:15:01 -08:00
Tessa Walsh	3dc8c825c6	Add superadmin endpoint to readd scheduled workflow cronjobs (#2430 ) Adds new superadmin-only `POST /orgs/all/crawlconfigs/reAddCronjobs` endpoint to update/recreate scheduled workflow cronjobs across all orgs.	2025-02-26 23:13:53 -08:00
Ilya Kreymer	e67708bd4f	version: update to 1.14.0	2025-02-24 14:49:46 -08:00
Ilya Kreymer	83180efac9	remove dropping page index on migrations (#2418 ) Don't need it for now, and this will now be slow due to amount of pages. Can readd in future migrations if we need it.. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-02-24 12:29:02 -08:00
Ilya Kreymer	8a507f0473	Consolidate list page endpoints + better QA sorting + optimize pages fix (#2417 ) - consolidate list_pages() and list_replay_query_pages() into list_pages() - to keep backwards compatibility, add <crawl>/pagesSearch that does not include page totals, keep <crawl>/pages with page total (slower) - qa frontend: add default 'Crawl Order' sort order, to better show pages in QA view - bgjob: account for parallelism in bgjobs, add logging if succeeded mismatches parallelism - QA sorting: default to 'crawl order' by default to get better results. - Optimize pages job: also cover crawls that may not have any pages but have pages listed in done stats - Bgjobs: give custom op jobs more memory	2025-02-21 13:47:20 -08:00
Ilya Kreymer	3ca68bf1d2	version: 1.14.0-beta.6	2025-02-20 15:37:33 -08:00
Tessa Walsh	f8fb2d2c8d	Rework crawl page migration + MongoDB Query Optimizations (#2412 ) Fixes #2406 Converts migration 0042 to launch a background job (parallelized across several pods) to migrate all crawls by optimizing their pages and setting `version: 2` on the crawl when complete. Also Optimizes MongoDB queries for better performance. Migration Improvements: - Add `isMigrating` and `version` fields to `BaseCrawl` - Add new background job type to use in migration with accompanying `migration_job.yaml` template that allows for parallelization - Add new API endpoint to launch this crawl migration job, and ensure that we have list and retry endpoints for superusers that work with background jobs that aren't tied to a specific org - Rework background job models and methods now that not all background jobs are tied to a single org - Ensure new crawls and uploads have `version` set to `2` - Modify crawl and collection replay.json endpoints to only include fields for replay optimization (`initialPages`, `pageQueryUrl`, `preloadResources`) if all relevant crawls/uploads have `version` set to `2` - Remove `distinct` calls from migration pathways - Consolidate collection recompute stats Query Optimizations: - Remove all uses of $group and $facet - Optimize /replay.json endpoints to precompute preload_resources, avoid fetching crawl list twice - Optimize /collections endpoint by not fetching resources - Rename /urls -> /pageUrlCounts and avoid $group, instead sort with index, either by seed + ts or by url to get top matches. - Use $gte instead of $regex to get prefix matches on URL - Use $text instead of $regex to get text search on title - Remove total from /pages and /pageUrlCounts queries by not using $facet - frontend: only call /pageUrlCounts when dialog is opened. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Emma Segal-Grossman <hi@emma.cafe> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-02-20 15:26:11 -08:00
Ilya Kreymer	36e723cc51	Adjust crawler pvc on exit code 3 (out of storage) (#2375 ) crawler 1.5.0 now has an exit code 3 for when crawler is actually out of disk space. The operator should handle this by immediately adjusting the PVC size. Ideally, crawler will be improved to avoid this, but since this can still happen, operator should be able to respond and fix the issue.	2025-02-20 11:03:28 -08:00
Ilya Kreymer	88a9f3baf7	ensure running crawl configmap is updated when exclusions are added/removed (#2409 ) exclusions are already updated dynamically if crawler pod is running, but when crawler pod is restarted, this ensures new exclusions are also picked up: - mount configmap in separate path, avoiding subPath, to allow dynamic updates of mounted volume - adds a lastConfigUpdate timestamp to CrawlJob - if lastConfigUpdate in spec is different from current, the configmap is recreated by operator - operator: also update image from channel avoid any issues with updating crawler in channel - only updates for exclusion add/remove so far, can later be expanded to other crawler settings (see: #2355 for broader running crawl config updates) - fixes #2408	2025-02-19 11:42:19 -08:00
Ilya Kreymer	d23bca1f73	style change: remove spaces from python version docstring	2025-02-17 16:52:49 -08:00
Ilya Kreymer	a7c8ca4028	version: bump to 1.14.0-beta.1	2025-02-17 16:48:27 -08:00
Tessa Walsh	6c2d8c88c8	Modify page upload migration (#2400 ) Related to #2396 Changes to migration 0037: - Re-adds pages in migration rather than in background job to avoid race condition with later migrations - Re-adds pages for all uploads in all orgs Fix for readd pages for org: - Ensure org filter is applied! - Fix wrong type - Remove distinct, use iterator to iterate over crawls faster. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-02-17 16:47:58 -08:00
Ilya Kreymer	5bebb6161a	Issue 2396 readd pages fixes (#2398 ) readd pages fixes: - add additional mem to background job - copy page qa data to separate temp coll when re-adding pages, then merge back in	2025-02-17 13:52:11 -08:00
Ilya Kreymer	e112f96614	Upload Fixes: (#2397 ) - ensure upload pages are always added with a new uuid, to avoid any duplicates with existing uploads, even if upload wacz is actually a crawl from different browsertrix instance, etc.. - cleanup upload names with slugify, which also replaces spaces, fixes uploading wacz filenames with spaces in them - part of fix for #2396 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-02-17 13:05:33 -08:00
Tessa Walsh	39d99e7c5d	Add support for custom link selectors to backend (#2346 ) Related to #2152 This PR adds backend support for custom link selectors via `selectLinks` on the crawl workflow config. Tests have been updated as well. It also adds `selectLinks` to the frontend in a minimal and for now hardcoded way that we can use as a basis for proper frontend support moving forward. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-02-13 22:22:27 -08:00
Ilya Kreymer	4516268a70	misc fixes: cors + disable buffering for uploads (#2395 ) - ensure pages endpoint support CORS for local dev - disable proxy request buffering to support large uploads	2025-02-13 19:38:20 -08:00
Tessa Walsh	7f1af9bb31	Mark all pages from pages.jsonl as seeds (#2390 ) Fixes #2389 All pages from `pages/pages.jsonl` files now have `isSeed: True` in the database, in addition to any pages that explicitly have `seed` set to true in the actual JSONL. Tests have been added to ensure that all pages from our fixture uploads have `isSeed: True`. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-02-13 16:54:30 -08:00
Ilya Kreymer	7b2932c582	Add initial pages + pagesQuery endpoint to /replay.json APIs (#2380 ) Fixes #2360 - Adds `initialPages` to /replay.json response for collections, returning up-to 25 pages (seed pages first, then sorted by capture time). - Adds `pagesQueryUrl` to /replay.json - Adds a public pages search endpoint to support public collections. - Adds `preloadResources`, including list of WACZ files that should always be loaded, to /replay.json --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-02-13 16:53:47 -08:00
sua yoo	f7b9b73a68	fix: Sort filtered collection page URLs (#2384 ) Fixes https://github.com/webrecorder/browsertrix/issues/2383 - Fixes unpredictable sort order when typing in collection page URL - Fixes page URL results flickering in and out while typing --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-02-12 11:59:20 -05:00
Emma Segal-Grossman	f8a44258d8	Merge pull request #2332 from webrecorder/frontend-collection-editing-dialog Collection editing and sharing revamp	2025-02-11 18:27:35 -05:00
Tessa Walsh	98a45b0d85	Add collection page list/search endpoint (#2354 ) Fixes #2353 Adds a new endpoint to list pages in a collection, with filtering available on `url` (exact match), `ts`, `urlPrefix`, `isSeed`, and `depth`, as well as accompanying tests. Additional sort options have been added as well. These same filters and sort options have also been added to the crawl pages endpoint. Also fixes an issue where `isSeed` wasn't being set in the database when false but only added on serialization, which was preventing filtering from working as expected.	2025-02-10 16:44:37 -08:00
Ilya Kreymer	001839a521	Fix max pages quota setting and display (#2370 ) - add ensure_page_limit_quotas() which sets the config limit to the max pages quota, if any - set the page limit on the config when: creating new crawl, creating configmap - don't set the quota page limit on new or existing crawl workflows (remove setting it on new workflows) to allow updated quotas to take affect for next crawl - frontend: correctly display page limit on workflow settings page from org quotas, if any. - operator: get org on each sync in one place - fixes #2369 --------- Co-authored-by: sua yoo <sua@webrecorder.org>	2025-02-10 16:15:21 -08:00
Tessa Walsh	0e9e70f3a3	Add WACZ filename, depth, favIconUrl, isSeed to pages (#2352 ) Adds `filename` to pages, pointed to the WACZ file those files come from, as well as depth, favIconUrl, and isSeed. Also adds an idempotent migration to backfill this information for existing pages, and increases the backend container's startupProbe time to 24 hours to give it sufficient time to finish the migration. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-02-05 15:50:04 -05:00
Ilya Kreymer	ea3b5e7322	quickfix: fix typo (missing self) that did not make it into #2351	2025-01-30 13:11:42 -08:00
Tessa Walsh	0a8df62ab4	Ensure collection stats are updated when WACZ is added on upload (#2351 ) Fixes #2350 Collection earliest/latest dates and the collection modified date are also now updated when crawls or uploads are added to a collection via the collection auto-add feature.	2025-01-30 13:05:56 -08:00
Tessa Walsh	b0aebb599a	Reformat with Black for 2025 ruleset (#2349 )	2025-01-29 16:57:06 -05:00
Tessa Walsh	9363095d62	Validate exclusion regexes on backend (#2316 )	2025-01-23 13:32:54 -05:00
Tessa Walsh	763c654484	feat: Update collection sorting, metadata, stats (#2327 ) - Refactors dashboard and org profile preview to use private API endpoint, to fix public collections not showing when the org visibility is hidden - Adds additional sorting options for collections - Adds unique page url counts for archived items, collections, and organizations to backend and exposes this in collections - Shows collection period (i.e. `dateEarliest` to `dateLatest`) in collections list - Shows same collection metadata in private and public views, updates private view info bar - Fixes "Update Org Profile" action item showing for crawler roles --------- Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: sua yoo <sua@suayoo.com> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-01-23 13:32:23 -05:00
Ilya Kreymer	28d39d8c4d	Fix migration to avoid duplicate collection slugs and names (#2318 ) Follow-up to #2301 Updates the 0039 migration to ensure collection slugs and names are unique by: - Removing all indexes - Setting `slug` to random value - Adding unique index to `slug` field. - Attempting to set slug from name using `slug_from_name()` - If rejected due to duplicate, append `-<counter>` at end of slug. Also update name with ` <counter>`. - Now that names should also be unique, add unique index on name field. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-01-21 14:23:32 -08:00
Tessa Walsh	6797b41de0	Add pageCount to crawls and uploads and use in frontend for page counts (#2315 ) Fixes #2257 This is a follow-up to the public collections work, which adds pages to the database for uploads. All crawls and uploads now have a `pageCount` field which is populated when the item is successfully added. A new migration is also added to populate the field for existing archived items that don't have it set yet. OrgMetrics have also been modified to include `crawlPageCount` and `uploadPageCount`, and to include the total of both in `pageCount`, and all three included in the frontend org dashboard. The frontend has been updated to use `pageCount` rather than `stats.done` wherever appropriate, meaning that in archived item lists and details we now have a consistent page count for both crawls and uploads. ### New functionality - Deploy this branch - Create new crawls and uploads and verify that page count appears correctly throughout the frontend for all new crawls and uploads ### Migration - Deploy from latest main - Create some crawls and uploads - Change to this branch and re-deploy - Verify migration ran without errors in backend logs - Verify that page count has been populated successfully by checking archived items lists, crawl and upload detail pages, and dashboard to ensure there are no longer any missing page counts. --------- Co-authored-by: emma <hi@emma.cafe>	2025-01-16 14:41:14 -08:00
Tessa Walsh	5684e896af	Add support for autoclick (#2313 ) Fixes #2259 This PR brings backend and frontend support for the new autoclick behavior in Browsertrix, introduces in Browsertrix 1.5.0+ On the backend, we introduce `min_autoclick_crawler_image` to `values.yaml`, with a default value of `"docker.io/webrecorder/browsertrix-crawler:1.5.0"`. If this is set and the crawler version for a new crawl is less than this value, the autoclick behavior is removed from the behaviors list in the configmap created for the crawl. The one caveat for this is that a crawler image tag like "latest" will always be parsed as greater than `min_autoclick_crawler_image`, so there is the potential for the crawler to run into issues if using a non-numeric image tag with an older version of the crawler. For production we use hardcoded specific versions of the crawler except for the dev channel, which from here on out will including autoclick support, so I think this should be okay (and is also true of the existing implementation for checking `min_qa_crawler_image`). On the frontend, I've added a checkbox (unchecked by default) in the "Limits" section just below the current checkbox for autoscroll. We might want to move these to a different section eventually - I'm not sure Limits is the right place for them - but I wanted to be consistent with things as they are. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-01-16 12:44:00 -08:00
Tessa Walsh	4583babecb	feat: Add slug to collections and use it in public collection URLs (#2301 ) Resolves https://github.com/webrecorder/browsertrix/issues/2298 ## Changes - Slugs added to collections, can be specified separately when creating or updating collections or else is based off of supplied collection name - Migration added to backfill slugs for existing collections - Redirect collection to newest slug if changed - Adds option to copy public profile link to "Public Collections" action menu - Show "Back to <Org>" link instead of breadcrumbs --------- Co-authored-by: sua yoo <sua@suayoo.com> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-01-15 22:44:32 -08:00
sua yoo	4347fcdba5	feat: Show collection created date (#2302 ) - Shows collection created date in detail view (if present) - Adds `black` formatter to vscode extension recommendations --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-01-14 11:22:00 -05:00
Tessa Walsh	cbcf087a48	Add last crawl and subscription status indicators to org list (#2273 ) Fixes #2260 - Adds `lastCrawlFinished` to Organization model, updated after crawls are added/deleted and with an idempotent migration to backfill existing orgs - Adds Last Crawl column to end of admin orgs list table - Adds subscription icon next to existing status icon in orgs list - Adds "lastCrawlFinished", "subscriptionStatus", and "subscriptionPlan" sort options to orgs list backend endpoint in anticipation of future sorting/filtering of orgs list --------- Co-authored-by: emma <hi@emma.cafe> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-01-14 10:57:06 -05:00
Ilya Kreymer	12f358b826	Merge pull request #2271 from webrecorder/public-collections-feature feat: Public collections, includes: - feat: Public org profile page #2172 - feat: Collection thumbnails, start page, and public view updates #2209 - feat: Track collection events #2256	2025-01-13 19:32:45 -08:00
Ilya Kreymer	bab5345ad5	version: bump to 1.14.0-beta.0 for public collections!	2025-01-13 19:29:54 -08:00
Tessa Walsh	d8655d3bc6	Use id for thumbnail size error detail	2025-01-13 15:15:49 -08:00
Tessa Walsh	be9ff04ee8	Make more explicit error message for large thumbnails	2025-01-13 15:15:49 -08:00
Tessa Walsh	eb88e9f90c	Add missing os import	2025-01-13 15:15:48 -08:00
Tessa Walsh	a031fab313	Backend work for public collections (#2198 ) Fixes #2182 This rather large PR adds the rest of what should be needed for public collections work in the frontend. New API endpoints include: - Public collections endpoints: GET, streaming download - Paginated list of URLs in collection with snapshot (page) info for each - Collection endpoint to set home URL - Collection endpoint to upload thumbnail as stream - DELETE endpoint to remove collection thumbnail Changes to existing API endpoints include: - Paginating public collection list results - Several `pages` endpoints that previously only supported `/crawls/` in their path, e.g. `/orgs/{oid}/crawls/all/pages/reAdd`, now support `/uploads/` and `/all-crawls/` namespaces as well. This is necessitated by adding pages for uploads to the database (see below). For `/orgs/{oid}/namespace/all/pages/reAdd`, `crawls` or `uploads` will serve as a filter to only affect crawls of that given type. Other endpoints are more liberal at this point, and will perform the same action regardless of the namespace used in the route (we'll likely want to change this in a follow-up to be more consistent). - `/orgs/{oid}/namespace/all/pages/reAdd` now kicks off a background job rather than doing all of the computation in an asyncio task in the backend container. The background job additionally updates collection date ranges, page/size counts, and tags for each collection in the org after pages have been (re)added. Other big changes: - New uploads will now have their pages read into the database! Collection page counts now also include uploads - A migration was added to start a background job for each org that will add the pages for previously-uploaded WACZ files to the database and update collections accordingly - Adds a new `ImageFile` subclass of `BaseFile` for thumbnails that we can use for other user-uploaded image files moving forward, with separate output models for authenticated and public endpoints	2025-01-13 15:15:48 -08:00
Tessa Walsh	190bdeb868	Add public API endpoint for public collections (#2174 ) Fixes #1051 If org with provided slug doesn't exist or no public collections exist for that org, return same 404 response with a detail of "public_profile_not_found" to prevent people from using public endpoint to determine whether an org exists. Endpoint is `GET /api/public-collections/<org-slug>` (no auth needed) to avoid collisions with existing org and collection endpoints.	2025-01-13 15:15:48 -08:00
Tessa Walsh	42ebfd303d	Make changes to collections to support publicly listed collections (#2164 ) Fixes #2158 - Adds `Organization.listPublicCollections` field and API endpoint to update it - Replaces `Collection.isPublic` boolean with `Collection.access` (values: `private`, `unlisted`, `public`) and add database migration - Update frontend to use `Collection.access` instead of `isPublic`, otherwise not changing current behavior --------- Co-authored-by: sua yoo <sua@suayoo.com>	2025-01-13 15:15:47 -08:00
Ilya Kreymer	a21b2ff0df	version: bump to 1.13.2	2025-01-08 22:58:33 -08:00
Tessa Walsh	589819682e	Optionally delay replica deletion (#2252 ) Fixes #2170 The number of days to delay file replication deletion by is configurable in the Helm chart with `replica_deletion_delay_days` (set by default to 7 days in `values.yaml` to encourage good practice, though we could change this). When `replica_deletion_delay_days` is set to an int above 0, when a delete replica job would otherwise be started as a Kubernetes Job, a CronJob is created instead with a cron schedule set to run yearly, starting x days from the current moment. This cronjob is then deleted by the operator after the job successfully completes. If a failed background job is retried, it is re-run immediately as a Job rather than being scheduled out into the future again. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-12-19 18:50:28 -08:00
Ilya Kreymer	2060ee78b4	Support Presigning for use with custom domain (#2249 ) If access_endpoint_url is provided: - Use virtual host addressing style, so presigned URLs are of the form `https://bucket.s3-host.example.com/path/` instead of `https://s3-host.example.com/bucket/path/` - Allow for replacing `https://bucket.s3-host.example.com/path/` -> `https://my-custom-domain.example.com/path/`, where `https://my-custom-domain.example.com/path/` is the access_endpoint_url - Remove old `use_access_for_presign` which is no longer used - Fixes #2248 - docs: update deployment docs storages section to mention custom storages, access_endpoint_url --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-12-19 18:41:47 -08:00
Ilya Kreymer	8e375335cd	Related crawljob filtering by role (#2262 ) add filtering by role to related crawljobs query: - for regular crawls (role 'job'), only count other regular crawls - for qa runs (role 'qa-job') only count other qa jobs - ensures that concurrent crawl limits apply separately to regular crawls and qa runs - fixes #2261	2024-12-19 17:20:15 -08:00
Ilya Kreymer	60d07762be	version: bump to 1.13.1	2024-12-19 12:01:47 -08:00
Ilya Kreymer	cf60c43df2	version: bump to 1.13.0! (#2242 )	2024-12-13 20:32:38 -08:00
Ilya Kreymer	c27758a0f6	quickfix: update test_api.py to match all locales enabled by default (#2241 )	2024-12-13 20:30:06 -08:00
Emma Segal-Grossman	b650762a45	Allow configuring available languages from helm chart (#2230 ) Closes #2223 - [x] Adds `localesAvailable` to `/api/settings` endpoint, and uses that list if available, rather than the full list of translated locales, to determine which options to display to users - [x] ~~Uses the user's browser locales, filtered to the current language setting, for formatting numbers, dates, and durations~~ - [x] Adds & persists checkbox for "use same language for formatting dates and numbers" in user settings - [x] Replaces uses of `sl-format-bytes` with `localize.bytes(...)`, and `sl-format-date` with replacement `btrix-format-date` that properly handles fallback locales - [x] Caches all number/duration/datetime formatters by a combined key consisting of app language, browser language, browser setting, and formatter options so that all formatters can be reused if needed (previously any formatter with non-default options would be recreated every render) - [x] Splits out ordinal formatting from number formatter, as it didn't make much sense in some non-English locales - [x] Adds a little demo of date/time/duration/number formatting so you can see what effect your language settings have https://github.com/user-attachments/assets/724858cb-b140-4d72-a38d-83f602c71bc7 --------- Signed-off-by: emma <hi@emma.cafe> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-12-13 22:31:26 -05:00
Ilya Kreymer	db39333ef4	Send subscription cancelation email (#2234 ) Adds sending a cancellation email when a subscription is cancelled. - The email may also include an option survey optional survey URL, if configured in helm chart `survey_url` setting. - Cancellation e-mail configured in `sub_cancel` e-mail template - E-mails are sent to all org admins. - Also adds `trialing_canceled` subscription state to differentiate from a default `trialing` which will automatically rollover into `active`. - The email is sent when: a new cancellation date is added for an `active` subscription, or a `trialing` subscription is changed to to `trialing_canceled`. (A subscription can be canceled/uncanceled several times before actual date, and e-mail is sent every time it is canceled.) - The 'You have X days left of your trial' is also always displayed when state is in trialing_canceled. Fixes #2229 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-12-12 11:52:38 -08:00
Tessa Walsh	b7604ee61d	Add superuser endpoint to get user emails with org info (#2211 ) Fixes #2203 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-12-09 16:38:01 -08:00
Tessa Walsh	661e5d9fae	Fix issue with failed background job emails not being sent (#2187 ) Fixes #2186 Background job emails will no longer fail to send for jobs unrelated to file replication or replica deletion. Also uses `AnyJob` for paginated background job response model, to fix typing being out of data following addition of other types of background jobs and lower overhead for adding new ones moving forward.	2024-11-27 17:00:35 -08:00
Ilya Kreymer	50dac7dc50	1.12.2 release -> main (#2181 ) Merge 1.12.2 release changes into main, includes: - Collection replay full refresh on metadata / archived items (#2176) - Fix for self-registration default org (#2178) - Prepend missing https in start URL (#2177) - Updated billing to support free trial messaging (#2179) --------- Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: sua yoo <sua@suayoo.com> Co-authored-by: SuaYoo <SuaYoo@users.noreply.github.com>	2024-11-26 11:17:07 -08:00
Tessa Walsh	ba5ca3fdd9	Move org storage recalculation into background job (#2138 ) Fixes #2112 - Moves org storage recalculation to background job, modify endpoint to return job id as part of response - Updates crawl + QA backend tests that broke due to https://webrecorder.net website changes --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-11-19 17:32:57 -05:00
Henry Wilkinson	74161f8477	Update Webrecorder.net links (#2120 ) - Updates documentation links to point to new Browsertrix landing page - Updates redoc links	2024-10-31 16:33:54 -04:00
Tessa Walsh	55a758f342	Consolidate ops class initialization (#2117 ) Fixes #2111 The background job and operator entrypoints now use a shared function that initalizes and returns the ops classes. This is not applied in the main entrypoint as that also initializes the backend API, which we don't want in the other entrypoints.	2024-10-30 15:33:22 -04:00
Tessa Walsh	0dc025e9fd	Update nightly org deletion tests to account for bg job (#2118 ) Follow-up to https://github.com/webrecorder/browsertrix/pull/2098 Updates I missed to nightly org deletion tests following the shift to deleting orgs in a background job. I think this should be the last thing to get nightly tests passing consistently again.	2024-10-30 15:31:33 -04:00
Tessa Walsh	3ea20e538d	Fix nightly tests: Add boto3 as test requirement (#2116 )	2024-10-23 13:41:22 -07:00
Tessa Walsh	f7426cc46a	Fix nightly tests: modify kubectl exec syntax for creating new minio bucket (#2097 ) Fixes #2096 For example failing test run, see: https://github.com/webrecorder/browsertrix/actions/runs/11121185534/job/30899729448 --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-10-21 17:41:19 -07:00
Tessa Walsh	1b1819ba5a	Move org deletion to background job with access to backend ops classes (#2098 ) This PR introduces background jobs that have full access to the backend ops classes and moves the delete org job to a background job.	2024-10-10 14:41:05 -04:00
Ilya Kreymer	84a74c43a4	version: bump to 1.13.0-beta.0	2024-10-10 11:38:13 -07:00
Ilya Kreymer	6032e28231	fix: firstOrgAdmin being set to true even if invite was not for an admin (#2110 ) Non-admin users should not be given option to rename org when invited to a new org: - set firstOrgAdmin to true only when invite is for an admin - default to false instead of null - update tests to check	2024-10-08 16:42:30 -07:00
Ilya Kreymer	8192e5bed6	version: bump to 1.12.0	2024-10-03 16:45:54 -07:00
Ilya Kreymer	104ea097c4	switch to simpler streaming download + multiwacz metadata improvements: (#1982 ) - download via presigned URLs via requests instead of boto APIs, remove boto - follow-up to #1933 for streaming download improvements - fixes datapackage.json in multi-wacz to contain the same resources objects with: `name`, `path`, `hash`, `bytes` to match single WACZ. - Add additional metadata to multi-wacz datapackage.json, including `type` (`crawl`, `upload`, `collection`, `qaRun`), `id` (unique id for the object), `title` / `description` if available (for crawl/upload/collection), and `crawlId` for `qaRun`	2024-10-03 16:13:31 -07:00
Vinzenz Sinapius	bb6e703f6a	Configure browsertrix proxies (#1847 ) Resolves #1354 Supports crawling through pre-configured proxy servers, allowing users to select which proxy servers to use (requires browsertrix crawler 1.3+) Config: - proxies defined in btrix-proxies subchart - can be configured via btrix-proxies key or separate proxies.yaml file via separate subchart - proxies list refreshed automatically if crawler_proxies.json changes if subchart is deployed - support for ssh and socks5 proxies - proxy keys added to secrets in subchart - support for default proxy to be always used if no other proxy configured, prevent starting cluster if default proxy not available - prevent starting manual crawl if previously configured proxy is no longer available, return error - force 'btrix' username and group name on browsertrix-crawler non-root user to support ssh Operator: - support crawling through proxies, pass proxyId in CrawlJob - support running profile browsers which designated proxy, pass proxyId to ProfileJob - prevent starting scheduled crawl if previously configured proxy is no longer available API / Access: - /api/orgs/all/crawlconfigs/crawler-proxies - get all proxies (superadmin only) - /api/orgs/{oid}/crawlconfigs/crawler-proxies - get proxies available to particular org - /api/orgs/{oid}/proxies - update allowed proxies for particular org (superadmin only) - superadmin can configure which orgs can use which proxies, stored on the org - superadmin can also allow an org to access all 'shared' proxies, to avoid having to allow a shared proxy on each org. UI: - Superadmin has 'Edit Proxies' dialog to configure for each org if it has: dedicated proxies, has access to shared proxies. - User can select a proxy in Crawl Workflow browser settings - Users can choose to launch a browser profile with a particular proxy - Display which proxy is used to create profile in profile selector - Users can choose with default proxy to use for new workflows in Crawling Defaults --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-10-02 18:35:45 -07:00
Ilya Kreymer	62da0fbd6c	security: tweak get /invite endpoints / InviteOut to: (#2087 ) don't set inviterEmail / inviterName if the inviter is the superuser: - return fromSuperuser true/false - if fromSuperuser, don't set inviterEmail / inviterName - tests: add tests for non-superuser admin invites	2024-09-20 11:52:56 -07:00
Ilya Kreymer	feb6b1f26c	Ensure email comparisons are case-insensitive, emails stored as lowercase (#2084 ) (#2086 ) (fixes from 1.11.7) - Add a custom EmailStr type which lowercases the full e-mail, not just the domain. - Ensure EmailStr is used throughout wherever e-mails are used, both for invites and user models - Tests: update to check for lowercase email responses, e-mails returned from APIs are always lowercase - Tests: remove tests where '@' was ur-lencoded, should not be possible since POSTing JSON and no url-decoding is done/expected. E-mails should have '@' present. - Fixes #2083 where invites were rejected due to case differences - CI: pin pymongo dependency due to latest releases update, update python used for CI	2024-09-19 12:20:34 -07:00
Tessa Walsh	123705c53f	Serialize datetimes with Z suffix (#2058 ) Use timezone aware datetimes instead of timezone naive datetimes: - Update mongodb client to use tz-aware conversion - Convert dt_now() to return timezone aware UTC date - Rename to_k8s_date -> date_to_str, just returns ISO UTC date with 'Z' (instead of '+00:00' suffix) - Rename from_k8s_date -> str_to_date, returns timezone aware date from str - Standardize all string<->date conversion to use either date_to_str or str_to_date - Update frontend to assume iso date, not append 'Z' directly - Update tests to check for 'Z' suffix on some dates --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-09-12 16:16:13 -07:00
Ilya Kreymer	c242bb96d2	version: bump to 1.12.0-beta.0	2024-09-12 14:30:15 -07:00
Ilya Kreymer	1f919de294	Allow custom auto-resize crawler volume ratio adjustable (#2076 ) Make the avail / used storage ratio (for crawler volumes) adjustable. Disable auto-resize if set to 0. Follow-up to #2023	2024-09-12 09:28:19 -07:00
sua yoo	4c36c80351	feat: Display scale as number of browser windows (#2057 ) Resolves https://github.com/webrecorder/browsertrix/issues/2048 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-09-05 17:32:40 -07:00
Ilya Kreymer	b3c1195878	version: bump to 1.11.6	2024-09-05 17:31:10 -07:00
Ilya Kreymer	ea252e8da9	version: bump to 1.11.5	2024-08-27 10:00:53 -07:00
sua yoo	337454f8c9	feat: Add link to hosted sign-up page (#2045 ) Resolves https://github.com/webrecorder/browsertrix/issues/2043 <!-- Fixes #issue_number --> ### Changes - Shows link to sign up in UI if `sign_up_url` is configured. - Expires settings in session storage (for now)	2024-08-26 17:26:25 -07:00
Ilya Kreymer	95969ec747	Attempt to auto-adjust storage if usage is running out while crawl is running (#2023 ) Attempt to auto-adjust PVC storage if: - used storage (as reported in redis by the crawler) * 2.5 > total_storage - will cause PVC to resize, if possible (not supported by all drivers) - uses multiples of 1Gi, rounding up to next GB - AVAIL_STORAGE_RATIO hard-coded to 2.5 for now, to account for 2x space for WACZ plus change for fast updating crawls Some caveats: - only works if the storageClass used for PVCs has `allowVolumeExpansion: true`, if not, it will have no effect - designed as a last resort option: the `crawl_storage` in values and `--sizeLimit` and `--diskUtilization` should generally result in this not being needed. - can be useful in cases where a crawl is rapidly capturing a lot of content in one page, and there's no time to interrupt / restart, since the other limits apply only at page end. - May want to have crawler update the disk usage more frequently, not just at page end to make this more effective.	2024-08-26 14:19:20 -07:00
Ilya Kreymer	a1df689729	stats recompute fixes: (#2022 ) - fix stats_recompute_last() and stats_recompute_all() to not update the lastCrawl* properties of a crawl workflow if a crawl is running, as those stats now point to the running crawl - refactor _add_running_curr_crawl_stats() to make it clear stats only updated if crawl is running - stats_recompute_all() change order to ascending to actually get last crawl, not first!	2024-08-26 14:18:59 -07:00
Ilya Kreymer	135c97419d	version: update to 1.11.4	2024-08-26 12:31:56 -07:00
Ilya Kreymer	96e393e80d	update crawler channel fix: add crawlerChannel to update check (#2046 ) Add missing check for crawlerChannel update	2024-08-26 10:41:54 -04:00
Ilya Kreymer	04c8b50423	add a crawling defaults on the Org to allow setting certain crawl workflow fields as defaults: (#2031 ) - add POST /orgs/<id>/defaults/crawling API to update all defaults (defaults unset are cleared) - defaults returned as 'crawlingDefaults' object on Org, if set - fixes #2016 --------- Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>	2024-08-22 10:36:04 -07:00
Ilya Kreymer	86c9e538c1	quickfix: webhooks: ensure the 'crawl_reviewed' webhook is sent async, doesn't delay submitting a review (#2033 ) make the call to `create_crawl_reviewed_notification` be called with create_task (similar to other user-initiated webhook events), to avoid extra wait for webhook to complete	2024-08-20 17:50:18 -07:00
Ilya Kreymer	8c9a14b6a2	Ensure Subscription Update doesn't update the gifted quotas (#2012 ) - add a separate OrgQuotasIn where all quota updates are optional - ensure gifted quotas are never updated as part of org update - update tests	2024-08-20 13:15:03 -07:00
Tessa Walsh	916813af2d	Include user and user org info in login response (#2014 ) Fixes #2013 Adds the `/users/me` response data to the API login endpoint response under the key `user_info` and adds a test.	2024-08-12 18:51:42 -07:00
Ilya Kreymer	d9f49afcc5	type fixes on util functions (#2009 ) Some additional typing for util.py functions and resultant changes	2024-08-12 10:54:45 -07:00
Ilya Kreymer	12f994b864	QA: Count QA execution minutes separately for now (#2011 ) For now, keep QA exec time separate, as it may be scaled differently and currently still in beta.	2024-08-09 13:13:21 -07:00
Ilya Kreymer	4ec7cf8adc	Additional operator edge case fixes (#2007 ) Fix a few edge-case situations: - Restart evicted pods that have reached the terminal `Failed` state with reason `Evicted`, by just recreating them. These pods will not be automatically retried, so need to be recreated (usually happens due to memory pressure from the node) - Don't treat containers in ContainerCreating as running, even though this state is usually quick, its possible for containers to get stuck there, and will improve accuracy of exec seconds tracking. - Consolidate state transition for running states, either sets to running or to pending-wait/generate-wacz/upload-wacz and allows changing from to either of these states from each other or waiting_capacity	2024-08-09 13:12:25 -07:00
Ilya Kreymer	8ff1ad39a7	version: bump to 1.11.3	2024-08-08 15:16:18 -07:00
Ilya Kreymer	ed9038fbdb	version: bump to 1.11.2	2024-08-07 12:37:26 -07:00
Ilya Kreymer	5f53db75ee	fix resetting of invalid logins: (#2002 ) * Fixes issue in FailedLogin model: - fix data-model to remove nested 'attempted.attempted' - migrate existing data to remove nested field * Also, avoid setting dt_now() in model as that results in fixed date for all objects: - update FailedLogin to update 'attempted' date on every attempt - also update PageNote object to set date in constructor * Update text for too many logins to make it clear it is set only if its a valid email * fixes #2001	2024-08-07 12:36:06 -07:00
Ilya Kreymer	41d43ae249	Fix forgot password for invalid user (#1999 ) - fix validation error if user doesn'r exist - always return success even if user doesn't exist for security reasons - add test for forgot password endpoint	2024-08-07 11:02:40 -07:00
Ilya Kreymer	7fa2b61b29	Execution time tracking tweaks (#1994 ) Tweaks to how execution time is tracked for more accuracy + excluding waiting states: - don't update if crawl state is in a 'waiting state' (waiting for capacity or waiting for org limit) - rename start states -> waiting states for clarity - reset lastUpdatedTime if two consecutive updates of non-running state, to ensure non-running states don't count, but also account for occasional hiccups -- if only one update detects non-running state, don't reset - webhooks: move start webhook to when crawl actually starts for first time (db lastUpdatedTime is not yet + crawl is running) - don't set lastUpdatedTime until pods actually running - set crawljob update interval to every 10 seconds for more accurate execution time tracking - frontend: show seconds in 'Execution Time' display	2024-08-06 09:44:44 -07:00

1 2 3 4 5 ...

661 Commits