browsertrix

Author	SHA1	Message	Date
Tessa Walsh	38a01860b8	Add API endpoints for crawl statistics (#1461 ) Fixes #1158 Introduces two new API endpoints that stream crawling statistics CSVs (with a suggested attachment filename header): - `GET /api/orgs/all/crawls/stats` - crawls from all orgs (superuser only) - `GET /api/orgs/{oid}/crawls/stats` - crawls from just one org (available to org crawler/admin users as well as superusers) Also includes tests for both endpoints.	2024-01-10 13:30:47 -08:00
Ilya Kreymer	a6936299d3	version: bump to 1.9.0-beta.0	2023-12-20 00:08:16 -08:00
Ilya Kreymer	d74d9ac09d	Recreate configmaps if missing (#1444 ) If configmap is missing (eg. was accidentally deleted from k8s) recreate the configmap when updating the crawl workflow or running a crawl. Previously, this would result in an error, but now the configmap should be correctly recreated.	2023-12-12 17:48:27 -05:00
Ilya Kreymer	d902cf5338	version: bump to 1.8.2	2023-12-07 13:34:37 -08:00
Tessa Walsh	be41c48c27	Add extra and gifted execution minutes (#1361 ) Fixes #1358 - Adds `extraExecMinutes` and `giftedExecMinutes` org quotas, which are not reset monthly but are updateable amounts that carry across months - Adds `quotaUpdate` field to `Organization` to track when quotas were updated with timestamp - Adds `extraExecMinutesAvailable` and `giftedExecMinutesAvailable` fields to `Organization` to help with tracking available time left (includes tested migration to initialize these to 0) - Modifies org backend to track time across multiple categories, using monthlyExecSeconds, then giftedExecSeconds, then extraExecSeconds. All time is also written into crawlExecSeconds, which is now the monthly total and also contains any overage time above the quotas - Updates Dashboard crawling meter to include all types of execution time if `extraExecMinutes` and/or `giftedExecMinutes` are set above 0 - Updates Dashboard Usage History table to include all types of execution time (only displaying columns that have data) - Adds backend nightly test to check handling of quotas and execution time - Includes migration to add new fields and copy crawlExecSeconds to monthlyExecSeconds for previous months Co-authored-by: emma <hi@emma.cafe>	2023-12-07 14:34:37 -05:00
Tessa Walsh	478b794f9b	Add API endpoint to retry all failed bg jobs (#1396 ) Fixes #1395 - Adds new `POST /orgs/<orgid>/jobs/retryFailed` API endpoint to retry all failed background jobs for a specific org. - Also adds `POST /orgs/all/jobs/retryFailed` for superadmin to retry all failed background jobs for all orgs	2023-12-05 13:00:45 -08:00
Tessa Walsh	3d93d0a0d0	Add API tests for browser profiles (#1392 ) Fixes #1330	2023-11-28 10:40:58 -05:00
Henry Wilkinson	f507f1d2ec	Fixes allowed actions for viewers and crawlers throughout the app (#1326 ) Closes #1294 ### Changes - `crawl-list` component - Adds a check if there are any items in the actions menu. If not, skip rendering the actions menu. - This allows us to give the component no actions! Currently required to remove them for viewers! - Collection Details - Hides "Remove from Collection" option for viewers - Crawls List - Removes the single "View Crawl Details" option from archived items for viewers - All the other actions were already set up correctly to be used by all roles! - Dashboard - Hides org settings gear icon button unless the user is an admin - Hides "Create New" dropdown for viewers - Workflow Details - Hides workflow edit icon button for viewers - Hides the "Delete Crawl" option in archived items for viewers - Hides the "Run Crawl" option for viewers - Workflow List - Hides all edit-related options for viewers, the only option now is copying tags - Removes the deactivate / delete options (were only visible when running a crawl) in the workflow list actions --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: sua yoo <sua@suayoo.com>	2023-11-17 14:41:21 -08:00
Ilya Kreymer	1218d6e767	version: bump to 1.8.1	2023-11-17 14:39:52 -08:00
Ilya Kreymer	b6f8c968e9	version: bump to 1.8.0	2023-11-15 17:57:43 -08:00
Ilya Kreymer	b23eed5003	Email Templates (#1375 ) - Emails are now processed from Jinja2 templates found in `charts/email-templates`, to support easier updates via helm chart in the future. - The available templates are: `invite`, `password_reset`, `validate` and `failed_bg_job`. - Each template can be text only or also include HTML. The format of the template is: ``` subject ~~~ <html content> ~~~ text ``` - A new `support_email` field is also added to the email block in values.yaml Invite Template: - Currently, only the invite template includes an HTML version, other templates are text only. - The same template is used for new and existing users, with slightly different text if adding user to an existing org. - If user is invited by the superadmin, the invited by field is not included, otherwise it also includes 'You have been invited by X to join Y'	2023-11-15 15:22:12 -08:00
Ilya Kreymer	7d985a9688	version: bump to 1.8.0-beta.4	2023-11-14 11:59:04 -08:00
Ilya Kreymer	dfba4b3940	Replace partial_complete -> stopped_by_user or stopped_quota_reached + operator edge cases (#1368 ) - Adds two new crawl finished state, stopped_by_user and stopped_quota_reached - Tracking other possible 'stop reasons' in operator, though not making them distinct states for now. - Updated frontend with 'Stopped by User' and 'Stopped: Time Quota Reached', shown with same icon as current partial_complete - Added migration of partial_complete to either stopped_by_user or complete (no historical quota data available) - Addresses edge case in scaling: if crawl never scaled (no redis entry, no pod), automatically scale down - Edge case in status: if crawl is somehow 'canceled' but not deleted, immediately delete crawl object and begin finalizing. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-11-14 11:17:16 -08:00
Ilya Kreymer	67892994a6	version: bump to 1.8.0-beta.3	2023-11-09 18:20:04 -08:00
Tessa Walsh	f3cbd9e179	Add crawl, upload, and collection delete webhook event notifications (#1363 ) Fixes #1307 Fixes #1132 Related to #1306 Deleted webhook notifications include the org id and item/collection id. This PR also includes API docs for the new webhooks and extends the existing tests to account for the new webhooks. This PR also does some additional cleanup for existing webhooks: - Remove `downloadUrls` from item finished webhook bodies - Rename collection webhook body `downloadUrls` to `downloadUrl`, since we only ever have one per collection - Fix API docs for existing webhooks, one of which had the wrong response body	2023-11-09 18:19:08 -08:00
Tessa Walsh	1afc411114	Implement retry API endpoint for failed background jobs (#1356 ) Fixes #1328 - Adds /retry endpoint for retrying failed jobs. - Returns 400 error if previous job still running or has succeeded - Keeps track of previous failed attempts in previousAttempts array on failed job. - Also amends the similar webhook /retry endpoint to use `POST` for consistency. - Remove duplicate api tag for backgroundjobs	2023-11-09 18:09:37 -08:00
Tessa Walsh	82a5d1e4e4	Regression fix: add profiles/ prefix to profile filenames (#1365 ) Fixes #1364 Regression fix for issue introduced in storage refactoring (see issue for more details). Changes: 1. Add `profiles/` prefix to profile filename passed in to crawler for profile creation and written into db 2. Remove hardcoded `profiles/` prefix from crawler YAML 3. Add migration to add `profiles/` prefix to profile filenames that don't already have it, including updating PROFILE_FILENAME in ConfigMaps This way between the related storage document and the profile filename, we have the full path to the object in the database rather than relying on additional prefixes hardcoded into k8s job YAML files. Note that this as a follow-up it'll be necessary to manually move any profiles that had been written into the `<oid>` "directory" in object storage rather than `<oid>/profiles` to the latter. This should only affect profiles created very recently in a 1.8.0-beta release.	2023-11-09 17:44:16 -08:00
Tessa Walsh	30bbefbeaa	Send email to superuser when background job fails (#1355 ) Fixes #1344 Sends email to superadmin when a background job fails.	2023-11-08 19:55:59 -08:00
Ilya Kreymer	ff10124d01	charts cleanup: (#1360 ) - move authsign secret to signer and make port configurable - rename storages to more general ops-configs - put 'storages.json' path into env var - rename backend secret to backend-auth - cronjobs: don't keep succeeded jobs around, triggers operator update	2023-11-08 19:24:00 -08:00
Ilya Kreymer	d2d7240455	background jobs fix: ensure bucket is parsed correctly (#1359 ) Follow-up to #1321 - correctly parse the endpoint_url into prefix and bucket path - also add region and s3 provider type to storage secrets	2023-11-08 15:08:23 -08:00
Ilya Kreymer	3aebf2e37f	version: bump to 1.8.0-beta.2	2023-11-06 16:35:15 -08:00
Ilya Kreymer	b4fd5e6e94	Crawl Timeout via elapsed time (#1338 ) Fixes #1337 Crawl timeout is tracked via `elapsedCrawlTime` field on the crawl status, which is similar to regular crawl execution time, but only counts one pod if scale > 1. If scale == 1, this time is equivalent. Crawl is gracefully stopped when the elapsed execution time exceeds the timeout. For more responsiveness, also adding current crawl time since last update interval. Details: - handle crawl timeout via elapsed crawl time - longest running time of a single pod, instead of expire time. - include current running from last update for best precision - more accurately count elapsed time crawl is actually running - store elapsedCrawlTime in addition to crawlExecTime, storing the longest duration of each pod since last test interval --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-11-06 16:32:58 -08:00
Ilya Kreymer	5530ca92e1	Move backend app templates to be installed from configmap volume (#1331 ) Instead of adding the app templates launched from the backend via `backend/btrixcloud/templates`, add them to a configmap and mount the configmap in the same location. This allows these templates to be updated, like other values in charts/... without having to rebuild any of the images, speeding up dev and maintenance time. Changes include: - move backend/btrixcloud/templates -> chart/app-templates/ - add app-templates/*.yaml to app-templates configmap - mount app-templates configmap to /app/btrixcloud/templates/ in api and op containers	2023-11-06 09:37:48 -08:00
Ilya Kreymer	0935d43a97	exclusion optimizations: dynamic exclusions (part of #1216 ): (#1268 ) - instead of restarting crawler when exclusion added/removed, add a message to a redis list (per crawler instance) - no longer filtering existing queue on backend, now handled via crawler (implemented in 0.12.0 via webrecorder/browsertrix-crawler#408) - match response optimization: instead of returning first 1000 matches, limits response to 500K and returns however many matches fit in that response size (for optional pagination on frontend)	2023-11-06 09:36:25 -08:00
Ilya Kreymer	fb3d88291f	Background Jobs Work (#1321 ) Fixes #1252 Supports a generic background job system, with two background jobs, CreateReplicaJob and DeleteReplicaJob. - CreateReplicaJob runs on new crawls, uploads, profiles and updates the `replicas` array with the info about the replica after the job succeeds. - DeleteReplicaJob deletes the replica. - Both jobs are created from the new `replica_job.yaml` template. The CreateReplicaJob sets secrets for primary storage + replica storage, while DeleteReplicaJob only needs the replica storage. - The job is processed in the operator when the job is finalized (deleted), which should happen immediately when the job is done, either because it succeeds or because the backoffLimit is reached (currently set to 3). - /jobs/ api lists all jobs using a paginated response, including filtering and sorting - /jobs/<job id> returns details for a particular job - tests: nightly tests updated to check create + delete replica jobs for crawls as well as uploads, job api endpoints - tests: also fixes to timeouts in nightly tests to avoid crawls finishing too quickly. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-11-02 13:02:17 -07:00
Ilya Kreymer	6384d8b5f1	Additional Type Hints / Type Fix Pass (#1320 ) This PR adds more type safety to the backend codebase: - All ops classes calls should be type checked - Avoiding circular references with TYPE_CHECKING conditional - Consistent UUID usage: uuid.UUID / UUID4 with just UUID - Crawl states moved to models, made into lists - Additional typing added as needed, fixed a few type related errors - CrawlOps / UploadOps / BaseCrawlOps now all have same param init order to simplify changes	2023-10-30 12:59:24 -04:00
Ilya Kreymer	72f1840ae7	fix regression in concurrent crawls: (#1324 ) - check the 'btrix.org' instead of 'oid' labels in getting related crawls - fixes regression introduced in #1296 where labels where all org id labels were switched to 'btrix.org' for consistency	2023-10-30 12:58:07 -04:00
Ilya Kreymer	8c09934298	version: bump to 1.8.0-beta.1	2023-10-27 14:35:24 -07:00
Ilya Kreymer	c1d3beda9c	users: add case-insensitive index to maintain backwards compatibility with fastapi-users (#1319 ) follow up to #1290 Based on implementation in: https://github.com/fastapi-users/fastapi-users-db-mongodb/blob/main/fastapi_users_db_mongodb/__init__.py	2023-10-27 14:31:29 -07:00
Ilya Kreymer	6dc452ebad	Storage Refactor: Replication + Custom Storage Support (#1296 ) - Refactors storage to support replicas + custom storages on the Org. - There is a default primary + replica storage, while an Org can also have primary and replica storages. - StorageRef object is used to store references to default and custom storage. - CrawlFile has been updated to contain a StorageRef instead of a def_storage_name, which references either a default storage (in StorageOps) or custom storage (in Organization) - There is also a 'replicas' Optional[List[StorageRef]] which contains replicas, if any. - CrawlFileOut contain a numReplicas for how many replicas exist for a given file. - Migration: migration 0020 added to migrate existing Orgs, CrawlFile and ProfileFile objects to new storage system (CrawlFile and ProfileFile now extend BaseFile) Part of #1262 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-10-26 21:44:09 -07:00
Tessa Walsh	38f32f11ea	Enforce quota and hard cap for monthly execution minutes (#1284 ) Fixes #1261 Closes #1092 The quota for monthly execution minutes is treated as a hard cap. Once it is exceeded, an alert indicating that an org has exceeded its monthly execution minutes will display and the user will be unable to start new crawls. Any running crawls will be stopped once the quota is exceeded. An execution minutes meter bar is also added in the Org Dashboard and displayed if a quota is set. More detail in #1305 which was merged into this branch. ## Changes - Enable setting 'maxExecMinutesPerMonth' in orgs list quotas by superadmin - Enforce quota by stopping crawls in operator once quota is reached - Show alert banner once execution time quota is hit: - Once quota is hit, disable Run Crawl buttons in frontend, return 403 message with `exec_minutes_quota_reached` detail in backend from crawl config `/run` endpoint, and don't run new workflows on creation (similar to storage quota) - Display execution time for crawls in the crawl details overview, immediately below - Show execution minutes meter on dashboard (from #1305) --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: sua yoo <sua@webrecorder.org>	2023-10-26 15:38:51 -07:00
Tessa Walsh	5fadc630ce	Check for empty string for SMTP password (#1317 ) Follow-up fix for #1136 based on this comment: https://github.com/webrecorder/browsertrix-cloud/issues/1136#issuecomment-1777119534	2023-10-26 09:44:55 -07:00
Ilya Kreymer	4591db1afe	More stringent UUID types for user input / avoid 500 errors (#1309 ) Fixes #1297 Ensures proper typing for UUIDs in FastAPI input models, to avoid explicit conversions, which may throw errors. This avoids possible 500 errors (due to ValueError exceptions) when converting UUIDs from user input. Instead, will get more 422 errors from FastAPI. UUID conversions remaining are in operator / profile handling where UUIDs are retrieved from previously set fields, remaining user input conversions in user auth and collection list are wrapped in exceptions. For `profileid`, update fastapi models to support union of UUID, null, and EmptyStr (new empty string only type), to differentiate removing profile (empty string) vs not changing at all (null) for config updates	2023-10-25 15:15:53 -04:00
Tessa Walsh	d58747dfa2	Provide full resources in archived items finished webhooks (#1308 ) Fixes #1306 - Include full `resources` with expireAt (as string) in crawlFinished and uploadFinished webhook notifications rather than using the `downloadUrls` field (this is retained for collections). - Set default presigned duration to one minute short of 1 week and enforce maximum supported by S3 - Add 'storage_presign_duration_minutes' commented out to helm values.yaml - Update tests --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-10-23 19:01:58 -07:00
Tessa Walsh	5c5ef68a8a	Prevent user from logging in after 5 consecutive failed login attempts until pw is reset (#1281 ) Fixes #1270 After 5 consecutive failed logins from the same user, we now prevent the user from logging in even with the correct password until they reset it via their email, or wait an hour. - After failure threshold is reached, all further login attempts are rejected - Attempts for invalid email addresses are also tracked - On 6th try, a reset password email is automatically sent, only once - Failed login counter resets after an hour of no further logins after last attempted login. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-10-20 14:10:56 -07:00
Tessa Walsh	733809b5a8	Update user names in crawls and workflows after username update (#1299 ) Fixes #1275	2023-10-19 23:34:49 -07:00
Ilya Kreymer	63291e95a5	avoid exception if 'errors' key doesn't exist (#1301 ) - avoid exception if 'errors' (or 'files' keys) don't exist (part of #1297) - ensure 'errors' list always set on output model for consistency, defaulting to empty list - fix tests for 'errors' being an empty empty list follow-up to #1300 (merging 1.7.1 release into main)	2023-10-19 14:39:54 -07:00
Ilya Kreymer	9a2787f9c4	User refactor + remove fastapi_users dependency + update fastapi (#1290 ) Fixes #1050 Major refactor of the user/auth system to remove fastapi_users dependency. Refactors users.py to be standalone and adds new auth.py module for handling auth. UserManager now works similar to other ops classes. The auth should be fully backwards compatible with fastapi_users auth, including accepting previous JWT tokens w/o having to re-login. The User data model in mongodb is also unchanged. Additional fixes: - allows updating fastapi to latest - add webhook docs to openapi (follow up to #1041) API changes: - Removing the`GET, PATCH, DELETE /users/<id>` endpoints, which were not in used before, as users are scoped to orgs. For deletion, probably auto-delete when user is removed from last org (to be implemented). - Rename `/users/me-with-orgs` is renamed to just `/users/me/` - New `PUT /users/me/change-password` endpoint with password required to update password, fixes #1269, supersedes #1272 Frontend changes: - Fixes from #1272 to support new change password endpoint. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: sua yoo <sua@suayoo.com>	2023-10-18 10:49:23 -07:00
sua yoo	4610d95cd7	Use org slug in place of UUIDs in app URLs (#1277 ) - Replaces org UUID in URL/browser location bar with org slug. - Refactor: Adds shared app state utility using https://sijakret.github.io/lit-shared-state/ to access org data from deep descendants. - Backwards compatible: org UUID URLs should auto-redirect to org slug URLs. - Show the org UUID in org settings general tab for use with APIs (Resolves #1258, Follows #1279)	2023-10-18 09:28:30 -07:00
Ilya Kreymer	36bd228115	version: update to 1.8.0-beta.0	2023-10-17 18:06:55 -07:00
Ilya Kreymer	b3f530f8e6	version: bump to 1.7.0	2023-10-16 18:39:20 -07:00
Ilya Kreymer	ddc4e03422	operator status typo fix: (#1293 ) - don't log normal exists as crashes! - set pod_status.exitCode to the exitCode - count exit code 13 as not-a-crash also (force interrupt)	2023-10-16 15:01:46 -07:00
Ilya Kreymer	1bc4697995	optimization: avoid updating whole org when only need to set one field (#1288 ) - add update_users and update_slug_and_name - rename update to update_full	2023-10-16 10:54:04 -07:00
Ilya Kreymer	dc8d510b11	webhook tweak: pass oid to crawl finished and upload finished webhooks (#1287 ) Optimizes webhooks by passing oid directly to webhooks: - avoids extra crawl lookup - possible for crawl to be deleted before webhook is processed via operator (resulting in crawl lookup to fail) - add more typing to operator and webhooks	2023-10-16 10:51:36 -07:00
Ilya Kreymer	a295f5d05d	version: bump to 1.7.0-beta.3	2023-10-15 18:31:03 -07:00
Tessa Walsh	2383b0d616	Set log download attachment name to crawl_id.log (#1280 ) Fixes #1271 Using .log for now due to broader support for opening with default viewers --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2023-10-13 20:00:37 -07:00
Tessa Walsh	c5ca250f37	Add id-slug lookup and restrict slugs endpoints to superadmins (#1279 ) Fixes #1278 - Adds `GET /orgs/slug-lookup` endpoint returning `{id: slug}` for all orgs - Restricts new endpoint and existing `GET /orgs/slugs` to superadmins	2023-10-13 17:02:19 -07:00
Ilya Kreymer	41c054d209	Storage ops followup type checking (#1274 ) * storage ops: follow up to #1257: - fix refactor typo - add type hints for all storageops apis (add mypy_boto3_s3 and types_aiobotocore_s3 for type hints)	2023-10-11 14:03:00 -07:00
Tessa Walsh	266afdf8d9	Add slugs to org backend (#1250 ) - Add slug field with uniqueness constraint to Organization - Use python-slugify to generate slug from name and import that in migration - Require name in all /rename and org creation requests - Auto-generate slug for new org with no slug or when /rename is called w/o a slug - Auto-generate slug for 'default-org' based on name - Add /api/orgs/slugs GET endpoint to return all slugs in use - tests: extend backend test-requirements.txt from requirements to allow testing slugify - tests: move get_redis_crawl_stats() to avoid extra dependency in utils	2023-10-10 18:30:09 -07:00
Ilya Kreymer	16e7a1d0a2	Storage Ops Refactor (#1257 ) * storage ops refactor: - create StorageOps class similar to other ops classes - init storages list in StorageOps, no longer require lookup up default storages via CrawlManager - convert all storage functions to members, add storageops to operator - remove unused params, ensure crawl exists for rollover restart - add env var to determine if using local minio to use correct endpoint URL * crawls /seeds endpoint: just return empty list if not a crawl (eg. upload) * crawlmanager: remove unused code, rename check_storage -> has_storage	2023-10-10 15:04:23 -07:00

1 2 3 4 5 ...

369 Commits