browsertrix

Author	SHA1	Message	Date
Tessa Walsh	a51f7c635e	Add behavior logs from Redis to database and add endpoint to serve (#2526 ) Backend work for #2524 This PR adds a second dedicated endpoint similar to `/errors`, as a combined log endpoint would give a false impression of being the complete crawl logs (which is far from what we're serving in Browsertrix at this point). Eventually when we have support for streaming live crawl logs in `crawls/<id>/logs` I'd ideally like to deprecate these two dedicated endpoints in favor of using that, but for now this seems like the best solution. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-04-08 02:16:10 +02:00
Ilya Kreymer	4c0ddd0fe3	crawl replay: remove isSeed=true from initialPages query (#2509 ) - matches initial query for collections - fixes 'Show Non-Seed Pages' not appearing for crawl replay	2025-03-20 15:03:41 -07:00
Ilya Kreymer	afa892000b	replay api: add downloadUrl to replay endpoints to be used by RWP (#2456 ) RWP (2.3.3+) can determine if the 'Download Archive' menu item should be showed based on the value of downloadUrl. If set to 'null', will hide the menu item: - set downloadUrl to public collection download for public collections replay - set downloadUrl to null for private collection and crawl replay to hide the download menu item in RWP (otherwise have to add the auth_header query with bearer token and should assess security before doing that..) --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-03-03 14:11:28 -08:00
Ilya Kreymer	702c9ab3b7	Better cacheing of presigned URLs + support for thumbnails (#2446 ) Overhauls URL presigning by: - cache the presigned urls in a flat, separate mongodb collection which has an expiring index - update presigned urls if not found / expired automatically in index - remove logic on storing presignedUrl in files - support cacheing presigned URL for thumbnails. - add endpoints to clear presigned urls for org or for all files in all orgs (superadmin only) - supersedes #2438, fix for #2437 - removes previous presignedUrl and expireAt data from crawls and QA runs --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-03-03 12:05:23 -08:00
Ilya Kreymer	8a507f0473	Consolidate list page endpoints + better QA sorting + optimize pages fix (#2417 ) - consolidate list_pages() and list_replay_query_pages() into list_pages() - to keep backwards compatibility, add <crawl>/pagesSearch that does not include page totals, keep <crawl>/pages with page total (slower) - qa frontend: add default 'Crawl Order' sort order, to better show pages in QA view - bgjob: account for parallelism in bgjobs, add logging if succeeded mismatches parallelism - QA sorting: default to 'crawl order' by default to get better results. - Optimize pages job: also cover crawls that may not have any pages but have pages listed in done stats - Bgjobs: give custom op jobs more memory	2025-02-21 13:47:20 -08:00
Tessa Walsh	f8fb2d2c8d	Rework crawl page migration + MongoDB Query Optimizations (#2412 ) Fixes #2406 Converts migration 0042 to launch a background job (parallelized across several pods) to migrate all crawls by optimizing their pages and setting `version: 2` on the crawl when complete. Also Optimizes MongoDB queries for better performance. Migration Improvements: - Add `isMigrating` and `version` fields to `BaseCrawl` - Add new background job type to use in migration with accompanying `migration_job.yaml` template that allows for parallelization - Add new API endpoint to launch this crawl migration job, and ensure that we have list and retry endpoints for superusers that work with background jobs that aren't tied to a specific org - Rework background job models and methods now that not all background jobs are tied to a single org - Ensure new crawls and uploads have `version` set to `2` - Modify crawl and collection replay.json endpoints to only include fields for replay optimization (`initialPages`, `pageQueryUrl`, `preloadResources`) if all relevant crawls/uploads have `version` set to `2` - Remove `distinct` calls from migration pathways - Consolidate collection recompute stats Query Optimizations: - Remove all uses of $group and $facet - Optimize /replay.json endpoints to precompute preload_resources, avoid fetching crawl list twice - Optimize /collections endpoint by not fetching resources - Rename /urls -> /pageUrlCounts and avoid $group, instead sort with index, either by seed + ts or by url to get top matches. - Use $gte instead of $regex to get prefix matches on URL - Use $text instead of $regex to get text search on title - Remove total from /pages and /pageUrlCounts queries by not using $facet - frontend: only call /pageUrlCounts when dialog is opened. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Emma Segal-Grossman <hi@emma.cafe> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-02-20 15:26:11 -08:00
Ilya Kreymer	7b2932c582	Add initial pages + pagesQuery endpoint to /replay.json APIs (#2380 ) Fixes #2360 - Adds `initialPages` to /replay.json response for collections, returning up-to 25 pages (seed pages first, then sorted by capture time). - Adds `pagesQueryUrl` to /replay.json - Adds a public pages search endpoint to support public collections. - Adds `preloadResources`, including list of WACZ files that should always be loaded, to /replay.json --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-02-13 16:53:47 -08:00
Tessa Walsh	b0aebb599a	Reformat with Black for 2025 ruleset (#2349 )	2025-01-29 16:57:06 -05:00
Tessa Walsh	763c654484	feat: Update collection sorting, metadata, stats (#2327 ) - Refactors dashboard and org profile preview to use private API endpoint, to fix public collections not showing when the org visibility is hidden - Adds additional sorting options for collections - Adds unique page url counts for archived items, collections, and organizations to backend and exposes this in collections - Shows collection period (i.e. `dateEarliest` to `dateLatest`) in collections list - Shows same collection metadata in private and public views, updates private view info bar - Fixes "Update Org Profile" action item showing for crawler roles --------- Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: sua yoo <sua@suayoo.com> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-01-23 13:32:23 -05:00
Tessa Walsh	cbcf087a48	Add last crawl and subscription status indicators to org list (#2273 ) Fixes #2260 - Adds `lastCrawlFinished` to Organization model, updated after crawls are added/deleted and with an idempotent migration to backfill existing orgs - Adds Last Crawl column to end of admin orgs list table - Adds subscription icon next to existing status icon in orgs list - Adds "lastCrawlFinished", "subscriptionStatus", and "subscriptionPlan" sort options to orgs list backend endpoint in anticipation of future sorting/filtering of orgs list --------- Co-authored-by: emma <hi@emma.cafe> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-01-14 10:57:06 -05:00
Tessa Walsh	a031fab313	Backend work for public collections (#2198 ) Fixes #2182 This rather large PR adds the rest of what should be needed for public collections work in the frontend. New API endpoints include: - Public collections endpoints: GET, streaming download - Paginated list of URLs in collection with snapshot (page) info for each - Collection endpoint to set home URL - Collection endpoint to upload thumbnail as stream - DELETE endpoint to remove collection thumbnail Changes to existing API endpoints include: - Paginating public collection list results - Several `pages` endpoints that previously only supported `/crawls/` in their path, e.g. `/orgs/{oid}/crawls/all/pages/reAdd`, now support `/uploads/` and `/all-crawls/` namespaces as well. This is necessitated by adding pages for uploads to the database (see below). For `/orgs/{oid}/namespace/all/pages/reAdd`, `crawls` or `uploads` will serve as a filter to only affect crawls of that given type. Other endpoints are more liberal at this point, and will perform the same action regardless of the namespace used in the route (we'll likely want to change this in a follow-up to be more consistent). - `/orgs/{oid}/namespace/all/pages/reAdd` now kicks off a background job rather than doing all of the computation in an asyncio task in the backend container. The background job additionally updates collection date ranges, page/size counts, and tags for each collection in the org after pages have been (re)added. Other big changes: - New uploads will now have their pages read into the database! Collection page counts now also include uploads - A migration was added to start a background job for each org that will add the pages for previously-uploaded WACZ files to the database and update collections accordingly - Adds a new `ImageFile` subclass of `BaseFile` for thumbnails that we can use for other user-uploaded image files moving forward, with separate output models for authenticated and public endpoints	2025-01-13 15:15:48 -08:00
Ilya Kreymer	104ea097c4	switch to simpler streaming download + multiwacz metadata improvements: (#1982 ) - download via presigned URLs via requests instead of boto APIs, remove boto - follow-up to #1933 for streaming download improvements - fixes datapackage.json in multi-wacz to contain the same resources objects with: `name`, `path`, `hash`, `bytes` to match single WACZ. - Add additional metadata to multi-wacz datapackage.json, including `type` (`crawl`, `upload`, `collection`, `qaRun`), `id` (unique id for the object), `title` / `description` if available (for crawl/upload/collection), and `crawlId` for `qaRun`	2024-10-03 16:13:31 -07:00
Tessa Walsh	123705c53f	Serialize datetimes with Z suffix (#2058 ) Use timezone aware datetimes instead of timezone naive datetimes: - Update mongodb client to use tz-aware conversion - Convert dt_now() to return timezone aware UTC date - Rename to_k8s_date -> date_to_str, just returns ISO UTC date with 'Z' (instead of '+00:00' suffix) - Rename from_k8s_date -> str_to_date, returns timezone aware date from str - Standardize all string<->date conversion to use either date_to_str or str_to_date - Update frontend to assume iso date, not append 'Z' directly - Update tests to check for 'Z' suffix on some dates --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-09-12 16:16:13 -07:00
Ilya Kreymer	86c9e538c1	quickfix: webhooks: ensure the 'crawl_reviewed' webhook is sent async, doesn't delay submitting a review (#2033 ) make the call to `create_crawl_reviewed_notification` be called with create_task (similar to other user-initiated webhook events), to avoid extra wait for webhook to complete	2024-08-20 17:50:18 -07:00
Ilya Kreymer	d9f49afcc5	type fixes on util functions (#2009 ) Some additional typing for util.py functions and resultant changes	2024-08-12 10:54:45 -07:00
Ilya Kreymer	7fa2b61b29	Execution time tracking tweaks (#1994 ) Tweaks to how execution time is tracked for more accuracy + excluding waiting states: - don't update if crawl state is in a 'waiting state' (waiting for capacity or waiting for org limit) - rename start states -> waiting states for clarity - reset lastUpdatedTime if two consecutive updates of non-running state, to ensure non-running states don't count, but also account for occasional hiccups -- if only one update detects non-running state, don't reset - webhooks: move start webhook to when crawl actually starts for first time (db lastUpdatedTime is not yet + crawl is running) - don't set lastUpdatedTime until pods actually running - set crawljob update interval to every 10 seconds for more accurate execution time tracking - frontend: show seconds in 'Execution Time' display	2024-08-06 09:44:44 -07:00
Ilya Kreymer	894aa29d4b	remove crc32 from CrawlFile (#1980 ) - no longer being used with latest stream-zip - was not computed correctly in the crawler - counterpart to webrecorder/browsertrix-crawler#657 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-07-30 11:23:15 -07:00
Tessa Walsh	551660bb62	Add webhooks for qaAnalysisStarted, qaAnalysisFinished, and crawlReviewed (#1974 ) Fixes #1957 Adds three new webhook events related to QA: analysis started, analysis ended, and crawl reviewed. Tests have been updated accordingly. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-25 16:53:49 -07:00
Ilya Kreymer	94e985ae13	optimize org quota lookups (#1973 ) - instead of looking up storage and exec min quotas from oid, and loading an org each time, load org once and then check quotas on the org object - many times the org was already available, and was looked up again - storage and exec quota checks become sync - rename can_run_crawl() to more generic can_write_data(), optionally also checks exec minutes - typing: get_org_by_id() always returns org, or throws, adjust methods accordingly (don't check for none, catch exception) - typing: fix typo in BaseOperator, catch type errors in operator 'org_ops' - operator quota check: use up-to-date 'status.size' for current job, ignore current job in all jobs list to avoid double-counting - follow up to #1969	2024-07-25 14:00:16 -07:00
Tessa Walsh	27ee16d308	Implement downloading archived item + QA runs as multi-WACZ (#1933 ) Fixes #1412 ## Changes ### Backend - Adds `all-crawls`, `crawls`, and `uploads` API endpoints to download archived item as multi-WACZ - Download QA runs as multi-WACZ - Adds backend tests for new endpoints - Update to new version of stream-zip library which does not require crc-32 to be present for ZIP members, computes after streaming, fixing invalid crc-32 issues as previously computed crc-32s from crawler may be invalid. ### Frontend Adds ability to download archived item from: - Button in archived item detail Files tab - Archived item details actions menu - Archived items list menu --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-25 10:28:57 -07:00
Ilya Kreymer	cd00f52cca	Fix queue response models + additional testing for queue + exclusions (#1948 ) Follow-up to regressions from #1928, this PR: - Fixes response models for queue endpoints, which had incorrect model - Adds tests for queue get, queue match, and exclusions add / remove to ensure regressions like this can be caught via tests. This involves starting a new crawl in test_run_crawls() instead of relying on implicit running via fixtures, make it easier to test crawl while it's running. - Adds additional typing for crawls apis, including making delete_crawls() have correct typing, consistent derived class override - Adds check to ensure queue + exclusion operations can not be called when crawl is not running	2024-07-22 09:00:23 -07:00
Tessa Walsh	2237120cd5	Add API endpoint to recalculate org storage (#1943 ) Fixes #1942 This process might be a bit slow for large orgs, may consider moving it to background job in #1898.	2024-07-19 18:29:20 -07:00
Ilya Kreymer	335700e683	Additional typing cleanup (#1938 ) Misc typing fixes, including in profiles and time functions --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-07-17 10:49:22 -07:00
Tessa Walsh	d41647e6c2	Document all API endpoints with response models (#1928 ) Fixes #1920 Adds response models to all API endpoints that were missing them, documenting current behavior without making any changes at this stage to standardize responses. Follow-up work will involve adding generics to some of the response models	2024-07-16 12:48:38 -07:00
Tessa Walsh	f076e7d9e3	Add superuser API endpoints to export and import org data (#1394 ) Fixes #890 This PR introduces new streaming superuser-only API endpoints to export and import database information for an organization. New Adminstrator deployment documentation on how to manage the process and copy files between S3 buckets as needed is also included. --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-07-02 17:14:34 -04:00
Ilya Kreymer	fa6627ce70	ensure QA configmap is updated for long running QA runs: (#1865 ) - add a 'expire_at_duration_seconds' which is 75% of actual presign duration time, or <25% remaining until presigned URL actually expires to ensure presigned URLs are updated early than when they actually expire - set cached expireAt time to the renew at time for more frequent updates - update QA configmap in place with updated presigned URLs when expireAt time is reached - mount qa config volume under /tmp/qa/ without subPath to get automatic updates, which crawler will handle - tests: fix qa test typo (from main) - fixes #1864	2024-06-12 10:51:35 -07:00
Ilya Kreymer	26abbddd62	regression fix: remove 'project' from non-raw crawl object getters, (#1778 ) unnecessary projection prevents creating Crawl object, now removed Fixes #1777	2024-05-02 01:02:22 +02:00
Tessa Walsh	b8caeb88e9	Ensure QA run WACZs are deleted (#1715 ) - When qa run is deleted - When crawl is deleted And adds tests for WACZ deletion. Fixes #1713	2024-04-22 18:04:09 -04:00
Ilya Kreymer	1844e761dc	Support sorting by last QA started time (#1712 ) To support #1683, it would be useful to be able to sort by 'last QA start time' in addition to/instead of last QA state. - make sorting consistent with workflow sorting - sortBy fields renamed to lastQAState and lastQAStarted - Current QA runs are now included in the lastQAState/lastQAStarted fields, rather than being separated out to different values --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-04-22 13:00:52 -07:00
Ilya Kreymer	9609ff4194	Add 'activeQAStats' field (#1694 ) As additional support for #1683, include the active QA stats in the crawl response, along with active QA state. This will allow showing progress of QA run in the archived items list.	2024-04-18 10:05:39 -04:00
Tessa Walsh	b87860c68a	Ensure /all-crawls?sortBy=qaState always sorts crawls above uploads (#1691 ) Follow-up to #1686	2024-04-17 19:14:29 -07:00
Tessa Walsh	c800da1732	Add reviewStatus, qaState, and qaRunCount sort options to crawls/all-crawls list endpoints (#1686 ) Backend work for #1672 Adds new sort options to /crawls and /all-crawls GET list endpoints: - `reviewStatus` - `qaRunCount`: number of completed QA runs for crawl (also added to CrawlOut) - `qaState` (sorts by `activeQAState` first, then `lastQAState`, both of which are added to CrawlOut)	2024-04-16 23:54:09 -07:00
Ilya Kreymer	4f676e4e82	QA Runs Initial Backend Implementation (#1586 ) Supports running QA Runs via the QA API! Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes #1498 Also requires the latest Browsertrix Crawler 1.1.0+ (from webrecorder/browsertrix-crawler#469 branch) Notable changes: - QARun objects contain info about QA runs, which are crawls performed on data loaded from existing crawls. - Various crawl db operations can be performed on either the crawl or `qa.` object, and core crawl fields have been moved to CoreCrawlable. - While running,`QARun` data stored in a single `qa` object, while finished qa runs are added to `qaFinished` dictionary on the Crawl. The QA list API returns data from the finished list, sorted by most recent first. - Includes additional type fixes / type safety, especially around BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific get_upload(), get_basecrawl(), get_crawl() getters for internal use and get_crawl_out() for API - Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-20 22:42:16 -07:00
Tessa Walsh	14189b7cfb	Add crawl pages and related API endpoints (#1516 ) Fixes #1502 - Adds pages to database as they get added to Redis during crawl - Adds migration to add pages to database for older crawls from pages.jsonl and extraPages.jsonl files in WACZ - Adds GET, list GET, and PATCH update endpoints for pages - Adds POST (add), PATCH, and POST (delete) endpoints for page notes, each with their own id, timestamp, and user info in addition to text - Adds page_ops methods for 1. adding resources/urls to page, and 2. adding automated heuristics and supplemental info (mime, type, etc.) to page (for use in crawl QA job) - Modifies `Migration` class to accept kwargs so that we can pass in ops classes as needed for migrations - Deletes WACZ files and pages from database for failed crawls during crawl_finished process - Deletes crawl pages when a crawl is deleted Note: Requires a crawler version 1.0.0 beta3 or later, with support for `--writePagesToRedis` to populate pages at crawl completion. Beta 4 is configured in the test chart, which should be upgraded to stable 1.0.0 when it's released. Connected to https://github.com/webrecorder/browsertrix-crawler/pull/464 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-02-28 12:11:35 -05:00
Tessa Walsh	f3cbd9e179	Add crawl, upload, and collection delete webhook event notifications (#1363 ) Fixes #1307 Fixes #1132 Related to #1306 Deleted webhook notifications include the org id and item/collection id. This PR also includes API docs for the new webhooks and extends the existing tests to account for the new webhooks. This PR also does some additional cleanup for existing webhooks: - Remove `downloadUrls` from item finished webhook bodies - Rename collection webhook body `downloadUrls` to `downloadUrl`, since we only ever have one per collection - Fix API docs for existing webhooks, one of which had the wrong response body	2023-11-09 18:19:08 -08:00
Ilya Kreymer	fb3d88291f	Background Jobs Work (#1321 ) Fixes #1252 Supports a generic background job system, with two background jobs, CreateReplicaJob and DeleteReplicaJob. - CreateReplicaJob runs on new crawls, uploads, profiles and updates the `replicas` array with the info about the replica after the job succeeds. - DeleteReplicaJob deletes the replica. - Both jobs are created from the new `replica_job.yaml` template. The CreateReplicaJob sets secrets for primary storage + replica storage, while DeleteReplicaJob only needs the replica storage. - The job is processed in the operator when the job is finalized (deleted), which should happen immediately when the job is done, either because it succeeds or because the backoffLimit is reached (currently set to 3). - /jobs/ api lists all jobs using a paginated response, including filtering and sorting - /jobs/<job id> returns details for a particular job - tests: nightly tests updated to check create + delete replica jobs for crawls as well as uploads, job api endpoints - tests: also fixes to timeouts in nightly tests to avoid crawls finishing too quickly. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-11-02 13:02:17 -07:00
Ilya Kreymer	6384d8b5f1	Additional Type Hints / Type Fix Pass (#1320 ) This PR adds more type safety to the backend codebase: - All ops classes calls should be type checked - Avoiding circular references with TYPE_CHECKING conditional - Consistent UUID usage: uuid.UUID / UUID4 with just UUID - Crawl states moved to models, made into lists - Additional typing added as needed, fixed a few type related errors - CrawlOps / UploadOps / BaseCrawlOps now all have same param init order to simplify changes	2023-10-30 12:59:24 -04:00
Ilya Kreymer	6dc452ebad	Storage Refactor: Replication + Custom Storage Support (#1296 ) - Refactors storage to support replicas + custom storages on the Org. - There is a default primary + replica storage, while an Org can also have primary and replica storages. - StorageRef object is used to store references to default and custom storage. - CrawlFile has been updated to contain a StorageRef instead of a def_storage_name, which references either a default storage (in StorageOps) or custom storage (in Organization) - There is also a 'replicas' Optional[List[StorageRef]] which contains replicas, if any. - CrawlFileOut contain a numReplicas for how many replicas exist for a given file. - Migration: migration 0020 added to migrate existing Orgs, CrawlFile and ProfileFile objects to new storage system (CrawlFile and ProfileFile now extend BaseFile) Part of #1262 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-10-26 21:44:09 -07:00
Tessa Walsh	38f32f11ea	Enforce quota and hard cap for monthly execution minutes (#1284 ) Fixes #1261 Closes #1092 The quota for monthly execution minutes is treated as a hard cap. Once it is exceeded, an alert indicating that an org has exceeded its monthly execution minutes will display and the user will be unable to start new crawls. Any running crawls will be stopped once the quota is exceeded. An execution minutes meter bar is also added in the Org Dashboard and displayed if a quota is set. More detail in #1305 which was merged into this branch. ## Changes - Enable setting 'maxExecMinutesPerMonth' in orgs list quotas by superadmin - Enforce quota by stopping crawls in operator once quota is reached - Show alert banner once execution time quota is hit: - Once quota is hit, disable Run Crawl buttons in frontend, return 403 message with `exec_minutes_quota_reached` detail in backend from crawl config `/run` endpoint, and don't run new workflows on creation (similar to storage quota) - Display execution time for crawls in the crawl details overview, immediately below - Show execution minutes meter on dashboard (from #1305) --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: sua yoo <sua@webrecorder.org>	2023-10-26 15:38:51 -07:00
Tessa Walsh	d58747dfa2	Provide full resources in archived items finished webhooks (#1308 ) Fixes #1306 - Include full `resources` with expireAt (as string) in crawlFinished and uploadFinished webhook notifications rather than using the `downloadUrls` field (this is retained for collections). - Set default presigned duration to one minute short of 1 week and enforce maximum supported by S3 - Add 'storage_presign_duration_minutes' commented out to helm values.yaml - Update tests --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-10-23 19:01:58 -07:00
Tessa Walsh	733809b5a8	Update user names in crawls and workflows after username update (#1299 ) Fixes #1275	2023-10-19 23:34:49 -07:00
Ilya Kreymer	63291e95a5	avoid exception if 'errors' key doesn't exist (#1301 ) - avoid exception if 'errors' (or 'files' keys) don't exist (part of #1297) - ensure 'errors' list always set on output model for consistency, defaulting to empty list - fix tests for 'errors' being an empty empty list follow-up to #1300 (merging 1.7.1 release into main)	2023-10-19 14:39:54 -07:00
Ilya Kreymer	16e7a1d0a2	Storage Ops Refactor (#1257 ) * storage ops refactor: - create StorageOps class similar to other ops classes - init storages list in StorageOps, no longer require lookup up default storages via CrawlManager - convert all storage functions to members, add storageops to operator - remove unused params, ensure crawl exists for rollover restart - add env var to determine if using local minio to use correct endpoint URL * crawls /seeds endpoint: just return empty list if not a crawl (eg. upload) * crawlmanager: remove unused code, rename check_storage -> has_storage	2023-10-10 15:04:23 -07:00
Ilya Kreymer	fa86555eed	Track pod resource usage, detect OOM crashes, handle auto-scaling (#1235 ) * keep track of per pod status on crawljob: - crashes time, and reason - 'used' vs 'allocated' resources - 'percent' used / allocated * crawl log errors: log error when crawler crashes via OOM, either via redis error log or to console * add initial autoscaling support! - detect if metrics server is available via K8SApi.is_pod_metrics_available() - if available, use metrics for 'used' fields - if no metrics, set memory used for redis only (using redis apis) - allow overriding memory and cpu via newMemory and newCpu settings on pod status - scale memory / cpu based on newMemory and newCpu setting - templates: update jinja templates to allow restarting crawler and redis with new resources - ci: enable metrics-server on k3d, microk8s and nightly k3d ci runs * roles: cleanup unused roles, add permissions for listing metrics * stats for running crawls: - update in db via operator - avoids losing stats if redis pod happens to be done - tradeoff is more db access in operator, but less extra connections to redis + already loading from db in backend - size stat: ensure size of previous files is added to the stats * crawler deployment tweaks: - adjust cpu/mem per browser - add --headless flag to configmap to use new headless mode by default!	2023-10-05 20:41:18 -07:00
Tessa Walsh	e9bac4c088	API delete endpoint improvements (#1232 ) - Applies user permissions check before deleting anything in all /delete endpoints - Shuts down running crawls before deleting anything in /all-crawls/delete as well as /crawls/delete - Splits delete_list.crawl_ids into crawls and upload lists at same time as checks in /all-crawls/delete - Updates frontend notification message to Only org owners can delete other users' archived items. when a crawler user attempts to delete another users' archived items	2023-10-03 13:05:00 -07:00
sua yoo	941a75ef12	Separate seeds into a new endpoints (#1217 ) - Remove config.seeds from workflow and crawl detail endpoints - Add new paginated GET /crawls/{crawl_id}/seeds and /crawlconfigs/{cid}/seeds endpoints to retrieve seeds for a crawl or workflow - Include firstSeed in GET /crawlconfigs/{cid} endpoint (was missing before) - Modify frontend to fetch seeds from new /seeds endpoints with loading indicator --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-10-02 10:56:12 -07:00
Tessa Walsh	7a56fa23f5	Remove username lookups for crawls and workflows by storing usernames in db (#1199 ) * store usernames (createdByName, modifiedByName, startedByName) in db for workflows * store userName for userid for crawls in db * update output models to return usernames * add migration 0018 to add usernames to existing crawls and crawlconfigs * updated tests for crawl and config usernames * use async for to iterate over crawls and crawlconfigs --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-09-28 09:37:23 -07:00
Tessa Walsh	094f27bcff	Track bytes stored per file type and include in org metrics (#1207 ) * Add bytes stored per type to org and metrics The org now tracks bytesStored by type of crawl, uploads, and browser profiles in addition to the total, and returns these values in the org metrics endpoint. A migration is added to precompute these values in existing deployments. In addition, all /metrics storage values are now returned solely as bytes, as the GB form wasn't being used in the frontend and is unnecessary. * Improve deletion of multiple archived item types via `/all-crawls` delete endpoint - Update `/all-crawls` delete test to check that org and workflow size values are correct following deletion. - Fix bug where it was always assumed only one crawl was deleted per cid and size was not tracked per cid - Add type check within delete_crawls	2023-09-22 12:55:21 -04:00
Tessa Walsh	9224f52f51	Remove config from list endpoints to speed up responses (#1193 ) * Remove config from list endpoints - Remove config field from workflow and crawl list endpoints - Add seedCount to CrawlConfigOut on backend and Workflow on frontend - Refactor CrawlConfig and CrawlConfigOut to extend CrawlConfigCore + CrawlConfigAdditional - Refactor workflow list in frontend to use firstSeed and seedCount - Frontend uses ListWorkflow type which is Omit<Workflow, "config">	2023-09-19 11:05:48 -05:00
Ilya Kreymer	feb7ab7652	Improved type checking for backend with mypy (#1174 ) * add mypy type check - run type check on backend fix ambiguous typing issues - add mypy to lint gh action + precommit hook - add mypy.ini	2023-09-13 19:40:26 -07:00

1 2

67 Commits