browsertrix

Author	SHA1	Message	Date
Tessa Walsh	9224f52f51	Remove config from list endpoints to speed up responses (#1193 ) * Remove config from list endpoints - Remove config field from workflow and crawl list endpoints - Add seedCount to CrawlConfigOut on backend and Workflow on frontend - Refactor CrawlConfig and CrawlConfigOut to extend CrawlConfigCore + CrawlConfigAdditional - Refactor workflow list in frontend to use firstSeed and seedCount - Frontend uses ListWorkflow type which is Omit<Workflow, "config">	2023-09-19 11:05:48 -05:00
Ilya Kreymer	65b7c10ba1	bump version to 1.7.0-beta.1	2023-09-18 14:33:03 -07:00
Ilya Kreymer	ff327c0b8b	Reset crawl state to running when any crawlers are running (after post-process states) (#1179 ) * operator state changes: (fixes #1178) - if at least one crawler is 'running' ensure state is reset back to running - for multiple instances, set status to earliest state (not latest) to be consistent, eg. if at least one crawl is running, set to running, if at least one is generating wacz, set to that	2023-09-15 09:16:46 -07:00
Tessa Walsh	2efc461b9b	Implement sync streaming for finished crawl logs (#1168 ) - Crawl logs streamed from WACZs using the sync boto client	2023-09-14 17:05:19 -07:00
Tessa Walsh	c7cd4e61fd	Increase wait to 30 seconds to ensure webhooks are sent (#1173 )	2023-09-13 20:20:47 -07:00
Ilya Kreymer	feb7ab7652	Improved type checking for backend with mypy (#1174 ) * add mypy type check - run type check on backend fix ambiguous typing issues - add mypy to lint gh action + precommit hook - add mypy.ini	2023-09-13 19:40:26 -07:00
Ilya Kreymer	4b34da033a	Refactor / Cleanup: move ops functions back into classes (#1171 ) * remove almost all standalone functions and move them back into ops member functions * operator now has access to all the ops classes as well * keep two standalone functions used only in migrations --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-09-13 11:56:09 -07:00
Ilya Kreymer	9159c7c914	ensure max crawl size and max crawl timeout values are set to 0 when unused, instead of null (#1167 ) - convert None->0 when creating CrawlJob - ensure frontend sends 0 not null - make input model require 'int = 0' instead of 'Optional[int] = 0'	2023-09-13 09:51:26 -07:00
Tessa Walsh	7cf2b11eb7	Add event webhook tests (#1155 ) * Add success filter to webhook list GET endpoint * Add sorting to webhooks list API and add event filter * Test webhooks via echo server * Set address to echo server on host from CI env var for k3d and microk8s * Add -s back to pytest command for k3d ci * Change pytest test path to avoid hanging on collecting tests * Revert microk8s to only run on push to main	2023-09-12 22:08:40 -07:00
Tessa Walsh	f980c3c509	Expect that crawl deleted response is bool, not int (#1170 )	2023-09-12 15:03:17 -07:00
Ilya Kreymer	c9c39d47b7	Scheduled Crawl Refactor: Handle via Operator + Add Skipped Crawls on Quota Reached (#1162 ) * use metacontroller's decoratorcontroller to create CrawlJob from Job * scheduled job work: - use existing job name for scheduled crawljob - use suspended job, set startTime, completionTime and succeeded status on job when crawljob is done - simplify cronjob template: remove job_image, cron_namespace, using same namespace as crawls, placeholder job image for cronjobs * move storage quota check to crawljob handler: - add 'skipped_quota_reached' as new failed status type - check for storage quota before checking if crawljob can be started, fail if not (check before any pods/pvcs created) * frontend: - show all crawls in crawl workflow, no need to filter by status - add 'skipped_quota_reached' status, show as 'Skipped (Quota Reached)', render same as failed * migration: make release namespace available as DEFAULT_NAMESPACE, delete old cronjobs in DEFAULT_NAMESPACE and recreate in crawlers namespace with new template	2023-09-12 13:05:43 -07:00
Tessa Walsh	9377a6f456	Issue all non-upload storage-quota-update events from LiteElement (#1151 ) - More specific toast notification error messages to the action being attempted - Single dismissable global banner shown when org storage is reached - Removed check for storage quota reached in `runNow`, since buttons are disabled in UI, and errors handled if request fails. - Allow creating new workflow when storage quota reached - More responsive storage quota updates: add storageQuotaReached to archived item replay.json, updates w/o reload when crawl pushes quota over limit - Modify LiteElement to check for storageQuotaReached on GET requests --------- Co-authored-by: sua yoo <sua@suayoo.com>	2023-09-11 18:17:48 -07:00
Ilya Kreymer	ad9bca2e92	Operator refactor to control pods + pvcs directly instead of statefulsets (#1149 ) - Ability for pod to be Completed, unlike in Statefulset - eg. if 3 pods are running and first one finishes, all 3 must be running until all 3 are done. With this setup, the first finished pod can remain in Completed state. - Fixed shutdown order - crawler pods now correctly shutdown first before redis pods, by switching to background deletion. - Pod priority decreases with scale: 1st instance of a new crawl can preempt 3rd or 2nd instance of another crawl - Create priority classes upto 'max_crawl_scale, configured in values.yaml - Improved scale change reconciliation: if increasing scale, immediately scale up. If decreasing scale, graceful stop scaled-down instance to complete via redis 'stopone' key, wait until they exit with Completed state before adjust status.scale / removing scaled down pods. Ensures unaccepted interrupts don't cause scaled down data to be deleted. - Redis pod remains inactive until crawler is first active, or after no crawl pods are active for 60 seconds - Configurable Redis storage with 'redis_storage' value, set to 3Gi by default - CrawlJob deletion starts as soon as post-finish crawl operations are run - Post-crawl operations get their own redis instance, since one during response is being cleaned up in finalizer - Finalizer ignores request with incorrect state (returns 400 if reported as not finished while crawl is finished) - Current resource usage added to status - Profile browser: also manage single pod directly without statefulset for consistency. - Restart pods via restartTime value: if spec.restartTime != status.restartTime, clear out pods and update status.restartTime (using OnDelete policy to avoid recreate loops in edge cases). - Update to latest metacontroller (v4.11.0) - Add --restartOnError flag for crawler (for browsertrix-crawler 0.11.0) - Failed crawl logging: dd 'fail_crawl()' to be used for failing a crawl, which prints logs for default container (if enabled) as well as pod status - tests: check other finished states to avoid stuck in infinite loop if crawl fails - tests: disable disk utilization check, which adds unpredictability to crawl testing! fixes #1147 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-09-11 10:38:04 -07:00
Anish Lakhwara	e57148d0e9	feat: add SMTP {port, use_tls} config (#1142 ) * feat: add SMTP {port, use_tls} config * If `password` is None don't attempt to log in * remove 'can be omitted' comment --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2023-09-08 08:18:36 -07:00
Ilya Kreymer	e75b207f7e	Fix 0015 migration (#1154 ) * migration: fix 0015 migration to ensure it reads the correct mongo collection, avoid variable overwrites and and uses org _id field. fixes #1153	2023-09-08 08:17:40 -07:00
Tessa Walsh	d2ededc895	Add and enforce org storage quota (#1106 ) * Implement in backend - Track bytesStored in org - Add migration to pre-calculate based on size of crawlfiles and profilefiles - Add methods to increase or decrease org storage when crawl or profile files are added or deleted - Include storageQuotaReached boolean in API responses that alter storage - Don't start new crawls and fail uploads if storage quota reached * Implement in frontend - Add to orgs-list quotas - Update org's storageQuotaReached based on backend endpoint responses - Disable buttons when storage quota is met - Show toast notification when attempting to run a crawl when org storage quota is met	2023-09-07 12:45:43 -04:00
Ilya Kreymer	68bc053ba0	Print crawl log to operator log (mostly for testing) (#1148 ) * log only if 'log_failed_crawl_lines' value is set to number of last lines to log from failed container --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-09-06 17:53:02 -07:00
Ilya Kreymer	dce1ae6129	better resources scaling by number of browsers per crawler container (#1103 ) - set crawler cpu / memory with fixed base + incremental bumps based on number of browsers - allow parsing k8s quantities with parse_quantity, compute in operator - set 'crawler_cpu = crawler_cpu_base + crawler_extra_cpu_per_browser * (num_browsers - 1)' and same for memory	2023-09-06 01:42:44 -04:00
Ilya Kreymer	876ba1bf24	null check: check before accessing config in 'get_all_crawl_search_values' (#1144 )	2023-09-05 23:57:05 -04:00
Tessa Walsh	147bfd9d44	Add event webhook notifications system to backend (#1061 ) Initial set of backend API for event webhook notifications for the following events: * Crawl started (including boolean indicating if crawl was scheduled) * Crawl finished * Upload finished * Archived item added to collection * Archived item removed from collection Configuration of URLs is done via /api/orgs/<oid>/event-webhook-urls. If a URL is configured for a given event, a webhook notification is added to the database and then attempted to be sent (up to a total of 5 tries per overall attempt, with an increasing backoff between, implemented via use of the backoff library, which supports async). webhook status available via /api/orgs/<oid>/webhooks (Additional testing + potential fastapi integration left in separate follow-ups Fixes #1041	2023-08-31 19:52:37 -07:00
Tessa Walsh	1aa951132c	Fix unsetting all collections via PATCH update (#1126 )	2023-08-30 18:16:21 -04:00
Tessa Walsh	f6369ee01e	Add support for collectionIds to archived item PATCH endpoints (#1121 ) * Add support for collectionIds to patch endpoints * Make update available via all-crawls/ and add test * Fix tests * Always remove collectionIds from udpate * Remove unnecessary fallback * One more pass on expected values before update	2023-08-30 10:41:30 -04:00
Tessa Walsh	e667fe2e97	Add max crawl size option to backend and frontend (#1045 ) Backend: - add 'maxCrawlSize' to models and crawljob spec - add 'MAX_CRAWL_SIZE' to configmap - add maxCrawlSize to new crawlconfig + update APIs - operator: gracefully stop crawl if current size (from stats) exceeds maxCrawlSize - tests: add max crawl size tests Frontend: - Add Max Crawl Size text box Limits tab - Users enter max crawl size in GB, convert to bytes - Add BYTES_PER_GB as constant for converting to bytes - docs: Crawl Size Limit to user guide workflow setup section Operator Refactor: - use 'status.stopping' instead of 'crawl.stopping' to indicate crawl is being stopped, as changing later has no effect in operator - add is_crawl_stopping() to return if crawl is being stopped, based on crawl.stopping or size or time limit being reached - crawlerjob status: store byte size under 'size', human readable size under 'sizeHuman' for clarity - size stat always exists so remove unneeded conditional (defaults to 0) - store raw byte size in 'size', human readable size in 'sizeHuman' Charts: - subchart: update crawlerjob crd in btrix-crds to show status.stopping instead of spec.stopping - subchart: show 'sizeHuman' property instead of 'size' - bump subchart version to 0.1.1 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-08-26 22:00:37 -07:00
Ilya Kreymer	2da6c1c905	1.6.3 Fixes - Fix workflow sort order for Latest Crawl + 'Remove From Collection' action menu on archived items in collections (#1113 ) * fix latest crawl (lastRun) sort: - don't cast 'started' value to string when setting as starting crawl time (regression from #937) - caused incorrect sorting as finished crawl time was a datetime, while starting crawl time was a string - move updated config crawl info in one place, simplify to avoid returning started time altogether, just set directly - pass mdb crawlconfigs and crawls collections directly to add_new_crawl() function - fixes #1108 * Add dropdown menu containing 'Remove from Collection' to archived items in collection view (#1110) - Enables users to remove an item from a collection from the collection detail view - menu was previously missing - Fixes: #1102 (missing dropdown menu) by making use of the inactive menu trigger button. - Updates collection items page size to match "Archived Items" page size (20 items per page) --------- Co-authored-by: sua yoo <sua@webrecorder.org>	2023-08-25 21:08:47 -07:00
Anish Lakhwara	8b16124675	feat: implement 'collections' array with {name, id} for archived item details (#1098 ) - rename 'collections' -> 'collectionIds', adding migration 0014 - only populate 'collections' array with {name, id} pair for get_crawl() / single archived item path, but not for aggregate/list methods - remove Crawl.get_crawl(), redundant with BaseCrawl.get_crawl() version - ensure _files_to_resources returns an empty [] instead of none if empty (matching BaseCrawl.get_crawl() behavior to Crawl.get_crawl()) - tests: update tests to use collectionIds for id list, add 'collections' for {name, id} test - frontend: change Crawl object to have collectionIds instead of collections --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-08-25 00:26:46 -07:00
Ilya Kreymer	989ed2a8da	Use Shared Services for Crawling, Redis, Profile Browsers (#1088 ) * refactor to use shared role-based service shared across pods: - 'crawler' service for all crawler screencasting, scales 0 .. N with crawler-<ID>-N.crawl - 'redis' service for all redis access, redis-<ID>-0.redis - 'browser' service for all browser access (profile browsers), browser-<ID>-0.browser - don't create a new service per crawl/profile at all - enable 'publishNotReadyAddresses' for potentially faster resolving, esp for redis - remove service as type managed by operator as no longer creating services dynamically - remove frontend var CRAWLER_SVC_SUFFIX, suffix always '.crawler' to match crawler service name	2023-08-24 20:08:53 -07:00
Ilya Kreymer	e7f2d93f80	bump version to 1.7.0-beta.0	2023-08-23 12:03:45 -07:00
Tessa Walsh	ce5b52f8af	Add and enforce org maxPagesPerCrawl quota (#1044 )	2023-08-23 10:38:36 -04:00
sua yoo	54cf4f23e4	Paginate Workflows and refactor to use server-side queries (#1078 ) - Paginates Crawl Workflows when there are more than 10 workflows - Refactors workflow search and crawl search to use the same component - Adds sort by first seed, workflow creation date, and workflow modified date - Separates "last run" date from "modified" date - Update column layout into Name & Schedule (or Manual Ru'ri=), Latest Crawl (<finish time> in <duration>), total size, and last modified (modified by and modified time)	2023-08-22 16:29:17 -07:00
Ilya Kreymer	422452b5c1	bump to 1.6.2	2023-08-18 18:27:37 -07:00
Ilya Kreymer	90b2f94aef	follow-up to #1066 : update redis to 5.0.0 which includes full fix for connection leak in from_url(), (#1081 ) simplifies previous workaround addressed in 5.0.0	2023-08-15 20:34:47 -07:00
Ilya Kreymer	2e73148bea	fix redis connection leaks + exclusions error: (fixes #1065 ) (#1066 ) * fix redis connection leaks + exclusions error: (fixes #1065) - use contextmanager for accessing redis to ensure redis.close() is always called - add get_redis_client() to k8sapi to ensure unified place to get redis client - use connectionpool.from_url() until redis 5.0.0 is released to ensure auto close and single client settings are applied - also: catch invalid regex passed to re.compile() in queue regex check, return 400 instead of 500 for invalid regex - redis requirements: bump to 5.0.0rc2	2023-08-14 18:29:28 -07:00
Ilya Kreymer	9553115bbe	helm chart tweaks: (#1067 ) * helm chart tweaks: - lower mem requirements for backend and crawler - disable cors in ingress to pass through cors headers from backend - crawler statefulset: use ordered instead of parallel scaling policy to avoid single crawl taking up all crawling capacity quickly	2023-08-14 16:43:12 -07:00
Ilya Kreymer	d93ddaf620	bump version to 1.6.1	2023-08-11 12:50:41 -07:00
Ilya Kreymer	35ab6d6df6	bump to 1.6.0!	2023-08-09 15:40:27 -07:00
sua yoo	37733483d5	Standardize archived item filtering, sorting and labels (#1054 ) Frontend: - Renames list view to "All Archived Items" - Refactors fetches to use single all-crawls endpoints - Removes search by config ID for more search parity with uploads - Adds sort by size - Refactors property and method names to replace crawl* - Replaces remaining references to "crawl" in copy with "item"' - Rename Upload Archive button to Upload WACZ - Fix focusout in item menu so menus close Backend: - Filter search values by type as well - Only get list of cids for crawls in search values - Don't list crawl/workflow ids in search values --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-08-09 12:13:55 -07:00
Ilya Kreymer	7a8f370bc2	bump version to 1.6.0-beta.4 for testing	2023-08-09 12:09:37 -07:00
Ilya Kreymer	de3e5907a7	backend: crawlout: include raw crawnconfig in api details, fixes #1030 (#1055 )	2023-08-09 08:46:42 -07:00
Ilya Kreymer	8d0a4f2ca9	fix public collections endpoint returning 404 when not public (#1052 ) tests: add tests for public collections endpoint when collection is public and when not	2023-08-04 13:29:13 -04:00
Tessa Walsh	7ff57ce6b5	Backend: standardize search values, filters, and sorting for archived items (#1039 ) - all-crawls list endpoint filters now conform to 'Standardize list controls for archived items #1025' and URL decode values before passing them in - Uploads list endpoint now includes all all-crawls filters relevant to uploads - An all-crawls/search-values endpoint is added to support searching across all archived item types - Crawl configuration names are now copied to the crawl when the crawl is created, and crawl names and descriptions are now editable via the backend API (note: this will require frontend changes as well to make them editable via the UI) - Migration added to copy existing config names for active configs into their associated crawls. This migration has been tested in a local deployment - New statuses generate-wacz, uploading-wacz, and pending-wait are added when relevant to tests to ensure that they pass - Tests coverage added for all new all-crawls endpoints, filters, and sort values	2023-08-04 09:56:52 -07:00
Ilya Kreymer	362afa47bd	Support for Public / Shareable Collections (#1038 ) * collections: support toggling collections public/private, viewable via RWP - backend: add 'public' to collection model, support patching to update - backend: add .../collections/<id>/public/replay.json for public access - backend: add CORS handling for public endpoint - frontend: support 'make shareable / make private' dropdown actions on collection detail + collection list views - frontend: show shareable / private icons by collection name on detail + list views - frontend: link to replayweb.page for standalone browsing - frontend: add embed code popup when a collection is shareable - refer to public collections as 'shareable' for now --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2023-08-03 19:11:01 -07:00
Ilya Kreymer	45eaa0b3a3	version: bump to 1.6.0-beta.3	2023-08-01 09:48:17 -07:00
Ilya Kreymer	06cf9c7cc3	add crawl ending states: 'generate-wacz', 'uploading-wacz', 'pending-wait' that occur after a crawl is finished or is being stopped (#1022 ) operator: ensure transitions from each of these states is supported, including to 'waiting_capacity' add extra check on stopping to avoid transitioning back to a running state after crawl is finished ui: add states to UI display, localization, add as active states fixes #263	2023-08-01 00:15:59 -07:00
Ilya Kreymer	7ea6d76f10	Resource Constraints Cleanup: (fixes #895 ) (#1019 ) * resource constraints: (fixes #895) - for cpu, only set cpu requests - for memory, set mem requests == mem limits - add missing resource constraints for minio and scheduled job - for crawler, set mem and cpu constraints per browser, scale based on browser instances per crawler - add comments in values.yaml for crawler values being multiplied - default values: bump crawler to 650 millicpu per browser instance just in case cleanup: remove unused entries from main backend configmap	2023-08-01 00:11:16 -07:00
Vinzenz Sinapius	5807507f29	Add proxy settings for crawler and profilebrowser (#997 )	2023-07-26 16:11:10 -07:00
Ilya Kreymer	6506965d98	Streaming Download for Collections (#1012 ) * support streaming download of collections (part of #927) - WACZ zip created on the fly using stream-zip - add 'Download Collection' option to collection detail and list - after editing collection, return to collection view - tests: add test for streaming download, ensure WACZ files + datapackage present, STORE compression used --------- Co-authored-by: sua yoo <sua@suayoo.com>	2023-07-26 15:42:17 -07:00
Tessa Walsh	c21153255a	Rename notes to description in frontend and backend (#1011 ) - Rename crawl notes to description - Add migration renaming notes -> description - Stop inheriting workflow description in crawl - Update frontend to replace crawl/upload notes with description - Remove setting of config description from crawl list - Adjust tests for changes	2023-07-26 13:00:04 -07:00
Ilya Kreymer	4bea7565bc	load handling: scale up redis only when crawler pods running (#1009 ) Operator: Modified init behavior to only load redis when at least one crawler pod available: - waits for at least one crawler pod to be available before starting redis pod, to avoid situation where many crawler pods are in pending mode, but redis pods are still running. - redis statefulset starts at scale of 0 - once crawler pod becomes available, redis sts is scaled to 1 (via `initRedis==true` status) - crawl remains in 'starting' or 'waiting_capacity' state until pod becomes available without redis pod running - set to 'running' state only after redis and at least one crawler pod is available - if no crawler pods available after running, or, if stuck in starting for >60 seconds, switch to 'waiting_capacity' state - when switching to 'waiting_capacity', also scale down redis to 0, wait for crawler pod to become available, only then scale up redis to 1, and get back to 'running' other tweaks: - add new status field 'initRedis', default to false, not displayed - crawler pod: consider 'ContainerCreating' state as available, as container will not be blocked by resource limits - add a resync after 3 seconds when waiting for crawler pod or redis pod to become available, configurable via 'operator_fast_resync_secs' - set_state: if not updating state, ensure state reflects actual value in db	2023-07-26 08:40:05 -07:00
Tessa Walsh	608a744aaf	Add migration to replace None with 0 for configmap CRAWL_TIMEOUT (#1008 )	2023-07-24 15:49:26 -04:00
Tessa Walsh	fcd48b1831	Add totalSize to collections and make it sortable in list endpoint (#1001 ) * Precompute collection.totalSize and make sortable * Add migration to recompute collection data with totalSize	2023-07-24 13:12:23 -04:00
Tessa Walsh	9f32aa697b	Add collections and tags to upload API endpoints (#993 ) * Add collections and tags to uploads * Fix order of deletion check test * Re-add tags to UploadedCrawl model after rebase * Fix Users model heading	2023-07-21 16:44:56 +02:00
Tessa Walsh	4014d98243	Move pydantic models to separate module + refactor crawl response endpoints to be consistent (#983 ) * Move all pydantic models to models.py to avoid circular dependencies * Include automated crawl details in all-crawls GET endpoints - ensure /all-crawls endpoint resolves names / firstSeed data same as /crawls endpoint for crawls to ensure consistent frontend display. fields added in get and list all-crawl endpoints for automated crawls only: - cid - name - description - firstSeed - seedCount - profileName * Add automated crawl fields to list all-crawls test * Uncomment mongo readinessProbe * cleanup CrawlOutWithResources: - remove 'files' from output model, only resources should be returned - add _files_to_resources() to simplify computing presigned 'resources' from raw 'files' - update upload tests to be more consistent, 'files' never present, 'errors' always none --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-07-20 13:05:33 +02:00
Tessa Walsh	d5c3a8519f	Add crawler Use Sitemap option to Browsertrix Cloud (#978 ) * Add user-guide docs for Use Sitemap option --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2023-07-19 13:57:52 -04:00
Ilya Kreymer	a5312709bb	fix issues that caused cronjob container to crash: (#987 ) - don't set CRAWL_TIMEOUT to "None" in configmap, and if encountered, just set to 0 - run register_exit_handler() after run loop has been inited	2023-07-18 18:08:53 +02:00
Ilya Kreymer	7d694754c6	uploads api ext: (#970 ) - also support collectionId filter on /all-crawls - update tests	2023-07-09 22:12:54 -07:00
Ilya Kreymer	f1bce310d0	uploads api: support filtering uploads by collectionId (#969 ) tests: add collection filter test	2023-07-09 10:54:30 -07:00
Ilya Kreymer	a640f58657	Tests: fix test get crawl loop (#967 ) * tests: add sleep() between all looping get_crawl() calls to avoid tight request loop, also remove unneeded loop will likely fix occasional '504 timeout' test failures where frontend is overwhelmed with /replay.json requests	2023-07-08 17:16:11 -07:00
Ilya Kreymer	2038e3d668	remove default: similar to #952 , remove default extraHops setting as it disables 'url list' extraHops by forcing the value to 0 (#954 )	2023-07-07 12:08:30 -07:00
Ilya Kreymer	7139b9a7a9	operator: ensure finished is always set (#953 )	2023-07-07 12:08:15 -07:00
Ilya Kreymer	00eb62214d	Uploads API: BaseCrawl refactor + Initial support for /uploads endpoint (#937 ) * basecrawl refactor: make crawls db more generic, supporting different types of 'base crawls': crawls, uploads, manual archives - move shared functionality to basecrawl.py - create a base BaseCrawl object, which contains start / finish time, metadata and files array - create BaseCrawlOps, base class for CrawlOps, which supports base crawl deletion, querying and collection add/remove * uploads api: (part of #929) - new UploadCrawl object which extends BaseCrawl, has name and description - support multipart form data data upload to /uploads/formdata - support streaming upload of a single file via /uploads/stream, using botocore multipart upload to upload to s3-endpoint in parts - require 'filename' param to set upload filename for streaming uploads (otherwise use form data names) - sanitize filename, place uploads in /uploads/<uuid>/<sanitized-filename>-<random>.wacz - uploads have internal id 'upload-<uuid>' - create UploadedCrawl object with CrawlFiles pointing to the newly uploaded files, set state to 'complete' - handle upload failures, abort multipart upload - ensure uploads added within org bucket path - return id / added when adding new UploadedCrawl - support listing, deleting, and patch /uploads - support upload details via /replay.json to support for replay - add support for 'replaceId=<id>', which would remove all previous files in upload after new upload succeeds. if replaceId doesn't exist, create new upload. (only for stream endpoint so far). - support patching upload metadata: notes, tags and name on uploads (UpdateUpload extends UpdateCrawl and adds 'name') * base crawls api: Add /all-crawls list and delete endpoints for all crawl types (without resources) - support all-crawls/<id>/replay.json with resources - Use ListCrawlOut model for /all-crawls list endpoint - Extend BaseCrawlOut from ListCrawlOut, add type - use 'type: crawl' for crawls and 'type: upload' for uploads - migration: ensure all previous crawl objects / missing type are set to 'type: crawl' - indexes: add db indices on 'type' field and with 'type' field and oid, cid, finished, state * tests: add test for multipart and streaming upload, listing uploads, deleting upload - add sample WACZ for upload testing: 'example.wacz' and 'example-2.wacz' * collections: support adding and remove both crawls and uploads via base crawl - include collection_ids in /all-crawls list - collections replay.json can include both crawls and uploads bump version to 1.6.0-beta.2 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-07-07 09:13:26 -07:00
Tessa Walsh	bf1e817da3	Unset default scopeType for seeds so they inherit parent scopeType by default (#952 )	2023-07-06 15:03:05 -07:00
Ilya Kreymer	e37f220d6c	version: bump to 1.6.0-beta.1	2023-06-16 18:53:32 -07:00
Tessa Walsh	c7051d5fbf	Backend API consistency pass (#921 ) * Make API add and update method returns consistent - Updates return {"updated": True} - Adds return {"added": True} - Both can additionally have other fields as needed, e.g. id or name - remove Profile response model, as returning added / id only - reformat --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-06-16 18:52:46 -07:00
Tessa Walsh	bd6dc79449	Add frontend support for auto-adding collections to workflows (#916 ) - Adds collections search and list to workflow editor - Adds collections to workflow details component - Adds namePrefix filter to backend GET /orgs/{oid}/collections endpoint to support case-insensitive searching of collections - Adds documentation for new setting --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2023-06-12 18:18:05 -07:00
Tessa Walsh	325355d991	Fix post-crawl collection stats update and add test (#918 ) This fixes #917, where crawls added to a collection via the workflow autoAddCollections were not successfully represented in the crawl and page count stats in the collection after completing.	2023-06-10 19:06:25 -07:00
Tessa Walsh	e10b7093c7	Fix bug preventing deleting collections with no crawls (#912 )	2023-06-08 11:28:30 -07:00
Ilya Kreymer	4428184aea	frontend: configure running with a fixed 'replay.json', auth headers passed via separate config (#899 ) wabac.js will reload the replay.json on 403 with new token (will be in next version of wabac.js) presign urls: make presign timeout configurable (in minutes), defaults to 60 mins dockerfile: fix configuring RWP_BASE_URL	2023-06-08 11:26:26 -07:00
Tessa Walsh	120f7ca158	Precompute crawl file stats (#906 )	2023-06-07 16:39:49 -07:00
sua yoo	66b3befef9	Frontend collections beta UI (#886 ) - Support for creating new collections and editing existing collections - Can select crawling workflows which adds entire workflow, and then deselect individual crawls - Can edit existing collections and add more crawls - Can view, create and delete collections via new Collections top-level nav entry	2023-06-06 17:52:01 -07:00
Ilya Kreymer	f2b7b6bcd5	Nightly Tests Fix (#905 ) * tests: fix nightly test to account for 'waiting_capacity' state * readd missing --logErrorsToRedis flag	2023-06-02 21:47:41 -07:00
Ilya Kreymer	3f42515914	crawls list: unset errors in crawls list response to avoid very large… (#904 ) * crawls list: unset errors in crawls list response to avoid very large responses #872 * Remove errors from crawl replay.json * Add tests to ensure errors are excluded from crawl GET endpoints * Update tests to accept None for errors --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-06-02 18:52:59 -07:00
Ilya Kreymer	00fb8ac048	Concurrent Crawl Limit (#874 ) concurrent crawl limits: (addresses #866) - support limits on concurrent crawls that can be run within a single org - change 'waiting' state to 'waiting_org_limit' for concurrent crawl limit and 'waiting_capacity' for capacity-based limits orgs: - add 'maxConcurrentCrawl' to new 'quotas' object on orgs - add /quotas endpoint for updating quotas object operator: - add all crawljobs as related, appear to be returned in creation order - operator: if concurrent crawl limit set, ensures current job is in the first N set of crawljobs (as provided via 'related' list of crawljob objects) before it can proceed to 'starting', otherwise set to 'waiting_org_limit' - api: add org /quotas endpoint for configuring quotas - remove 'new' state, always start with 'starting' - crawljob: add 'oid' to crawljob spec and label for easier querying - more stringent state transitions: add allowed_from to set_state() - ensure state transitions only happened from allowed states, while failed/canceled can happen from any state - ensure finished and state synched from db if transition not allowed - add crawl indices by oid and cid frontend: - show different waiting states on frontend: 'Waiting (Crawl Limit) and 'Waiting (At Capacity)' - add gear icon on orgs admin page - and initial popup for setting org quotas, showing all properties from org 'quotas' object tests: - add concurrent crawl limit nightly tests - fix state waiting -> waiting_capacity - ci: add logging of operator output on test failure	2023-05-30 15:38:03 -07:00
sua yoo	6208ead040	Sort collection by last updated (modified) (#897 ) Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-05-30 14:09:10 -04:00
Ilya Kreymer	4d30a64bc9	collection delete: (#896 ) set delete endpoint to use DELETE verb, fix for #869	2023-05-29 18:19:04 -07:00
Tessa Walsh	df4c4e6c5a	Optimize workflow statistics updates (#892 ) * optimizations: - rename update_crawl_config_stats to stats_recompute_all, only used in migration to fetch all crawls and do a full recompute of all file sizes - add stats_recompute_last to only get last crawl by size, increment total size by specified amount, and incr/decr number of crawls - Update migration 0007 to use stats_recompute_all - Add isCrawlRunning, lastCrawlStopping, and lastRun to stats_recompute_last - Increment crawlSuccessfulCount in stats_recompute_last * operator/crawls: - operator: keep track of filesAddedSize in redis as well - rename update_crawl to update_crawl_state_if_changed() and only update if state is different, otherwise return false - ensure mark_finished() operations only occur if crawl is state has changed - don't clear 'stopping' flag, can track if crawl was stopped - state always starts with "starting", don't reset to starting tests: - Add test for incremental workflow stats updating - don't clear stopping==true, indicates crawl was manually stopped --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-05-26 22:57:08 -07:00
Tessa Walsh	9c7a312a4c	Rework collections to track collections in Crawl (#878 ) * Track collections in Crawl rather than crawls in Collection * Add delete collection API endpoint and tests * Precompute collection crawlCount, pageCount, and tags and add them to GET collection responses * Add modified field to Collection * Update collection replay.json method * Make add and remove crawls accept list of crawl ids * Auto-add new workflow crawls to collections when they successfully complete via CrawlConfig.autoAddCollections field * Move long-running post-crawl operator tasks into asyncio task * Make CrawlConfig.autoAddCollections updatable via /update API endpoint	2023-05-25 15:41:50 -04:00
Ilya Kreymer	d7c19c7613	Wait for DB init for healthcheck + settings (#885 ) * init check: (backend fix for #794) - wait until db is inited before settings /api/settings to return 200 - also return 503 from healthcheck endpoint, until db is available	2023-05-25 09:58:30 -07:00
Tessa Walsh	e94e179bb9	Fix crawl stopping tests (#875 ) * Update currCrawlStopping references in backend tests * Make sure previous crawl is fully stopped before next test	2023-05-23 12:39:53 -07:00
Tessa Walsh	5c944d4626	Remove uniqueness constraint on collection descriptions Fix for copy-paste error	2023-05-23 11:03:13 -04:00
Ilya Kreymer	12f7db3ae2	tests: fixes for crawl cancel + crawl stopped (#864 ) * tests: - fix cancel crawl test by ensuring state is not running or waiting - fix stop crawl test by ensuring stop is only initiated after at least one page has been crawled, otherwise result may be failed, as no crawl data has been crawled yet (separate fix in crawler to avoid loop if stopped before any data written webrecorder/browsertrix-crawler#314) - bump page limit to 4 for tests to ensure crawl is partially complete, not fully complete when stopping - allow canceled or partial_complete due to race condition * chart: bump frontend limits in default, not just for tests (addresses #780) * crawl stop before starting: - if crawl stopped before it started, mark as canceled - add test for stopping immediately, which should result in 'canceled' crawl - attempt to increase resync interval for immediate failure - nightly tests: increase page limit to test timeout * backend: - detect stopped-before-start crawl as 'failed' instead of 'done' - stats: return stats counters as int instead of string	2023-05-22 20:17:29 -07:00
Tessa Walsh	28f1c815d0	Add crawlSuccessfulCount to workflows (#871 )	2023-05-22 19:06:37 -04:00
Tessa Walsh	bd8b306fbd	Improve sorting workflows by lastUpdated (#826 ) * Precompute config crawl stats Includes a database migration to move preciously dynamically computed crawl stats for workflows into the CrawlConfig model. * Add lastRun sorting option and enable it by default * Add modified as final sort key to order non-run workflows * Remove currCrawl* fields and update frontend accordingly * Add isCrawlRunning field to backend and use in frontend	2023-05-22 18:42:30 -04:00
Tessa Walsh	60fac2b677	Add collection sorting and filtering (#863 ) * Sort by name and description (ascending by default) * Filter by name * Add endpoint to fetch collection names for search * Add collation so that utf-8 chars sort as expected	2023-05-22 16:53:49 -04:00
Ilya Kreymer	826c2e8298	version: bump to 1.6.0-beta.0	2023-05-19 11:29:31 -07:00
Tessa Walsh	f482831d53	Use collection uuid as id (instead of name) (#855 ) Also ensure name is not empty by adding minimum length of 1 Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-05-19 09:03:48 -04:00
Ilya Kreymer	d07204e59d	version: bump to 1.5.1	2023-05-18 17:28:42 -07:00
Ilya Kreymer	a1ef93a46a	version: bump to 1.5.0 for release!	2023-05-16 17:36:58 +02:00
Ilya Kreymer	ebee5e1788	version: bump to 1.5.0-beta.4	2023-05-12 07:34:50 +02:00
Ilya Kreymer	d8b36c0ae2	version: bump to 1.5.0-beta.3	2023-05-11 03:05:46 +02:00
Ilya Kreymer	d1e5b0a021	version: bump to 1.5.0-beta.2	2023-05-10 14:55:35 +02:00
Ilya Kreymer	a6ddde496d	backend: fixes to 0005 migration: (#843 ) - catch any errors on updating config (likely due to missing configmap), fix formatting	2023-05-10 12:00:41 +02:00
Ilya Kreymer	cf15d9c873	backend: ensure cid is a UUID, remove unneeded inactive check on crawls (#842 ) * backend: ensure cid is a UUID, remove unneeded inactive check on crawls * add UUID cast to cancel only	2023-05-10 11:59:44 +02:00
Ilya Kreymer	2cae065c46	Add Waiting state on the backend and frontend (#839 ) * operator: add waiting state - add pods as related objects - inspect pod status, set crawl status to 'waiting' if no pods are running frontend: - frontend support for 'waiting' state - show waiting icon from mocks --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2023-05-08 17:05:01 -07:00
Ilya Kreymer	70319594c2	crawlconfig: fix default filename template, make configurable (#835 ) * crawlconfig: fix default filename template, make configurable - make default crawl file template configurable with 'default_crawl_filename_template' value in values.yaml - set to '@ts-@hostsuffix.wacz' by default - allow updating via 'crawlFilenameTemplate' in crawlconfig patch, which updates configmap - tests: add test for custom 'default_crawl_filename_template'	2023-05-08 14:03:27 -07:00
Ilya Kreymer	fd7e81b8b7	stopping fix: backend fixes for #836 + prep for additional status fields (#837 ) * stopping fix: backend fixes for #836 - sets 'stopping' field on crawl when crawl is being stopped (both via db and on k8s object) - k8s: show 'stopping' as part of crawljob object, update subchart - set 'currCrawlStopping' on workflow - support old and new browsertrix-crawler stopping keys - tests: add tests for new stopping state, also test canceling crawl (disable test for stopping crawl, currently failing) - catch redis error when getting stats operator: additional optimizations: - run pvc removal as background task - catch any exceptions in finalizer stage (eg. if db is down), return false until finalizer completes	2023-05-08 14:02:20 -07:00
Ilya Kreymer	064cd7e08a	quickfix: fix stopping crawls with current browsertrix-crawler beta	2023-05-06 23:35:25 -07:00
Ilya Kreymer	b40d599e17	operator fixes: (#834 ) - just pass cid from operator for consistency, don't load crawl from update_crawl (different object) - don't throw in update_config_crawl_stats() to avoid exception in operator, only throw in crawlconfigs api	2023-05-06 13:02:33 -07:00
Ilya Kreymer	f992704491	version: bump version to 1.5.0-beta.1	2023-05-06 00:31:03 -07:00
Tessa Walsh	4f121fb868	Update precompute migration to only update active workflows (#833 )	2023-05-05 21:35:03 -07:00
Tessa Walsh	8281ba723e	Pre-compute workflow last crawl information (#812 ) * Precompute config crawl stats * Includes a database migration to move preciously dynamically computed crawl stats for workflows into the CrawlConfig model. * Add crawls.finished descending index * Add last crawl fields to workflow tests	2023-05-05 15:12:52 -07:00
Ilya Kreymer	aae0e6590e	Ensure Volumes are deleted when crawl is canceled (#828 ) * operator: - ensures crawler pvcs are always deleted before crawl object is finalized (fixes #827) - refactor to ensure finalizer handler always run when finalizing - remove obsolete config entries	2023-05-05 12:05:54 -07:00
Tessa Walsh	48d34bc3c4	Add option to list workflows API endpoint to filter by schedule (#822 ) * Add option to filter workflows by empty or non-empty schedule * Add tests	2023-05-05 12:05:19 -07:00
Tessa Walsh	542ad7a24a	Update scale in workflow when crawl scale is updated (#820 )	2023-05-05 11:59:57 -07:00
Tessa Walsh	774ae518f4	Set crawl-stop in redis from operator when crawl is stopped (#815 ) Change redis to <crawl-id>:crawl-stop to match webrecorder/browsertrix-crawler#303	2023-05-05 11:34:24 -07:00
Tessa Walsh	b2005fe389	Fix crawl /errors API endpoint (#813 ) * Fix crawl error slicing to ensure a consistent number of errors per page * Fix total count in paginated API response	2023-05-03 13:58:38 -04:00
Tessa Walsh	1a63c31b71	backend: errors endpoint: Parse JSON-l errors before returning (#799 )	2023-04-26 14:36:48 -07:00
Ilya Kreymer	7aefe09581	startup fixes: (#793 ) - don't run migrations on first init, just set to CURR_DB_VERSION - implement 'run once lock' with mkdir/rmdir - move register_exit_handler() to utils - remove old run once handler	2023-04-24 18:32:52 -07:00
Ilya Kreymer	60ba9e366f	Refactor to use new operator on backend (#789 ) * Btrixjobs Operator - Phase 1 (#679) - add metacontroller and custom crds - add main_op entrypoint for operator * Btrix Operator Crawl Management (#767) * operator backend: - run operator api in separate container but in same pod, with WEB_CONCURRENCY=1 - operator creates statefulsets and services for CrawlJob and ProfileJob - operator: use service hook endpoint, set port in values.yaml * crawls working with CrawlJob - jobs start with 'crawljob-' prefix - update status to reflect current crawl state - set sync time to 10 seconds by default, overridable with 'operator_resync_seconds' - mark crawl as running, failed, complete when finished - store finished status when crawl is complete - support updating scale, forcing rollover, stop via patching CrawlJob - support cancel via deletion - requires hack to content-length for patching custom resources - auto-delete of CrawlJob via 'ttlSecondsAfterFinished' - also delete pvcs until autodelete supported via statefulset (k8s >1.27) - ensure filesAdded always set correctly, keep counter in redis, add to status display - optimization: attempt to reduce automerging, by reusing volumeClaimTemplates from existing children, as these may have additional props added - add add_crawl_errors_to_db() for storing crawl errors from redis '<crawl>:e' key to mongodb when crawl is finished/failed/canceled - add .status.size to display human-readable crawl size, if available (from webrecorder/browsertrix-crawler#291) - support new page size, >0.9.0 and old page size key (changed in webrecorder/browsertrix-crawler#284) * support for scheduled jobs! - add main_scheduled_job entrypoint to run scheduled jobs - add crawl_cron_job.yaml template for declaring CronJob - CronJobs moved to default namespace * operator manages ProfileJobs: - jobs start with 'profilejob-' - update expiry time by updating ProfileJob object 'expireTime' while profile is active * refactor/cleanup: - remove k8s package - merge k8sman and basecrawlmanager into crawlmanager - move templates, k8sapi, utils into root package - delete all _job.py files - remove dt_now, ts_now from crawls, now in utils - all db operations happen in crawl/crawlconfig/org files - move shared crawl/crawlconfig/org functions that use the db to be importable directly, including get_crawl_config, add_new_crawl, inc_crawl_stats role binding: more secure setup, don't allow crawler namespace any k8s permissions - move cronjobs to be created in default namespace - grant default namespace access to create cronjobs in default namespace - remove role binding from crawler namespace * additional tweaks to templates: - templates: split crawler and redis statefulset into separate yaml file (in case need to load one or other separately) * stats / redis optimization: - don't update stats in mongodb on every operator sync, only when crawl is finished - for api access, read stats directly from redis to get up-to-date stats - move get_page_stats() to utils, add get_redis_url() to k8sapi to unify access * Add migration for operator changes - Update configmap for crawl configs with scale > 1 or crawlTimeout > 0 and schedule exists to recreate CronJobs - add option to rerun last migration, enabled via env var and by running helm with --set=rerun_last_migration=1 * subcharts: move crawljob and profilejob crds to separate subchart, as this seems best way to guarantee proper install order with + update on upgrade with helm, add built btrix-crds-0.1.0.tgz subchart - metacontroller: use release from ghcr, add metacontroller-helm-v4.10.1.tgz subchart * backend api fixes - ensure changing scale of crawl also updates it in the db - crawlconfigs: add 'currCrawlSize' and 'lastCrawlSize' to crawlconfig api --------- Co-authored-by: D. Lee <leepro@gmail.com> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-04-24 18:30:52 -07:00
Tessa Walsh	a2435a013b	Add totalSize to workflow API endpoints (#783 )	2023-04-20 17:23:59 -04:00
Ilya Kreymer	3f41498c5c	quickfix: fix typo, remove unnecessary async	2023-04-18 16:14:15 -07:00
Ilya Kreymer	821d29bd2a	crawlconfig api: add 'currCrawlState' and 'currCrawlTimeStart' to crawlconfig list api (already queried on backend) (#770 ) * crawlconfig api: add 'currCrawlState' and 'currCrawlTimeStart' to crawlconfig list api (already queried on backend)	2023-04-17 23:13:13 -07:00
Tessa Walsh	6b19f72a89	Add crawl errors endpoint (#757 ) * Add crawl errors endpoint If this endpoint is called while the crawl is running, errors are pulled directly from redis. If this endpoint is called when the crawl is finished, errors are pulled from mongodb, where they're written when crawls complete. * Add nightly backend test for errors endpoint * Add errors for failed and cancelled crawls to mongo Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2023-04-17 12:59:25 -04:00
Ilya Kreymer	4a46f894a2	backend: add 'lastCrawlStartTime' and 'lastStartedByName' fields to crawlconfigs apis (#753 )	2023-04-17 08:34:29 -07:00
Tessa Walsh	59e49eacd5	Update collections backend API (#759 ) * Re-implement collections, storing crawlIds in collection * Return collections for crawl endpoints and filter on coll name * Remove crawl from all collections when deleted * Revert get_collection_crawls to flat array of resources * Fix tests	2023-04-14 12:17:18 -04:00
Tessa Walsh	1ad82a63e6	Add crawl timeout nightly test (#762 )	2023-04-11 19:36:18 -07:00
Ilya Kreymer	85b6a05419	Upgrade to mongo 6 and use sortArray for workflow crawls (#764 ) (#765 ) fixes from 1.4.1: * Upgrade to mongo 6 and use for workflow crawls * update readiness probe with timeouts doubled, and failure threshold increased for slower 'mongosh' readiness check update versions to 1.5.0-beta.0 in backend and frontend Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-04-11 18:22:07 -07:00
Tessa Walsh	fb80a04f18	Add crawl /log API endpoint If a crawl is completed, the endpoint streams the logs from the log files in all of the created WACZ files, sorted by timestamp. The API endpoint supports filtering by log_level and context whether the crawl is still running or not. This is not yet proper streaming because the entire log file is read into memory before being streamed to the client. We will want to switch to proper streaming eventually, but are currently blocked by an aiobotocore bug - see: https://github.com/aio-libs/aiobotocore/issues/991?#issuecomment-1490737762	2023-04-11 11:51:17 -04:00
Ilya Kreymer	631c84e488	version: bump to 1.4.0!	2023-04-06 10:12:43 -07:00
Ilya Kreymer	3ab62547a9	version: bump to 1.4.0-beta.2	2023-04-06 02:45:20 -07:00
Tessa Walsh	11ca3e678a	Configure crawler disk utilization threshold via helm chart (#748 )	2023-04-05 21:51:53 -07:00
Ilya Kreymer	7f757d396a	config: add 'pageLoadTimeout' and 'pageExtraDelay' options to backend… (#742 ) * config: add 'pageLoadTimeout' and 'pageExtraDelay' options to backend config - add 'default_page_load_timeout_seconds' to values.yaml, defaulting to 120, for pageLoadTimeout - add 'defaultPageLoadTimeSeconds ' to /api/settings, update tests for /api/settings addresses issue in #636	2023-04-04 19:52:23 -07:00
Ilya Kreymer	67172ca1e2	fix: only include finished crawls in crawlCount value for /api/crawlconfigs (#746 )	2023-04-04 19:50:14 -07:00
Ilya Kreymer	1c47a648a9	Max page limit override (#737 ) * more page limit: update to #717, instead of setting --limit in each crawlconfig, apply override --maxPageLimit setting, implemented in crawler, to override individually configured page limit * update tests, no longer returning 'crawl_page_limit_exceeds_allowed'	2023-04-03 14:01:32 -07:00
Tessa Walsh	3b99bdf26a	Update nightly test fixtures to use Seed objects (#734 )	2023-04-03 16:21:25 -04:00
Tessa Walsh	e9b61c632d	Add pageSize to pagination format (#736 )	2023-04-03 15:57:47 -04:00
Ilya Kreymer	887cb16146	Allow configurable max pages per crawl in deployment settings (#717 ) * backend: max pages per crawl limit, part of fix for #716: - set 'max_pages_crawl_limit' in values.yaml, default to 100,000 - if set/non-0, automatically set limit if none provided - if set/non-0, return 400 if adding config with limit exceeding max limit - return limit as 'maxPagesPerCrawl' in /api/settings - api: /all/crawls - add runningOnly=0 to show all crawls, default to 1/true (for more reliable testing) tests: add test for 'max_pages_per_crawl' setting - ensure 'limit' can not be set higher than max_pages_per_crawl - ensure pages crawled is at the limit - set test limit to max 2 pages - add settings test - check for pages.jsonl and extraPages.jsonl when crawling 2 pages	2023-03-28 16:26:29 -07:00
Tessa Walsh	4724754efc	Filter and sort crawl and workflow list API endpoints in backend (#724 ) * Re-implement pagination and paginate crawlconfig revs First step toward simplifying pagination to set us up for sorting and filtering of list endpoints. This commit removes fastapi-pagination as a dependency. * Migrate all HttpUrl seeds to Seeds This commit also updates the frontend to always use Seeds and to fix display issues resulting from the change. * Filter and sort crawls and workflows Crawls: - Filter by createdBy (via userid param) - Filter by state (comma-separated string for multiple values) - Filter by first_seed, name, description - Sort by started, finished, fileSize, firstSeed - Sort descending by default to match frontend Workflows: - Filter by createdBy (formerly userid) and modifiedBy - Filter by first_seed, name, description - Sort by created, modified, firstSeed, lastCrawlTime * Add crawlconfigs search-values API endpoint and test	2023-03-28 17:55:40 -04:00
Tessa Walsh	e293e98ac3	Fix migration to avoid jobType KeyError (#727 ) * Fix migration to avoid KeyError * Use .get() for other optional fields	2023-03-27 13:52:05 -07:00
Tessa Walsh	4136bdad2e	Add optional description to crawl configs and return in crawl endpoints (#707 )	2023-03-21 15:39:09 -04:00
Ilya Kreymer	ba70d3227e	version: update to 1.4.0-beta.1	2023-03-17 21:14:42 -07:00
Ilya Kreymer	07e9f51292	backend: update queue apis to work with new sorted queue apis (also b… (#712 ) * backend: update queue apis to work with new sorted queue apis (also backwards compatible to existing apis) designed for browsertrix-crawler 0.9.0-beta.1 but also backwards compatible with older list-based queue as well	2023-03-17 21:11:17 -07:00
Ilya Kreymer	de9212eec7	exclusions editor fix: (#692 ) - backend: fix updating model after exclusions change - frontend: don't check for new_cid, just success - fixes #691	2023-03-10 22:36:10 -08:00
Ilya Kreymer	86ca9c4bac	backend: Fix for total crawl time limit. (#665 ) * backend: fix for total crawl timelimit: - time limit is computed for total job run time - when limit is exceeded, job starts to stop crawls gracefully, equivalent to 'stop crawl' operation - fix for #664 * rename crawl-timeout -> crawl_expire_time * fix lint	2023-03-10 11:43:16 -08:00
Ilya Kreymer	c2fa78859b	permissions: allow user with 'viewer' permissions to access read-only crawlconfig apis (#687 ) addresses issue in #653, fixes #685	2023-03-08 09:29:25 -08:00
Ilya Kreymer	544346d1d4	backend: make crawlconfigs mutable! (#656 ) (#662 ) * backend: make crawlconfigs mutable! (#656) - crawlconfig PATCH /{id} can now receive a new JSON config to replace the old one (in addition to scale, schedule, tags) - exclusions: add / remove APIs mutate the current crawlconfig, do not result in a new crawlconfig created - exclusions: ensure crawl job 'config' is updated when exclusions are added/removed, unify add/remove exclusions on crawl - k8s: crawlconfig json is updated along with scale - k8s: stateful set is restarted by updating annotation, instead of changing template - crawl object: now has 'config', as well as 'profileid', 'schedule', 'crawlTimeout', 'jobType' properties to ensure anything that is changeable is stored on the crawl - crawlconfigcore: store share properties between crawl and crawlconfig in new crawlconfigcore (includes 'schedule', 'jobType', 'config', 'profileid', 'schedule', 'crawlTimeout', 'tags', 'oid') - crawlconfig object: remove 'oldId', 'newId', disallow deactivating/deleting while crawl is running - rename 'userid' -> 'createdBy' - remove unused 'completions' field - add missing return to fix /run response - crawlout: ensure 'profileName' is resolved on CrawlOut from profileid - crawlout: return 'name' instead of 'configName' for consistent response - update: 'modified', 'modifiedBy' fields to set modification date and user modifying config - update: ensure PROFILE_FILENAME is updated in configmap is profileid provided, clear if profileid=="" - update: return 'settings_changed' and 'metadata_changed' if either crawl settings or metadata changed - tests: update tests to check settings_changed/metadata_changed return values add revision tracking to crawlconfig: - store each revision separate mongo db collection - revisions accessible via /crawlconfigs/{cid}/revs - store 'rev' int in crawlconfig and in crawljob - only add revision history if crawl config changed migration: - update to db v3 - copy fields from crawlconfig -> crawl - rename userid -> createdBy - copy userid -> modifiedBy, created -> modified - skip invalid crawls (missing config), make createdBy optional (just in case) frontend: Update crawl config keys with new API (#681), update frontend to use new PATCH endpoint, load config from crawl object in details view --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: sua yoo <sua@suayoo.com>	2023-03-07 20:36:50 -08:00
Tessa Walsh	e98c7172a9	Paginate API list endpoints (#659 ) * Paginate API list endpoints fastapi-pagination is pinned to 0.9.3, the latest release that plays nicely with pinned versions of fastapi and fastapi-users. * Increase page size via overriden Params and Page classes * update api resource list keys --------- Co-authored-by: sua yoo <sua@suayoo.com>	2023-03-06 14:41:25 -05:00
Ilya Kreymer	ace4e79e3f	version: bump version to 1.4.0-beta.0	2023-03-06 10:20:56 -08:00
Ilya Kreymer	df9a7eccf3	version: bump to 1.3.1	2023-02-28 18:40:15 -08:00
Ilya Kreymer	4901fc2fe9	version: bump to 1.3.0	2023-02-24 18:07:56 -08:00
Tessa Walsh	e2f359c352	CrawlConfig migration and crawl stats query optimization (#633 ) * Drop crawl stats fields from CrawlConfig and add migration * Remove migrate_down from BaseMigration * Get crawl stats from optimized mongo query	2023-02-24 18:01:15 -08:00
Sara Tavares	8167d7da8d	fix typos (#640 )	2023-02-24 11:10:49 -08:00
Tessa Walsh	1b1bc10c60	Fix nightly tests (#632 )	2023-02-23 13:57:22 -05:00
Tessa Walsh	567e851235	Dynamically calculate crawl stats for crawlconfig endpoints (#623 )	2023-02-22 22:17:45 -05:00
Tessa Walsh	ed94dde7e6	Include firstSeed and seedCount in crawl endpoints (#618 )	2023-02-22 10:27:31 -05:00
Ilya Kreymer	0fd18ed3dd	version: bump to 1.3.0-beta.0 CHANGES: add upcoming release, link to release changelist for 1.2.0	2023-02-21 10:14:08 -08:00
Tessa Walsh	4234f89d25	Rename crawlconfig name from file suffixes (#610 )	2023-02-21 12:52:22 -05:00
Tessa Walsh	30f1930519	Add back GET /users/invite/{token} used by frontend (#607 )	2023-02-16 13:02:38 -05:00
Tessa Walsh	bd4fba7af7	Fix POST /orgs/{oid}/crawls/delete (#591 ) * Fix POST /orgs/{oid}/crawls/delete - Add permissions check to ensure crawler users can only delete their own crawls - Fix broken delete_crawls endpoint - Delete files from storage as well as deleting crawl from db - Add tests, including nightly test that ensures crawl files are no longer accessible after the crawl is deleted	2023-02-15 21:06:12 -05:00
Tessa Walsh	14b349443f	Make pending invites expire via TTL index (#568 ) * Make invites expire after configurable window The value can be set in EXPIRE_AFTER_SECONDS env var and via helm chart values, and defaults to 7 days. * Create nightly test CI and add invite expiration test to it * Update 404 error message for missing or expired invite --------- Co-authored-by: sua yoo <sua@suayoo.com>	2023-02-14 16:07:14 -05:00
Tessa Walsh	103d91556f	Remove non-org-scoped invites from backend (#585 ) * Remove non-org-scoped invites - remove POST /users/invite and related tests - remove GET /users/invite-delete/{token}	2023-02-08 18:56:28 -08:00
Tessa Walsh	b642c53c59	Make crawlconfig name optional (#588 )	2023-02-08 18:38:15 -08:00
Tessa Walsh	ce8f426978	Add notes to crawl and crawl updates (#587 )	2023-02-08 18:36:22 -08:00
Ilya Kreymer	40fb04b385	backend: /orgs/<id>/remove: return 404 if org user doesn't exist, fix… (#561 ) * backend: /orgs/<id>/remove: return 404 if org user doesn't exist, fixes issue in #535 Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-02-08 16:22:36 -05:00
Tessa Walsh	a7a18b9db0	Add org-specific delete invite endpoint (#575 ) Adds POST /orgs/{oid}/invites/delete, which expects the invited email address in the POST body. This endpoint will also delete duplicate invites with the same email/oid combination if env var ALLOW_DUPE_INVITES allows dupes.	2023-02-08 16:10:09 -05:00
Tessa Walsh	95155e6fbf	Invite token improvements (#564 ) - URL decode email address in invites.invite_user - Add tests for accepting invites	2023-02-07 20:40:28 -08:00
Tessa Walsh	6d424a1ae0	Serialize pending invites to return "id" not "_id" (#559 )	2023-02-06 12:28:11 -05:00
Ilya Kreymer	67df783885	bump version to 1.2.1-beta.0	2023-02-05 12:27:45 -08:00
Ilya Kreymer	af7ba4c90a	version: update to 1.2.0	2023-02-02 23:46:23 -08:00
Tessa Walsh	2e3b3cb228	Add API endpoint to update crawl tags (#545 ) * Add API endpoint to update crawls (tags only for now) * Allow setting tags to empty list in crawlconfig updates	2023-02-01 22:24:36 -05:00
Tessa Walsh	23022193fb	Reformat backend for black 23.1.0 (#548 )	2023-02-01 20:01:09 -05:00
Tessa Walsh	58aafc4191	Make API updates for member updates (#541 ) * Add API endpoint that lists pending invites for all orgs (superuser-only) * Add API endpoint that lists pending invites for org * Add user emails to /api/orgs/<oid> response	2023-02-01 16:44:00 -05:00
Ilya Kreymer	9048d46c6c	backend: add extraHops to support #543	2023-02-01 13:21:26 -08:00
Tessa Walsh	7d25565ef4	Add org role to /users/me-with-orgs (#536 ) * Add org role to /users/me-with-orgs * Add SUPERADMIN role and return in /me-with-orgs for superusers	2023-01-31 16:27:13 -05:00
Tessa Walsh	6cb79b580a	Fix issue where users are added to default org as admin (#534 ) Users should only be added as to the default org with Owner permissions if they are not specifically being invited to another org. This commit fixes the logic in the post-registration callback to make this the case.	2023-01-31 12:55:31 -08:00
Ilya Kreymer	6df31e13ab	backend: profile api: return additional data in profile /browser/<id> endpoint (#537 ) supports #533 , switching to client side rendering from VNC websocket	2023-01-31 11:58:50 -08:00
Tessa Walsh	2e6bf7535d	Add support for tags to update_crawl_config API endpoint (#521 ) * Add test for updating crawlconfigs	2023-01-30 21:46:54 -08:00
Tessa Walsh	231c37108c	Handle DuplicateKeyError on org rename requests (#514 ) * Handle DuplicateKeyError on org rename requests	2023-01-25 17:46:35 -08:00
Tessa Walsh	9f0abd6a28	Only drop indexes if migrations are run (#515 )	2023-01-25 17:46:10 -08:00
Tessa Walsh	0486d50fe9	Add new /users/me-with-orgs API endpoint (#510 )	2023-01-24 10:23:30 -05:00
Tessa Walsh	31e7939cba	Add new API user management endpoints (#511 ) - Remove user from org - Delete user invite	2023-01-23 17:03:07 -08:00
Tessa Walsh	c0e2ec6155	Fix logic for creating pidfile parent dir (#512 )	2023-01-23 17:02:25 -08:00
Ilya Kreymer	ccd87e0dff	Rename api / nginx settings -> backend / frontend, set pull policy job images (#504 ) * rename config values - api -> backend - nginx -> frontend * job pods: - set job_pull_policy from api_pull_policy (same as backend image) - default to Always, but can be overridden for local deployment (same as backend image) typo fix: CRAWL_NAMESPACE -> CRAWLER_NAMESPACE (part of #491) ansible: set default label to :latest instead of :dev for	2023-01-18 20:21:36 -08:00
Ilya Kreymer	1dfa494210	backend: add default behavior time to /api/settings (part of #321 ) (#499 )	2023-01-18 14:52:15 -08:00
Tessa Walsh	0fa60ebc45	Rename archives/teams -> orgs in codebase + add db migration (#486 ) * Rename archives to orgs and aid to oid on backend * Rename archive to org and aid to oid in frontend * Remove translation artifact * Rename team -> organization * Add database migrations and run once on startup * This commit also applies the new by_one_worker decorator to other asyncio tasks to prevent heavy tasks from being run in each worker. * Run black, pylint, and husky via pre-commit * Set db version and use in migrations * Update and prepare database in single task * Migrate k8s configmaps	2023-01-18 14:51:04 -08:00
Ilya Kreymer	d028b93412	backend: password related fixes: (#479 ) - mongodb: support passwords with '@' by escaping mongo username and password - superadmin: update superadmin email and password after initial creation if updated in helm values	2023-01-13 18:22:50 -08:00
Ilya Kreymer	bc67cc8443	backend: registration: (#472 ) - if registration is enabled, newly registred users get added to the default org, instead of getting their own org/archive	2023-01-13 00:03:37 -08:00
Ilya Kreymer	827b643262	backend: add 'allow_dupe_invites' option to allow re-inviting users. if not set (default), duplicate invites will result in errors (#471 )	2023-01-12 23:25:48 -08:00
Ilya Kreymer	4dbca8c421	email sending tweaks: (#470 ) - support 'reply-to' email field in values, and in ansible-based values - set 'subject' for different types of messages	2023-01-12 23:25:23 -08:00
Ilya Kreymer	a916322c30	ansible: digitalocean tweaks: (#469 ) * ansible: digitalocean tweaks: - add org_name to template - better check for db existence - simplify domain, fix default_org chart: - make job images pull IfNotPresent	2023-01-12 23:11:20 -08:00
Ilya Kreymer	2daa742585	Copy tags from crawlconfig to crawl (#467 ), fixes #466 - add tags to crawl object - ensure tags are copied from crawlconfig to crawl when crawl is created (both manually and scheduled) - tests: add test to ensure tags added to crawl, remove redundant wait replaced with fixtures	2023-01-12 17:46:19 -08:00
Tessa Walsh	49460bb070	Add default organization + invite to default org (#465 ), #455 - Add default switch to Archive (org) model - Set default org name via values.yaml - Add check to ensure only one org with default org name exists - Stop creating new orgs for new users - Add new API endpoints for creating and renaming orgs (part of #457) - Make Archive.name unique via index - Wait for db connection on init, log if waiting - Make archive-less invites invite user to default org with Owner role - Rename default org from chart value if changed - Don't create new org for invited users	2023-01-12 16:44:18 -08:00
Ilya Kreymer	5efeaa58b1	API filters by user + crawl collection ids (#462 ) backend: object filtering: - add filtering crawls, crawlconfigs and profiles by userid= query arg, fixes #460 - add filtering crawls by crawlconfig via cid= query arg, fixes #400 - tests: add test_filter_results test suite to test filtering crawls and crawlconfigs by user, also create user with 'crawler' permissions, run second crawl with that user.	2023-01-11 16:50:38 -08:00
Ilya Kreymer	7b5d82936d	backend: initial tags api support (addresses #365 ): (#434 ) * backend: initial tags api support (addresses #365): - add 'tags' field to crawlconfig (array of strings) - allow querying crawlconfigs to specify multiple 'tag' query args, eg. tag=A&tag=B - add /archives/<aid>/crawlconfigs/tags api to query by distinct tag, include index on aid + tag tests: add tests for adding configs, querying by tags tests: fix fixtures to retry login if initial attempts fails, use test seed of https://webrecorder.net instead of https://example.com/	2023-01-11 13:29:35 -08:00
Ilya Kreymer	edfb1bd513	quickfix: pydantic / lint fix (#452 ) * backend: use latest pydantic again, fix pylint with custom .pylintrc (as suggested in pydantic/pydantic#1961)	2023-01-10 18:54:11 -08:00
Ilya Kreymer	56a6d7a5d8	Backend lint check (#451 ) - apply lint + format fixes to backend - add ci for lint + format fixes for backend - use fixed version of pydantic	2023-01-10 16:17:06 -08:00
Ilya Kreymer	30bda8c75d	VNC-Based Profile Browser (#433 ) * profile browser vnc support + fixes: - switch profile browser rendering to use VNC - frontend: add @novnc/novnc as dependency, create separate bundle novnc.js to load into vnc browser (to avoid loading from each container) - frontend: update proxy paths to proxy websocket, index page to crawler - frontend: allow browser profiles in all browsers, remove browser compatibility check - frontend: update webpack dev config, apply prettier - frontend: node version fix - backend: get vncpassword, build new URL for proxying to crawler iframe - backend: fix profile / crawl job pull policy from 'Always' -> 'Never', should use existing image for job - backend: fix kill signal to use bash -c to work with latest backend image - backend/chart: add 'profile_browser_timeout_seconds' to chart values to control how long profile browser to remain when idle (default to 60) - backend: remove utils.py, now using secret.token_hex() for random suffix Co-authored-by: sua yoo <sua@suayoo.com>	2023-01-10 14:42:42 -08:00
Tessa Walsh	d1b59c9bd0	Use archive_viewer_dep permissions to GET crawls (#443 ) * Use archive_viewer_dep permissions to GET crawls * Add is_viewer check to archive_dep * Add API endpoint to add new user to archive directly (/archive/<id>/add-user) * Add tests * Refactor tests to use fixtures * And remove login test that duplicates fixtures	2023-01-09 19:11:53 -08:00
Ilya Kreymer	dfca09fc9c	Add single crawl info api at /crawls/{crawl_id} (#418 ) * backend: crawl info apis: - add /crawls/{crawl_id} api endpoint which just lists the crawl info, without resolving the individual files - move /crawls/{crawl_id}.json -> /crawls/{crawl_id}/replay.json for clarity that it's used for replay * frontend: update api for new replay.json endpoint	2022-12-19 14:54:48 -08:00
sua yoo	28346e0a54	New create crawl config user workflow (#391 )	2022-12-12 13:50:33 -08:00
Ilya Kreymer	61c63d0be9	Remove Code and Configs for Swarm/podman support (#407 ) - remove swarm / podman support - remove docker-compose.yml, btrixcloud.swarm package, and podman/swarm scripts from scripts/ dir- - remove python-on-whales - add error if not running in k8s - remove python-on-whales	2022-12-08 18:19:58 -08:00
Ilya Kreymer	2d93cef966	CI: Add K3D CI test (#405 ) - add testing with K3D cluster - bump backend image to python 3.10-slim for newer python, smaller image. - bump to 1.2.0-beta.0	2022-12-07 23:26:16 -08:00
Ilya Kreymer	0aa09be8c3	README + CHANGES + doc tweaks for 1.1.0 release (#402 ) - update README + docs with deprecation of non-k8s deployment - add CHANGES.md - bump version to 1.1.0	2022-12-06 12:27:27 -08:00
Ilya Kreymer	829548af0f	doc tweaks: - fix typos in docs - update prod deployment info - update minikube info - add info on how to run with local images - bump version to 1.1.0-beta.3 for testing multiarch build	2022-12-05 18:14:19 -08:00
Ilya Kreymer	82ffc0dfbc	Local Deployment Work: Support running locally + test cluster on CI (#396 ) * k8s local deployment work: - make it easier to deploy w/o ingress by setting 'local_service_port' (suggested port 30870) - if using local minio, ensure file endpoints set to /data/ and /data/ proxies correctly to local bucket - if not using minio, ensure file endpoints point to correct access / endpoint url. - setup should work with docker desktop, minikube, microk8s and k3s! - nginx chart: bump nginx memory limit to 20Mi - nginx image: 00-default-override-resolver-config -> 00-browsertrix-nginx-init for clarity - nginx image: use default nginx.conf, pin to nginx 1.23.2 - mongo: readd readiness probe, bump connect wait timeout (needed for ci) - config: set superadmin username to 'admin' - config schema: set 'name' as required - add sample chart values overrides: - chart values: local-config.yaml for running locally with 'local_service_port' - chart values: add microk8s-hosted.yaml for configuring a hosted microk8s setup - chart values: add microk8s-ci.yaml for ci tests - ci: remove docker swarm tests - ci: add microk8s integration tests: launching cluster, logging in, running a crawl of example.com, downloading/checking WACZ - bump to 1.1.0-beta.2	2022-12-02 19:58:34 -08:00
Ilya Kreymer	aabb0b2a92	chart / deployment fixes to run on microk8s: (fixes #385 ) (#387 ) - ingress: fix proxying /data to minio, use another ingress which proxies correct host to ensure presigned urls work - presigning: determine if signing endpoint url (minio) or access endpoint (cloud bucket) based on if access endpoint is provided, set bool on storage object - chart: fix indent on incorrect storageClassName configs - ingress: make 'ingress_class' configurable (set to 'public' for microk8s, default to 'nginx') - minio: use older minio image which supports legacy fs based setup (for now) - nginx service: add 'nginx_service_use_node_port' config setting: if true, will use NodePort for frontend, other will use default (ClusterIP) and only for the frontend / nginx - chart: remove changing service type for other services	2022-11-30 09:21:58 -08:00
Ilya Kreymer	afe536e568	version: bump to 1.1.0-beta.1	2022-11-23 23:37:33 -08:00
Ilya Kreymer	d6386b7051	Release Build + Versioning (#373 ) - Adds version to version.txt in root - adds update-version.sh which updates version in frontend/package.json and backend/btrixcloud/version.py - frontend: loads version from $VERSION env var, ../version.txt or package.json - ci: on new github release, pushes webrecorder/browsertrix-backend and webrecorder/browsertrix-frontend images to Dockerhub with current version, as well as latest. - version set to 1.1.0-beta.0 - closes #357	2022-11-18 17:15:25 -08:00
Ilya Kreymer	704838f562	backend: add 'lang' and 'blockAds' field to raw config, in prep for future work, and fixes #369	2022-11-18 17:09:31 -08:00
Ilya Kreymer	793611e5bb	add exclusion api, fixes #311 (#349 ) * add exclusion api, fixes #311 add new apis: `POST crawls/{crawl_id}/exclusion?regex=...` and `DELETE crawls/{crawl_id}/exclusion?regex=...` which will: - create new config with add 'regex' as exclusion (deleting or making inactive previous config) OR remove as exclusion. - update crawl to point to new config - update statefulset to point to new config, causing crawler pods to restart - filter out urls matching 'regex' from both queue and seen list (currently a bit slow) (when adding only) - return 400 if exclusion already existing when adding, or doesn't exist when removing - api reads redis list in reverse to match how exclusion queue is used	2022-11-12 17:24:30 -08:00
Ilya Kreymer	d340bceb39	style pass: normalize docstring spacing	2022-10-19 21:47:34 -07:00

... 2 3 4 5 6 ...

447 Commits