browsertrix

Author	SHA1	Message	Date
Ilya Kreymer	b2a5dbf2cd	enable screenshots by default + fix py version formatting (#1518 ) configmap: add --screenshot thumbnail,view as default screenshots version: update update-version.sh to add newline in version.py to match new black formatting (from changes in #1507) Fixes #1519	2024-02-07 17:07:28 -08:00
Tessa Walsh	07fa46d9aa	Add custom user agent to workflows (#1465 ) Fixes #1341 Adds "User Agent" field to workflow editor under the Browser Settings tab. If not set, the crawler will use the browser's default user agent. Also added to docs and to the workflow details page (if set). --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-01-17 17:33:50 -05:00
Ilya Kreymer	90197b2a85	Backend mem usage fix - use fixed MOTOR_MAX_WORKERS + switch to gunicorn (#1468 ) Refactors backend deployment to: - Use MOTOR_MAX_WORKERS (defaulting to 1) to reduce threads used by mongodb connections - Also sets backend workers to 1 by default to reduce default memory usage - Switches to gunicorn with uvloop worker for production use instead of uvicorn (as recommended by uvicorn) Lower thread count should address memory leak/increased usage, which resulted in 5x thread x cpus x workers, eg. potentially 20 or 40 threads just for mongodb connections. Lower default number of workers should make it easier to scale backend with HPA if additional capacity. Fixes #1467	2024-01-16 15:32:42 -08:00
Tessa Walsh	032859f361	Support multiple crawler versions (#1420 ) Fixes #1385 ## Changes Supports multiple crawler 'channels' which can be configured to different browsertrix-crawler versions - Replaces `crawler_image` in helm chart with `crawler_channels` array similar to how storages are handled - The `default` crawler channel must always be provided and specifies the default crawler image - Adds backend `/orgs/{oid}/crawlconfigs/crawler-channels` API endpoint to fetch information about available crawler versions (name, image, and label) and test - Adds crawler channel select to workflow creation/edit screens and profile creation dialog, and updates related API endpoints and configmaps accordingly. The select dropdown is shown only if more than one channel is configured. - Adds `crawlerChannel` to workflow and crawl details. - Add `image` to crawler image, used to display actual image used as part of the crawl. - Modifies `crawler_crawl_id` backend test fixture to use `test` crawler version to ensure crawler versions other than latest work - Adds migration to add `crawlerChannel` set to `default` to existing workflow and profile objects and workflow configmaps --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-01-16 15:32:12 -08:00
Ilya Kreymer	b23eed5003	Email Templates (#1375 ) - Emails are now processed from Jinja2 templates found in `charts/email-templates`, to support easier updates via helm chart in the future. - The available templates are: `invite`, `password_reset`, `validate` and `failed_bg_job`. - Each template can be text only or also include HTML. The format of the template is: ``` subject ~~~ <html content> ~~~ text ``` - A new `support_email` field is also added to the email block in values.yaml Invite Template: - Currently, only the invite template includes an HTML version, other templates are text only. - The same template is used for new and existing users, with slightly different text if adding user to an existing org. - If user is invited by the superadmin, the invited by field is not included, otherwise it also includes 'You have been invited by X to join Y'	2023-11-15 15:22:12 -08:00
Ilya Kreymer	ff10124d01	charts cleanup: (#1360 ) - move authsign secret to signer and make port configurable - rename storages to more general ops-configs - put 'storages.json' path into env var - rename backend secret to backend-auth - cronjobs: don't keep succeeded jobs around, triggers operator update	2023-11-08 19:24:00 -08:00
Ilya Kreymer	d2d7240455	background jobs fix: ensure bucket is parsed correctly (#1359 ) Follow-up to #1321 - correctly parse the endpoint_url into prefix and bucket path - also add region and s3 provider type to storage secrets	2023-11-08 15:08:23 -08:00
Ilya Kreymer	5530ca92e1	Move backend app templates to be installed from configmap volume (#1331 ) Instead of adding the app templates launched from the backend via `backend/btrixcloud/templates`, add them to a configmap and mount the configmap in the same location. This allows these templates to be updated, like other values in charts/... without having to rebuild any of the images, speeding up dev and maintenance time. Changes include: - move backend/btrixcloud/templates -> chart/app-templates/ - add app-templates/*.yaml to app-templates configmap - mount app-templates configmap to /app/btrixcloud/templates/ in api and op containers	2023-11-06 09:37:48 -08:00
Francesco Servida	0b8bbcf8e6	Allow User to specify custom cluster-issuer (#1332 ) Implemented variable and defaults for cluster-issuer to allow users to specify, if needed, their own cluster issuer. (eg. installations with only outbound traffic that cannot solve ACME https challenge)	2023-11-04 13:29:17 -07:00
Francesco Servida	4998274ab0	correctly suffix Auth-Signer url when running in custom namespace (#1335 )	2023-11-04 10:34:05 -07:00
Ilya Kreymer	fb3d88291f	Background Jobs Work (#1321 ) Fixes #1252 Supports a generic background job system, with two background jobs, CreateReplicaJob and DeleteReplicaJob. - CreateReplicaJob runs on new crawls, uploads, profiles and updates the `replicas` array with the info about the replica after the job succeeds. - DeleteReplicaJob deletes the replica. - Both jobs are created from the new `replica_job.yaml` template. The CreateReplicaJob sets secrets for primary storage + replica storage, while DeleteReplicaJob only needs the replica storage. - The job is processed in the operator when the job is finalized (deleted), which should happen immediately when the job is done, either because it succeeds or because the backoffLimit is reached (currently set to 3). - /jobs/ api lists all jobs using a paginated response, including filtering and sorting - /jobs/<job id> returns details for a particular job - tests: nightly tests updated to check create + delete replica jobs for crawls as well as uploads, job api endpoints - tests: also fixes to timeouts in nightly tests to avoid crawls finishing too quickly. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-11-02 13:02:17 -07:00
Ilya Kreymer	6dc452ebad	Storage Refactor: Replication + Custom Storage Support (#1296 ) - Refactors storage to support replicas + custom storages on the Org. - There is a default primary + replica storage, while an Org can also have primary and replica storages. - StorageRef object is used to store references to default and custom storage. - CrawlFile has been updated to contain a StorageRef instead of a def_storage_name, which references either a default storage (in StorageOps) or custom storage (in Organization) - There is also a 'replicas' Optional[List[StorageRef]] which contains replicas, if any. - CrawlFileOut contain a numReplicas for how many replicas exist for a given file. - Migration: migration 0020 added to migrate existing Orgs, CrawlFile and ProfileFile objects to new storage system (CrawlFile and ProfileFile now extend BaseFile) Part of #1262 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-10-26 21:44:09 -07:00
Tessa Walsh	d58747dfa2	Provide full resources in archived items finished webhooks (#1308 ) Fixes #1306 - Include full `resources` with expireAt (as string) in crawlFinished and uploadFinished webhook notifications rather than using the `downloadUrls` field (this is retained for collections). - Set default presigned duration to one minute short of 1 week and enforce maximum supported by S3 - Add 'storage_presign_duration_minutes' commented out to helm values.yaml - Update tests --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-10-23 19:01:58 -07:00
Anish Lakhwara	834fa72baf	Refactor microk8s playbook to follow "new" structure (#1264 ) * Refactor microk8s playbook to follow structure with shared roles - Integrates with btrix/deploy role for deploying - Seperated RedHat and Debian into seperate roles - Created Common role - allow running remotely by default - use 'browsertrix_cloud_home' for charts path - add additional customizable options to btrix_values.j2 (todo: unify all the templates) - docs: update to new playbook path --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-10-11 19:33:30 -07:00
Ilya Kreymer	16e7a1d0a2	Storage Ops Refactor (#1257 ) * storage ops refactor: - create StorageOps class similar to other ops classes - init storages list in StorageOps, no longer require lookup up default storages via CrawlManager - convert all storage functions to members, add storageops to operator - remove unused params, ensure crawl exists for rollover restart - add env var to determine if using local minio to use correct endpoint URL * crawls /seeds endpoint: just return empty list if not a crawl (eg. upload) * crawlmanager: remove unused code, rename check_storage -> has_storage	2023-10-10 15:04:23 -07:00
Ilya Kreymer	fa86555eed	Track pod resource usage, detect OOM crashes, handle auto-scaling (#1235 ) * keep track of per pod status on crawljob: - crashes time, and reason - 'used' vs 'allocated' resources - 'percent' used / allocated * crawl log errors: log error when crawler crashes via OOM, either via redis error log or to console * add initial autoscaling support! - detect if metrics server is available via K8SApi.is_pod_metrics_available() - if available, use metrics for 'used' fields - if no metrics, set memory used for redis only (using redis apis) - allow overriding memory and cpu via newMemory and newCpu settings on pod status - scale memory / cpu based on newMemory and newCpu setting - templates: update jinja templates to allow restarting crawler and redis with new resources - ci: enable metrics-server on k3d, microk8s and nightly k3d ci runs * roles: cleanup unused roles, add permissions for listing metrics * stats for running crawls: - update in db via operator - avoids losing stats if redis pod happens to be done - tradeoff is more db access in operator, but less extra connections to redis + already loading from db in backend - size stat: ensure size of previous files is added to the stats * crawler deployment tweaks: - adjust cpu/mem per browser - add --headless flag to configmap to use new headless mode by default!	2023-10-05 20:41:18 -07:00
Ilya Kreymer	86a424af93	migration improvements: (#1228 ) * migration improvements + rerunning migrations: (fixes #1227) - avoid starting some workers while migration is still running - ensure workers that aren't performing migration await for migration to complete - backend will not be valid until migration is run * allow rerunning migration from specified version via --set rerun_from_migration=<VERSION> (replaces rerun_last_migration)	2023-09-28 12:04:19 -07:00
Ilya Kreymer	c9c39d47b7	Scheduled Crawl Refactor: Handle via Operator + Add Skipped Crawls on Quota Reached (#1162 ) * use metacontroller's decoratorcontroller to create CrawlJob from Job * scheduled job work: - use existing job name for scheduled crawljob - use suspended job, set startTime, completionTime and succeeded status on job when crawljob is done - simplify cronjob template: remove job_image, cron_namespace, using same namespace as crawls, placeholder job image for cronjobs * move storage quota check to crawljob handler: - add 'skipped_quota_reached' as new failed status type - check for storage quota before checking if crawljob can be started, fail if not (check before any pods/pvcs created) * frontend: - show all crawls in crawl workflow, no need to filter by status - add 'skipped_quota_reached' status, show as 'Skipped (Quota Reached)', render same as failed * migration: make release namespace available as DEFAULT_NAMESPACE, delete old cronjobs in DEFAULT_NAMESPACE and recreate in crawlers namespace with new template	2023-09-12 13:05:43 -07:00
Ilya Kreymer	ad9bca2e92	Operator refactor to control pods + pvcs directly instead of statefulsets (#1149 ) - Ability for pod to be Completed, unlike in Statefulset - eg. if 3 pods are running and first one finishes, all 3 must be running until all 3 are done. With this setup, the first finished pod can remain in Completed state. - Fixed shutdown order - crawler pods now correctly shutdown first before redis pods, by switching to background deletion. - Pod priority decreases with scale: 1st instance of a new crawl can preempt 3rd or 2nd instance of another crawl - Create priority classes upto 'max_crawl_scale, configured in values.yaml - Improved scale change reconciliation: if increasing scale, immediately scale up. If decreasing scale, graceful stop scaled-down instance to complete via redis 'stopone' key, wait until they exit with Completed state before adjust status.scale / removing scaled down pods. Ensures unaccepted interrupts don't cause scaled down data to be deleted. - Redis pod remains inactive until crawler is first active, or after no crawl pods are active for 60 seconds - Configurable Redis storage with 'redis_storage' value, set to 3Gi by default - CrawlJob deletion starts as soon as post-finish crawl operations are run - Post-crawl operations get their own redis instance, since one during response is being cleaned up in finalizer - Finalizer ignores request with incorrect state (returns 400 if reported as not finished while crawl is finished) - Current resource usage added to status - Profile browser: also manage single pod directly without statefulset for consistency. - Restart pods via restartTime value: if spec.restartTime != status.restartTime, clear out pods and update status.restartTime (using OnDelete policy to avoid recreate loops in edge cases). - Update to latest metacontroller (v4.11.0) - Add --restartOnError flag for crawler (for browsertrix-crawler 0.11.0) - Failed crawl logging: dd 'fail_crawl()' to be used for failing a crawl, which prints logs for default container (if enabled) as well as pod status - tests: check other finished states to avoid stuck in infinite loop if crawl fails - tests: disable disk utilization check, which adds unpredictability to crawl testing! fixes #1147 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-09-11 10:38:04 -07:00
Anish Lakhwara	e57148d0e9	feat: add SMTP {port, use_tls} config (#1142 ) * feat: add SMTP {port, use_tls} config * If `password` is None don't attempt to log in * remove 'can be omitted' comment --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2023-09-08 08:18:36 -07:00
Ilya Kreymer	2967f1e320	ingress: simplify ingress config: (fixes #1135 ) (#1146 ) * ingress: simplify ingress config: (fixes #1135) - use standard Prefix pathTypes - remove nginx-specific rewriting - remove 'scheme', use https/http based on 'tls' setting (in ingress and configmap) - fix signing ingress to use ingressClassName	2023-09-07 09:51:48 -07:00
Ilya Kreymer	68bc053ba0	Print crawl log to operator log (mostly for testing) (#1148 ) * log only if 'log_failed_crawl_lines' value is set to number of last lines to log from failed container --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-09-06 17:53:02 -07:00
Ilya Kreymer	38f596fd81	chart: move minio credentials to separate secret, part of #490 (#1143 )	2023-09-06 17:35:30 -07:00
Ilya Kreymer	dce1ae6129	better resources scaling by number of browsers per crawler container (#1103 ) - set crawler cpu / memory with fixed base + incremental bumps based on number of browsers - allow parsing k8s quantities with parse_quantity, compute in operator - set 'crawler_cpu = crawler_cpu_base + crawler_extra_cpu_per_browser * (num_browsers - 1)' and same for memory	2023-09-06 01:42:44 -04:00
Ilya Kreymer	6dca2f1c03	supports overriding the replayweb.page version without having to be r… (#1122 ) * supports overriding the replayweb.page version without having to be rebuild frontend image: - ensures 'rwp_base_url' from helm chart is passed to nginx - ensures both ui.js and sw.js are loaded based on nginx environment variable, not hard-coded - ui.js loaded via redirect from new /replay/ui.js path - pin RWP to known working release in default values.yaml - remove RWP_BASE_URL from Dockerfile, no longer needed, set via chart env var - set default RWP_BASE_URL for devserver to use CDN - set RWP version to 1.8.11	2023-09-05 20:10:21 -04:00
Ilya Kreymer	a9ab17fc61	publish helm chart on release (fixes #1114 ) (#1117 ) (#1123 ) - no longer using :latest by default in values.yaml, instead updating version with each release - set chart version to match app version in Chart.yaml - update version in helm chart and values.yaml as part of update-version.sh script - update test.yaml and local-config.yaml to enable using :latest tag images - ci: add ci script for packaging current helm chart - docs: updates docs to indicate deploying directly from GitHub release - docs: add script to fill in latest version for 'VERSION' using custom script - chart: set local_service_port to 30870 by default, but use only if no ingress. - default values.yaml set up for local deployment, local-config.yaml contains additional commented out examples - ci draft: add deployment info to draft with helm install command for current version - test: fix password check test	2023-08-30 12:02:02 -07:00
Ilya Kreymer	989ed2a8da	Use Shared Services for Crawling, Redis, Profile Browsers (#1088 ) * refactor to use shared role-based service shared across pods: - 'crawler' service for all crawler screencasting, scales 0 .. N with crawler-<ID>-N.crawl - 'redis' service for all redis access, redis-<ID>-0.redis - 'browser' service for all browser access (profile browsers), browser-<ID>-0.browser - don't create a new service per crawl/profile at all - enable 'publishNotReadyAddresses' for potentially faster resolving, esp for redis - remove service as type managed by operator as no longer creating services dynamically - remove frontend var CRAWLER_SVC_SUFFIX, suffix always '.crawler' to match crawler service name	2023-08-24 20:08:53 -07:00
Ilya Kreymer	63b776bce8	ingress: minor tweaks to ingress to update to latest spec: (#1096 ) - use pathType ImplementationSpecific for regexes - use ingressClassName instead of annotation	2023-08-23 11:36:52 -07:00
Ilya Kreymer	9553115bbe	helm chart tweaks: (#1067 ) * helm chart tweaks: - lower mem requirements for backend and crawler - disable cors in ingress to pass through cors headers from backend - crawler statefulset: use ordered instead of parallel scaling policy to avoid single crawl taking up all crawling capacity quickly	2023-08-14 16:43:12 -07:00
Ilya Kreymer	7ea6d76f10	Resource Constraints Cleanup: (fixes #895 ) (#1019 ) * resource constraints: (fixes #895) - for cpu, only set cpu requests - for memory, set mem requests == mem limits - add missing resource constraints for minio and scheduled job - for crawler, set mem and cpu constraints per browser, scale based on browser instances per crawler - add comments in values.yaml for crawler values being multiplied - default values: bump crawler to 650 millicpu per browser instance just in case cleanup: remove unused entries from main backend configmap	2023-08-01 00:11:16 -07:00
Vinzenz Sinapius	5807507f29	Add proxy settings for crawler and profilebrowser (#997 )	2023-07-26 16:11:10 -07:00
Ilya Kreymer	4bea7565bc	load handling: scale up redis only when crawler pods running (#1009 ) Operator: Modified init behavior to only load redis when at least one crawler pod available: - waits for at least one crawler pod to be available before starting redis pod, to avoid situation where many crawler pods are in pending mode, but redis pods are still running. - redis statefulset starts at scale of 0 - once crawler pod becomes available, redis sts is scaled to 1 (via `initRedis==true` status) - crawl remains in 'starting' or 'waiting_capacity' state until pod becomes available without redis pod running - set to 'running' state only after redis and at least one crawler pod is available - if no crawler pods available after running, or, if stuck in starting for >60 seconds, switch to 'waiting_capacity' state - when switching to 'waiting_capacity', also scale down redis to 0, wait for crawler pod to become available, only then scale up redis to 1, and get back to 'running' other tweaks: - add new status field 'initRedis', default to false, not displayed - crawler pod: consider 'ContainerCreating' state as available, as container will not be blocked by resource limits - add a resync after 3 seconds when waiting for crawler pod or redis pod to become available, configurable via 'operator_fast_resync_secs' - set_state: if not updating state, ensure state reflects actual value in db	2023-07-26 08:40:05 -07:00
Ilya Kreymer	d7cb47390e	readd support for passing in 'crawler_extra_args' for additional/custom (#957 ) options not covered by standard crawler opts (removed setting all args this way in #889)	2023-07-07 12:08:40 -07:00
Ilya Kreymer	00eb62214d	Uploads API: BaseCrawl refactor + Initial support for /uploads endpoint (#937 ) * basecrawl refactor: make crawls db more generic, supporting different types of 'base crawls': crawls, uploads, manual archives - move shared functionality to basecrawl.py - create a base BaseCrawl object, which contains start / finish time, metadata and files array - create BaseCrawlOps, base class for CrawlOps, which supports base crawl deletion, querying and collection add/remove * uploads api: (part of #929) - new UploadCrawl object which extends BaseCrawl, has name and description - support multipart form data data upload to /uploads/formdata - support streaming upload of a single file via /uploads/stream, using botocore multipart upload to upload to s3-endpoint in parts - require 'filename' param to set upload filename for streaming uploads (otherwise use form data names) - sanitize filename, place uploads in /uploads/<uuid>/<sanitized-filename>-<random>.wacz - uploads have internal id 'upload-<uuid>' - create UploadedCrawl object with CrawlFiles pointing to the newly uploaded files, set state to 'complete' - handle upload failures, abort multipart upload - ensure uploads added within org bucket path - return id / added when adding new UploadedCrawl - support listing, deleting, and patch /uploads - support upload details via /replay.json to support for replay - add support for 'replaceId=<id>', which would remove all previous files in upload after new upload succeeds. if replaceId doesn't exist, create new upload. (only for stream endpoint so far). - support patching upload metadata: notes, tags and name on uploads (UpdateUpload extends UpdateCrawl and adds 'name') * base crawls api: Add /all-crawls list and delete endpoints for all crawl types (without resources) - support all-crawls/<id>/replay.json with resources - Use ListCrawlOut model for /all-crawls list endpoint - Extend BaseCrawlOut from ListCrawlOut, add type - use 'type: crawl' for crawls and 'type: upload' for uploads - migration: ensure all previous crawl objects / missing type are set to 'type: crawl' - indexes: add db indices on 'type' field and with 'type' field and oid, cid, finished, state * tests: add test for multipart and streaming upload, listing uploads, deleting upload - add sample WACZ for upload testing: 'example.wacz' and 'example-2.wacz' * collections: support adding and remove both crawls and uploads via base crawl - include collection_ids in /all-crawls list - collections replay.json can include both crawls and uploads bump version to 1.6.0-beta.2 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-07-07 09:13:26 -07:00
Ilya Kreymer	4c8de3160b	typo fix: fix extra trailing quote on CRAWL_ARGS in configmap.yaml	2023-06-16 18:55:21 -07:00
Ilya Kreymer	4428184aea	frontend: configure running with a fixed 'replay.json', auth headers passed via separate config (#899 ) wabac.js will reload the replay.json on 403 with new token (will be in next version of wabac.js) presign urls: make presign timeout configurable (in minutes), defaults to 60 mins dockerfile: fix configuring RWP_BASE_URL	2023-06-08 11:26:26 -07:00
Ilya Kreymer	dd757961fc	config: add overridable 'user_agent_suffix' and 'user_agent' to values.yaml, (#910 ) passed to crawler --userAgentSuffix and --userAgent params, respectively, using 'quote' to support spaces in user-agent. config: re-order settings to put 'Crawler Settings' section first, followed by 'Cluster Settings' fixes #787	2023-06-07 12:01:12 -07:00
Ilya Kreymer	f2b7b6bcd5	Nightly Tests Fix (#905 ) * tests: fix nightly test to account for 'waiting_capacity' state * readd missing --logErrorsToRedis flag	2023-06-02 21:47:41 -07:00
Tessa Walsh	0284903b34	Cleanup carwler args (#889 ) * crawler args cleanup: - move crawler args command line entirely to configmap - add required settings like --generateWACZ and --waitOnDone to configmap to not be overridable - values files can configure individual settings, assembled in configmap - move disk_utilization_threshold to configmap - add 'crawler_logging_opts' and 'crawler_extract_full_text' options to values.yaml to more easily set these options --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-05-30 19:29:07 -04:00
Ilya Kreymer	70319594c2	crawlconfig: fix default filename template, make configurable (#835 ) * crawlconfig: fix default filename template, make configurable - make default crawl file template configurable with 'default_crawl_filename_template' value in values.yaml - set to '@ts-@hostsuffix.wacz' by default - allow updating via 'crawlFilenameTemplate' in crawlconfig patch, which updates configmap - tests: add test for custom 'default_crawl_filename_template'	2023-05-08 14:03:27 -07:00
Ilya Kreymer	aae0e6590e	Ensure Volumes are deleted when crawl is canceled (#828 ) * operator: - ensures crawler pvcs are always deleted before crawl object is finalized (fixes #827) - refactor to ensure finalizer handler always run when finalizing - remove obsolete config entries	2023-05-05 12:05:54 -07:00
Ilya Kreymer	60ba9e366f	Refactor to use new operator on backend (#789 ) * Btrixjobs Operator - Phase 1 (#679) - add metacontroller and custom crds - add main_op entrypoint for operator * Btrix Operator Crawl Management (#767) * operator backend: - run operator api in separate container but in same pod, with WEB_CONCURRENCY=1 - operator creates statefulsets and services for CrawlJob and ProfileJob - operator: use service hook endpoint, set port in values.yaml * crawls working with CrawlJob - jobs start with 'crawljob-' prefix - update status to reflect current crawl state - set sync time to 10 seconds by default, overridable with 'operator_resync_seconds' - mark crawl as running, failed, complete when finished - store finished status when crawl is complete - support updating scale, forcing rollover, stop via patching CrawlJob - support cancel via deletion - requires hack to content-length for patching custom resources - auto-delete of CrawlJob via 'ttlSecondsAfterFinished' - also delete pvcs until autodelete supported via statefulset (k8s >1.27) - ensure filesAdded always set correctly, keep counter in redis, add to status display - optimization: attempt to reduce automerging, by reusing volumeClaimTemplates from existing children, as these may have additional props added - add add_crawl_errors_to_db() for storing crawl errors from redis '<crawl>:e' key to mongodb when crawl is finished/failed/canceled - add .status.size to display human-readable crawl size, if available (from webrecorder/browsertrix-crawler#291) - support new page size, >0.9.0 and old page size key (changed in webrecorder/browsertrix-crawler#284) * support for scheduled jobs! - add main_scheduled_job entrypoint to run scheduled jobs - add crawl_cron_job.yaml template for declaring CronJob - CronJobs moved to default namespace * operator manages ProfileJobs: - jobs start with 'profilejob-' - update expiry time by updating ProfileJob object 'expireTime' while profile is active * refactor/cleanup: - remove k8s package - merge k8sman and basecrawlmanager into crawlmanager - move templates, k8sapi, utils into root package - delete all _job.py files - remove dt_now, ts_now from crawls, now in utils - all db operations happen in crawl/crawlconfig/org files - move shared crawl/crawlconfig/org functions that use the db to be importable directly, including get_crawl_config, add_new_crawl, inc_crawl_stats role binding: more secure setup, don't allow crawler namespace any k8s permissions - move cronjobs to be created in default namespace - grant default namespace access to create cronjobs in default namespace - remove role binding from crawler namespace * additional tweaks to templates: - templates: split crawler and redis statefulset into separate yaml file (in case need to load one or other separately) * stats / redis optimization: - don't update stats in mongodb on every operator sync, only when crawl is finished - for api access, read stats directly from redis to get up-to-date stats - move get_page_stats() to utils, add get_redis_url() to k8sapi to unify access * Add migration for operator changes - Update configmap for crawl configs with scale > 1 or crawlTimeout > 0 and schedule exists to recreate CronJobs - add option to rerun last migration, enabled via env var and by running helm with --set=rerun_last_migration=1 * subcharts: move crawljob and profilejob crds to separate subchart, as this seems best way to guarantee proper install order with + update on upgrade with helm, add built btrix-crds-0.1.0.tgz subchart - metacontroller: use release from ghcr, add metacontroller-helm-v4.10.1.tgz subchart * backend api fixes - ensure changing scale of crawl also updates it in the db - crawlconfigs: add 'currCrawlSize' and 'lastCrawlSize' to crawlconfig api --------- Co-authored-by: D. Lee <leepro@gmail.com> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-04-24 18:30:52 -07:00
Ilya Kreymer	f6dc26eeb5	nginx: enable worker processes autotune to correctly set the number of processes for nginx, possible fix for #780 (#785 )	2023-04-21 18:13:22 -07:00
Ilya Kreymer	85b6a05419	Upgrade to mongo 6 and use sortArray for workflow crawls (#764 ) (#765 ) fixes from 1.4.1: * Upgrade to mongo 6 and use for workflow crawls * update readiness probe with timeouts doubled, and failure threshold increased for slower 'mongosh' readiness check update versions to 1.5.0-beta.0 in backend and frontend Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-04-11 18:22:07 -07:00
Ilya Kreymer	7f757d396a	config: add 'pageLoadTimeout' and 'pageExtraDelay' options to backend… (#742 ) * config: add 'pageLoadTimeout' and 'pageExtraDelay' options to backend config - add 'default_page_load_timeout_seconds' to values.yaml, defaulting to 120, for pageLoadTimeout - add 'defaultPageLoadTimeSeconds ' to /api/settings, update tests for /api/settings addresses issue in #636	2023-04-04 19:52:23 -07:00
Ilya Kreymer	1c47a648a9	Max page limit override (#737 ) * more page limit: update to #717, instead of setting --limit in each crawlconfig, apply override --maxPageLimit setting, implemented in crawler, to override individually configured page limit * update tests, no longer returning 'crawl_page_limit_exceeds_allowed'	2023-04-03 14:01:32 -07:00
Ilya Kreymer	887cb16146	Allow configurable max pages per crawl in deployment settings (#717 ) * backend: max pages per crawl limit, part of fix for #716: - set 'max_pages_crawl_limit' in values.yaml, default to 100,000 - if set/non-0, automatically set limit if none provided - if set/non-0, return 400 if adding config with limit exceeding max limit - return limit as 'maxPagesPerCrawl' in /api/settings - api: /all/crawls - add runningOnly=0 to show all crawls, default to 1/true (for more reliable testing) tests: add test for 'max_pages_per_crawl' setting - ensure 'limit' can not be set higher than max_pages_per_crawl - ensure pages crawled is at the limit - set test limit to max 2 pages - add settings test - check for pages.jsonl and extraPages.jsonl when crawling 2 pages	2023-03-28 16:26:29 -07:00
Ilya Kreymer	413fd8d7ea	Chart: split Crawl args into separate variables (#639 ) * chart crawl args cleanup: - move configurable settings out of 'crawler_args' - add 'crawler_session_size_limit_bytes' and 'crawler_session_time_limit_seconds' for --timeLimit and --sizeLimit option for crawler - remove hard-coded 'timeout' to allow configuring via crawl config - set liveness check port from existing config value - add comments that requests hd must be at least double the size limit - defaults: set crawler_requests_hd to 22GB, default crawl session size limit to 10GB	2023-02-24 17:24:04 -08:00
Ilya Kreymer	3df6e0f146	crawler arguments fixes: (#621 ) - partial fix to #321, don't hard-code behavior limit into crawler args - allow setting number of crawler browser instances via 'crawler_browser_instances' to avoid having to override the full crawler args	2023-02-22 13:23:19 -08:00
Tessa Walsh	14b349443f	Make pending invites expire via TTL index (#568 ) * Make invites expire after configurable window The value can be set in EXPIRE_AFTER_SECONDS env var and via helm chart values, and defaults to 7 days. * Create nightly test CI and add invite expiration test to it * Update 404 error message for missing or expired invite --------- Co-authored-by: sua yoo <sua@suayoo.com>	2023-02-14 16:07:14 -05:00

1 2 3

102 Commits