browsertrix

Author	SHA1	Message	Date
Ilya Kreymer	86311ab4ea	merge 1.9.5 fixes (#1637 ) retry loading profile if initial load fails, follow-up to #1604 - Add missing setTimeout to retry profile loading bump RWP to 1.8.15	2024-03-27 21:49:19 -07:00
Ilya Kreymer	4f676e4e82	QA Runs Initial Backend Implementation (#1586 ) Supports running QA Runs via the QA API! Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes #1498 Also requires the latest Browsertrix Crawler 1.1.0+ (from webrecorder/browsertrix-crawler#469 branch) Notable changes: - QARun objects contain info about QA runs, which are crawls performed on data loaded from existing crawls. - Various crawl db operations can be performed on either the crawl or `qa.` object, and core crawl fields have been moved to CoreCrawlable. - While running,`QARun` data stored in a single `qa` object, while finished qa runs are added to `qaFinished` dictionary on the Crawl. The QA list API returns data from the finished list, sorted by most recent first. - Includes additional type fixes / type safety, especially around BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific get_upload(), get_basecrawl(), get_crawl() getters for internal use and get_crawl_out() for API - Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-20 22:42:16 -07:00
Tessa Walsh	21ae38362e	Add endpoints to read pages from older crawl WACZs into database (#1562 ) Fixes #1597 New endpoints (replacing old migration) to re-add crawl pages to db from WACZs. After a few implementation attempts, we settled on using [remotezip](https://github.com/gtsystem/python-remotezip) to handle parsing of the zip files and streaming their contents line-by-line for pages. I've also modified the sync log streaming to use remotezip as well, which allows us to remove our own zip module and let remotezip handle the complexity of parsing zip files. Database inserts for pages from WACZs are batched 100 at a time to help speed up the endpoint, and the task is kicked off using asyncio.create_task so as not to block before giving a response. StorageOps now contains a method for streaming the bytes of any file in a remote WACZ, requiring only the presigned URL for the WACZ and the name of the file to stream.	2024-03-19 14:14:21 -07:00
Ilya Kreymer	e7af081af1	profile browser fixes: better resource usage + load retry (main) (#1604 ) - Backend: Use separate resource constraints for profiles: default profile browser resources to either 'profile_browser_cpu' / 'profile_browser_memory' or single browser 'crawler_memory_base' / 'crawler_cpu_base', instead of scaled to the number of browser workers - Frontend: check that profile html page is loading, keep retrying if still getting nginx error instead of loading an iframe with the error. Fixes #1598 (Copy of #1599 from 1.9.4)	2024-03-16 15:07:04 -07:00
Ilya Kreymer	a8e3ff1141	version: bump to 1.10.0-beta.0	2024-02-20 00:22:29 -08:00
Ilya Kreymer	c1cffe9ecd	version: bump to 1.9.1	2024-02-16 09:44:18 -08:00
Ilya Kreymer	64bf21311d	version: bump to 1.9.0!	2024-02-14 13:30:46 -08:00
Ilya Kreymer	1d266e3cea	bump to 1.9.0.beta.5	2024-02-12 18:29:39 -08:00
Ilya Kreymer	4bc8152640	version: bump to 1.9.0-beta.4	2024-02-09 16:17:13 -08:00
Ilya Kreymer	7aebce66f6	version: bump to 1.9.0-beta.3	2024-02-07 15:21:10 -08:00
Ilya Kreymer	e43feedc43	version: bump to 1.9.0-beta.2	2024-01-18 10:01:38 -08:00
Ilya Kreymer	370590b14f	version: bump to 1.9.0-beta.1	2024-01-17 14:58:25 -08:00
Tessa Walsh	07fa46d9aa	Add custom user agent to workflows (#1465 ) Fixes #1341 Adds "User Agent" field to workflow editor under the Browser Settings tab. If not set, the crawler will use the browser's default user agent. Also added to docs and to the workflow details page (if set). --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-01-17 17:33:50 -05:00
Ilya Kreymer	90197b2a85	Backend mem usage fix - use fixed MOTOR_MAX_WORKERS + switch to gunicorn (#1468 ) Refactors backend deployment to: - Use MOTOR_MAX_WORKERS (defaulting to 1) to reduce threads used by mongodb connections - Also sets backend workers to 1 by default to reduce default memory usage - Switches to gunicorn with uvloop worker for production use instead of uvicorn (as recommended by uvicorn) Lower thread count should address memory leak/increased usage, which resulted in 5x thread x cpus x workers, eg. potentially 20 or 40 threads just for mongodb connections. Lower default number of workers should make it easier to scale backend with HPA if additional capacity. Fixes #1467	2024-01-16 15:32:42 -08:00
Tessa Walsh	032859f361	Support multiple crawler versions (#1420 ) Fixes #1385 ## Changes Supports multiple crawler 'channels' which can be configured to different browsertrix-crawler versions - Replaces `crawler_image` in helm chart with `crawler_channels` array similar to how storages are handled - The `default` crawler channel must always be provided and specifies the default crawler image - Adds backend `/orgs/{oid}/crawlconfigs/crawler-channels` API endpoint to fetch information about available crawler versions (name, image, and label) and test - Adds crawler channel select to workflow creation/edit screens and profile creation dialog, and updates related API endpoints and configmaps accordingly. The select dropdown is shown only if more than one channel is configured. - Adds `crawlerChannel` to workflow and crawl details. - Add `image` to crawler image, used to display actual image used as part of the crawl. - Modifies `crawler_crawl_id` backend test fixture to use `test` crawler version to ensure crawler versions other than latest work - Adds migration to add `crawlerChannel` set to `default` to existing workflow and profile objects and workflow configmaps --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-01-16 15:32:12 -08:00
Ilya Kreymer	a6936299d3	version: bump to 1.9.0-beta.0	2023-12-20 00:08:16 -08:00
Ilya Kreymer	d902cf5338	version: bump to 1.8.2	2023-12-07 13:34:37 -08:00
Ilya Kreymer	1218d6e767	version: bump to 1.8.1	2023-11-17 14:39:52 -08:00
Ilya Kreymer	b6f8c968e9	version: bump to 1.8.0	2023-11-15 17:57:43 -08:00
Ilya Kreymer	b23eed5003	Email Templates (#1375 ) - Emails are now processed from Jinja2 templates found in `charts/email-templates`, to support easier updates via helm chart in the future. - The available templates are: `invite`, `password_reset`, `validate` and `failed_bg_job`. - Each template can be text only or also include HTML. The format of the template is: ``` subject ~~~ <html content> ~~~ text ``` - A new `support_email` field is also added to the email block in values.yaml Invite Template: - Currently, only the invite template includes an HTML version, other templates are text only. - The same template is used for new and existing users, with slightly different text if adding user to an existing org. - If user is invited by the superadmin, the invited by field is not included, otherwise it also includes 'You have been invited by X to join Y'	2023-11-15 15:22:12 -08:00
Ilya Kreymer	7d985a9688	version: bump to 1.8.0-beta.4	2023-11-14 11:59:04 -08:00
Ilya Kreymer	67892994a6	version: bump to 1.8.0-beta.3	2023-11-09 18:20:04 -08:00
Ilya Kreymer	3aebf2e37f	version: bump to 1.8.0-beta.2	2023-11-06 16:35:15 -08:00
Francesco Servida	0b8bbcf8e6	Allow User to specify custom cluster-issuer (#1332 ) Implemented variable and defaults for cluster-issuer to allow users to specify, if needed, their own cluster issuer. (eg. installations with only outbound traffic that cannot solve ACME https challenge)	2023-11-04 13:29:17 -07:00
Ilya Kreymer	fb3d88291f	Background Jobs Work (#1321 ) Fixes #1252 Supports a generic background job system, with two background jobs, CreateReplicaJob and DeleteReplicaJob. - CreateReplicaJob runs on new crawls, uploads, profiles and updates the `replicas` array with the info about the replica after the job succeeds. - DeleteReplicaJob deletes the replica. - Both jobs are created from the new `replica_job.yaml` template. The CreateReplicaJob sets secrets for primary storage + replica storage, while DeleteReplicaJob only needs the replica storage. - The job is processed in the operator when the job is finalized (deleted), which should happen immediately when the job is done, either because it succeeds or because the backoffLimit is reached (currently set to 3). - /jobs/ api lists all jobs using a paginated response, including filtering and sorting - /jobs/<job id> returns details for a particular job - tests: nightly tests updated to check create + delete replica jobs for crawls as well as uploads, job api endpoints - tests: also fixes to timeouts in nightly tests to avoid crawls finishing too quickly. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-11-02 13:02:17 -07:00
Ilya Kreymer	8c09934298	version: bump to 1.8.0-beta.1	2023-10-27 14:35:24 -07:00
Ilya Kreymer	6dc452ebad	Storage Refactor: Replication + Custom Storage Support (#1296 ) - Refactors storage to support replicas + custom storages on the Org. - There is a default primary + replica storage, while an Org can also have primary and replica storages. - StorageRef object is used to store references to default and custom storage. - CrawlFile has been updated to contain a StorageRef instead of a def_storage_name, which references either a default storage (in StorageOps) or custom storage (in Organization) - There is also a 'replicas' Optional[List[StorageRef]] which contains replicas, if any. - CrawlFileOut contain a numReplicas for how many replicas exist for a given file. - Migration: migration 0020 added to migrate existing Orgs, CrawlFile and ProfileFile objects to new storage system (CrawlFile and ProfileFile now extend BaseFile) Part of #1262 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-10-26 21:44:09 -07:00
Tessa Walsh	d58747dfa2	Provide full resources in archived items finished webhooks (#1308 ) Fixes #1306 - Include full `resources` with expireAt (as string) in crawlFinished and uploadFinished webhook notifications rather than using the `downloadUrls` field (this is retained for collections). - Set default presigned duration to one minute short of 1 week and enforce maximum supported by S3 - Add 'storage_presign_duration_minutes' commented out to helm values.yaml - Update tests --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-10-23 19:01:58 -07:00
Ilya Kreymer	36bd228115	version: update to 1.8.0-beta.0	2023-10-17 18:06:55 -07:00
Ilya Kreymer	b3f530f8e6	version: bump to 1.7.0	2023-10-16 18:39:20 -07:00
Ilya Kreymer	a295f5d05d	version: bump to 1.7.0-beta.3	2023-10-15 18:31:03 -07:00
Ilya Kreymer	fa86555eed	Track pod resource usage, detect OOM crashes, handle auto-scaling (#1235 ) * keep track of per pod status on crawljob: - crashes time, and reason - 'used' vs 'allocated' resources - 'percent' used / allocated * crawl log errors: log error when crawler crashes via OOM, either via redis error log or to console * add initial autoscaling support! - detect if metrics server is available via K8SApi.is_pod_metrics_available() - if available, use metrics for 'used' fields - if no metrics, set memory used for redis only (using redis apis) - allow overriding memory and cpu via newMemory and newCpu settings on pod status - scale memory / cpu based on newMemory and newCpu setting - templates: update jinja templates to allow restarting crawler and redis with new resources - ci: enable metrics-server on k3d, microk8s and nightly k3d ci runs * roles: cleanup unused roles, add permissions for listing metrics * stats for running crawls: - update in db via operator - avoids losing stats if redis pod happens to be done - tradeoff is more db access in operator, but less extra connections to redis + already loading from db in backend - size stat: ensure size of previous files is added to the stats * crawler deployment tweaks: - adjust cpu/mem per browser - add --headless flag to configmap to use new headless mode by default!	2023-10-05 20:41:18 -07:00
Ilya Kreymer	20560abb81	version: bump to 1.7.0-beta.2	2023-10-05 20:33:38 -07:00
Ilya Kreymer	d6bc467c54	improvements to redis pod: (#1219 ) - add liveness check/fix readiness check - ensure 'redis-cli ping' actually returns 'PONG', as exit code is 0 even if errors will detect situations where redis is not available, such as due to to max clients being reached - bump redis memory/cpu for now (until autoscaling/automatic adjustment is available)	2023-09-28 13:00:31 -07:00
Ilya Kreymer	18b2c1abfc	limit: set default page limit to 50k pages (#1211 )	2023-09-26 10:29:03 -07:00
Ilya Kreymer	65b7c10ba1	bump version to 1.7.0-beta.1	2023-09-18 14:33:03 -07:00
Ilya Kreymer	3d4ff264b6	version: bump RWP version to 1.8.12 (#1181 )	2023-09-15 11:32:45 -07:00
Ilya Kreymer	ad9bca2e92	Operator refactor to control pods + pvcs directly instead of statefulsets (#1149 ) - Ability for pod to be Completed, unlike in Statefulset - eg. if 3 pods are running and first one finishes, all 3 must be running until all 3 are done. With this setup, the first finished pod can remain in Completed state. - Fixed shutdown order - crawler pods now correctly shutdown first before redis pods, by switching to background deletion. - Pod priority decreases with scale: 1st instance of a new crawl can preempt 3rd or 2nd instance of another crawl - Create priority classes upto 'max_crawl_scale, configured in values.yaml - Improved scale change reconciliation: if increasing scale, immediately scale up. If decreasing scale, graceful stop scaled-down instance to complete via redis 'stopone' key, wait until they exit with Completed state before adjust status.scale / removing scaled down pods. Ensures unaccepted interrupts don't cause scaled down data to be deleted. - Redis pod remains inactive until crawler is first active, or after no crawl pods are active for 60 seconds - Configurable Redis storage with 'redis_storage' value, set to 3Gi by default - CrawlJob deletion starts as soon as post-finish crawl operations are run - Post-crawl operations get their own redis instance, since one during response is being cleaned up in finalizer - Finalizer ignores request with incorrect state (returns 400 if reported as not finished while crawl is finished) - Current resource usage added to status - Profile browser: also manage single pod directly without statefulset for consistency. - Restart pods via restartTime value: if spec.restartTime != status.restartTime, clear out pods and update status.restartTime (using OnDelete policy to avoid recreate loops in edge cases). - Update to latest metacontroller (v4.11.0) - Add --restartOnError flag for crawler (for browsertrix-crawler 0.11.0) - Failed crawl logging: dd 'fail_crawl()' to be used for failing a crawl, which prints logs for default container (if enabled) as well as pod status - tests: check other finished states to avoid stuck in infinite loop if crawl fails - tests: disable disk utilization check, which adds unpredictability to crawl testing! fixes #1147 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-09-11 10:38:04 -07:00
Anish Lakhwara	e57148d0e9	feat: add SMTP {port, use_tls} config (#1142 ) * feat: add SMTP {port, use_tls} config * If `password` is None don't attempt to log in * remove 'can be omitted' comment --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2023-09-08 08:18:36 -07:00
Ilya Kreymer	2967f1e320	ingress: simplify ingress config: (fixes #1135 ) (#1146 ) * ingress: simplify ingress config: (fixes #1135) - use standard Prefix pathTypes - remove nginx-specific rewriting - remove 'scheme', use https/http based on 'tls' setting (in ingress and configmap) - fix signing ingress to use ingressClassName	2023-09-07 09:51:48 -07:00
Ilya Kreymer	68bc053ba0	Print crawl log to operator log (mostly for testing) (#1148 ) * log only if 'log_failed_crawl_lines' value is set to number of last lines to log from failed container --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-09-06 17:53:02 -07:00
Ilya Kreymer	dce1ae6129	better resources scaling by number of browsers per crawler container (#1103 ) - set crawler cpu / memory with fixed base + incremental bumps based on number of browsers - allow parsing k8s quantities with parse_quantity, compute in operator - set 'crawler_cpu = crawler_cpu_base + crawler_extra_cpu_per_browser * (num_browsers - 1)' and same for memory	2023-09-06 01:42:44 -04:00
Ilya Kreymer	6dca2f1c03	supports overriding the replayweb.page version without having to be r… (#1122 ) * supports overriding the replayweb.page version without having to be rebuild frontend image: - ensures 'rwp_base_url' from helm chart is passed to nginx - ensures both ui.js and sw.js are loaded based on nginx environment variable, not hard-coded - ui.js loaded via redirect from new /replay/ui.js path - pin RWP to known working release in default values.yaml - remove RWP_BASE_URL from Dockerfile, no longer needed, set via chart env var - set default RWP_BASE_URL for devserver to use CDN - set RWP version to 1.8.11	2023-09-05 20:10:21 -04:00
Ilya Kreymer	a9ab17fc61	publish helm chart on release (fixes #1114 ) (#1117 ) (#1123 ) - no longer using :latest by default in values.yaml, instead updating version with each release - set chart version to match app version in Chart.yaml - update version in helm chart and values.yaml as part of update-version.sh script - update test.yaml and local-config.yaml to enable using :latest tag images - ci: add ci script for packaging current helm chart - docs: updates docs to indicate deploying directly from GitHub release - docs: add script to fill in latest version for 'VERSION' using custom script - chart: set local_service_port to 30870 by default, but use only if no ingress. - default values.yaml set up for local deployment, local-config.yaml contains additional commented out examples - ci draft: add deployment info to draft with helm install command for current version - test: fix password check test	2023-08-30 12:02:02 -07:00
Ilya Kreymer	8e43940196	chart resources: adjust backend memory to 350Mi, as 200Mi was too low (#1082 )	2023-08-15 21:59:57 -07:00
Ilya Kreymer	9553115bbe	helm chart tweaks: (#1067 ) * helm chart tweaks: - lower mem requirements for backend and crawler - disable cors in ingress to pass through cors headers from backend - crawler statefulset: use ordered instead of parallel scaling policy to avoid single crawl taking up all crawling capacity quickly	2023-08-14 16:43:12 -07:00
Ilya Kreymer	7ea6d76f10	Resource Constraints Cleanup: (fixes #895 ) (#1019 ) * resource constraints: (fixes #895) - for cpu, only set cpu requests - for memory, set mem requests == mem limits - add missing resource constraints for minio and scheduled job - for crawler, set mem and cpu constraints per browser, scale based on browser instances per crawler - add comments in values.yaml for crawler values being multiplied - default values: bump crawler to 650 millicpu per browser instance just in case cleanup: remove unused entries from main backend configmap	2023-08-01 00:11:16 -07:00
Ilya Kreymer	c76dd10928	chart: always pull latest crawler image - since default image is pointing to webrecorder/browsertrix-crawler:latest, makes sense to always pull latest (#1018 )	2023-07-27 12:41:41 -07:00
Vinzenz Sinapius	5807507f29	Add proxy settings for crawler and profilebrowser (#997 )	2023-07-26 16:11:10 -07:00
Anish Lakhwara	b5a9c42df1	feat: add pre-commit to check we don't have real passwords in yml files (#990 ) * feat: use existing pre-commit framework * feat(ci): add github action for password_check * feat: add some simple tests to password_check.py * fix: set `backend_password_secret` in default values.yaml to an allowed password	2023-07-26 13:29:37 -07:00

1 2 3

105 Commits