browsertrix

Author	SHA1	Message	Date
Ilya Kreymer	60ba9e366f	Refactor to use new operator on backend (#789 ) * Btrixjobs Operator - Phase 1 (#679) - add metacontroller and custom crds - add main_op entrypoint for operator * Btrix Operator Crawl Management (#767) * operator backend: - run operator api in separate container but in same pod, with WEB_CONCURRENCY=1 - operator creates statefulsets and services for CrawlJob and ProfileJob - operator: use service hook endpoint, set port in values.yaml * crawls working with CrawlJob - jobs start with 'crawljob-' prefix - update status to reflect current crawl state - set sync time to 10 seconds by default, overridable with 'operator_resync_seconds' - mark crawl as running, failed, complete when finished - store finished status when crawl is complete - support updating scale, forcing rollover, stop via patching CrawlJob - support cancel via deletion - requires hack to content-length for patching custom resources - auto-delete of CrawlJob via 'ttlSecondsAfterFinished' - also delete pvcs until autodelete supported via statefulset (k8s >1.27) - ensure filesAdded always set correctly, keep counter in redis, add to status display - optimization: attempt to reduce automerging, by reusing volumeClaimTemplates from existing children, as these may have additional props added - add add_crawl_errors_to_db() for storing crawl errors from redis '<crawl>:e' key to mongodb when crawl is finished/failed/canceled - add .status.size to display human-readable crawl size, if available (from webrecorder/browsertrix-crawler#291) - support new page size, >0.9.0 and old page size key (changed in webrecorder/browsertrix-crawler#284) * support for scheduled jobs! - add main_scheduled_job entrypoint to run scheduled jobs - add crawl_cron_job.yaml template for declaring CronJob - CronJobs moved to default namespace * operator manages ProfileJobs: - jobs start with 'profilejob-' - update expiry time by updating ProfileJob object 'expireTime' while profile is active * refactor/cleanup: - remove k8s package - merge k8sman and basecrawlmanager into crawlmanager - move templates, k8sapi, utils into root package - delete all _job.py files - remove dt_now, ts_now from crawls, now in utils - all db operations happen in crawl/crawlconfig/org files - move shared crawl/crawlconfig/org functions that use the db to be importable directly, including get_crawl_config, add_new_crawl, inc_crawl_stats role binding: more secure setup, don't allow crawler namespace any k8s permissions - move cronjobs to be created in default namespace - grant default namespace access to create cronjobs in default namespace - remove role binding from crawler namespace * additional tweaks to templates: - templates: split crawler and redis statefulset into separate yaml file (in case need to load one or other separately) * stats / redis optimization: - don't update stats in mongodb on every operator sync, only when crawl is finished - for api access, read stats directly from redis to get up-to-date stats - move get_page_stats() to utils, add get_redis_url() to k8sapi to unify access * Add migration for operator changes - Update configmap for crawl configs with scale > 1 or crawlTimeout > 0 and schedule exists to recreate CronJobs - add option to rerun last migration, enabled via env var and by running helm with --set=rerun_last_migration=1 * subcharts: move crawljob and profilejob crds to separate subchart, as this seems best way to guarantee proper install order with + update on upgrade with helm, add built btrix-crds-0.1.0.tgz subchart - metacontroller: use release from ghcr, add metacontroller-helm-v4.10.1.tgz subchart * backend api fixes - ensure changing scale of crawl also updates it in the db - crawlconfigs: add 'currCrawlSize' and 'lastCrawlSize' to crawlconfig api --------- Co-authored-by: D. Lee <leepro@gmail.com> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-04-24 18:30:52 -07:00
Ilya Kreymer	f6dc26eeb5	nginx: enable worker processes autotune to correctly set the number of processes for nginx, possible fix for #780 (#785 )	2023-04-21 18:13:22 -07:00
Tessa Walsh	6b19f72a89	Add crawl errors endpoint (#757 ) * Add crawl errors endpoint If this endpoint is called while the crawl is running, errors are pulled directly from redis. If this endpoint is called when the crawl is finished, errors are pulled from mongodb, where they're written when crawls complete. * Add nightly backend test for errors endpoint * Add errors for failed and cancelled crawls to mongo Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2023-04-17 12:59:25 -04:00
Ilya Kreymer	85b6a05419	Upgrade to mongo 6 and use sortArray for workflow crawls (#764 ) (#765 ) fixes from 1.4.1: * Upgrade to mongo 6 and use for workflow crawls * update readiness probe with timeouts doubled, and failure threshold increased for slower 'mongosh' readiness check update versions to 1.5.0-beta.0 in backend and frontend Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-04-11 18:22:07 -07:00
Tessa Walsh	11ca3e678a	Configure crawler disk utilization threshold via helm chart (#748 )	2023-04-05 21:51:53 -07:00
Ilya Kreymer	7f757d396a	config: add 'pageLoadTimeout' and 'pageExtraDelay' options to backend… (#742 ) * config: add 'pageLoadTimeout' and 'pageExtraDelay' options to backend config - add 'default_page_load_timeout_seconds' to values.yaml, defaulting to 120, for pageLoadTimeout - add 'defaultPageLoadTimeSeconds ' to /api/settings, update tests for /api/settings addresses issue in #636	2023-04-04 19:52:23 -07:00
Ilya Kreymer	1c47a648a9	Max page limit override (#737 ) * more page limit: update to #717, instead of setting --limit in each crawlconfig, apply override --maxPageLimit setting, implemented in crawler, to override individually configured page limit * update tests, no longer returning 'crawl_page_limit_exceeds_allowed'	2023-04-03 14:01:32 -07:00
Ilya Kreymer	887cb16146	Allow configurable max pages per crawl in deployment settings (#717 ) * backend: max pages per crawl limit, part of fix for #716: - set 'max_pages_crawl_limit' in values.yaml, default to 100,000 - if set/non-0, automatically set limit if none provided - if set/non-0, return 400 if adding config with limit exceeding max limit - return limit as 'maxPagesPerCrawl' in /api/settings - api: /all/crawls - add runningOnly=0 to show all crawls, default to 1/true (for more reliable testing) tests: add test for 'max_pages_per_crawl' setting - ensure 'limit' can not be set higher than max_pages_per_crawl - ensure pages crawled is at the limit - set test limit to max 2 pages - add settings test - check for pages.jsonl and extraPages.jsonl when crawling 2 pages	2023-03-28 16:26:29 -07:00
D. Lee	7528f2ec6d	Add lightweight logging mode (#668 ) Enabled with `logging.fileMode`: true - disables elasticsearch, kibana and ingress - only enables fluentd to write logs in the node's volume - lightweight logging into files (in JSON format and compressed in gzip) - log file rotation (default: rotating files every 4 hours, retention 3 days)	2023-03-10 14:34:37 -08:00
Francis Kayiwa	3ba77f0ed2	ansible: rocky firewall (#635 ) * modify the template file to highlight optional host that stores WAC files * numerically reorder the tcp ports - fix the 404's on the documentation * add a configuration file - this allows automatic selection of inventory directory * provide better examples on documentation	2023-02-24 17:28:21 -08:00
Ilya Kreymer	413fd8d7ea	Chart: split Crawl args into separate variables (#639 ) * chart crawl args cleanup: - move configurable settings out of 'crawler_args' - add 'crawler_session_size_limit_bytes' and 'crawler_session_time_limit_seconds' for --timeLimit and --sizeLimit option for crawler - remove hard-coded 'timeout' to allow configuring via crawl config - set liveness check port from existing config value - add comments that requests hd must be at least double the size limit - defaults: set crawler_requests_hd to 22GB, default crawl session size limit to 10GB	2023-02-24 17:24:04 -08:00
Tessa Walsh	fff74ee754	Fix microk8s CI (#634 )	2023-02-23 16:58:25 -05:00
Ilya Kreymer	3df6e0f146	crawler arguments fixes: (#621 ) - partial fix to #321, don't hard-code behavior limit into crawler args - allow setting number of crawler browser instances via 'crawler_browser_instances' to avoid having to override the full crawler args	2023-02-22 13:23:19 -08:00
D. Lee	362dc67532	fix the admin logging doc (#612 ) * fix the admin logging doc * Update chart/admin/logging/README.md --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2023-02-21 09:52:46 -08:00
Tessa Walsh	14b349443f	Make pending invites expire via TTL index (#568 ) * Make invites expire after configurable window The value can be set in EXPIRE_AFTER_SECONDS env var and via helm chart values, and defaults to 7 days. * Create nightly test CI and add invite expiration test to it * Update 404 error message for missing or expired invite --------- Co-authored-by: sua yoo <sua@suayoo.com>	2023-02-14 16:07:14 -05:00
Ilya Kreymer	21745fb6f8	health readiness check: add /healthz endpoint for nginx readiness check, set failure threshold to 3 (similar to ingress) (#562 )	2023-02-06 15:08:05 -08:00
D. Lee	5fac103e10	[FIX] Add ingress class for admin logging (#532 ) * add ingress class, support changing ingress class for microk8s * update dependency	2023-01-31 12:52:57 -08:00
D. Lee	be4f918149	Merge pull request #442 from webrecorder/admin-logging-service Add logging service	2023-01-19 22:15:16 -08:00
Ilya Kreymer	ccd87e0dff	Rename api / nginx settings -> backend / frontend, set pull policy job images (#504 ) * rename config values - api -> backend - nginx -> frontend * job pods: - set job_pull_policy from api_pull_policy (same as backend image) - default to Always, but can be overridden for local deployment (same as backend image) typo fix: CRAWL_NAMESPACE -> CRAWLER_NAMESPACE (part of #491) ansible: set default label to :latest instead of :dev for	2023-01-18 20:21:36 -08:00
Ilya Kreymer	1dfa494210	backend: add default behavior time to /api/settings (part of #321 ) (#499 )	2023-01-18 14:52:15 -08:00
Tessa Walsh	0fa60ebc45	Rename archives/teams -> orgs in codebase + add db migration (#486 ) * Rename archives to orgs and aid to oid on backend * Rename archive to org and aid to oid in frontend * Remove translation artifact * Rename team -> organization * Add database migrations and run once on startup * This commit also applies the new by_one_worker decorator to other asyncio tasks to prevent heavy tasks from being run in each worker. * Run black, pylint, and husky via pre-commit * Set db version and use in migrations * Update and prepare database in single task * Migrate k8s configmaps	2023-01-18 14:51:04 -08:00
Ilya Kreymer	d028b93412	backend: password related fixes: (#479 ) - mongodb: support passwords with '@' by escaping mongo username and password - superadmin: update superadmin email and password after initial creation if updated in helm values	2023-01-13 18:22:50 -08:00
Ilya Kreymer	827b643262	backend: add 'allow_dupe_invites' option to allow re-inviting users. if not set (default), duplicate invites will result in errors (#471 )	2023-01-12 23:25:48 -08:00
Ilya Kreymer	4dbca8c421	email sending tweaks: (#470 ) - support 'reply-to' email field in values, and in ansible-based values - set 'subject' for different types of messages	2023-01-12 23:25:23 -08:00
Tessa Walsh	49460bb070	Add default organization + invite to default org (#465 ), #455 - Add default switch to Archive (org) model - Set default org name via values.yaml - Add check to ensure only one org with default org name exists - Stop creating new orgs for new users - Add new API endpoints for creating and renaming orgs (part of #457) - Make Archive.name unique via index - Wait for db connection on init, log if waiting - Make archive-less invites invite user to default org with Owner role - Rename default org from chart value if changed - Don't create new org for invited users	2023-01-12 16:44:18 -08:00
Ilya Kreymer	30bda8c75d	VNC-Based Profile Browser (#433 ) * profile browser vnc support + fixes: - switch profile browser rendering to use VNC - frontend: add @novnc/novnc as dependency, create separate bundle novnc.js to load into vnc browser (to avoid loading from each container) - frontend: update proxy paths to proxy websocket, index page to crawler - frontend: allow browser profiles in all browsers, remove browser compatibility check - frontend: update webpack dev config, apply prettier - frontend: node version fix - backend: get vncpassword, build new URL for proxying to crawler iframe - backend: fix profile / crawl job pull policy from 'Always' -> 'Never', should use existing image for job - backend: fix kill signal to use bash -c to work with latest backend image - backend/chart: add 'profile_browser_timeout_seconds' to chart values to control how long profile browser to remain when idle (default to 60) - backend: remove utils.py, now using secret.token_hex() for random suffix Co-authored-by: sua yoo <sua@suayoo.com>	2023-01-10 14:42:42 -08:00
DongWoo Lee	b03b848ec2	have ingress for signer only when it is enabled	2023-01-06 14:06:33 -08:00
DongWoo Lee	539a4556aa	add logging service CI with k3d	2023-01-04 22:52:41 -08:00
DongWoo Lee	6a228bb370	update doc	2023-01-04 00:31:59 -08:00
DongWoo Lee	75ce09a163	allow for non-local setup	2023-01-04 00:23:44 -08:00
DongWoo Lee	5b7d214c8a	add logging	2023-01-04 00:15:23 -08:00
Ilya Kreymer	2d93cef966	CI: Add K3D CI test (#405 ) - add testing with K3D cluster - bump backend image to python 3.10-slim for newer python, smaller image. - bump to 1.2.0-beta.0	2022-12-07 23:26:16 -08:00
Ilya Kreymer	82ffc0dfbc	Local Deployment Work: Support running locally + test cluster on CI (#396 ) * k8s local deployment work: - make it easier to deploy w/o ingress by setting 'local_service_port' (suggested port 30870) - if using local minio, ensure file endpoints set to /data/ and /data/ proxies correctly to local bucket - if not using minio, ensure file endpoints point to correct access / endpoint url. - setup should work with docker desktop, minikube, microk8s and k3s! - nginx chart: bump nginx memory limit to 20Mi - nginx image: 00-default-override-resolver-config -> 00-browsertrix-nginx-init for clarity - nginx image: use default nginx.conf, pin to nginx 1.23.2 - mongo: readd readiness probe, bump connect wait timeout (needed for ci) - config: set superadmin username to 'admin' - config schema: set 'name' as required - add sample chart values overrides: - chart values: local-config.yaml for running locally with 'local_service_port' - chart values: add microk8s-hosted.yaml for configuring a hosted microk8s setup - chart values: add microk8s-ci.yaml for ci tests - ci: remove docker swarm tests - ci: add microk8s integration tests: launching cluster, logging in, running a crawl of example.com, downloading/checking WACZ - bump to 1.1.0-beta.2	2022-12-02 19:58:34 -08:00
Ilya Kreymer	aabb0b2a92	chart / deployment fixes to run on microk8s: (fixes #385 ) (#387 ) - ingress: fix proxying /data to minio, use another ingress which proxies correct host to ensure presigned urls work - presigning: determine if signing endpoint url (minio) or access endpoint (cloud bucket) based on if access endpoint is provided, set bool on storage object - chart: fix indent on incorrect storageClassName configs - ingress: make 'ingress_class' configurable (set to 'public' for microk8s, default to 'nginx') - minio: use older minio image which supports legacy fs based setup (for now) - nginx service: add 'nginx_service_use_node_port' config setting: if true, will use NodePort for frontend, other will use default (ClusterIP) and only for the frontend / nginx - chart: remove changing service type for other services	2022-11-30 09:21:58 -08:00
Francis Kayiwa	6833c9d676	Digital ocean setup (#314 ) - Ansible playbook for deploying on DigitalOcean, configuring space, k8s cluster, mongodb, domain / subdomain, signing subdomain, container registry, and cors - Generates helm chat in ./deploys/ directory for future use with helm directly - Initial support for deletion of created resources as well. - add documentation on how to use playbook default helm values: update to latest authsign, set default timeout to 120 seconds	2022-11-15 13:44:24 -08:00
Ilya Kreymer	dde4c5ee68	k8s chart: ingress: use separate ingress for authsign to allow ssl-redirect true on main ingress mongo: local: disable readiness check for now due to issues with eval command (for now)	2022-10-15 13:46:31 -07:00
Ilya Kreymer	447b0bf9b9	k8s chart + values tweak: (#317 ) - mongo chart to avoid requiring username/password if passing db_url - tweaks to default values (set registration enabled by default, longer) add missing options	2022-09-21 12:45:08 -07:00
Ilya Kreymer	1216f6cb66	K8s: update chart for local minio + mongo default (#301 ) * k8s chart fixes: mongo: pin to 5.0.11 version for now minio: create empty dir for local storage for now instead of using mc, use 'btrix-data' as bucket name	2022-09-02 13:07:47 -07:00
Ilya Kreymer	f0c079dc1b	k8s: update default images to dev images in values.yaml	2022-09-01 16:18:15 -07:00
Ilya Kreymer	68ec582f73	nginx simplify: (#259 ) - add custom init script for ./docker-entrypoint.d/ to setup resolver from local /etc/resolv.conf - custom init script also removes default.conf, and removes minio route if NO_MINIO_ROUTE=1 is set - assign template vars to nginx vars to avoid conflicts when interpolating - k8s: remove initContainers and volumes, now handled via custom init script in image	2022-06-13 11:53:15 -07:00
Ilya Kreymer	5b6aa3bc95	Affinity + Tolerations + Cleanup Crawl Job (#256 ) * k8s: add tolerations for 'nodeType=crawling:NoSchedule' to allow scheduling crawling on designated nodes for crawler and profiles jobs and statefulsets * add affinity for 'nodeType=crawling' on crawling and profile browser statefulsets * refactor crawljob: combine crawl_updater logic into base crawl_job * increment new 'crawlAttemptCount' counter crawlconfig when crawl is started, not necessarily finished, to avoid deleting configs that had attempted but not finished crawls. * better external mongodb support: use MONGO_DB_URL to set custom url directly, otherwise build from username, password and mongo host	2022-06-10 19:21:37 -07:00
Ilya Kreymer	dee354f252	affinity: add affinity for k8s crawl deployments: - prefer deploy crawler, redis and job to same zone - prefer deploying crawler and job together via crawler node type, redis via redis node type (all optional)	2022-06-07 21:52:04 -07:00
Ilya Kreymer	0c8a5a49b4	refactor to use docker swarm for local alternative to k8s instead of docker compose (#247 ): - use python-on-whale to use docker cli api directly, creating docker stack for each crawl or profile browser - configure storages via storages.yaml secret - add crawl_job, profile_job, splitting into base and k8s/swarm implementations - split manager into base crawlmanager and k8s/swarm implementations - swarm: load initial scale from db to avoid modifying fixed configs, in k8s, load from configmap - swarm: support scheduled jobs via swarm-cronjob service - remove docker dependencies (aiodocker, apscheduler, scheduling) - swarm: when using local minio, expose via /data/ route in nginx via extra include (in k8s, include dir is empty and routing handled via ingress) - k8s: cleanup minio chart: move init containers to minio.yaml - swarm: stateful set implementation to be consistent with k8s scaling: - don't use service replicas, - create a unique service with '-N' appended and allocate unique volume for each replica - allows crawl containers to be restarted w/o losing data - add volume pruning background service, as volumes can be deleted only after service shuts down fully - watch: fully simplify routing, route via replica index instead of ip for both k8s and swarm - rename network btrix-cloud-net -> btrix-net to avoid conflict with compose network	2022-06-05 10:37:17 -07:00
Ilya Kreymer	bf79959a5a	refactoring to use statefulsets + job (#245 ) - use statefulsets instead of deployments for mongo, redis, signer - use k8s job + statefulset for running crawls - use separate statefulset for crawl (scaled) and single-replica redis stateful set - move crawl job update login to crawl_updater - remove shared redis chart package refactor: - move to shared code to 'btrixcloud' - move k8s to 'btrixcloud.k8s' - move docker to 'btrixcloud.docker'	2022-06-05 10:37:17 -07:00
Ilya Kreymer	c023fe7c9a	Backend API prefix (#240 ) * apply /api prefix consistently, both directly through backend and when accessing via frontend, fixes #236 * docs: update local deployment docs to use 9871 instead of 8000, don't expose 8000 by default * schemas: don't include /openapi.json as /healthz in documentation, keep /healthz at root * k8s: route backend to /api without additional rewriting	2022-05-31 19:29:20 -07:00
Ilya Kreymer	3df310ee4f	Backend: Crawls with Multiple WACZ files + Profile + Misc Fixes (#232 ) * backend: k8s: - support crawls with multiple wacz files, don't assume crawl complete after first wacz uploaded - if crawl is running and has wacz file, still show as running - k8s: allow configuring node selector for main pods (eg. nodeType=main) and for crawlers (eg. nodeType=crawling) - profiles: support uploading to alternate storage specified via 'shared_profile_storage' value is set - misc fixes for profiles * backend: ensure docker run_profile api matches k8s k8s chart: don't delete pvc and pv in helm chart * dependency: bump authsign to 0.4.0 docker: disable public redis port * profiles: fix path, profile browser return value * fix typo in presigned url cacheing	2022-05-19 18:40:41 -07:00
Ilya Kreymer	aa83d51f7a	k8s backend improvements: (#205 ) - add liveness probe for crawls, configurable via 'crawler_liveness_port' - add User system:anonymous permissions - treat jobs that have exceeded total as 'partial_complete' (experimental)	2022-03-30 14:39:06 -07:00
Ilya Kreymer	4b2f89db91	k8s: support for using a pre-made persistent volume/claim for crawling, configurable via CRAWLER_PV_CLAIM, otherwise using emptyDir k8s: ability to set deployment scale for frontend as well	2022-03-15 11:18:23 -07:00
Ilya Kreymer	8ce7a9802b	backend quick fix: chart/config: use screencastPort, fixed collection name k8s: set pod to never restart to see logs	2022-03-14 11:42:53 -07:00
Ilya Kreymer	fb51f8e33e	Mongo auth fix (#190 ) * backend: makes mongo auth configurable! use mongo_auth secret in k8s and set env vars in docker fixes #177 * docker: update config.sample.env: use ws screencast by default, add NO_DELETE_ON_FAIL option, extend default login lifetime	2022-03-04 15:04:33 -08:00

1 2

81 Commits