browsertrix/chart/templates/configmap.yaml
Ilya Kreymer 60ba9e366f
Refactor to use new operator on backend (#789)
* Btrixjobs Operator - Phase 1 (#679)

- add metacontroller and custom crds
- add main_op entrypoint for operator

* Btrix Operator Crawl Management (#767)

* operator backend:
- run operator api in separate container but in same pod, with WEB_CONCURRENCY=1
- operator creates statefulsets and services for CrawlJob and ProfileJob
- operator: use service hook endpoint, set port in values.yaml

* crawls working with CrawlJob
- jobs start with 'crawljob-' prefix
- update status to reflect current crawl state
- set sync time to 10 seconds by default, overridable with 'operator_resync_seconds'
- mark crawl as running, failed, complete when finished
- store finished status when crawl is complete
- support updating scale, forcing rollover, stop via patching CrawlJob
- support cancel via deletion
- requires hack to content-length for patching custom resources
- auto-delete of CrawlJob via 'ttlSecondsAfterFinished'
- also delete pvcs until autodelete supported via statefulset (k8s >1.27)
- ensure filesAdded always set correctly, keep counter in redis, add to status display
- optimization: attempt to reduce automerging, by reusing volumeClaimTemplates from existing children, as these may have additional props added
- add add_crawl_errors_to_db() for storing crawl errors from redis '<crawl>:e' key to mongodb when crawl is finished/failed/canceled
- add .status.size to display human-readable crawl size, if available (from webrecorder/browsertrix-crawler#291)
- support new page size, >0.9.0 and old page size key (changed in webrecorder/browsertrix-crawler#284)

* support for scheduled jobs!
- add main_scheduled_job entrypoint to run scheduled jobs
- add crawl_cron_job.yaml template for declaring CronJob
- CronJobs moved to default namespace

* operator manages ProfileJobs:
- jobs start with 'profilejob-'
- update expiry time by updating ProfileJob object 'expireTime' while profile is active

* refactor/cleanup:
- remove k8s package
- merge k8sman and basecrawlmanager into crawlmanager
- move templates, k8sapi, utils into root package
- delete all *_job.py files
- remove dt_now, ts_now from crawls, now in utils
- all db operations happen in crawl/crawlconfig/org files
- move shared crawl/crawlconfig/org functions that use the db to be importable directly,
including get_crawl_config, add_new_crawl, inc_crawl_stats

* role binding: more secure setup, don't allow crawler namespace any k8s permissions
- move cronjobs to be created in default namespace
- grant default namespace access to create cronjobs in default namespace
- remove role binding from crawler namespace

* additional tweaks to templates:
- templates: split crawler and redis statefulset into separate yaml file (in case need to load one or other separately)

* stats / redis optimization:
- don't update stats in mongodb on every operator sync, only when crawl is finished
- for api access, read stats directly from redis to get up-to-date stats
- move get_page_stats() to utils, add get_redis_url() to k8sapi to unify access

* Add migration for operator changes
- Update configmap for crawl configs with scale > 1 or
crawlTimeout > 0 and schedule exists to recreate CronJobs
- add option to rerun last migration, enabled via env var and by running helm with --set=rerun_last_migration=1

* subcharts: move crawljob and profilejob crds to separate subchart, as this seems best way to guarantee proper install order with + update on upgrade with helm, add built btrix-crds-0.1.0.tgz subchart
- metacontroller: use release from ghcr, add metacontroller-helm-v4.10.1.tgz subchart

* backend api fixes
- ensure changing scale of crawl also updates it in the db
- crawlconfigs: add 'currCrawlSize' and 'lastCrawlSize' to crawlconfig api

---------

Co-authored-by: D. Lee <leepro@gmail.com>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-04-24 18:30:52 -07:00

147 lines
4.2 KiB
YAML

---
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ .Values.name }}-env-config
namespace: {{ .Release.Namespace }}
data:
APP_ORIGIN: {{.Values.ingress.scheme }}://{{ .Values.ingress.host | default "localhost:9870" }}
CRON_NAMESPACE: {{ .Release.Namespace }}
CRAWLER_NAMESPACE: {{ .Values.crawler_namespace }}
CRAWLER_IMAGE: {{ .Values.crawler_image }}
CRAWLER_PULL_POLICY: {{ .Values.crawler_pull_policy }}
CRAWLER_FQDN_SUFFIX: ".{{ .Values.crawler_namespace }}.svc.cluster.local"
CRAWLER_TIMEOUT: "{{ .Values.crawl_timeout }}"
CRAWLER_RETRIES: "{{ .Values.crawl_retries }}"
CRAWLER_REQUESTS_CPU: "{{ .Values.crawler_requests_cpu }}"
CRAWLER_LIMITS_CPU: "{{ .Values.crawler_limits_cpu }}"
CRAWLER_REQUESTS_MEM: "{{ .Values.crawler_requests_memory }}"
CRAWLER_LIMITS_MEM: "{{ .Values.crawler_limits_memory }}"
CRAWLER_LIVENESS_PORT: "{{ .Values.crawler_liveness_port | default 0 }}"
DEFAULT_ORG: "{{ .Values.default_org }}"
INVITE_EXPIRE_SECONDS: "{{ .Values.invite_expire_seconds }}"
JOB_IMAGE: "{{ .Values.backend_image }}"
JOB_PULL_POLICY: "{{ .Values.backend_pull_policy }}"
{{- if .Values.crawler_pv_claim }}
CRAWLER_PV_CLAIM: "{{ .Values.crawler_pv_claim }}"
{{- end }}
CRAWLER_NODE_TYPE: "{{ .Values.crawler_node_type }}"
REDIS_URL: "{{ .Values.redis_url }}"
REDIS_CRAWLS_DONE_KEY: "crawls-done"
NO_DELETE_JOBS: "{{ .Values.no_delete_jobs | default 0 }}"
GRACE_PERIOD_SECS: "{{ .Values.grace_period_secs | default 600 }}"
REGISTRATION_ENABLED: "{{ .Values.registration_enabled | default 0 }}"
ALLOW_DUPE_INVITES: "{{ .Values.allow_dupe_invites | default 0 }}"
JWT_TOKEN_LIFETIME_MINUTES: "{{ .Values.jwt_token_lifetime_minutes | default 60 }}"
DEFAULT_BEHAVIOR_TIME_SECONDS: "{{ .Values.default_behavior_time_seconds }}"
DEFAULT_PAGE_LOAD_TIME_SECONDS: "{{ .Values.default_page_load_time_seconds }}"
MAX_PAGES_PER_CRAWL: "{{ .Values.max_pages_per_crawl | default 0 }}"
IDLE_TIMEOUT: "{{ .Values.profile_browser_idle_seconds | default 60 }}"
RERUN_LAST_MIGRATION: "{{ .Values.rerun_last_migration }}"
---
apiVersion: v1
kind: ConfigMap
metadata:
name: shared-crawler-config
namespace: {{ .Values.crawler_namespace }}
data:
CRAWL_ARGS: "{{ .Values.crawler_args }} --workers {{ .Values.crawler_browser_instances | default 1 }} --sizeLimit {{ .Values.crawler_session_size_limit_bytes }} --timeLimit {{ .Values.crawler_session_time_limit_seconds }} --maxPageLimit {{ .Values.max_pages_per_crawl | default 0 }} --healthCheckPort {{ .Values.crawler_liveness_port }}"
---
apiVersion: v1
kind: ConfigMap
metadata:
name: shared-job-config
#namespace: {{ .Values.crawler_namespace }}
namespace: {{ .Release.Namespace }}
data:
config.yaml: |
namespace: {{ .Values.crawler_namespace }}
termination_grace_secs: "{{ .Values.grace_period_secs | default 600 }}"
volume_storage_class: "{{ .Values.volume_storage_class }}"
requests_hd: "{{ .Values.crawler_requests_storage }}"
# redis
redis_image: {{ .Values.redis_image }}
redis_image_pull_policy: {{ .Values.redis_pull_policy }}
redis_requests_cpu: "{{ .Values.redis_requests_cpu }}"
redis_limits_cpu: "{{ .Values.redis_limits_cpu }}"
redis_requests_memory: "{{ .Values.redis_requests_memory }}"
redis_limits_memory: "{{ .Values.redis_limits_memory }}"
# crawler
crawler_image: {{ .Values.crawler_image }}
crawler_image_pull_policy: {{ .Values.crawler_pull_policy }}
crawler_requests_cpu: "{{ .Values.crawler_requests_cpu }}"
crawler_limits_cpu: "{{ .Values.crawler_limits_cpu }}"
crawler_requests_memory: "{{ .Values.crawler_requests_memory }}"
crawler_limits_memory: "{{ .Values.crawler_limits_memory }}"
crawler_liveness_port: "{{ .Values.crawler_liveness_port | default 0 }}"
crawler_node_type: "{{ .Values.crawler_node_type }}"
redis_node_type: "{{ .Values.redis_node_type }}"
---
apiVersion: v1
kind: ConfigMap
metadata:
name: shared-redis-conf
namespace: {{ .Values.crawler_namespace }}
data:
redis.conf: |
appendonly yes
dir /data
---
apiVersion: v1
kind: ConfigMap
metadata:
name: nginx-config
namespace: {{ .Release.Namespace }}
data:
{{ (.Files.Glob "*.conf").AsConfig | indent 2 }}
#{{ (.Files.Glob "frontend/*.*").AsConfig | indent 2 }}