browsertrix/backend/btrixcloud/templates/redis.yaml
Ilya Kreymer fa86555eed
Track pod resource usage, detect OOM crashes, handle auto-scaling (#1235)
* keep track of per pod status on crawljob:
- crashes time, and reason
- 'used' vs 'allocated' resources 
- 'percent' used / allocated

* crawl log errors: log error when crawler crashes via OOM, either via redis error log
or to console

* add initial autoscaling support!
- detect if metrics server is available via K8SApi.is_pod_metrics_available()
- if available, use metrics for 'used' fields
- if no metrics, set memory used for redis only (using redis apis)
- allow overriding memory and cpu via newMemory and newCpu settings on pod status
- scale memory / cpu based on newMemory and newCpu setting
- templates: update jinja templates to allow restarting crawler and redis with new resources
- ci: enable metrics-server on k3d, microk8s and nightly k3d ci runs

* roles: cleanup unused roles, add permissions for listing metrics

* stats for running crawls:
- update in db via operator
- avoids losing stats if redis pod happens to be done
- tradeoff is more db access in operator, but less extra connections to redis + already
loading from db in backend
- size stat: ensure size of previous files is added to the stats

* crawler deployment tweaks:
- adjust cpu/mem per browser
- add --headless flag to configmap to use new headless mode by default!
2023-10-05 20:41:18 -07:00

131 lines
2.6 KiB
YAML

# -------
# PVC
# -------
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: {{ name }}
namespace: {{ namespace }}
labels:
crawl: {{ id }}
role: redis
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: {{ redis_storage }}
{% if volume_storage_class %}
storageClassName: {{ volume_storage_class }}
{% endif %}
# --------
# REDIS
# --------
{% if init_redis %}
---
apiVersion: v1
kind: Pod
metadata:
name: {{ name }}
namespace: {{ namespace }}
labels:
crawl: {{ id }}
role: redis
spec:
hostname: {{ name }}
subdomain: redis
terminationGracePeriodSeconds: 10
volumes:
- name: shared-redis-conf
configMap:
name: shared-redis-conf
items:
- key: redis.conf
path: redis.conf
- name: redis-data
persistentVolumeClaim:
claimName: {{ name }}
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: nodeType
operator: In
values:
- "{{ redis_node_type }}"
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 2
podAffinityTerm:
topologyKey: "failure-domain.beta.kubernetes.io/zone"
labelSelector:
matchLabels:
crawl: {{ id }}
tolerations:
- key: nodeType
operator: Equal
value: crawling
effect: NoSchedule
- key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
effect: NoExecute
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoExecute
tolerationSeconds: 300
containers:
- name: redis
image: {{ redis_image }}
imagePullPolicy: {{ redis_image_pull_policy }}
args: ["/redis-conf/redis.conf", "--appendonly", "yes"]
volumeMounts:
- name: redis-data
mountPath: /data
- name: shared-redis-conf
mountPath: /redis-conf
resources:
limits:
memory: {{ memory }}
requests:
cpu: {{ cpu }}
memory: {{ memory }}
readinessProbe:
initialDelaySeconds: 10
timeoutSeconds: 5
exec:
command:
- bash
- -c
- "res=$(redis-cli ping); [[ $res = 'PONG' ]]"
livenessProbe:
initialDelaySeconds: 10
timeoutSeconds: 5
exec:
command:
- bash
- -c
- "res=$(redis-cli ping); [[ $res = 'PONG' ]]"
{% endif %}