browsertrix/backend/btrixcloud/k8s/templates/crawl_job.yaml
Ilya Kreymer 793611e5bb
add exclusion api, fixes #311 (#349)
* add exclusion api, fixes #311
add new apis: `POST crawls/{crawl_id}/exclusion?regex=...` and `DELETE crawls/{crawl_id}/exclusion?regex=...` which will:
- create new config with add 'regex' as exclusion (deleting or making inactive previous config) OR remove as exclusion.
- update crawl to point to new config
- update statefulset to point to new config, causing crawler pods to restart
- filter out urls matching 'regex' from both queue and seen list (currently a bit slow) (when adding only)
- return 400 if exclusion already existing when adding, or doesn't exist when removing
- api reads redis list in reverse to match how exclusion queue is used
2022-11-12 17:24:30 -08:00

107 lines
2.6 KiB
YAML

apiVersion: batch/v1
kind: Job
metadata:
name: job-{{ id }}
annotations:
btrix.run.manual: "{{ manual }}"
labels:
btrix.user: {{ userid }}
btrix.archive: {{ aid }}
btrix.crawlconfig: {{ cid }}
spec:
backoffLimit: 1000
ttlSecondsAfterFinished: 20
template:
metadata:
labels:
btrix.user: {{ userid }}
btrix.archive: {{ aid }}
btrix.crawlconfig: {{ cid }}
spec:
restartPolicy: OnFailure
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: nodeType
operator: In
values:
- "{{ crawler_node_type }}"
tolerations:
- key: "nodeType"
operator: "Equal"
value: "crawling"
effect: "NoSchedule"
containers:
- name: crawl-job
image: {{ job_image }}
imagePullPolicy: Always
command: ["uvicorn", "btrixcloud.k8s.crawl_job:app", "--host", "0.0.0.0", "--access-log", "--log-level", "info"]
volumeMounts:
- name: config-volume
mountPath: /config
envFrom:
- secretRef:
name: mongo-auth
env:
- name: JOB_ID
valueFrom:
fieldRef:
fieldPath: metadata.labels['job-name']
- name: RUN_MANUAL
value: "{{ manual }}"
- name: USER_ID
value: "{{ userid }}"
- name: ARCHIVE_ID
value: "{{ aid }}"
- name: CRAWL_CONFIG_ID
value: "{{ cid }}"
- name: STORE_PATH
valueFrom:
configMapKeyRef:
name: crawl-config-{{ cid }}
key: STORE_PATH
- name: STORE_FILENAME
valueFrom:
configMapKeyRef:
name: crawl-config-{{ cid }}
key: STORE_FILENAME
- name: STORAGE_NAME
valueFrom:
configMapKeyRef:
name: crawl-config-{{ cid }}
key: STORAGE_NAME
- name: PROFILE_FILENAME
valueFrom:
configMapKeyRef:
name: crawl-config-{{ cid }}
key: PROFILE_FILENAME
volumes:
- name: config-volume
configMap:
name: shared-job-config
items:
- key: config.yaml
path: config.yaml