Fixes #2406 Converts migration 0042 to launch a background job (parallelized across several pods) to migrate all crawls by optimizing their pages and setting `version: 2` on the crawl when complete. Also Optimizes MongoDB queries for better performance. Migration Improvements: - Add `isMigrating` and `version` fields to `BaseCrawl` - Add new background job type to use in migration with accompanying `migration_job.yaml` template that allows for parallelization - Add new API endpoint to launch this crawl migration job, and ensure that we have list and retry endpoints for superusers that work with background jobs that aren't tied to a specific org - Rework background job models and methods now that not all background jobs are tied to a single org - Ensure new crawls and uploads have `version` set to `2` - Modify crawl and collection replay.json endpoints to only include fields for replay optimization (`initialPages`, `pageQueryUrl`, `preloadResources`) if all relevant crawls/uploads have `version` set to `2` - Remove `distinct` calls from migration pathways - Consolidate collection recompute stats Query Optimizations: - Remove all uses of $group and $facet - Optimize /replay.json endpoints to precompute preload_resources, avoid fetching crawl list twice - Optimize /collections endpoint by not fetching resources - Rename /urls -> /pageUrlCounts and avoid $group, instead sort with index, either by seed + ts or by url to get top matches. - Use $gte instead of $regex to get prefix matches on URL - Use $text instead of $regex to get text search on title - Remove total from /pages and /pageUrlCounts queries by not using $facet - frontend: only call /pageUrlCounts when dialog is opened. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Emma Segal-Grossman <hi@emma.cafe> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
74 lines
1.5 KiB
YAML
74 lines
1.5 KiB
YAML
apiVersion: batch/v1
|
|
kind: Job
|
|
metadata:
|
|
name: "{{ id }}"
|
|
labels:
|
|
role: "background-job"
|
|
job_type: {{ job_type }}
|
|
{% if oid %}
|
|
btrix.org: {{ oid }}
|
|
{% endif %}
|
|
|
|
spec:
|
|
ttlSecondsAfterFinished: 90
|
|
backoffLimit: 3
|
|
{% if scale %}
|
|
parallelism: {{ scale }}
|
|
{% endif %}
|
|
template:
|
|
spec:
|
|
restartPolicy: Never
|
|
priorityClassName: bg-job
|
|
podFailurePolicy:
|
|
rules:
|
|
- action: FailJob
|
|
onExitCodes:
|
|
containerName: btrixbgjob
|
|
operator: NotIn
|
|
values: [0]
|
|
|
|
volumes:
|
|
- name: ops-configs
|
|
secret:
|
|
secretName: ops-configs
|
|
|
|
containers:
|
|
- name: btrixbgjob
|
|
image: {{ backend_image }}
|
|
imagePullPolicy: {{ pull_policy }}
|
|
env:
|
|
- name: BG_JOB_TYPE
|
|
value: {{ job_type }}
|
|
|
|
{% if oid %}
|
|
- name: OID
|
|
value: {{ oid }}
|
|
{% endif %}
|
|
- name: CRAWL_TYPE
|
|
value: {{ crawl_type }}
|
|
|
|
{% if crawl_id %}
|
|
- name: CRAWL_ID
|
|
value: {{ crawl_id }}
|
|
{% endif %}
|
|
|
|
envFrom:
|
|
- configMapRef:
|
|
name: backend-env-config
|
|
- secretRef:
|
|
name: mongo-auth
|
|
|
|
volumeMounts:
|
|
- name: ops-configs
|
|
mountPath: /ops-configs/
|
|
|
|
command: ["python3", "-m", "btrixcloud.main_bg"]
|
|
|
|
resources:
|
|
limits:
|
|
memory: "500Mi"
|
|
|
|
requests:
|
|
memory: "250Mi"
|
|
cpu: "200m"
|