browsertrix/backend/btrixcloud
Tessa Walsh f8fb2d2c8d
Rework crawl page migration + MongoDB Query Optimizations (#2412)
Fixes #2406 

Converts migration 0042 to launch a background job (parallelized across
several pods) to migrate all crawls by optimizing their pages and
setting `version: 2` on the crawl when complete.

Also Optimizes MongoDB queries for better performance.

Migration Improvements:

- Add `isMigrating` and `version` fields to `BaseCrawl`
- Add new background job type to use in migration with accompanying
`migration_job.yaml` template that allows for parallelization
- Add new API endpoint to launch this crawl migration job, and ensure
that we have list and retry endpoints for superusers that work with
background jobs that aren't tied to a specific org
- Rework background job models and methods now that not all background
jobs are tied to a single org
- Ensure new crawls and uploads have `version` set to `2`
- Modify crawl and collection replay.json endpoints to only include
fields for replay optimization (`initialPages`, `pageQueryUrl`,
`preloadResources`) if all relevant crawls/uploads have `version` set to
`2`
- Remove `distinct` calls from migration pathways
- Consolidate collection recompute stats

Query Optimizations:
- Remove all uses of $group and $facet
- Optimize /replay.json endpoints to precompute preload_resources, avoid
fetching crawl list twice
- Optimize /collections endpoint by not fetching resources 
- Rename /urls -> /pageUrlCounts and avoid $group, instead sort with
index, either by seed + ts or by url to get top matches.
- Use $gte instead of $regex to get prefix matches on URL
- Use $text instead of $regex to get text search on title
- Remove total from /pages and /pageUrlCounts queries by not using
$facet
- frontend: only call /pageUrlCounts when dialog is opened.


---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2025-02-20 15:26:11 -08:00
..
migrations Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
operator Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
__init__.py
auth.py Reformat with Black for 2025 ruleset (#2349) 2025-01-29 16:57:06 -05:00
background_jobs.py Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
basecrawls.py Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
colls.py Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
crawlconfigs.py Fix max pages quota setting and display (#2370) 2025-02-10 16:15:21 -08:00
crawlmanager.py Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
crawls.py Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
db.py Add WACZ filename, depth, favIconUrl, isSeed to pages (#2352) 2025-02-05 15:50:04 -05:00
emailsender.py Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
invites.py Reformat with Black for 2025 ruleset (#2349) 2025-01-29 16:57:06 -05:00
k8sapi.py Reformat with Black for 2025 ruleset (#2349) 2025-01-29 16:57:06 -05:00
main_bg.py Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
main_op.py Reformat with Black for 2025 ruleset (#2349) 2025-01-29 16:57:06 -05:00
main.py Add collection page list/search endpoint (#2354) 2025-02-10 16:44:37 -08:00
models.py Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
ops.py Modify page upload migration (#2400) 2025-02-17 16:47:58 -08:00
orgs.py Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
pages.py Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
pagination.py Format backend with Black 24 (#1507) 2024-02-07 11:35:34 -08:00
profiles.py Reformat with Black for 2025 ruleset (#2349) 2025-01-29 16:57:06 -05:00
storages.py Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
subs.py Send subscription cancelation email (#2234) 2024-12-12 11:52:38 -08:00
uploads.py Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
users.py Add superuser endpoint to get user emails with org info (#2211) 2024-12-09 16:38:01 -08:00
utils.py Reformat with Black for 2025 ruleset (#2349) 2025-01-29 16:57:06 -05:00
version.py Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
webhooks.py Add webhooks for qaAnalysisStarted, qaAnalysisFinished, and crawlReviewed (#1974) 2024-07-25 16:53:49 -07:00