browsertrix

History

Tessa Walsh f8fb2d2c8d Rework crawl page migration + MongoDB Query Optimizations (#2412 ) Fixes #2406 Converts migration 0042 to launch a background job (parallelized across several pods) to migrate all crawls by optimizing their pages and setting `version: 2` on the crawl when complete. Also Optimizes MongoDB queries for better performance. Migration Improvements: - Add `isMigrating` and `version` fields to `BaseCrawl` - Add new background job type to use in migration with accompanying `migration_job.yaml` template that allows for parallelization - Add new API endpoint to launch this crawl migration job, and ensure that we have list and retry endpoints for superusers that work with background jobs that aren't tied to a specific org - Rework background job models and methods now that not all background jobs are tied to a single org - Ensure new crawls and uploads have `version` set to `2` - Modify crawl and collection replay.json endpoints to only include fields for replay optimization (`initialPages`, `pageQueryUrl`, `preloadResources`) if all relevant crawls/uploads have `version` set to `2` - Remove `distinct` calls from migration pathways - Consolidate collection recompute stats Query Optimizations: - Remove all uses of $group and $facet - Optimize /replay.json endpoints to precompute preload_resources, avoid fetching crawl list twice - Optimize /collections endpoint by not fetching resources - Rename /urls -> /pageUrlCounts and avoid $group, instead sort with index, either by seed + ts or by url to get top matches. - Use $gte instead of $regex to get prefix matches on URL - Use $text instead of $regex to get text search on title - Remove total from /pages and /pageUrlCounts queries by not using $facet - frontend: only call /pageUrlCounts when dialog is opened. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Emma Segal-Grossman <hi@emma.cafe> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>		2025-02-20 15:26:11 -08:00
..
btrixcloud	Rework crawl page migration + MongoDB Query Optimizations (#2412 )	2025-02-20 15:26:11 -08:00
test	Rework crawl page migration + MongoDB Query Optimizations (#2412 )	2025-02-20 15:26:11 -08:00
test_nightly	Add superuser endpoint to get user emails with org info (#2211 )	2024-12-09 16:38:01 -08:00
.pylintrc	security: tweak get /invite endpoints / InviteOut to: (#2087 )	2024-09-20 11:52:56 -07:00
dev-requirements.txt	quickfix: pin mypy version to avoid issues with latest release	2024-07-19 18:30:57 -07:00
Dockerfile	Pydantic 2.x update + type fixes + python 3.12 (#1947 )	2024-07-22 17:23:03 -07:00
mypy.ini	Support multiple crawler versions (#1420 )	2024-01-16 15:32:12 -08:00
requirements.txt	switch to simpler streaming download + multiwacz metadata improvements: (#1982 )	2024-10-03 16:13:31 -07:00
test-requirements.txt	Fix nightly tests: Add boto3 as test requirement (#2116 )	2024-10-23 13:41:22 -07:00