browsertrix/backend/test
Tessa Walsh f8fb2d2c8d
Rework crawl page migration + MongoDB Query Optimizations (#2412)
Fixes #2406 

Converts migration 0042 to launch a background job (parallelized across
several pods) to migrate all crawls by optimizing their pages and
setting `version: 2` on the crawl when complete.

Also Optimizes MongoDB queries for better performance.

Migration Improvements:

- Add `isMigrating` and `version` fields to `BaseCrawl`
- Add new background job type to use in migration with accompanying
`migration_job.yaml` template that allows for parallelization
- Add new API endpoint to launch this crawl migration job, and ensure
that we have list and retry endpoints for superusers that work with
background jobs that aren't tied to a specific org
- Rework background job models and methods now that not all background
jobs are tied to a single org
- Ensure new crawls and uploads have `version` set to `2`
- Modify crawl and collection replay.json endpoints to only include
fields for replay optimization (`initialPages`, `pageQueryUrl`,
`preloadResources`) if all relevant crawls/uploads have `version` set to
`2`
- Remove `distinct` calls from migration pathways
- Consolidate collection recompute stats

Query Optimizations:
- Remove all uses of $group and $facet
- Optimize /replay.json endpoints to precompute preload_resources, avoid
fetching crawl list twice
- Optimize /collections endpoint by not fetching resources 
- Rename /urls -> /pageUrlCounts and avoid $group, instead sort with
index, either by seed + ts or by url to get top matches.
- Use $gte instead of $regex to get prefix matches on URL
- Use $text instead of $regex to get text search on title
- Remove total from /pages and /pageUrlCounts queries by not using
$facet
- frontend: only call /pageUrlCounts when dialog is opened.


---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2025-02-20 15:26:11 -08:00
..
data Backend work for public collections (#2198) 2025-01-13 15:15:48 -08:00
__init__.py
conftest.py Add collection page list/search endpoint (#2354) 2025-02-10 16:44:37 -08:00
echo_server.py
test_api.py quickfix: update test_api.py to match all locales enabled by default (#2241) 2024-12-13 20:30:06 -08:00
test_collections.py Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
test_crawl_config_search_values.py
test_crawl_config_tags.py
test_crawlconfigs.py Add support for custom link selectors to backend (#2346) 2025-02-13 22:22:27 -08:00
test_filter_sort_results.py
test_login.py
test_org_subs.py security: tweak get /invite endpoints / InviteOut to: (#2087) 2024-09-20 11:52:56 -07:00
test_org.py Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
test_permissions.py
test_profiles.py Serialize datetimes with Z suffix (#2058) 2024-09-12 16:16:13 -07:00
test_qa.py Move org storage recalculation into background job (#2138) 2024-11-19 17:32:57 -05:00
test_run_crawl.py Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
test_stop_cancel_crawl.py
test_uploads.py Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
test_users.py Add superuser endpoint to get user emails with org info (#2211) 2024-12-09 16:38:01 -08:00
test_utils.py
test_webhooks.py
test_workflow_auto_add_to_collection.py
test_y_org_import_export.py
test_z_delete_org.py Move org storage recalculation into background job (#2138) 2024-11-19 17:32:57 -05:00
utils.py