Fixes #2406 Converts migration 0042 to launch a background job (parallelized across several pods) to migrate all crawls by optimizing their pages and setting `version: 2` on the crawl when complete. Also Optimizes MongoDB queries for better performance. Migration Improvements: - Add `isMigrating` and `version` fields to `BaseCrawl` - Add new background job type to use in migration with accompanying `migration_job.yaml` template that allows for parallelization - Add new API endpoint to launch this crawl migration job, and ensure that we have list and retry endpoints for superusers that work with background jobs that aren't tied to a specific org - Rework background job models and methods now that not all background jobs are tied to a single org - Ensure new crawls and uploads have `version` set to `2` - Modify crawl and collection replay.json endpoints to only include fields for replay optimization (`initialPages`, `pageQueryUrl`, `preloadResources`) if all relevant crawls/uploads have `version` set to `2` - Remove `distinct` calls from migration pathways - Consolidate collection recompute stats Query Optimizations: - Remove all uses of $group and $facet - Optimize /replay.json endpoints to precompute preload_resources, avoid fetching crawl list twice - Optimize /collections endpoint by not fetching resources - Rename /urls -> /pageUrlCounts and avoid $group, instead sort with index, either by seed + ts or by url to get top matches. - Use $gte instead of $regex to get prefix matches on URL - Use $text instead of $regex to get text search on title - Remove total from /pages and /pageUrlCounts queries by not using $facet - frontend: only call /pageUrlCounts when dialog is opened. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Emma Segal-Grossman <hi@emma.cafe> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com> |
||
---|---|---|
.. | ||
data | ||
__init__.py | ||
conftest.py | ||
echo_server.py | ||
test_api.py | ||
test_collections.py | ||
test_crawl_config_search_values.py | ||
test_crawl_config_tags.py | ||
test_crawlconfigs.py | ||
test_filter_sort_results.py | ||
test_login.py | ||
test_org_subs.py | ||
test_org.py | ||
test_permissions.py | ||
test_profiles.py | ||
test_qa.py | ||
test_run_crawl.py | ||
test_stop_cancel_crawl.py | ||
test_uploads.py | ||
test_users.py | ||
test_utils.py | ||
test_webhooks.py | ||
test_workflow_auto_add_to_collection.py | ||
test_y_org_import_export.py | ||
test_z_delete_org.py | ||
utils.py |