browsertrix/backend/btrixcloud/migrations
Tessa Walsh f8fb2d2c8d
Rework crawl page migration + MongoDB Query Optimizations (#2412)
Fixes #2406 

Converts migration 0042 to launch a background job (parallelized across
several pods) to migrate all crawls by optimizing their pages and
setting `version: 2` on the crawl when complete.

Also Optimizes MongoDB queries for better performance.

Migration Improvements:

- Add `isMigrating` and `version` fields to `BaseCrawl`
- Add new background job type to use in migration with accompanying
`migration_job.yaml` template that allows for parallelization
- Add new API endpoint to launch this crawl migration job, and ensure
that we have list and retry endpoints for superusers that work with
background jobs that aren't tied to a specific org
- Rework background job models and methods now that not all background
jobs are tied to a single org
- Ensure new crawls and uploads have `version` set to `2`
- Modify crawl and collection replay.json endpoints to only include
fields for replay optimization (`initialPages`, `pageQueryUrl`,
`preloadResources`) if all relevant crawls/uploads have `version` set to
`2`
- Remove `distinct` calls from migration pathways
- Consolidate collection recompute stats

Query Optimizations:
- Remove all uses of $group and $facet
- Optimize /replay.json endpoints to precompute preload_resources, avoid
fetching crawl list twice
- Optimize /collections endpoint by not fetching resources 
- Rename /urls -> /pageUrlCounts and avoid $group, instead sort with
index, either by seed + ts or by url to get top matches.
- Use $gte instead of $regex to get prefix matches on URL
- Use $text instead of $regex to get text search on title
- Remove total from /pages and /pageUrlCounts queries by not using
$facet
- frontend: only call /pageUrlCounts when dialog is opened.


---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2025-02-20 15:26:11 -08:00
..
__init__.py Fix migration to avoid duplicate collection slugs and names (#2318) 2025-01-21 14:23:32 -08:00
migration_0001_archives_to_orgs.py
migration_0002_crawlconfig_crawlstats.py
migration_0003_mutable_crawl_configs.py
migration_0004_config_seeds.py
migration_0005_operator_scheduled_jobs.py
migration_0006_precompute_crawl_stats.py
migration_0007_colls_and_config_update.py Reformat with Black for 2025 ruleset (#2349) 2025-01-29 16:57:06 -05:00
migration_0008_precompute_crawl_file_stats.py
migration_0009_crawl_types.py
migration_0010_collection_total_size.py
migration_0011_crawl_timeout_configmap.py
migration_0012_notes_to_description.py
migration_0013_crawl_name.py
migration_0014_to_collection_ids.py
migration_0015_org_storage_usage.py
migration_0016_operator_scheduled_jobs_v2.py
migration_0017_storage_by_type.py
migration_0018_usernames.py
migration_0019_org_slug.py
migration_0020_org_storage_refs.py
migration_0021_profile_filenames.py
migration_0022_partial_complete.py
migration_0023_available_extra_exec_mins.py
migration_0024_crawlerchannel.py
migration_0025_workflow_db_configmap_fixes.py
migration_0026_crawl_review_status.py
migration_0027_profile_modified.py
migration_0028_page_files_errors.py
migration_0029_remove_workflow_configmaps.py
migration_0030_user_invites_flatten.py
migration_0031_org_created.py
migration_0032_dupe_org_names.py
migration_0033_crawl_quota_states.py
migration_0034_drop_invalid_crc.py remove crc32 from CrawlFile (#1980) 2024-07-30 11:23:15 -07:00
migration_0035_fix_failed_logins.py fix resetting of invalid logins: (#2002) 2024-08-07 12:36:06 -07:00
migration_0036_coll_visibility.py Make changes to collections to support publicly listed collections (#2164) 2025-01-13 15:15:47 -08:00
migration_0037_upload_pages.py Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
migration_0038_org_last_crawl_finished.py Add last crawl and subscription status indicators to org list (#2273) 2025-01-14 10:57:06 -05:00
migration_0039_coll_slugs.py Fix migration to avoid duplicate collection slugs and names (#2318) 2025-01-21 14:23:32 -08:00
migration_0040_archived_item_page_count.py feat: Update collection sorting, metadata, stats (#2327) 2025-01-23 13:32:23 -05:00
migration_0041_pages_snapshots.py feat: Update collection sorting, metadata, stats (#2327) 2025-01-23 13:32:23 -05:00
migration_0042_page_filenames.py Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00