Go to file
Tessa Walsh f8fb2d2c8d
Rework crawl page migration + MongoDB Query Optimizations (#2412)
Fixes #2406 

Converts migration 0042 to launch a background job (parallelized across
several pods) to migrate all crawls by optimizing their pages and
setting `version: 2` on the crawl when complete.

Also Optimizes MongoDB queries for better performance.

Migration Improvements:

- Add `isMigrating` and `version` fields to `BaseCrawl`
- Add new background job type to use in migration with accompanying
`migration_job.yaml` template that allows for parallelization
- Add new API endpoint to launch this crawl migration job, and ensure
that we have list and retry endpoints for superusers that work with
background jobs that aren't tied to a specific org
- Rework background job models and methods now that not all background
jobs are tied to a single org
- Ensure new crawls and uploads have `version` set to `2`
- Modify crawl and collection replay.json endpoints to only include
fields for replay optimization (`initialPages`, `pageQueryUrl`,
`preloadResources`) if all relevant crawls/uploads have `version` set to
`2`
- Remove `distinct` calls from migration pathways
- Consolidate collection recompute stats

Query Optimizations:
- Remove all uses of $group and $facet
- Optimize /replay.json endpoints to precompute preload_resources, avoid
fetching crawl list twice
- Optimize /collections endpoint by not fetching resources 
- Rename /urls -> /pageUrlCounts and avoid $group, instead sort with
index, either by seed + ts or by url to get top matches.
- Use $gte instead of $regex to get prefix matches on URL
- Use $text instead of $regex to get text search on title
- Remove total from /pages and /pageUrlCounts queries by not using
$facet
- frontend: only call /pageUrlCounts when dialog is opened.


---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2025-02-20 15:26:11 -08:00
.github quickfix: add missing dependency for docs (#2388) 2025-02-12 16:39:06 -05:00
.vscode chore: Add pylint to vscode extensions (#2387) 2025-02-12 19:40:27 -08:00
ansible
assets refactor: Implement brand colors (#2141) 2024-11-12 08:54:11 -08:00
backend Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
chart Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
configs
frontend Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00
scripts Configure browsertrix proxies (#1847) 2024-10-02 18:35:45 -07:00
test
.gitattributes Add linguist-generated attribute to generated files (#2221) 2024-12-07 01:27:50 -05:00
.gitignore
.pre-commit-config.yaml
btrix
CHANGES.md
LICENSE
NOTICE
pylintrc
README.md Update Readme to add Weblate information (#2109) 2024-10-31 16:36:12 -04:00
update-version.sh style change: remove spaces from python version docstring 2025-02-17 16:52:49 -08:00
version.txt Rework crawl page migration + MongoDB Query Optimizations (#2412) 2025-02-20 15:26:11 -08:00

Browsertrix

 

Browsertrix is a cloud-native, high-fidelity, browser-based crawling service designed to make web archiving easier and more accessible for everyone.

The service provides an API and UI for scheduling crawls and viewing results, and managing all aspects of crawling process. This system provides the orchestration and management around crawling, while the actual crawling is performed using Browsertrix Crawler containers, which are launched for each crawl.

See webrecorder.net/browsertrix for a feature overview and information about how to sign up for Webrecorder's hosted Browsertrix service.

Documentation

The full docs for using, deploying, and developing Browsertrix are available at docs.browsertrix.com.

Our docs are created with Material for MKDocs.

Deployment

The latest deployment documentation is available at docs.browsertrix.com/deploy.

The docs cover deploying Browsertrix in different environments using Kubernetes, from a single-node setup to scalable clusters in the cloud.

Early on, Browsertrix also supported Docker Compose and podman-based deployment. This was deprecated due to the complexity of maintaining feature parity across different setups, and with various Kubernetes deployment options being available and easy to deploy, even on a single machine.

Making deployment of Browsertrix as easy as possible remains a key goal, and we welcome suggestions for how we can further improve our Kubernetes deployment options.

If you are looking to just try running a single crawl, you may want to try Browsertrix Crawler first to test out the crawling capabilities.

Contributing

Though the system and backend API is fairly stable, we are working on many additional features. Please see the GitHub issues and this GitHub Project for our current project plan and tasks.

Guides for getting started with local development are available at docs.browsertrix.com/develop.

Translation

We use Weblate to manage translation contributions.

Translation status

License

Browsertrix is made available under the AGPLv3 License.

Documentation is made available under the Creative Commons Attribution 4.0 International License