Go to file
Ilya Kreymer 8a507f0473
Consolidate list page endpoints + better QA sorting + optimize pages fix (#2417)
- consolidate list_pages() and list_replay_query_pages() into
list_pages()
- to keep backwards compatibility, add <crawl>/pagesSearch that does not
include page totals, keep <crawl>/pages with page total (slower)
- qa frontend: add default 'Crawl Order' sort order, to better show
pages in QA view
- bgjob: account for parallelism in bgjobs, add logging if succeeded
mismatches parallelism
- QA sorting: default to 'crawl order' by default to get better results.
- Optimize pages job: also cover crawls that may not have any pages but have pages listed in done stats
- Bgjobs: give custom op jobs more memory
2025-02-21 13:47:20 -08:00
.github quickfix: add missing dependency for docs (#2388) 2025-02-12 16:39:06 -05:00
.vscode chore: Add pylint to vscode extensions (#2387) 2025-02-12 19:40:27 -08:00
ansible
assets refactor: Implement brand colors (#2141) 2024-11-12 08:54:11 -08:00
backend Consolidate list page endpoints + better QA sorting + optimize pages fix (#2417) 2025-02-21 13:47:20 -08:00
chart Consolidate list page endpoints + better QA sorting + optimize pages fix (#2417) 2025-02-21 13:47:20 -08:00
configs
frontend Consolidate list page endpoints + better QA sorting + optimize pages fix (#2417) 2025-02-21 13:47:20 -08:00
scripts
test
.gitattributes Add linguist-generated attribute to generated files (#2221) 2024-12-07 01:27:50 -05:00
.gitignore
.pre-commit-config.yaml
btrix
CHANGES.md
LICENSE
NOTICE
pylintrc
README.md Update Readme to add Weblate information (#2109) 2024-10-31 16:36:12 -04:00
update-version.sh style change: remove spaces from python version docstring 2025-02-17 16:52:49 -08:00
version.txt Consolidate list page endpoints + better QA sorting + optimize pages fix (#2417) 2025-02-21 13:47:20 -08:00

Browsertrix

 

Browsertrix is a cloud-native, high-fidelity, browser-based crawling service designed to make web archiving easier and more accessible for everyone.

The service provides an API and UI for scheduling crawls and viewing results, and managing all aspects of crawling process. This system provides the orchestration and management around crawling, while the actual crawling is performed using Browsertrix Crawler containers, which are launched for each crawl.

See webrecorder.net/browsertrix for a feature overview and information about how to sign up for Webrecorder's hosted Browsertrix service.

Documentation

The full docs for using, deploying, and developing Browsertrix are available at docs.browsertrix.com.

Our docs are created with Material for MKDocs.

Deployment

The latest deployment documentation is available at docs.browsertrix.com/deploy.

The docs cover deploying Browsertrix in different environments using Kubernetes, from a single-node setup to scalable clusters in the cloud.

Early on, Browsertrix also supported Docker Compose and podman-based deployment. This was deprecated due to the complexity of maintaining feature parity across different setups, and with various Kubernetes deployment options being available and easy to deploy, even on a single machine.

Making deployment of Browsertrix as easy as possible remains a key goal, and we welcome suggestions for how we can further improve our Kubernetes deployment options.

If you are looking to just try running a single crawl, you may want to try Browsertrix Crawler first to test out the crawling capabilities.

Contributing

Though the system and backend API is fairly stable, we are working on many additional features. Please see the GitHub issues and this GitHub Project for our current project plan and tasks.

Guides for getting started with local development are available at docs.browsertrix.com/develop.

Translation

We use Weblate to manage translation contributions.

Translation status

License

Browsertrix is made available under the AGPLv3 License.

Documentation is made available under the Creative Commons Attribution 4.0 International License