Vinzenz Sinapius bb6e703f6a

Resolves #1354

Supports crawling through pre-configured proxy servers, allowing users to select which proxy servers to use (requires browsertrix crawler 1.3+)

Config:
- proxies defined in btrix-proxies subchart
- can be configured via btrix-proxies key or separate proxies.yaml file via separate subchart
- proxies list refreshed automatically if crawler_proxies.json changes if subchart is deployed
- support for ssh and socks5 proxies
- proxy keys added to secrets in subchart
- support for default proxy to be always used if no other proxy configured, prevent starting cluster if default proxy not available
- prevent starting manual crawl if previously configured proxy is no longer available, return error
- force 'btrix' username and group name on browsertrix-crawler non-root user to support ssh

Operator:
- support crawling through proxies, pass proxyId in CrawlJob
- support running profile browsers which designated proxy, pass proxyId to ProfileJob
- prevent starting scheduled crawl if previously configured proxy is no longer available

API / Access:
- /api/orgs/all/crawlconfigs/crawler-proxies - get all proxies (superadmin only)
- /api/orgs/{oid}/crawlconfigs/crawler-proxies - get proxies available to particular org
- /api/orgs/{oid}/proxies - update allowed proxies for particular org (superadmin only)
- superadmin can configure which orgs can use which proxies, stored on the org
- superadmin can also allow an org to access all 'shared' proxies, to avoid having to allow a shared proxy on each org.

UI:
- Superadmin has 'Edit Proxies' dialog to configure for each org if it has: dedicated proxies, has access to shared proxies.
- User can select a proxy in Crawl Workflow browser settings
- Users can choose to launch a browser profile with a particular proxy
- Display which proxy is used to create profile in profile selector
- Users can choose with default proxy to use for new workflows in Crawling Defaults

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>

2024-10-02 18:35:45 -07:00

1.3 KiB

Raw Blame History

Self-Hosting

!!! info "Already signed up for Browsertrix?"

This guide is for developers and users who are self-hosting Browsertrix. If you've registered through [browsertrix.com](https://browsertrix.com/), you may be looking for the [user guide](../user-guide/index.md).

Browsertrix is designed to be a cloud-native application running in Kubernetes.

However, it is perfectly reasonable to deploy Browsertrix locally using one of the many available local Kubernetes options.

The main requirements for Browsertrix are:

A Kubernetes Cluster
Helm 3 (package manager for Kubernetes)

We have prepared a Local Deployment Guide which covers several options for testing Browsertrix locally on a single machine, as well as a Production (Self-Hosted and Cloud) Deployment guide to help with setting up Browsertrix in different production scenarios. Information about configuring storage, crawler channels, and other details in local or production deployments is in the Customizing Browsertrix Deployment Guide. Information about configuring proxies to use with Browsertrix can be found in the Configuring Proxies guide.

Details on managing org export and import for existing clusters can be found in the Org Import & Export guide.

1.3 KiB Raw Blame History

Self-Hosting

1.3 KiB

Raw Blame History