docs: add docs for path / virtual addressing (#2669)

Add docs about path / virtual 'access_addressing_style' that is
available for each storage option.

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
This commit is contained in:
Ilya Kreymer 2025-06-12 13:08:27 -04:00 committed by GitHub
parent 8516d70486
commit 001277ac9d
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 38 additions and 5 deletions

View File

@ -425,6 +425,14 @@ storages:
endpoint_url: "http://local-minio:9000/" endpoint_url: "http://local-minio:9000/"
access_endpoint_url: *minio_access_path access_endpoint_url: *minio_access_path
#access_addressing_style: 'path' or 'virtual'
# determine if bucket should be accessed as:
# - virtual - https://<bucket>.<host>/<key>
# - path - https://<host>/<bucket>/<key>
#
# if not specified, defaults to 'path' for local minio or
# 'virtual' for all other storages
# optional: duration in minutes for WACZ download links to be valid # optional: duration in minutes for WACZ download links to be valid
# used by webhooks and replay # used by webhooks and replay

View File

@ -92,12 +92,33 @@ Since the local Minio service is not used, `minio_local: false` can be set to sa
### Custom Access Endpoint URL ### Custom Access Endpoint URL
It may be useful to provide a custom access endpoint for accessing WACZ files and other data. if the `access_endpoint_url` is provided, It may be useful to provide a custom access endpoint for accessing WACZ files and other data. If the `access_endpoint_url` is provided, it can be in either the 'virtual host' or 'path' form, while the `endpoint_url` should always be in path-prefix form.
it should be in 'virtual host' form (the bucket is not added to the path, but is assumed to be the in the host).
The host portion of the URL is then replaced with the `access_endpoint_url`. For example, given `endpoint_url: https://s3provider.example.com/bucket/path/` and `access_endpoint_url: https://my-custom-domain.example.com/path/`, a URL to a WACZ files in 'virtual host' form may be `https://bucket.s3provider.example.com/path/to/files/crawl.wacz?signature...`. Here are two example of the addressing modes:
The `https://bucket.s3provider.example.com/path/` is then replaced with the `https://my-custom-domain.example.com/path/`, and the final URL becomes `https://my-custom-domain.example.com/path/to/files/crawl.wacz?signature...`. #### Virtual Host vs Path Addressing for Access Endpoints
Virtual host addressing:
```
endpoint_url: https://s3provider.example.com/bucket/path/
access_endpoint_url: https://my-custom-domain.example.com/path/
access_addressing_style: virtual
# Files loaded from: https://my-custom-domain.example.com/path/to/files/crawl.wacz?signature...
```
Path addressing:
```
...
endpoint_url: https://s3provider.example.com/bucket/path/
access_endpoint_url: https://my-custom-domain.example.com/bucket/path/
access_addressing_style: path
# Files loaded from: https://my-custom-domain.example.com/bucket/path/to/files/crawl.wacz?signature...
```
Note that when using the local Minio for storage, path-style addressing is used automatically as the
data is accessed via `/data/path/to/files`. Otherwise, virtual-style addressing is assumed as the default.
### Storage Replicas ### Storage Replicas
@ -117,6 +138,8 @@ storages:
endpoint_url: "http://local-minio.default:9000/" endpoint_url: "http://local-minio.default:9000/"
is_default_primary: true is_default_primary: true
# default for local minio is path
access_addressing_style: path
- name: "replica-0" - name: "replica-0"
type: "s3" type: "s3"
@ -126,6 +149,7 @@ storages:
endpoint_url: "http://local-minio.default:9000/" endpoint_url: "http://local-minio.default:9000/"
is_default_replica: true is_default_replica: true
access_addressing_style: path
- name: "replica-1" - name: "replica-1"
type: "s3" type: "s3"
@ -134,7 +158,8 @@ storages:
bucket_name: "replica-1" bucket_name: "replica-1"
endpoint_url: "https://s3provider.example.com/bucket/path/" endpoint_url: "https://s3provider.example.com/bucket/path/"
access_endpoint_url: "https://my-custom-domain.example.com/path/" access_endpoint_url: "https://bucket.my-custom-domain.example.com/path/"
access_addressing_style: virtual
``` ```
When replica locations are set, the default behavior when a crawl, upload, or browser profile is deleted is that the replica files are deleted at the same time as the file in primary storage. To delay deletion of replicas, set `replica_deletion_delay_days` in the Helm chart to the number of days by which to delay replica file deletion. This feature gives Browsertrix administrators time in the event of files being deleted accidentally or maliciously to recover copies from configured replica locations. When replica locations are set, the default behavior when a crawl, upload, or browser profile is deleted is that the replica files are deleted at the same time as the file in primary storage. To delay deletion of replicas, set `replica_deletion_delay_days` in the Helm chart to the number of days by which to delay replica file deletion. This feature gives Browsertrix administrators time in the event of files being deleted accidentally or maliciously to recover copies from configured replica locations.