Commit Graph

18 Commits

Author SHA1 Message Date
Ilya Kreymer
841c45fe59
volumes: use emptyDir for tmp dir volume (#2713)
- don't use a persistent volume for /tmp, instead use a temporary
emptyDir
- use volume to avoid permission issues with default /tmp dir
- follow-up to #2623
2025-07-08 13:10:02 -07:00
Ilya Kreymer
8ea16393c5
Optimize single-page crawl workflows (#2656)
For single page crawls:
- Always force 1 browser to be used, ignoring browser windows/scale
setting
- Don't use custom PVC volumes in crawler / redis, just use emptyDir -
no chance of crawler being interrupted and restarted on different
machine for a single page.

Adds a 'is_single_page' check to CrawlConfig, checking for either limit
or scopeType / no extra hops.

Fixes #2655
2025-06-10 12:13:57 -07:00
Ilya Kreymer
5b0f851857
Fix securityContext for pod (#2623)
Some of the `securityContext` settings need to be on the container, not
on the pod, including the read-only file system, which was not previously enabled.
This now enables the read-only file system.
Also map the crawler /tmp directory to use the same volume as crawls (as
crawler currently uses /tmp dir) as /tmp becomes read-only otherwise.
2025-05-27 10:59:50 -07:00
Ilya Kreymer
cb50c7c2c2
Pause / Resume Crawls Initial Implmentation. (#2572)
- add 'pause' crawl state (fixes #2567)
- gracefully shut down crawler pods, and then redis pod when paused
- crawler uploads WACZ before shutting down (dependent on
webrecorder/browsertrix-crawler#824, supported in 1.6.1+)
- add 'paused_at' on crawl spec to indicate when crawl is paused
- support max pause time limit, after which crawl becomes automatically
stopped.
- add 'stopped_pause_expired' when pause automatically expires and crawl
is stopped
- /crawl/<id>/{pause,resume} apis to toggle 'paused' on crawl spec
- ui: add pause/resume button, paused state (partially addresses #2568)
- ui: add pausing/resuming derivative states when crawl is running and
pausing, or paused and not pausing (partially addresses #2569)
- Designed to work with crawler 1.6.1+ which support pausing + uploading on pause

Work on #2566, Fixes #2576 

---------
Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: sua yoo <sua@suayoo.com>
2025-05-21 14:05:16 -07:00
Ilya Kreymer
88a9f3baf7
ensure running crawl configmap is updated when exclusions are added/removed (#2409)
exclusions are already updated dynamically if crawler pod is running,
but when crawler pod is restarted, this ensures new exclusions are also
picked up:
- mount configmap in separate path, avoiding subPath, to allow dynamic
updates of mounted volume
- adds a lastConfigUpdate timestamp to CrawlJob - if lastConfigUpdate in
spec is different from current, the configmap is recreated by operator
- operator: also update image from channel avoid any issues with
updating crawler in channel
- only updates for exclusion add/remove so far, can later be expanded to
other crawler settings (see: #2355 for broader running crawl config
updates)
- fixes #2408
2025-02-19 11:42:19 -08:00
Vinzenz Sinapius
bb6e703f6a
Configure browsertrix proxies (#1847)
Resolves #1354

Supports crawling through pre-configured proxy servers, allowing users to select which proxy servers to use (requires browsertrix crawler 1.3+)

Config:
- proxies defined in btrix-proxies subchart
- can be configured via btrix-proxies key or separate proxies.yaml file via separate subchart
- proxies list refreshed automatically if crawler_proxies.json changes if subchart is deployed
- support for ssh and socks5 proxies
- proxy keys added to secrets in subchart
- support for default proxy to be always used if no other proxy configured, prevent starting cluster if default proxy not available
- prevent starting manual crawl if previously configured proxy is no longer available, return error
- force 'btrix' username and group name on browsertrix-crawler non-root user to support ssh

Operator:
- support crawling through proxies, pass proxyId in CrawlJob
- support running profile browsers which designated proxy, pass proxyId to ProfileJob
- prevent starting scheduled crawl if previously configured proxy is no longer available

API / Access:
- /api/orgs/all/crawlconfigs/crawler-proxies - get all proxies (superadmin only)
- /api/orgs/{oid}/crawlconfigs/crawler-proxies - get proxies available to particular org
- /api/orgs/{oid}/proxies - update allowed proxies for particular org (superadmin only)
- superadmin can configure which orgs can use which proxies, stored on the org
- superadmin can also allow an org to access all 'shared' proxies, to avoid having to allow a shared proxy on each org.

UI:
- Superadmin has 'Edit Proxies' dialog to configure for each org if it has: dedicated proxies, has access to shared proxies.
- User can select a proxy in Crawl Workflow browser settings
- Users can choose to launch a browser profile with a particular proxy
- Display which proxy is used to create profile in profile selector
- Users can choose with default proxy to use for new workflows in Crawling Defaults

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-10-02 18:35:45 -07:00
Ilya Kreymer
95969ec747
Attempt to auto-adjust storage if usage is running out while crawl is running (#2023)
Attempt to auto-adjust PVC storage if:
- used storage (as reported in redis by the crawler) * 2.5 >
total_storage
- will cause PVC to resize, if possible (not supported by all drivers)
- uses multiples of 1Gi, rounding up to next GB
- AVAIL_STORAGE_RATIO hard-coded to 2.5 for now, to account for 2x space
for WACZ plus change for fast updating crawls

Some caveats:
- only works if the storageClass used for PVCs has
`allowVolumeExpansion: true`, if not, it will have no effect
- designed as a last resort option: the `crawl_storage` in values and
`--sizeLimit` and `--diskUtilization` should generally result in this
not being needed.
- can be useful in cases where a crawl is rapidly capturing a lot of
content in one page, and there's no time to interrupt / restart, since
the other limits apply only at page end.
- May want to have crawler update the disk usage more frequently, not
just at page end to make this more effective.
2024-08-26 14:19:20 -07:00
Vinzenz Sinapius
01d8bdc5e6
Crawler network policy (#1727)
Limit egress traffic from crawler/profilebrowser pods to the internet
and limited internal services like dns, redis, frontend, auth-signer on certain ports

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-07-03 10:55:03 -07:00
Ilya Kreymer
3cd52342a7
Remove Crawl Workflow Configmaps (#1894)
Fixes #1893 

- Removes crawl workflow-scoped configmaps, and replaces with operator-controlled
per-crawl configmaps that only contain the json config passed to Browsertrix
Crawler (as a volume).
- Other configmap settings replaced are replaced the custom CrawlJob options
(mostly already were, just added profile_filename and storage_filename)
- Cron jobs also updated to create CrawlJob without relying on configmaps,
querying the db for additional settings.
- The `userid` associated with cron jobs is set to the user that last modified
 the schedule of the crawl, rather than whomever last modified the workflow
- Various functions that deal with updating configmaps have been removed,
including in migrations.
- New migration 0029 added to remove all crawl workflow configmaps
2024-06-28 15:25:23 -07:00
Ilya Kreymer
fa6627ce70
ensure QA configmap is updated for long running QA runs: (#1865)
- add a 'expire_at_duration_seconds' which is 75% of actual presign
duration time, or <25% remaining until presigned URL actually expires to
ensure presigned URLs are updated early than when they actually expire
- set cached expireAt time to the renew at time for more frequent
updates
- update QA configmap in place with updated presigned URLs when expireAt
time is reached
- mount qa config volume under /tmp/qa/ without subPath to get automatic
updates, which crawler will handle
- tests: fix qa test typo (from main)
- fixes #1864
2024-06-12 10:51:35 -07:00
Ilya Kreymer
61239a40ed
include workflow config in QA runs + different browser instances for QA (#1829)
Currently, the workflow crawl settings were not being included at all in
QA runs.
This mounts the crawl workflow config, as well as QA configmap, into QA
run crawls, allowing for page limits from crawl workflow to be applied
to QA runs.

It also allows a different number of browser instances to be used for QA
runs, as QA runs might work better with less browsers, (eg. 2 instead of
4). This can be set with `qa_browser_instances` in helm chart.

Default qa browser workers to 1 if unset (for now, for best results)

Fixes #1828
2024-05-29 13:32:25 -07:00
Vinzenz Sinapius
a8336925b6
Run crawler and profilebrowser with non-root user (#1625)
With these changes, crawler and profilebrowser jobs run as a
non-root user.
2024-04-17 12:03:33 -07:00
Ilya Kreymer
3438133fcb
Crawler pod memory padding + auto scaling (#1631)
- set memory limit to 1.2x memory request to provide extra padding and
avoid OOM
- attempt to resize crawler pods by 1.2x when exceeding 90% of available
memory
- do a 'soft OOM' (send extra SIGTERM) to pod when reaching 100% of
requested memory, resulting in faster graceful restart, but avoiding a
system-instant OOM Kill
- Fixes #1632

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-28 16:39:00 -07:00
Ilya Kreymer
4f676e4e82
QA Runs Initial Backend Implementation (#1586)
Supports running QA Runs via the QA API!

Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes
#1498

Also requires the latest Browsertrix Crawler 1.1.0+ (from
webrecorder/browsertrix-crawler#469 branch)

Notable changes:
- QARun objects contain info about QA runs, which are crawls
performed on data loaded from existing crawls.

- Various crawl db operations can be performed on either the crawl or
`qa.` object, and core crawl fields have been moved to CoreCrawlable.

- While running,`QARun` data stored in a single `qa` object, while
finished qa runs are added to `qaFinished` dictionary on the Crawl. The
QA list API returns data from the finished list, sorted by most recent
first.

- Includes additional type fixes / type safety, especially around
BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific
get_upload(), get_basecrawl(), get_crawl() getters for internal use and
get_crawl_out() for API

- Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) 
along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-20 22:42:16 -07:00
Ilya Kreymer
8ae032ff88 More friendly WARC prefix inside WACZ based on Org slug + Crawl Name / First Seed URL. (#1537)
Supports setting WARC prefix for WARCs inside WACZ to `<org slug>-<slug
[crawl name | first seed host]>`.
- Prefix set via WARC_PREFIX env var, supported in browsertrix-crawler
1.0.0-beta.4 or higher
If crawl name is provided, uses crawl name, other hostname of first
seed. The name is 'sluggified', using lowercase alphanum characters
separated by dashes.

Ex: in an organization called `Default Org`, a crawl of
`https://specs.webrecorder.net/` and no name will have WARCs named:
`default-org-specs-webrecorder-net-....warc.gz`
If the crawl is given the name `SPECS`, the WARCs will be named
`default-org-specs-manual-....warc.gz`

Fixes #412 in a default way.
2024-02-22 23:54:23 -08:00
Ilya Kreymer
a6a78c9ef2
node affinity: set to required instead of preferred to keep crawlers on dedicated infrastructure (#1366)
Previously, the crawler pods use preferred node affinity, instead of
required node affinity. This results in crawler nodes running on the
main node pool. Instead, we want to ensure crawler nodes are running on
dedicated node pool (if configured).
- Converts 'preferred node affinity' to 'required node affinity' for
the node pool, while keeping preferred pod affinity for keeping all
crawler / redis pods together.
- For profiles, updates to same node affinity, and also adds
resource constraint to match a single crawler for profile browser,
which did not have resource constraints.
2023-11-13 10:02:05 -08:00
Tessa Walsh
82a5d1e4e4
Regression fix: add profiles/ prefix to profile filenames (#1365)
Fixes #1364 

Regression fix for issue introduced in storage refactoring (see issue
for more details).

Changes:
1. Add `profiles/` prefix to profile filename passed in to crawler for
profile creation and written into db
2. Remove hardcoded `profiles/` prefix from crawler YAML
3. Add migration to add `profiles/` prefix to profile filenames that
don't already have it, including updating PROFILE_FILENAME in ConfigMaps

This way between the related storage document and the profile filename,
we have the full path to the object in the database rather than relying
on additional prefixes hardcoded into k8s job YAML files.

Note that this as a follow-up it'll be necessary to manually move any
profiles that had been written into the `<oid>` "directory" in object
storage rather than `<oid>/profiles` to the latter. This should only
affect profiles created very recently in a 1.8.0-beta release.
2023-11-09 17:44:16 -08:00
Ilya Kreymer
5530ca92e1
Move backend app templates to be installed from configmap volume (#1331)
Instead of adding the app templates launched from the backend via
`backend/btrixcloud/templates`, add them to a configmap and mount the
configmap in the same location.

This allows these templates to be updated, like other values in
charts/... without having to rebuild any of the images, speeding up dev
and maintenance time.

Changes include:
- move backend/btrixcloud/templates -> chart/app-templates/
- add app-templates/*.yaml to app-templates configmap
- mount app-templates configmap to /app/btrixcloud/templates/ in api and op containers
2023-11-06 09:37:48 -08:00