## Changes
- Deletes and rewrites arrays in URL search params in workflow list when
editing array filters (i.e. tags & profiles)
- Removes a missed `console.log`
- bump to 1.17.3
cc @SuaYoo
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- don't use a persistent volume for /tmp, instead use a temporary
emptyDir
- use volume to avoid permission issues with default /tmp dir
- follow-up to #2623
- follow up to #2627
- use qa_num_browser_windows to set exact number of QA browsers,
fallback to qa_scale
- set num_browser_windows and num_browsers_per_pod using crawler / qa
values depending if QA crawl
- scale_from_browser_windows() accepts optional browsers_per_pod if
dealing with possible QA override
- store 'desiredScale' in CrawlStatus to avoid recomputing for later
scale resolving
- ensure status.scale is always the actual scale observed
Add docs about path / virtual 'access_addressing_style' that is
available for each storage option.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- Adds `crawler_network_policy_additional_egress` setting, to add egress
rules to the existing crawler network policy. Useful for when you want
to allow-list a single IPs without replacing the whole network policy.
- Adds docs about `crawler_network_policy_additional_egress` to the customization page.
- Resolves#2121
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
It seems the latest redis image changed security settings so
root-mounted volumes no longer work.
This change:
- mount redis volumes as redis user/group 999
- needed to run with redis >=8.0.2
For single page crawls:
- Always force 1 browser to be used, ignoring browser windows/scale
setting
- Don't use custom PVC volumes in crawler / redis, just use emptyDir -
no chance of crawler being interrupted and restarted on different
machine for a single page.
Adds a 'is_single_page' check to CrawlConfig, checking for either limit
or scopeType / no extra hops.
Fixes#2655
Fixes#2425
## Changed
- Switch backend to primarily using number of browser windows rather
than scale multiplier (including migration to calculate `browserWindows`
from `scale` for existing workflows and crawls)
- Still support `scale` in addition to `browserWindows` in input models
for creating and updating workflows and re-adjusting live crawl scale
for backwards compatibility
- Adds new `max_browser_windows` value to Helm chart, but calculates the
value from `max_crawl_scale` as fallback for users with that value
already set in local charts
- Rework frontend to allow users to select multiples of
`crawler_browser_instances` or any value below
`crawler_browser_instances` for browser windows. For instance, with
`crawler_browser_instances=4` and `max_browser_windows=8`, the user
would be presented with the following options: 1, 2, 3, 4, 8
- Sets maximum width of screencast to image width returned by `message`
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: sua yoo <sua@suayoo.com>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
- Update the docs on k3s deployment for installing `ingress-nginx`, fixes
#2619.
- Also fix the indentation on the code blocks so markdown carries on list
numbering. At the moment the numbering confusingly resets after point 3.
- Update indentation on all code blocks so they show up as part of list +
wrap long commands.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Some of the `securityContext` settings need to be on the container, not
on the pod, including the read-only file system, which was not previously enabled.
This now enables the read-only file system.
Also map the crawler /tmp directory to use the same volume as crawls (as
crawler currently uses /tmp dir) as /tmp becomes read-only otherwise.
- add 'pause' crawl state (fixes#2567)
- gracefully shut down crawler pods, and then redis pod when paused
- crawler uploads WACZ before shutting down (dependent on
webrecorder/browsertrix-crawler#824, supported in 1.6.1+)
- add 'paused_at' on crawl spec to indicate when crawl is paused
- support max pause time limit, after which crawl becomes automatically
stopped.
- add 'stopped_pause_expired' when pause automatically expires and crawl
is stopped
- /crawl/<id>/{pause,resume} apis to toggle 'paused' on crawl spec
- ui: add pause/resume button, paused state (partially addresses #2568)
- ui: add pausing/resuming derivative states when crawl is running and
pausing, or paused and not pausing (partially addresses #2569)
- Designed to work with crawler 1.6.1+ which support pausing + uploading on pause
Work on #2566, Fixes#2576
---------
Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: sua yoo <sua@suayoo.com>
Fixes#2515.
This PR introduces a significantly optimized logic for presigning URLs
for crawls and collections.
- For collections, the files needed from all crawls are looked up, and
then the 'presign_urls' table is merged in one pass, resulting in a
unified iterator containing files and presign urls for those files.
- For crawls, the presign URLs are also looked up once, and the same
iterator is used for a single crawl with passed in list of CrawlFiles
- URLs that are already signed are added to the return list.
- For any remaining URLs to be signed, a bulk presigning function is
added, which shares an HTTP connection and signing 8 files in parallels
(customizable via helm chart, though may not be needed). This function
is used to call the presigning API in parallel.
Use V4 ('s3v4') signature version for for all presigning URLs to support
backblaze, fixes#2472
- add 'access_addressing_style' to be able to choose virtual/path
addressing for access endpoint (default to 'virtual' as before)
- fix minio presigning with v4 by using 'path' addressing style for
minio
- if path matches '/data/' for internal minio bucket, then always use
'path'
- also make minio access path '/data/' configurable
also simplify running in any namespace with default settings:
- don't hardcode 'local-minio.default'
- in crawlers namespace, add a 'local-minio' externalName service which
maps to the main namespace service.
Backend work for #2524
This PR adds a second dedicated endpoint similar to `/errors`, as a
combined log endpoint would give a false impression of being the
complete crawl logs (which is far from what we're serving in Browsertrix
at this point).
Eventually when we have support for streaming live crawl logs in
`crawls/<id>/logs` I'd ideally like to deprecate these two dedicated
endpoints in favor of using that, but for now this seems like the best
solution.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- add 'imagePullPolicy' field to each crawler channel declaration
- if unset, defaults to the setting in the existing
'crawler_image_pull_policy' field.
fixes#2522
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- Removes chart values that are unused
- Also change `local-mongo.default` -> `local-mongo`,
`local-minio.default` -> `local-minio` as some users have reported
issues with `.default` and it will certainly break if not deploying
Browsertrix in the `default `namespace.
Fixes#2459
- Set `/data/` as primary storage `access_endpoint_url` in nightly test
chart
- Modify nightly test GH Actions workflow to spawn a separate job per
nightly test module using dynamic matrix
- Set configuration not to fail other jobs if one job fails
- Modify failing tests:
- Add fixture to background job nightly test module so it can run alone
- Add retry loop to crawlconfig stats nightly test so it's less
dependent on timing
GitHub limits each workflow to 256 jobs, so this should continue to be
able to scale up for us without issue.
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
- should avoid gunicorn worker timeouts for long running migrations,
also fixes#2439
- add main_migrations as entrypoint to just run db migrations, using
existing init_ops() call
- first run 'migrations' container with same resources as 'app' and 'op'
- additional typing for initializing db
- cleanup unused code related to running only once, waiting for db to be ready
- fixes#2447
- consolidate list_pages() and list_replay_query_pages() into
list_pages()
- to keep backwards compatibility, add <crawl>/pagesSearch that does not
include page totals, keep <crawl>/pages with page total (slower)
- qa frontend: add default 'Crawl Order' sort order, to better show
pages in QA view
- bgjob: account for parallelism in bgjobs, add logging if succeeded
mismatches parallelism
- QA sorting: default to 'crawl order' by default to get better results.
- Optimize pages job: also cover crawls that may not have any pages but have pages listed in done stats
- Bgjobs: give custom op jobs more memory
Fixes#2406
Converts migration 0042 to launch a background job (parallelized across
several pods) to migrate all crawls by optimizing their pages and
setting `version: 2` on the crawl when complete.
Also Optimizes MongoDB queries for better performance.
Migration Improvements:
- Add `isMigrating` and `version` fields to `BaseCrawl`
- Add new background job type to use in migration with accompanying
`migration_job.yaml` template that allows for parallelization
- Add new API endpoint to launch this crawl migration job, and ensure
that we have list and retry endpoints for superusers that work with
background jobs that aren't tied to a specific org
- Rework background job models and methods now that not all background
jobs are tied to a single org
- Ensure new crawls and uploads have `version` set to `2`
- Modify crawl and collection replay.json endpoints to only include
fields for replay optimization (`initialPages`, `pageQueryUrl`,
`preloadResources`) if all relevant crawls/uploads have `version` set to
`2`
- Remove `distinct` calls from migration pathways
- Consolidate collection recompute stats
Query Optimizations:
- Remove all uses of $group and $facet
- Optimize /replay.json endpoints to precompute preload_resources, avoid
fetching crawl list twice
- Optimize /collections endpoint by not fetching resources
- Rename /urls -> /pageUrlCounts and avoid $group, instead sort with
index, either by seed + ts or by url to get top matches.
- Use $gte instead of $regex to get prefix matches on URL
- Use $text instead of $regex to get text search on title
- Remove total from /pages and /pageUrlCounts queries by not using
$facet
- frontend: only call /pageUrlCounts when dialog is opened.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
exclusions are already updated dynamically if crawler pod is running,
but when crawler pod is restarted, this ensures new exclusions are also
picked up:
- mount configmap in separate path, avoiding subPath, to allow dynamic
updates of mounted volume
- adds a lastConfigUpdate timestamp to CrawlJob - if lastConfigUpdate in
spec is different from current, the configmap is recreated by operator
- operator: also update image from channel avoid any issues with
updating crawler in channel
- only updates for exclusion add/remove so far, can later be expanded to
other crawler settings (see: #2355 for broader running crawl config
updates)
- fixes#2408
Adds `filename` to pages, pointed to the WACZ file those files come
from, as well as depth, favIconUrl, and isSeed. Also adds an idempotent
migration to backfill this information for existing pages, and increases
the backend container's startupProbe time to 24 hours to give it sufficient
time to finish the migration.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#2259
This PR brings backend and frontend support for the new autoclick
behavior in Browsertrix, introduces in Browsertrix 1.5.0+
On the backend, we introduce `min_autoclick_crawler_image` to
`values.yaml`, with a default value of
`"docker.io/webrecorder/browsertrix-crawler:1.5.0"`. If this is set and
the crawler version for a new crawl is less than this value, the
autoclick behavior is removed from the behaviors list in the configmap
created for the crawl.
The one caveat for this is that a crawler image tag like "latest" will
always be parsed as greater than `min_autoclick_crawler_image`, so there
is the potential for the crawler to run into issues if using a
non-numeric image tag with an older version of the crawler. For
production we use hardcoded specific versions of the crawler except for
the dev channel, which from here on out will including autoclick
support, so I think this should be okay (and is also true of the
existing implementation for checking `min_qa_crawler_image`).
On the frontend, I've added a checkbox (unchecked by default) in the
"Limits" section just below the current checkbox for autoscroll. We
might want to move these to a different section eventually - I'm not
sure Limits is the right place for them - but I wanted to be consistent
with things as they are.
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>