Fixes#2673
Changes in this PR:
- Adds a new `file_uploads.py` module and corresponding `/files` API
prefix with methods/endpoints for uploading, GETing, and deleting seed
files (can be extended to other types of files moving forward)
- Seed files are supported via `CrawlConfig.config.seedFileId` on POST
and PATCH endpoints. This seedFileId is replaced by a presigned url when
passed to the crawler by the operator
- Seed files are read when first uploaded to calculate `firstSeed` and
`seedCount` and store them in the database, and this is copied into the
workflow and crawl documents when they are created.
- Logic is added to store `firstSeed` and `seedCount` for other
workflows as well, and a migration added to backfill data, to maintain
consistency and fix some of the pymongo aggregations that previously
assumed all workflows would have at least one `Seed` object in
`CrawlConfig.seeds`
- Seed file and thumbnail storage stats are added to org stats
- Seed file and thumbnail uploads first check that the org's storage
quota has not been exceeded and return a 400 if so
- A cron background job (run weekly each Sunday at midnight by default,
but configurable) is added to look for seed files at least x minutes old
(1440 minutes, or 1 day, by default, but configurable) that are not in
use in any workflows, and to delete them when they are found. The
backend pods will ensure this k8s batch job exists when starting up and
create it if it does not already exist. A database entry for each run of
the job is created in the operator on job completion so that it'll
appear in the `/jobs` API endpoints, but retrying of this type of
regularly scheduled background job is not supported as we don't want to
accidentally create multiple competing scheduled jobs.
- Adds a `min_seed_file_crawler_image` value to the Helm chart that is
checked before creating a crawl from a workflow if set. If a workflow
cannot be run, return the detail of the exception in
`CrawlConfigAddedResponse.errorDetail` so that we can display the reason
in the frontend
- Add SeedFile model from base UserFile (former ImageFIle), ensure all APIs
returning uploaded files return an absolute pre-signed URL (either with external origin or internal service origin)
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- Fix race condition related to browser commit time
- The profile commit request waits for browser to actual finish, and
profile saved. This can cause request to time out, resulting in a retry,
in which the browser has already been closed.
- With these changes, the commit is now 'idempotent' and returns a
waiting_for_browser until the profile is actually committed.
- On frontend, keep pinging commit endpoint with a timeout while 'waiting_for_browser' is returned, actual committed when endpoint returns profile id.
---------
Co-authored-by: sua yoo <sua@suayoo.com>
## Changes
- Deletes and rewrites arrays in URL search params in workflow list when
editing array filters (i.e. tags & profiles)
- Removes a missed `console.log`
- bump to 1.17.3
cc @SuaYoo
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- don't use a persistent volume for /tmp, instead use a temporary
emptyDir
- use volume to avoid permission issues with default /tmp dir
- follow-up to #2623
- follow up to #2627
- use qa_num_browser_windows to set exact number of QA browsers,
fallback to qa_scale
- set num_browser_windows and num_browsers_per_pod using crawler / qa
values depending if QA crawl
- scale_from_browser_windows() accepts optional browsers_per_pod if
dealing with possible QA override
- store 'desiredScale' in CrawlStatus to avoid recomputing for later
scale resolving
- ensure status.scale is always the actual scale observed
Add docs about path / virtual 'access_addressing_style' that is
available for each storage option.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- Adds `crawler_network_policy_additional_egress` setting, to add egress
rules to the existing crawler network policy. Useful for when you want
to allow-list a single IPs without replacing the whole network policy.
- Adds docs about `crawler_network_policy_additional_egress` to the customization page.
- Resolves#2121
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
It seems the latest redis image changed security settings so
root-mounted volumes no longer work.
This change:
- mount redis volumes as redis user/group 999
- needed to run with redis >=8.0.2
For single page crawls:
- Always force 1 browser to be used, ignoring browser windows/scale
setting
- Don't use custom PVC volumes in crawler / redis, just use emptyDir -
no chance of crawler being interrupted and restarted on different
machine for a single page.
Adds a 'is_single_page' check to CrawlConfig, checking for either limit
or scopeType / no extra hops.
Fixes#2655
Fixes#2425
## Changed
- Switch backend to primarily using number of browser windows rather
than scale multiplier (including migration to calculate `browserWindows`
from `scale` for existing workflows and crawls)
- Still support `scale` in addition to `browserWindows` in input models
for creating and updating workflows and re-adjusting live crawl scale
for backwards compatibility
- Adds new `max_browser_windows` value to Helm chart, but calculates the
value from `max_crawl_scale` as fallback for users with that value
already set in local charts
- Rework frontend to allow users to select multiples of
`crawler_browser_instances` or any value below
`crawler_browser_instances` for browser windows. For instance, with
`crawler_browser_instances=4` and `max_browser_windows=8`, the user
would be presented with the following options: 1, 2, 3, 4, 8
- Sets maximum width of screencast to image width returned by `message`
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: sua yoo <sua@suayoo.com>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
- Update the docs on k3s deployment for installing `ingress-nginx`, fixes
#2619.
- Also fix the indentation on the code blocks so markdown carries on list
numbering. At the moment the numbering confusingly resets after point 3.
- Update indentation on all code blocks so they show up as part of list +
wrap long commands.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Some of the `securityContext` settings need to be on the container, not
on the pod, including the read-only file system, which was not previously enabled.
This now enables the read-only file system.
Also map the crawler /tmp directory to use the same volume as crawls (as
crawler currently uses /tmp dir) as /tmp becomes read-only otherwise.
- add 'pause' crawl state (fixes#2567)
- gracefully shut down crawler pods, and then redis pod when paused
- crawler uploads WACZ before shutting down (dependent on
webrecorder/browsertrix-crawler#824, supported in 1.6.1+)
- add 'paused_at' on crawl spec to indicate when crawl is paused
- support max pause time limit, after which crawl becomes automatically
stopped.
- add 'stopped_pause_expired' when pause automatically expires and crawl
is stopped
- /crawl/<id>/{pause,resume} apis to toggle 'paused' on crawl spec
- ui: add pause/resume button, paused state (partially addresses #2568)
- ui: add pausing/resuming derivative states when crawl is running and
pausing, or paused and not pausing (partially addresses #2569)
- Designed to work with crawler 1.6.1+ which support pausing + uploading on pause
Work on #2566, Fixes#2576
---------
Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: sua yoo <sua@suayoo.com>
Fixes#2515.
This PR introduces a significantly optimized logic for presigning URLs
for crawls and collections.
- For collections, the files needed from all crawls are looked up, and
then the 'presign_urls' table is merged in one pass, resulting in a
unified iterator containing files and presign urls for those files.
- For crawls, the presign URLs are also looked up once, and the same
iterator is used for a single crawl with passed in list of CrawlFiles
- URLs that are already signed are added to the return list.
- For any remaining URLs to be signed, a bulk presigning function is
added, which shares an HTTP connection and signing 8 files in parallels
(customizable via helm chart, though may not be needed). This function
is used to call the presigning API in parallel.
Use V4 ('s3v4') signature version for for all presigning URLs to support
backblaze, fixes#2472
- add 'access_addressing_style' to be able to choose virtual/path
addressing for access endpoint (default to 'virtual' as before)
- fix minio presigning with v4 by using 'path' addressing style for
minio
- if path matches '/data/' for internal minio bucket, then always use
'path'
- also make minio access path '/data/' configurable
also simplify running in any namespace with default settings:
- don't hardcode 'local-minio.default'
- in crawlers namespace, add a 'local-minio' externalName service which
maps to the main namespace service.
Backend work for #2524
This PR adds a second dedicated endpoint similar to `/errors`, as a
combined log endpoint would give a false impression of being the
complete crawl logs (which is far from what we're serving in Browsertrix
at this point).
Eventually when we have support for streaming live crawl logs in
`crawls/<id>/logs` I'd ideally like to deprecate these two dedicated
endpoints in favor of using that, but for now this seems like the best
solution.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- add 'imagePullPolicy' field to each crawler channel declaration
- if unset, defaults to the setting in the existing
'crawler_image_pull_policy' field.
fixes#2522
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- Removes chart values that are unused
- Also change `local-mongo.default` -> `local-mongo`,
`local-minio.default` -> `local-minio` as some users have reported
issues with `.default` and it will certainly break if not deploying
Browsertrix in the `default `namespace.
Fixes#2459
- Set `/data/` as primary storage `access_endpoint_url` in nightly test
chart
- Modify nightly test GH Actions workflow to spawn a separate job per
nightly test module using dynamic matrix
- Set configuration not to fail other jobs if one job fails
- Modify failing tests:
- Add fixture to background job nightly test module so it can run alone
- Add retry loop to crawlconfig stats nightly test so it's less
dependent on timing
GitHub limits each workflow to 256 jobs, so this should continue to be
able to scale up for us without issue.
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
- should avoid gunicorn worker timeouts for long running migrations,
also fixes#2439
- add main_migrations as entrypoint to just run db migrations, using
existing init_ops() call
- first run 'migrations' container with same resources as 'app' and 'op'
- additional typing for initializing db
- cleanup unused code related to running only once, waiting for db to be ready
- fixes#2447
- consolidate list_pages() and list_replay_query_pages() into
list_pages()
- to keep backwards compatibility, add <crawl>/pagesSearch that does not
include page totals, keep <crawl>/pages with page total (slower)
- qa frontend: add default 'Crawl Order' sort order, to better show
pages in QA view
- bgjob: account for parallelism in bgjobs, add logging if succeeded
mismatches parallelism
- QA sorting: default to 'crawl order' by default to get better results.
- Optimize pages job: also cover crawls that may not have any pages but have pages listed in done stats
- Bgjobs: give custom op jobs more memory
Fixes#2406
Converts migration 0042 to launch a background job (parallelized across
several pods) to migrate all crawls by optimizing their pages and
setting `version: 2` on the crawl when complete.
Also Optimizes MongoDB queries for better performance.
Migration Improvements:
- Add `isMigrating` and `version` fields to `BaseCrawl`
- Add new background job type to use in migration with accompanying
`migration_job.yaml` template that allows for parallelization
- Add new API endpoint to launch this crawl migration job, and ensure
that we have list and retry endpoints for superusers that work with
background jobs that aren't tied to a specific org
- Rework background job models and methods now that not all background
jobs are tied to a single org
- Ensure new crawls and uploads have `version` set to `2`
- Modify crawl and collection replay.json endpoints to only include
fields for replay optimization (`initialPages`, `pageQueryUrl`,
`preloadResources`) if all relevant crawls/uploads have `version` set to
`2`
- Remove `distinct` calls from migration pathways
- Consolidate collection recompute stats
Query Optimizations:
- Remove all uses of $group and $facet
- Optimize /replay.json endpoints to precompute preload_resources, avoid
fetching crawl list twice
- Optimize /collections endpoint by not fetching resources
- Rename /urls -> /pageUrlCounts and avoid $group, instead sort with
index, either by seed + ts or by url to get top matches.
- Use $gte instead of $regex to get prefix matches on URL
- Use $text instead of $regex to get text search on title
- Remove total from /pages and /pageUrlCounts queries by not using
$facet
- frontend: only call /pageUrlCounts when dialog is opened.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
exclusions are already updated dynamically if crawler pod is running,
but when crawler pod is restarted, this ensures new exclusions are also
picked up:
- mount configmap in separate path, avoiding subPath, to allow dynamic
updates of mounted volume
- adds a lastConfigUpdate timestamp to CrawlJob - if lastConfigUpdate in
spec is different from current, the configmap is recreated by operator
- operator: also update image from channel avoid any issues with
updating crawler in channel
- only updates for exclusion add/remove so far, can later be expanded to
other crawler settings (see: #2355 for broader running crawl config
updates)
- fixes#2408