- only enable if 'enable_auto_resize' is true, default to false
- if true, set memory limit to 1.2 of memory requests, resize when
hitting 'soft oom' of initial request, adjust by 1.2 (current behavior)
up to max_crawler_memory
- if false, set memory limit to max_crawler_memory and never adjust
memory requests or memory limits
- part of #1959
Fixes https://github.com/webrecorder/browsertrix/issues/1905
- adds a new top-level `/api/subscriptions` endpoint and SubOps handler on
the backend.
- enable subscriptions API endpoints available only if `billing_enabled` is
set in helm chart
- new POST /subscriptions/create, /subscriptions/update,
/subscriptions/cancel API endpoints
- Subscriptions mongo collection storing timestamped /subscription
API events
- GET /subscriptions/events API to get subscription events, support for filtering and sorting
- Subscription data model
- Support for setting and handling readOnlyOnCancel on org
- /orgs/<id>/billing-portal to lookup portalUrl using external API
- subscription in org getter and list views
- mark org as readOnly for subscription status `paused_payment_failed`, clears it on status `active`
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Limit egress traffic from crawler/profilebrowser pods to the internet
and limited internal services like dns, redis, frontend, auth-signer on certain ports
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- needed to support js-wacz signing requests in upcoming crawler versions
- Also has slightly increased memory requirements due to new versions of
some libraries.
- 0.5.2 adds a fix to dropping the fractional part of the second, to make
it work with ISO date strings that have microseconds, such as those from
js-wacz.
Resolves https://github.com/webrecorder/browsertrix/issues/1874
Support for new two-part sign up flow if first admin user is added to org
- If new user, user registers first, then is able to change the org name / slug on following screen
- If existing user, user accepts invite, then is able to change the org name / slug on following screen
- After confirming org slug name, user is taken to dashboard, or error is shown if org name or slug already taken.
- If org name == org id, org name and slug is automatically set to `{Your Name}'s Archive` when first user is registered / accepts invite
- Email templates updated to better reflect new / existing users and not show org name if it is 'unset' (org name == org id internally)
- tests: frontend unit testing for accept + invite screens.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: sua yoo <sua@suayoo.com>
Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Backend work for first two tasks of
https://github.com/webrecorder/browsertrix/issues/1875
New /billing API endpoint to be added separately once we have a better
idea of what data we can get from the payment processor.
- allow configuring QA run scale via 'qa_scale' setting in helm values
(overriding any setting on the qa crawljob)
- adds additional comments to browser instances helm values settings for clarity
- fixes#1842
Adds a `max_crawler_memory` chart setting, which, if set, will
defines the upper crawler memory limit that crawler pods can be resized up to.
If not set, auto resizing is disabled and pods are always set to 'crawler_memory' memory
if 'registration_enabled' is set, check 'registration_org_id' for org id
of an existing org that new users should be added to when they register.
if omitted, default to the default org
Fixes#1729
- fixes#1684
- can be used to optionally restrict QA to only some crawls (eg. with
browsertrix-crawler>=1.0.0)
- enforce error on backend (return 400) and handle special error on the
frontend
Supports horizontal pod autoscaling (hpa) for backend and frontend pods:
- use cpu and memory averages
- adjust base memory + cpu for backend
- threshold set to 80% cpu and 95% memory utilization by default
(configurable in values.yaml)
- instead of backend and frontend replicas, set max replicas in
values.yaml
- only enable hpa if backend_max_replicas or frontend_max_replicas is
>1, default to 1 for now
Supports running QA Runs via the QA API!
Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes
#1498
Also requires the latest Browsertrix Crawler 1.1.0+ (from
webrecorder/browsertrix-crawler#469 branch)
Notable changes:
- QARun objects contain info about QA runs, which are crawls
performed on data loaded from existing crawls.
- Various crawl db operations can be performed on either the crawl or
`qa.` object, and core crawl fields have been moved to CoreCrawlable.
- While running,`QARun` data stored in a single `qa` object, while
finished qa runs are added to `qaFinished` dictionary on the Crawl. The
QA list API returns data from the finished list, sorted by most recent
first.
- Includes additional type fixes / type safety, especially around
BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific
get_upload(), get_basecrawl(), get_crawl() getters for internal use and
get_crawl_out() for API
- Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch)
along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#1597
New endpoints (replacing old migration) to re-add crawl pages to db from
WACZs.
After a few implementation attempts, we settled on using
[remotezip](https://github.com/gtsystem/python-remotezip) to handle
parsing of the zip files and streaming their contents line-by-line for
pages. I've also modified the sync log streaming to use remotezip as
well, which allows us to remove our own zip module and let remotezip
handle the complexity of parsing zip files.
Database inserts for pages from WACZs are batched 100 at a time to help
speed up the endpoint, and the task is kicked off using
asyncio.create_task so as not to block before giving a response.
StorageOps now contains a method for streaming the bytes of any file in
a remote WACZ, requiring only the presigned URL for the WACZ and the
name of the file to stream.
- Backend: Use separate resource constraints for profiles: default
profile browser resources to either 'profile_browser_cpu' /
'profile_browser_memory' or single browser 'crawler_memory_base' /
'crawler_cpu_base', instead of scaled to the number of browser workers
- Frontend: check that profile html page is loading, keep retrying if
still getting nginx error instead of loading an iframe with the error.
Fixes#1598 (Copy of #1599 from 1.9.4)
Fixes#1341
Adds "User Agent" field to workflow editor under the Browser Settings
tab. If not set, the crawler will use the browser's default user agent.
Also added to docs and to the workflow details page (if set).
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Refactors backend deployment to:
- Use MOTOR_MAX_WORKERS (defaulting to 1) to reduce threads used by
mongodb connections
- Also sets backend workers to 1 by default to reduce default memory
usage
- Switches to gunicorn with uvloop worker for production use instead of
uvicorn (as recommended by uvicorn)
Lower thread count should address memory leak/increased usage, which
resulted in 5x thread x cpus x workers, eg. potentially 20 or 40 threads
just for mongodb connections. Lower default number of workers should
make it easier to scale backend with HPA if additional capacity.
Fixes#1467
Fixes#1385
## Changes
Supports multiple crawler 'channels' which can be configured to
different browsertrix-crawler versions
- Replaces `crawler_image` in helm chart with `crawler_channels` array
similar to how storages are handled
- The `default` crawler channel must always be provided and specifies
the default crawler image
- Adds backend `/orgs/{oid}/crawlconfigs/crawler-channels` API endpoint
to fetch information about available crawler versions (name, image, and
label) and test
- Adds crawler channel select to workflow creation/edit screens and
profile creation dialog, and updates related API endpoints and
configmaps accordingly. The select dropdown is shown only if more than
one channel is configured.
- Adds `crawlerChannel` to workflow and crawl details.
- Add `image` to crawler image, used to display actual image used as
part of the crawl.
- Modifies `crawler_crawl_id` backend test fixture to use `test` crawler
version to ensure crawler versions other than latest work
- Adds migration to add `crawlerChannel` set to `default` to existing
workflow and profile objects and workflow configmaps
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
- Emails are now processed from Jinja2 templates found in
`charts/email-templates`, to support easier updates via helm chart in
the future.
- The available templates are: `invite`, `password_reset`, `validate` and
`failed_bg_job`.
- Each template can be text only or also include HTML. The format of the
template is:
```
subject
~~~
<html content>
~~~
text
```
- A new `support_email` field is also added to the email block in
values.yaml
Invite Template:
- Currently, only the invite template includes an HTML version, other
templates are text only.
- The same template is used for new and existing users, with slightly
different text if adding user to an existing org.
- If user is invited by the superadmin, the invited by field is not
included, otherwise it also includes 'You have been invited by X to join Y'
Implemented variable and defaults for cluster-issuer to allow users to
specify, if needed, their own cluster issuer. (eg. installations with
only outbound traffic that cannot solve ACME https challenge)
Fixes#1252
Supports a generic background job system, with two background jobs,
CreateReplicaJob and DeleteReplicaJob.
- CreateReplicaJob runs on new crawls, uploads, profiles and updates the
`replicas` array with the info about the replica after the job succeeds.
- DeleteReplicaJob deletes the replica.
- Both jobs are created from the new `replica_job.yaml` template. The
CreateReplicaJob sets secrets for primary storage + replica storage,
while DeleteReplicaJob only needs the replica storage.
- The job is processed in the operator when the job is finalized
(deleted), which should happen immediately when the job is done, either
because it succeeds or because the backoffLimit is reached (currently
set to 3).
- /jobs/ api lists all jobs using a paginated response, including filtering and sorting
- /jobs/<job id> returns details for a particular job
- tests: nightly tests updated to check create + delete replica jobs for crawls as well as uploads, job api endpoints
- tests: also fixes to timeouts in nightly tests to avoid crawls finishing too quickly.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>