Commit Graph

182 Commits

Author SHA1 Message Date
Ilya Kreymer
b574f00d2b
Add Repository Index + Chart Rename + Docs Rename (#1708)
Repository Index: Generate an index.yaml in ./docx/helm-repo/index.yaml
to allow for browsertrix to be a helm repository.
docs: rename docs.browsertrix.cloud -> docs.browsertrix.com
docs: update deployment doc to mention helm repo as preferred way to
install
docs build action: generate repository index in GH action
publish action: update auto-generated message to mention installing from
the repo.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-04-21 09:42:25 -07:00
Ilya Kreymer
4360e0c1b5
Update tests with latest crawler (#1711)
tests: use 'latest' crawler release for testing, now that 1.1.x is released.
2024-04-20 15:56:26 -07:00
Vinzenz Sinapius
a8336925b6
Run crawler and profilebrowser with non-root user (#1625)
With these changes, crawler and profilebrowser jobs run as a
non-root user.
2024-04-17 12:03:33 -07:00
Ilya Kreymer
835014d829
restrict qa runs to a 'min_qa_crawler_image' if set in the chart (#1685)
- fixes #1684
- can be used to optionally restrict QA to only some crawls (eg. with
browsertrix-crawler>=1.0.0)
- enforce error on backend (return 400) and handle special error on the
frontend
2024-04-17 08:48:33 -07:00
Vinzenz Sinapius
1b034957ff
Improve reliability of backend tests (#1675)
- Remove globals from profile, uploads, and qa test modules in favor of fixtures
- Add retries to fix intermittent test failures due to timing

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-04-16 14:22:41 -04:00
Ilya Kreymer
95f5605af7
renumber crawl priority classes: (#1673)
- priority classes <-10 are ignored by cluster-autoscaler so QA jobs
with too low priorities never run
- start crawl priorities at 0 going down (same as before)
- start qa run priorities at -2 going down (instead of -100)
- this means a crawl of with scale of 3 can be preempted by 1st qa pod,
but otherwise crawls have higher priority
- rename priority classes as they are otherwise immutable and error on
helm upgrade

This allows for more room in lower pri classes for other type of
objects, while keeping in mind the -10 and below threshold: (see:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md)
2024-04-13 12:24:43 -07:00
Ilya Kreymer
17f49a52de
email templates update + customization + doc update (fixes #1652) (#1653)
- modify invite email template to answer common questions
- email templates: make each email template overridable with --set-file
- docs: update customization doc to document how to customize email
templates

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-04-08 12:27:47 -07:00
Ilya Kreymer
a7cda3b11b version: bump to 1.10.0-beta.1 2024-04-05 18:24:14 -07:00
Ilya Kreymer
c1817cbe04
add horizontal pod autoscaler for backend and frontend via helm charts (#1633)
Supports horizontal pod autoscaling (hpa) for backend and frontend pods:
- use cpu and memory averages
- adjust base memory + cpu for backend
- threshold set to 80% cpu and 95% memory utilization by default
(configurable in values.yaml)
- instead of backend and frontend replicas, set max replicas in
values.yaml
- only enable hpa if backend_max_replicas or frontend_max_replicas is
>1, default to 1 for now
2024-03-28 16:39:27 -07:00
Ilya Kreymer
3438133fcb
Crawler pod memory padding + auto scaling (#1631)
- set memory limit to 1.2x memory request to provide extra padding and
avoid OOM
- attempt to resize crawler pods by 1.2x when exceeding 90% of available
memory
- do a 'soft OOM' (send extra SIGTERM) to pod when reaching 100% of
requested memory, resulting in faster graceful restart, but avoiding a
system-instant OOM Kill
- Fixes #1632

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-28 16:39:00 -07:00
Ilya Kreymer
86311ab4ea
merge 1.9.5 fixes (#1637)
retry loading profile if initial load fails, follow-up to #1604
- Add missing setTimeout to retry profile loading

bump RWP to 1.8.15
2024-03-27 21:49:19 -07:00
Ilya Kreymer
412eb2ef32
MetaController update (#1630)
Bump metacontroller to latest (4.11)
2024-03-27 08:49:56 -07:00
Ilya Kreymer
4f676e4e82
QA Runs Initial Backend Implementation (#1586)
Supports running QA Runs via the QA API!

Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes
#1498

Also requires the latest Browsertrix Crawler 1.1.0+ (from
webrecorder/browsertrix-crawler#469 branch)

Notable changes:
- QARun objects contain info about QA runs, which are crawls
performed on data loaded from existing crawls.

- Various crawl db operations can be performed on either the crawl or
`qa.` object, and core crawl fields have been moved to CoreCrawlable.

- While running,`QARun` data stored in a single `qa` object, while
finished qa runs are added to `qaFinished` dictionary on the Crawl. The
QA list API returns data from the finished list, sorted by most recent
first.

- Includes additional type fixes / type safety, especially around
BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific
get_upload(), get_basecrawl(), get_crawl() getters for internal use and
get_crawl_out() for API

- Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) 
along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-20 22:42:16 -07:00
Tessa Walsh
21ae38362e
Add endpoints to read pages from older crawl WACZs into database (#1562)
Fixes #1597

New endpoints (replacing old migration) to re-add crawl pages to db from
WACZs.

After a few implementation attempts, we settled on using
[remotezip](https://github.com/gtsystem/python-remotezip) to handle
parsing of the zip files and streaming their contents line-by-line for
pages. I've also modified the sync log streaming to use remotezip as
well, which allows us to remove our own zip module and let remotezip
handle the complexity of parsing zip files.

Database inserts for pages from WACZs are batched 100 at a time to help
speed up the endpoint, and the task is kicked off using
asyncio.create_task so as not to block before giving a response.

StorageOps now contains a method for streaming the bytes of any file in
a remote WACZ, requiring only the presigned URL for the WACZ and the
name of the file to stream.
2024-03-19 14:14:21 -07:00
Ilya Kreymer
e7af081af1
profile browser fixes: better resource usage + load retry (main) (#1604)
- Backend: Use separate resource constraints for profiles: default
profile browser resources to either 'profile_browser_cpu' /
'profile_browser_memory' or single browser 'crawler_memory_base' /
'crawler_cpu_base', instead of scaled to the number of browser workers

- Frontend: check that profile html page is loading, keep retrying if
still getting nginx error instead of loading an iframe with the error.

Fixes #1598 (Copy of #1599 from 1.9.4)
2024-03-16 15:07:04 -07:00
Henry Wilkinson
8ba29ca776
Browsertrix Cloud → Browsertrix text rename (#1466)
Part of #1241

### Changes
- Renames all instances of "Browsertrix Cloud" to "Browsertrix" on the
front end, emails, and documentation

---------

Co-authored-by: emma <hi@emma.cafe>
2024-03-12 11:30:05 -04:00
Ilya Kreymer
804f755787
Increase startup probe time to account for long-running migrations (#1560)
- increases the failureThreshold for startupProbe for the api backend
container to account for long running migrations, upto 300 seconds
- add `/healthzStartup` which checks if db is ready
- bump 
- keeps `/healthz` to always return 200 when running
- increases livenessProbe failureThreshold to be higher than readiness
probe, following recommended best practice of liveness probe > readiness
probe
- fixes #1559
2024-02-28 14:22:33 -08:00
Tessa Walsh
14189b7cfb
Add crawl pages and related API endpoints (#1516)
Fixes #1502 

- Adds pages to database as they get added to Redis during crawl
- Adds migration to add pages to database for older crawls from
pages.jsonl and extraPages.jsonl files in WACZ
- Adds GET, list GET, and PATCH update endpoints for pages
- Adds POST (add), PATCH, and POST (delete) endpoints for page notes,
each with their own id, timestamp, and user info in addition to text
- Adds page_ops methods for 1. adding resources/urls to page, and 2.
adding automated heuristics and supplemental info (mime, type, etc.) to
page (for use in crawl QA job)
- Modifies `Migration` class to accept kwargs so that we can pass in ops
classes as needed for migrations
- Deletes WACZ files and pages from database for failed crawls during
crawl_finished process
- Deletes crawl pages when a crawl is deleted

Note: Requires a crawler version 1.0.0 beta3 or later, with support for
`--writePagesToRedis` to populate pages at crawl completion. Beta 4 is
configured in the test chart, which should be upgraded to stable 1.0.0
when it's released.

Connected to https://github.com/webrecorder/browsertrix-crawler/pull/464

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-02-28 12:11:35 -05:00
Ilya Kreymer
8ae032ff88 More friendly WARC prefix inside WACZ based on Org slug + Crawl Name / First Seed URL. (#1537)
Supports setting WARC prefix for WARCs inside WACZ to `<org slug>-<slug
[crawl name | first seed host]>`.
- Prefix set via WARC_PREFIX env var, supported in browsertrix-crawler
1.0.0-beta.4 or higher
If crawl name is provided, uses crawl name, other hostname of first
seed. The name is 'sluggified', using lowercase alphanum characters
separated by dashes.

Ex: in an organization called `Default Org`, a crawl of
`https://specs.webrecorder.net/` and no name will have WARCs named:
`default-org-specs-webrecorder-net-....warc.gz`
If the crawl is given the name `SPECS`, the WARCs will be named
`default-org-specs-manual-....warc.gz`

Fixes #412 in a default way.
2024-02-22 23:54:23 -08:00
Ilya Kreymer
a8e3ff1141 version: bump to 1.10.0-beta.0 2024-02-20 00:22:29 -08:00
Ilya Kreymer
c1cffe9ecd version: bump to 1.9.1 2024-02-16 09:44:18 -08:00
Ilya Kreymer
64bf21311d version: bump to 1.9.0! 2024-02-14 13:30:46 -08:00
Ilya Kreymer
1d266e3cea bump to 1.9.0.beta.5 2024-02-12 18:29:39 -08:00
Ilya Kreymer
4bc8152640 version: bump to 1.9.0-beta.4 2024-02-09 16:17:13 -08:00
Ilya Kreymer
b2a5dbf2cd
enable screenshots by default + fix py version formatting (#1518)
configmap: add --screenshot thumbnail,view as default screenshots 
version: update update-version.sh to add newline in version.py to match
new black formatting (from changes in #1507)
Fixes #1519
2024-02-07 17:07:28 -08:00
Ilya Kreymer
7aebce66f6 version: bump to 1.9.0-beta.3 2024-02-07 15:21:10 -08:00
Ilya Kreymer
e43feedc43 version: bump to 1.9.0-beta.2 2024-01-18 10:01:38 -08:00
Ilya Kreymer
370590b14f version: bump to 1.9.0-beta.1 2024-01-17 14:58:25 -08:00
Tessa Walsh
07fa46d9aa
Add custom user agent to workflows (#1465)
Fixes #1341

Adds "User Agent" field to workflow editor under the Browser Settings
tab. If not set, the crawler will use the browser's default user agent.

Also added to docs and to the workflow details page (if set).

---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2024-01-17 17:33:50 -05:00
Ilya Kreymer
90197b2a85
Backend mem usage fix - use fixed MOTOR_MAX_WORKERS + switch to gunicorn (#1468)
Refactors backend deployment to:
- Use MOTOR_MAX_WORKERS (defaulting to 1) to reduce threads used by
mongodb connections
- Also sets backend workers to 1 by default to reduce default memory
usage
- Switches to gunicorn with uvloop worker for production use instead of
uvicorn (as recommended by uvicorn)

Lower thread count should address memory leak/increased usage, which
resulted in 5x thread x cpus x workers, eg. potentially 20 or 40 threads
just for mongodb connections. Lower default number of workers should
make it easier to scale backend with HPA if additional capacity.

Fixes #1467
2024-01-16 15:32:42 -08:00
Tessa Walsh
032859f361
Support multiple crawler versions (#1420)
Fixes #1385 

## Changes
Supports multiple crawler 'channels' which can be configured to
different browsertrix-crawler versions
- Replaces `crawler_image` in helm chart with `crawler_channels` array
similar to how storages are handled
- The `default` crawler channel must always be provided and specifies
the default crawler image
- Adds backend `/orgs/{oid}/crawlconfigs/crawler-channels` API endpoint
to fetch information about available crawler versions (name, image, and
label) and test
- Adds crawler channel select to workflow creation/edit screens and
profile creation dialog, and updates related API endpoints and
configmaps accordingly. The select dropdown is shown only if more than
one channel is configured.
- Adds `crawlerChannel` to workflow and crawl details.
- Add `image` to crawler image, used to display actual image used as
part of the crawl.
- Modifies `crawler_crawl_id` backend test fixture to use `test` crawler
version to ensure crawler versions other than latest work
- Adds migration to add `crawlerChannel` set to `default` to existing
workflow and profile objects and workflow configmaps

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2024-01-16 15:32:12 -08:00
Ilya Kreymer
a6936299d3 version: bump to 1.9.0-beta.0 2023-12-20 00:08:16 -08:00
Ilya Kreymer
d902cf5338 version: bump to 1.8.2 2023-12-07 13:34:37 -08:00
Ilya Kreymer
1218d6e767 version: bump to 1.8.1 2023-11-17 14:39:52 -08:00
Ilya Kreymer
b6f8c968e9 version: bump to 1.8.0 2023-11-15 17:57:43 -08:00
Ilya Kreymer
b23eed5003
Email Templates (#1375)
- Emails are now processed from Jinja2 templates found in
`charts/email-templates`, to support easier updates via helm chart in
the future.
- The available templates are: `invite`, `password_reset`, `validate` and
`failed_bg_job`.
- Each template can be text only or also include HTML. The format of the
template is:
```
subject
~~~
<html content>
~~~
text
```
- A new `support_email` field is also added to the email block in
values.yaml

Invite Template: 
- Currently, only the invite template includes an HTML version, other
templates are text only.
- The same template is used for new and existing users, with slightly
different text if adding user to an existing org.
- If user is invited by the superadmin, the invited by field is not
included, otherwise it also includes 'You have been invited by X to join Y'
2023-11-15 15:22:12 -08:00
Ilya Kreymer
7d985a9688 version: bump to 1.8.0-beta.4 2023-11-14 11:59:04 -08:00
Ilya Kreymer
dfba4b3940
Replace partial_complete -> stopped_by_user or stopped_quota_reached + operator edge cases (#1368)
- Adds two new crawl finished state, stopped_by_user and
stopped_quota_reached
- Tracking other possible 'stop reasons' in operator, though not making
them distinct states for now.
- Updated frontend with 'Stopped by User' and 'Stopped: Time Quota
Reached', shown with same icon as current partial_complete
- Added migration of partial_complete to either stopped_by_user or
complete (no historical quota data available)
- Addresses edge case in scaling: if crawl never scaled (no redis entry,
no pod), automatically scale down
- Edge case in status: if crawl is somehow 'canceled' but not deleted,
immediately delete crawl object and begin finalizing.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-11-14 11:17:16 -08:00
Ilya Kreymer
3e82c4513b quickfix: bump replica_job memory to 200Mi 2023-11-13 13:45:24 -08:00
Ilya Kreymer
a6a78c9ef2
node affinity: set to required instead of preferred to keep crawlers on dedicated infrastructure (#1366)
Previously, the crawler pods use preferred node affinity, instead of
required node affinity. This results in crawler nodes running on the
main node pool. Instead, we want to ensure crawler nodes are running on
dedicated node pool (if configured).
- Converts 'preferred node affinity' to 'required node affinity' for
the node pool, while keeping preferred pod affinity for keeping all
crawler / redis pods together.
- For profiles, updates to same node affinity, and also adds
resource constraint to match a single crawler for profile browser,
which did not have resource constraints.
2023-11-13 10:02:05 -08:00
Ilya Kreymer
67892994a6 version: bump to 1.8.0-beta.3 2023-11-09 18:20:04 -08:00
Tessa Walsh
82a5d1e4e4
Regression fix: add profiles/ prefix to profile filenames (#1365)
Fixes #1364 

Regression fix for issue introduced in storage refactoring (see issue
for more details).

Changes:
1. Add `profiles/` prefix to profile filename passed in to crawler for
profile creation and written into db
2. Remove hardcoded `profiles/` prefix from crawler YAML
3. Add migration to add `profiles/` prefix to profile filenames that
don't already have it, including updating PROFILE_FILENAME in ConfigMaps

This way between the related storage document and the profile filename,
we have the full path to the object in the database rather than relying
on additional prefixes hardcoded into k8s job YAML files.

Note that this as a follow-up it'll be necessary to manually move any
profiles that had been written into the `<oid>` "directory" in object
storage rather than `<oid>/profiles` to the latter. This should only
affect profiles created very recently in a 1.8.0-beta release.
2023-11-09 17:44:16 -08:00
Ilya Kreymer
ff10124d01
charts cleanup: (#1360)
- move authsign secret to signer and make port configurable
- rename storages to more general ops-configs
- put 'storages.json' path into env var
- rename backend secret to backend-auth
- cronjobs: don't keep succeeded jobs around, triggers operator update
2023-11-08 19:24:00 -08:00
Ilya Kreymer
e4660dd010 nightly tests: bump page limit to 200 to ensure crawls don't end too quickly when testing quotas/limits 2023-11-08 15:10:55 -08:00
Ilya Kreymer
d2d7240455
background jobs fix: ensure bucket is parsed correctly (#1359)
Follow-up to #1321 
- correctly parse the endpoint_url into prefix and bucket path
- also add region and s3 provider type to storage secrets
2023-11-08 15:08:23 -08:00
Ilya Kreymer
3aebf2e37f version: bump to 1.8.0-beta.2 2023-11-06 16:35:15 -08:00
Ilya Kreymer
b4fd5e6e94
Crawl Timeout via elapsed time (#1338)
Fixes #1337 

Crawl timeout is tracked via `elapsedCrawlTime` field on the crawl
status, which is similar to regular crawl execution time, but only
counts one pod if scale > 1. If scale == 1, this time is equivalent.

Crawl is gracefully stopped when the elapsed execution time exceeds the
timeout. For more responsiveness, also adding current crawl time since
last update interval.

Details:
- handle crawl timeout via elapsed crawl time - longest running time of a
single pod, instead of expire time.
- include current running from last update for best precision
- more accurately count elapsed time crawl is actually running
- store elapsedCrawlTime in addition to crawlExecTime, storing the
longest duration of each pod since last test interval

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-11-06 16:32:58 -08:00
Ilya Kreymer
5530ca92e1
Move backend app templates to be installed from configmap volume (#1331)
Instead of adding the app templates launched from the backend via
`backend/btrixcloud/templates`, add them to a configmap and mount the
configmap in the same location.

This allows these templates to be updated, like other values in
charts/... without having to rebuild any of the images, speeding up dev
and maintenance time.

Changes include:
- move backend/btrixcloud/templates -> chart/app-templates/
- add app-templates/*.yaml to app-templates configmap
- mount app-templates configmap to /app/btrixcloud/templates/ in api and op containers
2023-11-06 09:37:48 -08:00
Francesco Servida
0b8bbcf8e6
Allow User to specify custom cluster-issuer (#1332)
Implemented variable and defaults for cluster-issuer to allow users to
specify, if needed, their own cluster issuer. (eg. installations with
only outbound traffic that cannot solve ACME https challenge)
2023-11-04 13:29:17 -07:00
Francesco Servida
4998274ab0
correctly suffix Auth-Signer url when running in custom namespace (#1335) 2023-11-04 10:34:05 -07:00