Commit Graph

174 Commits

Author SHA1 Message Date
Ilya Kreymer
946739b08b
Update authsigner to 0.5.2 (#1899)
- needed to support js-wacz signing requests in upcoming crawler versions
- Also has slightly increased memory requirements due to new versions of
some libraries.
- 0.5.2 adds a fix to dropping the fractional part of the second, to make
it work with ISO date strings that have microseconds, such as those from
js-wacz.
2024-06-28 13:38:24 -07:00
Tessa Walsh
8a904c9031
feat: Rename org when accepting org invite for first admin (#1870)
Resolves https://github.com/webrecorder/browsertrix/issues/1874

Support for new two-part sign up flow if first admin user is added to org
- If new user, user registers first, then is able to change the org name / slug on following screen
- If existing user, user accepts invite, then is able to change the org name / slug on following screen
- After confirming org slug name, user is taken to dashboard, or error is shown if org name or slug already taken.
- If org name == org id, org name and slug is automatically set to `{Your Name}'s Archive` when first user is registered / accepts invite
- Email templates updated to better reflect new / existing users and not show org name if it is 'unset' (org name == org id internally)
- tests: frontend unit testing for accept + invite screens.

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: sua yoo <sua@suayoo.com>
Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
2024-06-27 16:08:31 -07:00
Tessa Walsh
7af3980323
Add billing enabled and sales email to Helm chart and /settings API endpoint (#1873)
Backend work for first two tasks of
https://github.com/webrecorder/browsertrix/issues/1875

New /billing API endpoint to be added separately once we have a better
idea of what data we can get from the payment processor.
2024-06-25 10:55:29 -04:00
Ilya Kreymer
e3ee63f9b0 version: bump to 1.11.0-beta.0 2024-06-04 13:37:44 -07:00
Ilya Kreymer
d42de92d75
QA analysis scale configurable in helm chart (#1843)
- allow configuring QA run scale via 'qa_scale' setting in helm values
(overriding any setting on the qa crawljob)
- adds additional comments to browser instances helm values settings for clarity
- fixes #1842
2024-05-30 12:59:21 -07:00
Ilya Kreymer
4b6dd97c11 version: bump to 1.10.1 2024-05-23 22:24:58 -07:00
Ilya Kreymer
e853b62401 version: update to 1.10.0! 2024-05-20 19:30:22 -07:00
Ilya Kreymer
94d57b98ce version bump to 1.10.0-beta.7 2024-05-15 11:30:05 -07:00
Ilya Kreymer
e022994f4e version: update to 1.10.0-beta.6 2024-04-30 20:34:11 +02:00
Ilya Kreymer
a3911f6a8a version: bump to 1.10.0-beta.5 2024-04-25 09:00:54 +02:00
Ilya Kreymer
a09f565ce5 version: bump to 1.10.0-beta.4 2024-04-24 16:53:39 +02:00
Ilya Kreymer
f89027ac89 version: 1.10.0-beta.3 2024-04-24 15:45:17 +02:00
Ilya Kreymer
ec74eb4242
operator: add 'max_crawler_memory' to limit autosizing of crawler pods (#1746)
Adds a `max_crawler_memory` chart setting, which, if set, will
defines the upper crawler memory limit that crawler pods can be resized up to.
If not set, auto resizing is disabled and pods are always set to 'crawler_memory' memory
2024-04-24 15:16:32 +02:00
Ilya Kreymer
41655ef829 version: bump to 1.10.0-beta.2 2024-04-23 23:19:16 +02:00
Ilya Kreymer
b94070160b
allow configuring designated registration org to which new users can register (#1735)
if 'registration_enabled' is set, check 'registration_org_id' for org id
of an existing org that new users should be added to when they register.
if omitted, default to the default org

Fixes #1729
2024-04-23 17:11:37 -04:00
Vinzenz Sinapius
a8336925b6
Run crawler and profilebrowser with non-root user (#1625)
With these changes, crawler and profilebrowser jobs run as a
non-root user.
2024-04-17 12:03:33 -07:00
Ilya Kreymer
835014d829
restrict qa runs to a 'min_qa_crawler_image' if set in the chart (#1685)
- fixes #1684
- can be used to optionally restrict QA to only some crawls (eg. with
browsertrix-crawler>=1.0.0)
- enforce error on backend (return 400) and handle special error on the
frontend
2024-04-17 08:48:33 -07:00
Ilya Kreymer
a7cda3b11b version: bump to 1.10.0-beta.1 2024-04-05 18:24:14 -07:00
Ilya Kreymer
c1817cbe04
add horizontal pod autoscaler for backend and frontend via helm charts (#1633)
Supports horizontal pod autoscaling (hpa) for backend and frontend pods:
- use cpu and memory averages
- adjust base memory + cpu for backend
- threshold set to 80% cpu and 95% memory utilization by default
(configurable in values.yaml)
- instead of backend and frontend replicas, set max replicas in
values.yaml
- only enable hpa if backend_max_replicas or frontend_max_replicas is
>1, default to 1 for now
2024-03-28 16:39:27 -07:00
Ilya Kreymer
86311ab4ea
merge 1.9.5 fixes (#1637)
retry loading profile if initial load fails, follow-up to #1604
- Add missing setTimeout to retry profile loading

bump RWP to 1.8.15
2024-03-27 21:49:19 -07:00
Ilya Kreymer
4f676e4e82
QA Runs Initial Backend Implementation (#1586)
Supports running QA Runs via the QA API!

Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes
#1498

Also requires the latest Browsertrix Crawler 1.1.0+ (from
webrecorder/browsertrix-crawler#469 branch)

Notable changes:
- QARun objects contain info about QA runs, which are crawls
performed on data loaded from existing crawls.

- Various crawl db operations can be performed on either the crawl or
`qa.` object, and core crawl fields have been moved to CoreCrawlable.

- While running,`QARun` data stored in a single `qa` object, while
finished qa runs are added to `qaFinished` dictionary on the Crawl. The
QA list API returns data from the finished list, sorted by most recent
first.

- Includes additional type fixes / type safety, especially around
BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific
get_upload(), get_basecrawl(), get_crawl() getters for internal use and
get_crawl_out() for API

- Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) 
along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-20 22:42:16 -07:00
Tessa Walsh
21ae38362e
Add endpoints to read pages from older crawl WACZs into database (#1562)
Fixes #1597

New endpoints (replacing old migration) to re-add crawl pages to db from
WACZs.

After a few implementation attempts, we settled on using
[remotezip](https://github.com/gtsystem/python-remotezip) to handle
parsing of the zip files and streaming their contents line-by-line for
pages. I've also modified the sync log streaming to use remotezip as
well, which allows us to remove our own zip module and let remotezip
handle the complexity of parsing zip files.

Database inserts for pages from WACZs are batched 100 at a time to help
speed up the endpoint, and the task is kicked off using
asyncio.create_task so as not to block before giving a response.

StorageOps now contains a method for streaming the bytes of any file in
a remote WACZ, requiring only the presigned URL for the WACZ and the
name of the file to stream.
2024-03-19 14:14:21 -07:00
Ilya Kreymer
e7af081af1
profile browser fixes: better resource usage + load retry (main) (#1604)
- Backend: Use separate resource constraints for profiles: default
profile browser resources to either 'profile_browser_cpu' /
'profile_browser_memory' or single browser 'crawler_memory_base' /
'crawler_cpu_base', instead of scaled to the number of browser workers

- Frontend: check that profile html page is loading, keep retrying if
still getting nginx error instead of loading an iframe with the error.

Fixes #1598 (Copy of #1599 from 1.9.4)
2024-03-16 15:07:04 -07:00
Ilya Kreymer
a8e3ff1141 version: bump to 1.10.0-beta.0 2024-02-20 00:22:29 -08:00
Ilya Kreymer
c1cffe9ecd version: bump to 1.9.1 2024-02-16 09:44:18 -08:00
Ilya Kreymer
64bf21311d version: bump to 1.9.0! 2024-02-14 13:30:46 -08:00
Ilya Kreymer
1d266e3cea bump to 1.9.0.beta.5 2024-02-12 18:29:39 -08:00
Ilya Kreymer
4bc8152640 version: bump to 1.9.0-beta.4 2024-02-09 16:17:13 -08:00
Ilya Kreymer
7aebce66f6 version: bump to 1.9.0-beta.3 2024-02-07 15:21:10 -08:00
Ilya Kreymer
e43feedc43 version: bump to 1.9.0-beta.2 2024-01-18 10:01:38 -08:00
Ilya Kreymer
370590b14f version: bump to 1.9.0-beta.1 2024-01-17 14:58:25 -08:00
Tessa Walsh
07fa46d9aa
Add custom user agent to workflows (#1465)
Fixes #1341

Adds "User Agent" field to workflow editor under the Browser Settings
tab. If not set, the crawler will use the browser's default user agent.

Also added to docs and to the workflow details page (if set).

---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2024-01-17 17:33:50 -05:00
Ilya Kreymer
90197b2a85
Backend mem usage fix - use fixed MOTOR_MAX_WORKERS + switch to gunicorn (#1468)
Refactors backend deployment to:
- Use MOTOR_MAX_WORKERS (defaulting to 1) to reduce threads used by
mongodb connections
- Also sets backend workers to 1 by default to reduce default memory
usage
- Switches to gunicorn with uvloop worker for production use instead of
uvicorn (as recommended by uvicorn)

Lower thread count should address memory leak/increased usage, which
resulted in 5x thread x cpus x workers, eg. potentially 20 or 40 threads
just for mongodb connections. Lower default number of workers should
make it easier to scale backend with HPA if additional capacity.

Fixes #1467
2024-01-16 15:32:42 -08:00
Tessa Walsh
032859f361
Support multiple crawler versions (#1420)
Fixes #1385 

## Changes
Supports multiple crawler 'channels' which can be configured to
different browsertrix-crawler versions
- Replaces `crawler_image` in helm chart with `crawler_channels` array
similar to how storages are handled
- The `default` crawler channel must always be provided and specifies
the default crawler image
- Adds backend `/orgs/{oid}/crawlconfigs/crawler-channels` API endpoint
to fetch information about available crawler versions (name, image, and
label) and test
- Adds crawler channel select to workflow creation/edit screens and
profile creation dialog, and updates related API endpoints and
configmaps accordingly. The select dropdown is shown only if more than
one channel is configured.
- Adds `crawlerChannel` to workflow and crawl details.
- Add `image` to crawler image, used to display actual image used as
part of the crawl.
- Modifies `crawler_crawl_id` backend test fixture to use `test` crawler
version to ensure crawler versions other than latest work
- Adds migration to add `crawlerChannel` set to `default` to existing
workflow and profile objects and workflow configmaps

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2024-01-16 15:32:12 -08:00
Ilya Kreymer
a6936299d3 version: bump to 1.9.0-beta.0 2023-12-20 00:08:16 -08:00
Ilya Kreymer
d902cf5338 version: bump to 1.8.2 2023-12-07 13:34:37 -08:00
Ilya Kreymer
1218d6e767 version: bump to 1.8.1 2023-11-17 14:39:52 -08:00
Ilya Kreymer
b6f8c968e9 version: bump to 1.8.0 2023-11-15 17:57:43 -08:00
Ilya Kreymer
b23eed5003
Email Templates (#1375)
- Emails are now processed from Jinja2 templates found in
`charts/email-templates`, to support easier updates via helm chart in
the future.
- The available templates are: `invite`, `password_reset`, `validate` and
`failed_bg_job`.
- Each template can be text only or also include HTML. The format of the
template is:
```
subject
~~~
<html content>
~~~
text
```
- A new `support_email` field is also added to the email block in
values.yaml

Invite Template: 
- Currently, only the invite template includes an HTML version, other
templates are text only.
- The same template is used for new and existing users, with slightly
different text if adding user to an existing org.
- If user is invited by the superadmin, the invited by field is not
included, otherwise it also includes 'You have been invited by X to join Y'
2023-11-15 15:22:12 -08:00
Ilya Kreymer
7d985a9688 version: bump to 1.8.0-beta.4 2023-11-14 11:59:04 -08:00
Ilya Kreymer
67892994a6 version: bump to 1.8.0-beta.3 2023-11-09 18:20:04 -08:00
Ilya Kreymer
3aebf2e37f version: bump to 1.8.0-beta.2 2023-11-06 16:35:15 -08:00
Francesco Servida
0b8bbcf8e6
Allow User to specify custom cluster-issuer (#1332)
Implemented variable and defaults for cluster-issuer to allow users to
specify, if needed, their own cluster issuer. (eg. installations with
only outbound traffic that cannot solve ACME https challenge)
2023-11-04 13:29:17 -07:00
Ilya Kreymer
fb3d88291f
Background Jobs Work (#1321)
Fixes #1252 

Supports a generic background job system, with two background jobs,
CreateReplicaJob and DeleteReplicaJob.
- CreateReplicaJob runs on new crawls, uploads, profiles and updates the
`replicas` array with the info about the replica after the job succeeds.
- DeleteReplicaJob deletes the replica.
- Both jobs are created from the new `replica_job.yaml` template. The
CreateReplicaJob sets secrets for primary storage + replica storage,
while DeleteReplicaJob only needs the replica storage.
- The job is processed in the operator when the job is finalized
(deleted), which should happen immediately when the job is done, either
because it succeeds or because the backoffLimit is reached (currently
set to 3).
- /jobs/ api lists all jobs using a paginated response, including filtering and sorting
- /jobs/<job id> returns details for a particular job
- tests: nightly tests updated to check create + delete replica jobs for crawls as well as uploads, job api endpoints
- tests: also fixes to timeouts in nightly tests to avoid crawls finishing too quickly.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-11-02 13:02:17 -07:00
Ilya Kreymer
8c09934298 version: bump to 1.8.0-beta.1 2023-10-27 14:35:24 -07:00
Ilya Kreymer
6dc452ebad
Storage Refactor: Replication + Custom Storage Support (#1296)
- Refactors storage to support replicas + custom storages on the Org.
- There is a default primary + replica storage, while an Org can also have
primary and replica storages.
- StorageRef object is used to store references to default and custom
storage.

- CrawlFile has been updated to contain a StorageRef instead of a
def_storage_name, which references
either a default storage (in StorageOps) or custom storage (in
Organization)
- There is also a 'replicas' Optional[List[StorageRef]] which contains
replicas, if any.
- CrawlFileOut contain a numReplicas for how many replicas exist for
a given file.
- Migration: migration 0020 added to migrate existing Orgs, CrawlFile and ProfileFile objects to new storage system (CrawlFile and ProfileFile now extend BaseFile)


Part of #1262

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-10-26 21:44:09 -07:00
Tessa Walsh
d58747dfa2
Provide full resources in archived items finished webhooks (#1308)
Fixes #1306 

- Include full `resources` with expireAt (as string) in crawlFinished
and uploadFinished webhook notifications rather than using the
`downloadUrls` field (this is retained for collections).
- Set default presigned duration to one minute short of 1 week and enforce
maximum supported by S3
- Add 'storage_presign_duration_minutes' commented out to helm values.yaml
- Update tests

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-10-23 19:01:58 -07:00
Ilya Kreymer
36bd228115 version: update to 1.8.0-beta.0 2023-10-17 18:06:55 -07:00
Ilya Kreymer
b3f530f8e6 version: bump to 1.7.0 2023-10-16 18:39:20 -07:00
Ilya Kreymer
a295f5d05d version: bump to 1.7.0-beta.3 2023-10-15 18:31:03 -07:00
Ilya Kreymer
fa86555eed
Track pod resource usage, detect OOM crashes, handle auto-scaling (#1235)
* keep track of per pod status on crawljob:
- crashes time, and reason
- 'used' vs 'allocated' resources 
- 'percent' used / allocated

* crawl log errors: log error when crawler crashes via OOM, either via redis error log
or to console

* add initial autoscaling support!
- detect if metrics server is available via K8SApi.is_pod_metrics_available()
- if available, use metrics for 'used' fields
- if no metrics, set memory used for redis only (using redis apis)
- allow overriding memory and cpu via newMemory and newCpu settings on pod status
- scale memory / cpu based on newMemory and newCpu setting
- templates: update jinja templates to allow restarting crawler and redis with new resources
- ci: enable metrics-server on k3d, microk8s and nightly k3d ci runs

* roles: cleanup unused roles, add permissions for listing metrics

* stats for running crawls:
- update in db via operator
- avoids losing stats if redis pod happens to be done
- tradeoff is more db access in operator, but less extra connections to redis + already
loading from db in backend
- size stat: ensure size of previous files is added to the stats

* crawler deployment tweaks:
- adjust cpu/mem per browser
- add --headless flag to configmap to use new headless mode by default!
2023-10-05 20:41:18 -07:00
Ilya Kreymer
20560abb81 version: bump to 1.7.0-beta.2 2023-10-05 20:33:38 -07:00
Ilya Kreymer
d6bc467c54
improvements to redis pod: (#1219)
- add liveness check/fix readiness check - ensure 'redis-cli ping' actually returns 'PONG', as exit code is 0 even if errors
will detect situations where redis is not available, such as due to to max clients being reached
- bump redis memory/cpu for now (until autoscaling/automatic adjustment is available)
2023-09-28 13:00:31 -07:00
Ilya Kreymer
18b2c1abfc
limit: set default page limit to 50k pages (#1211) 2023-09-26 10:29:03 -07:00
Ilya Kreymer
65b7c10ba1 bump version to 1.7.0-beta.1 2023-09-18 14:33:03 -07:00
Ilya Kreymer
3d4ff264b6
version: bump RWP version to 1.8.12 (#1181) 2023-09-15 11:32:45 -07:00
Ilya Kreymer
ad9bca2e92
Operator refactor to control pods + pvcs directly instead of statefulsets (#1149)
- Ability for pod to be Completed, unlike in Statefulset - eg. if 3 pods are running and first one finishes, all 3 must be running until all 3 are done. With this setup, the first finished pod can remain in Completed state.
- Fixed shutdown order - crawler pods now correctly shutdown first before redis pods, by switching to background deletion.
- Pod priority decreases with scale: 1st instance of a new crawl can preempt 3rd or 2nd instance of another crawl
- Create priority classes upto 'max_crawl_scale, configured in values.yaml
- Improved scale change reconciliation: if increasing scale, immediately scale up. If decreasing scale,
graceful stop scaled-down instance to complete via redis 'stopone' key, wait until they exit with Completed state
before adjust status.scale / removing scaled down pods. Ensures unaccepted interrupts don't cause scaled down data to be deleted.
- Redis pod remains inactive until crawler is first active, or after no crawl pods are active for 60 seconds
- Configurable Redis storage with 'redis_storage' value, set to 3Gi by default
- CrawlJob deletion starts as soon as post-finish crawl operations are run
- Post-crawl operations get their own redis instance, since one during response is being cleaned up in finalizer
- Finalizer ignores request with incorrect state (returns 400 if reported as not finished while crawl is finished)
- Current resource usage added to status
- Profile browser: also manage single pod directly without statefulset for consistency.
- Restart pods via restartTime value: if spec.restartTime != status.restartTime, clear out pods and update status.restartTime (using OnDelete policy to avoid recreate loops in edge cases).
- Update to latest metacontroller (v4.11.0)
- Add --restartOnError flag for crawler (for browsertrix-crawler 0.11.0)
- Failed crawl logging: dd 'fail_crawl()' to be used for failing a crawl, which prints logs for default container (if enabled) as well as pod status
- tests: check other finished states to avoid stuck in infinite loop if crawl fails
- tests: disable disk utilization check, which adds unpredictability to crawl testing!
fixes #1147 

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-09-11 10:38:04 -07:00
Anish Lakhwara
e57148d0e9
feat: add SMTP {port, use_tls} config (#1142)
* feat: add SMTP {port, use_tls} config
* If `password` is None don't attempt to log in
* remove 'can be omitted' comment

---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2023-09-08 08:18:36 -07:00
Ilya Kreymer
2967f1e320
ingress: simplify ingress config: (fixes #1135) (#1146)
* ingress: simplify ingress config: (fixes #1135)
- use standard Prefix pathTypes
- remove nginx-specific rewriting
- remove 'scheme', use https/http based on 'tls' setting (in ingress and configmap)
- fix signing ingress to use ingressClassName
2023-09-07 09:51:48 -07:00
Ilya Kreymer
68bc053ba0
Print crawl log to operator log (mostly for testing) (#1148)
* log only if 'log_failed_crawl_lines' value is set to number of last lines to log
from failed container

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-09-06 17:53:02 -07:00
Ilya Kreymer
dce1ae6129
better resources scaling by number of browsers per crawler container (#1103)
- set crawler cpu / memory with fixed base + incremental bumps based on number of browsers
- allow parsing k8s quantities with parse_quantity, compute in operator
- set 'crawler_cpu = crawler_cpu_base + crawler_extra_cpu_per_browser * (num_browsers - 1)'
and same for memory
2023-09-06 01:42:44 -04:00
Ilya Kreymer
6dca2f1c03
supports overriding the replayweb.page version without having to be r… (#1122)
* supports overriding the replayweb.page version without having to be rebuild frontend image:
- ensures 'rwp_base_url' from helm chart is passed to nginx
- ensures both ui.js and sw.js are loaded based on nginx environment variable, not hard-coded
- ui.js loaded via redirect from new /replay/ui.js path
- pin RWP to known working release in default values.yaml
- remove RWP_BASE_URL from Dockerfile, no longer needed, set via chart env var
- set default RWP_BASE_URL for devserver to use CDN
- set RWP version to 1.8.11
2023-09-05 20:10:21 -04:00
Ilya Kreymer
a9ab17fc61
publish helm chart on release (fixes #1114) (#1117) (#1123)
- no longer using :latest by default in values.yaml, instead updating version with each release
- set chart version to match app version in Chart.yaml
- update version in helm chart and values.yaml as part of update-version.sh script
- update test.yaml and local-config.yaml to enable using :latest tag images
- ci: add ci script for packaging current helm chart
- docs: updates docs to indicate deploying directly from GitHub release
- docs: add script to fill in latest version for 'VERSION' using custom script
- chart: set local_service_port to 30870 by default, but use only if no ingress.
- default values.yaml set up for local deployment, local-config.yaml contains additional commented out examples
- ci draft: add deployment info to draft with helm install command for current version
- test: fix password check test
2023-08-30 12:02:02 -07:00
Ilya Kreymer
8e43940196
chart resources: adjust backend memory to 350Mi, as 200Mi was too low (#1082) 2023-08-15 21:59:57 -07:00
Ilya Kreymer
9553115bbe
helm chart tweaks: (#1067)
* helm chart tweaks:
- lower mem requirements for backend and crawler
- disable cors in ingress to pass through cors headers from backend
- crawler statefulset: use ordered instead of parallel scaling policy to avoid single crawl taking up all crawling capacity quickly
2023-08-14 16:43:12 -07:00
Ilya Kreymer
7ea6d76f10
Resource Constraints Cleanup: (fixes #895) (#1019)
* resource constraints: (fixes #895)
- for cpu, only set cpu requests
- for memory, set mem requests == mem limits
- add missing resource constraints for minio and scheduled job
- for crawler, set mem and cpu constraints per browser, scale based on browser instances per crawler
- add comments in values.yaml for crawler values being multiplied
- default values: bump crawler to 650 millicpu per browser instance just in case

cleanup: remove unused entries from main backend configmap
2023-08-01 00:11:16 -07:00
Ilya Kreymer
c76dd10928
chart: always pull latest crawler image - since default image is pointing to webrecorder/browsertrix-crawler:latest, makes sense to always pull latest (#1018) 2023-07-27 12:41:41 -07:00
Vinzenz Sinapius
5807507f29
Add proxy settings for crawler and profilebrowser (#997) 2023-07-26 16:11:10 -07:00
Anish Lakhwara
b5a9c42df1
feat: add pre-commit to check we don't have real passwords in yml files (#990)
* feat: use existing pre-commit framework

* feat(ci): add github action for password_check

* feat: add some simple tests to password_check.py

* fix: set `backend_password_secret` in default values.yaml to an allowed password
2023-07-26 13:29:37 -07:00
Ilya Kreymer
d7cb47390e
readd support for passing in 'crawler_extra_args' for additional/custom (#957)
options not covered by standard crawler opts (removed setting all args this way in #889)
2023-07-07 12:08:40 -07:00
Ilya Kreymer
dd757961fc
config: add overridable 'user_agent_suffix' and 'user_agent' to values.yaml, (#910)
passed to crawler --userAgentSuffix and --userAgent params, respectively, using
'quote' to support spaces in user-agent.
config: re-order settings to put 'Crawler Settings' section first, followed by 'Cluster Settings'
fixes #787
2023-06-07 12:01:12 -07:00
Tessa Walsh
0284903b34
Cleanup carwler args (#889)
* crawler args cleanup:
- move crawler args command line entirely to configmap
- add required settings like --generateWACZ and --waitOnDone to configmap to not be overridable
- values files can configure individual settings, assembled in configmap
- move disk_utilization_threshold to configmap
- add 'crawler_logging_opts' and 'crawler_extract_full_text' options to values.yaml to more easily set these options

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-05-30 19:29:07 -04:00
Ilya Kreymer
12f7db3ae2
tests: fixes for crawl cancel + crawl stopped (#864)
* tests:
- fix cancel crawl test by ensuring state is not running or waiting
- fix stop crawl test by ensuring stop is only initiated after at least one page has been crawled,
otherwise result may be failed, as no crawl data has been crawled yet (separate fix in crawler to avoid loop if stopped
before any data written webrecorder/browsertrix-crawler#314)
- bump page limit to 4 for tests to ensure crawl is partially complete, not fully complete when stopping
- allow canceled or partial_complete due to race condition

* chart: bump frontend limits in default, not just for tests (addresses #780)

* crawl stop before starting:
- if crawl stopped before it started, mark as canceled
- add test for stopping immediately, which should result in 'canceled' crawl
- attempt to increase resync interval for immediate failure
- nightly tests: increase page limit to test timeout

* backend:
- detect stopped-before-start crawl as 'failed' instead of 'done'
- stats: return stats counters as int instead of string
2023-05-22 20:17:29 -07:00
Ilya Kreymer
70319594c2
crawlconfig: fix default filename template, make configurable (#835)
* crawlconfig: fix default filename template, make configurable
- make default crawl file template configurable with 'default_crawl_filename_template' value in values.yaml
- set to '@ts-@hostsuffix.wacz' by default
- allow updating via 'crawlFilenameTemplate' in crawlconfig patch, which updates configmap
- tests: add test for custom 'default_crawl_filename_template'
2023-05-08 14:03:27 -07:00
Ilya Kreymer
aae0e6590e
Ensure Volumes are deleted when crawl is canceled (#828)
* operator:
- ensures crawler pvcs are always deleted before crawl object is finalized (fixes #827)
- refactor to ensure finalizer handler always run when finalizing
- remove obsolete config entries
2023-05-05 12:05:54 -07:00
Ilya Kreymer
60ba9e366f
Refactor to use new operator on backend (#789)
* Btrixjobs Operator - Phase 1 (#679)

- add metacontroller and custom crds
- add main_op entrypoint for operator

* Btrix Operator Crawl Management (#767)

* operator backend:
- run operator api in separate container but in same pod, with WEB_CONCURRENCY=1
- operator creates statefulsets and services for CrawlJob and ProfileJob
- operator: use service hook endpoint, set port in values.yaml

* crawls working with CrawlJob
- jobs start with 'crawljob-' prefix
- update status to reflect current crawl state
- set sync time to 10 seconds by default, overridable with 'operator_resync_seconds'
- mark crawl as running, failed, complete when finished
- store finished status when crawl is complete
- support updating scale, forcing rollover, stop via patching CrawlJob
- support cancel via deletion
- requires hack to content-length for patching custom resources
- auto-delete of CrawlJob via 'ttlSecondsAfterFinished'
- also delete pvcs until autodelete supported via statefulset (k8s >1.27)
- ensure filesAdded always set correctly, keep counter in redis, add to status display
- optimization: attempt to reduce automerging, by reusing volumeClaimTemplates from existing children, as these may have additional props added
- add add_crawl_errors_to_db() for storing crawl errors from redis '<crawl>:e' key to mongodb when crawl is finished/failed/canceled
- add .status.size to display human-readable crawl size, if available (from webrecorder/browsertrix-crawler#291)
- support new page size, >0.9.0 and old page size key (changed in webrecorder/browsertrix-crawler#284)

* support for scheduled jobs!
- add main_scheduled_job entrypoint to run scheduled jobs
- add crawl_cron_job.yaml template for declaring CronJob
- CronJobs moved to default namespace

* operator manages ProfileJobs:
- jobs start with 'profilejob-'
- update expiry time by updating ProfileJob object 'expireTime' while profile is active

* refactor/cleanup:
- remove k8s package
- merge k8sman and basecrawlmanager into crawlmanager
- move templates, k8sapi, utils into root package
- delete all *_job.py files
- remove dt_now, ts_now from crawls, now in utils
- all db operations happen in crawl/crawlconfig/org files
- move shared crawl/crawlconfig/org functions that use the db to be importable directly,
including get_crawl_config, add_new_crawl, inc_crawl_stats

* role binding: more secure setup, don't allow crawler namespace any k8s permissions
- move cronjobs to be created in default namespace
- grant default namespace access to create cronjobs in default namespace
- remove role binding from crawler namespace

* additional tweaks to templates:
- templates: split crawler and redis statefulset into separate yaml file (in case need to load one or other separately)

* stats / redis optimization:
- don't update stats in mongodb on every operator sync, only when crawl is finished
- for api access, read stats directly from redis to get up-to-date stats
- move get_page_stats() to utils, add get_redis_url() to k8sapi to unify access

* Add migration for operator changes
- Update configmap for crawl configs with scale > 1 or
crawlTimeout > 0 and schedule exists to recreate CronJobs
- add option to rerun last migration, enabled via env var and by running helm with --set=rerun_last_migration=1

* subcharts: move crawljob and profilejob crds to separate subchart, as this seems best way to guarantee proper install order with + update on upgrade with helm, add built btrix-crds-0.1.0.tgz subchart
- metacontroller: use release from ghcr, add metacontroller-helm-v4.10.1.tgz subchart

* backend api fixes
- ensure changing scale of crawl also updates it in the db
- crawlconfigs: add 'currCrawlSize' and 'lastCrawlSize' to crawlconfig api

---------

Co-authored-by: D. Lee <leepro@gmail.com>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-04-24 18:30:52 -07:00
Tessa Walsh
6b19f72a89
Add crawl errors endpoint (#757)
* Add crawl errors endpoint

If this endpoint is called while the crawl is running, errors are
pulled directly from redis.

If this endpoint is called when the crawl is finished, errors are
pulled from mongodb, where they're written when crawls complete.

* Add nightly backend test for errors endpoint

* Add errors for failed and cancelled crawls to mongo

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2023-04-17 12:59:25 -04:00
Ilya Kreymer
85b6a05419
Upgrade to mongo 6 and use sortArray for workflow crawls (#764) (#765)
fixes from 1.4.1:
* Upgrade to mongo 6 and use  for workflow crawls

* update readiness probe with timeouts doubled, and failure threshold increased for slower 'mongosh' readiness check

update versions to 1.5.0-beta.0 in backend and frontend

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-04-11 18:22:07 -07:00
Tessa Walsh
11ca3e678a
Configure crawler disk utilization threshold via helm chart (#748) 2023-04-05 21:51:53 -07:00
Ilya Kreymer
7f757d396a
config: add 'pageLoadTimeout' and 'pageExtraDelay' options to backend… (#742)
* config: add 'pageLoadTimeout' and 'pageExtraDelay' options to backend config
- add 'default_page_load_timeout_seconds' to values.yaml, defaulting to 120, for pageLoadTimeout
- add 'defaultPageLoadTimeSeconds ' to /api/settings, update tests for /api/settings
addresses issue in #636
2023-04-04 19:52:23 -07:00
Ilya Kreymer
887cb16146
Allow configurable max pages per crawl in deployment settings (#717)
* backend: max pages per crawl limit, part of fix for #716:
- set 'max_pages_crawl_limit' in values.yaml, default to 100,000
- if set/non-0, automatically set limit if none provided
- if set/non-0, return 400 if adding config with limit exceeding max limit
- return limit as 'maxPagesPerCrawl' in /api/settings
- api: /all/crawls - add runningOnly=0 to show all crawls, default to 1/true (for more reliable testing)

tests: add test for 'max_pages_per_crawl' setting
- ensure 'limit' can not be set higher than max_pages_per_crawl
- ensure pages crawled is at the limit
- set test limit to max 2 pages
- add settings test
- check for pages.jsonl and extraPages.jsonl when crawling 2 pages
2023-03-28 16:26:29 -07:00
Ilya Kreymer
413fd8d7ea
Chart: split Crawl args into separate variables (#639)
* chart crawl args cleanup:
- move configurable settings out of 'crawler_args'
- add 'crawler_session_size_limit_bytes' and 'crawler_session_time_limit_seconds' for --timeLimit and --sizeLimit option for crawler
- remove hard-coded 'timeout' to allow configuring via crawl config
- set liveness check port from existing config value
- add comments that requests hd must be at least double the size limit
- defaults: set crawler_requests_hd to 22GB, default crawl session size limit to 10GB
2023-02-24 17:24:04 -08:00
Ilya Kreymer
3df6e0f146
crawler arguments fixes: (#621)
- partial fix to #321, don't hard-code behavior limit into crawler args
- allow setting number of crawler browser instances via 'crawler_browser_instances' to avoid having to override the full crawler args
2023-02-22 13:23:19 -08:00
Tessa Walsh
14b349443f
Make pending invites expire via TTL index (#568)
* Make invites expire after configurable window

The value can be set in EXPIRE_AFTER_SECONDS env var and via
helm chart values, and defaults to 7 days.

* Create nightly test CI and add invite expiration test to it

* Update 404 error message for missing or expired invite

---------

Co-authored-by: sua yoo <sua@suayoo.com>
2023-02-14 16:07:14 -05:00
D. Lee
be4f918149
Merge pull request #442 from webrecorder/admin-logging-service
Add logging service
2023-01-19 22:15:16 -08:00
Ilya Kreymer
ccd87e0dff
Rename api / nginx settings -> backend / frontend, set pull policy job images (#504)
* rename config values
- api -> backend
- nginx -> frontend

* job pods:
- set job_pull_policy from api_pull_policy (same as backend image)
- default to Always, but can be overridden for local deployment (same as backend image)

typo fix: CRAWL_NAMESPACE -> CRAWLER_NAMESPACE (part of #491)
ansible: set default label to :latest instead of :dev for
2023-01-18 20:21:36 -08:00
Ilya Kreymer
1dfa494210
backend: add default behavior time to /api/settings (part of #321) (#499) 2023-01-18 14:52:15 -08:00
Ilya Kreymer
827b643262
backend: add 'allow_dupe_invites' option to allow re-inviting users. if not set (default), duplicate invites will result in errors (#471) 2023-01-12 23:25:48 -08:00
Ilya Kreymer
4dbca8c421
email sending tweaks: (#470)
- support 'reply-to' email field in values, and in ansible-based values
- set 'subject' for different types of messages
2023-01-12 23:25:23 -08:00
Tessa Walsh
49460bb070
Add default organization + invite to default org (#465), #455
- Add default switch to Archive (org) model
- Set default org name via values.yaml
- Add check to ensure only one org with default org name exists
- Stop creating new orgs for new users
- Add new API endpoints for creating and renaming orgs (part of #457)
- Make Archive.name unique via index
- Wait for db connection on init, log if waiting
- Make archive-less invites invite user to default org with Owner role
- Rename default org from chart value if changed
- Don't create new org for invited users
2023-01-12 16:44:18 -08:00
Ilya Kreymer
30bda8c75d
VNC-Based Profile Browser (#433)
* profile browser vnc support + fixes:
- switch profile browser rendering to use VNC
- frontend: add @novnc/novnc as dependency, create separate bundle novnc.js to load into vnc browser (to avoid loading from each container)
- frontend: update proxy paths to proxy websocket, index page to crawler
- frontend: allow browser profiles in all browsers, remove browser compatibility check
- frontend: update webpack dev config, apply prettier
- frontend: node version fix
- backend: get vncpassword, build new URL for proxying to crawler iframe
- backend: fix profile / crawl job pull policy from 'Always' -> 'Never', should use existing image for job
- backend: fix kill signal to use bash -c to work with latest backend image
- backend/chart: add 'profile_browser_timeout_seconds' to chart values to control how long profile browser to remain when idle (default to 60)
- backend: remove utils.py, now using secret.token_hex() for random suffix
Co-authored-by: sua yoo <sua@suayoo.com>
2023-01-10 14:42:42 -08:00
DongWoo Lee
5b7d214c8a add logging 2023-01-04 00:15:23 -08:00
Ilya Kreymer
82ffc0dfbc
Local Deployment Work: Support running locally + test cluster on CI (#396)
* k8s local deployment work:
- make it easier to deploy w/o ingress by setting 'local_service_port' (suggested port 30870)
- if using local minio, ensure file endpoints set to /data/ and /data/ proxies correctly to local bucket
- if not using minio, ensure file endpoints point to correct access / endpoint url.
- setup should work with docker desktop, minikube, microk8s and k3s!
- nginx chart: bump nginx memory limit to 20Mi
- nginx image: 00-default-override-resolver-config -> 00-browsertrix-nginx-init for clarity
- nginx image: use default nginx.conf, pin to nginx 1.23.2
- mongo: readd readiness probe, bump connect wait timeout (needed for ci)
- config: set superadmin username to 'admin'
- config schema: set 'name' as required 
- add sample chart values overrides:
- chart values: local-config.yaml for running locally with 'local_service_port'
- chart values: add microk8s-hosted.yaml for configuring a hosted microk8s setup
- chart values: add microk8s-ci.yaml for ci tests
- ci: remove docker swarm tests
- ci: add microk8s integration tests: launching cluster, logging in, running a crawl of example.com, downloading/checking WACZ
- bump to 1.1.0-beta.2
2022-12-02 19:58:34 -08:00
Ilya Kreymer
aabb0b2a92
chart / deployment fixes to run on microk8s: (fixes #385) (#387)
- ingress: fix proxying /data to minio, use another ingress which proxies correct host to ensure presigned urls work
- presigning: determine if signing endpoint url (minio) or access endpoint (cloud bucket) based on if access endpoint is provided, set bool on storage object
- chart: fix indent on incorrect storageClassName configs
- ingress: make 'ingress_class' configurable (set to 'public' for microk8s, default to 'nginx')
- minio: use older minio image which supports legacy fs based setup (for now)
- nginx service: add 'nginx_service_use_node_port' config setting: if true, will use NodePort for frontend,
other will use default (ClusterIP) and only for the frontend / nginx
- chart: remove changing service type for other services
2022-11-30 09:21:58 -08:00
Francis Kayiwa
6833c9d676
Digital ocean setup (#314)
- Ansible playbook for deploying on DigitalOcean, configuring space, k8s cluster, mongodb, domain / subdomain, signing subdomain, container registry, and cors
- Generates helm chat in ./deploys/ directory for future use with helm directly
- Initial support for deletion of created resources as well.
- add documentation on how to use playbook
default helm values: update to latest authsign, set default timeout to 120 seconds
2022-11-15 13:44:24 -08:00
Ilya Kreymer
447b0bf9b9
k8s chart + values tweak: (#317)
- mongo chart to avoid requiring username/password if passing db_url
- tweaks to default values (set registration enabled by default, longer) add missing options
2022-09-21 12:45:08 -07:00
Ilya Kreymer
1216f6cb66
K8s: update chart for local minio + mongo default (#301)
* k8s chart fixes:
mongo: pin to 5.0.11 version for now
minio: create empty dir for local storage for now instead of using mc, use 'btrix-data' as bucket name
2022-09-02 13:07:47 -07:00
Ilya Kreymer
f0c079dc1b k8s: update default images to dev images in values.yaml 2022-09-01 16:18:15 -07:00
Ilya Kreymer
5b6aa3bc95
Affinity + Tolerations + Cleanup Crawl Job (#256)
* k8s: add tolerations for 'nodeType=crawling:NoSchedule' to allow scheduling crawling on designated nodes for crawler and profiles jobs and statefulsets
* add affinity for 'nodeType=crawling' on crawling and profile browser statefulsets
* refactor crawljob: combine crawl_updater logic into base crawl_job
* increment new 'crawlAttemptCount' counter crawlconfig when crawl is started, not necessarily finished, to avoid deleting configs that had attempted but not finished crawls.
* better external mongodb support: use MONGO_DB_URL to set custom url directly, otherwise build from username, password and mongo host
2022-06-10 19:21:37 -07:00
Ilya Kreymer
bf79959a5a refactoring to use statefulsets + job (#245)
- use statefulsets instead of deployments for mongo, redis, signer
- use k8s job + statefulset for running crawls
- use separate statefulset for crawl (scaled) and single-replica redis stateful set
- move crawl job update login to crawl_updater
- remove shared redis chart

package refactor:
- move to shared code to 'btrixcloud'
- move k8s to 'btrixcloud.k8s'
- move docker to 'btrixcloud.docker'
2022-06-05 10:37:17 -07:00