Fixes#1888
Refactors scale handling:
- Ensures number of scaled instances does not exceed number of pages,
but is also at minimum 1
- Checks for finish condition to be numFailed + numDone >= desired scale
- If at least one instance succeeds, crawl considers successful / done.
- If all instances fail, crawl considered failed
- Ensures that pod done count >= redis done count
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes https://github.com/webrecorder/browsertrix/issues/1883
Backend work for https://github.com/webrecorder/browsertrix/issues/1876
- If readOnly is set true, disallow crawls and QA analysis runs
- If readOnly is set to true, skip scheduled crawls
- Add endpoint to set `readOnly` with optional `readOnlyReason` (which
is automatically set back to an empty string when `readOnly` is being
set to false), which can be displayed in banner
- Operator: ensures cronjobs that are skipped due to internal logic (eg. readonly mode) simply succeed right away and do not leave a k8s job dangling.
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Backend work for first two tasks of
https://github.com/webrecorder/browsertrix/issues/1875
New /billing API endpoint to be added separately once we have a better
idea of what data we can get from the payment processor.
Backend work for #1859
- Remove file count from qa stats endpoint
- Compute isFile or isError per page when page is added
- Increment filePageCount and errorPageCount per crawl to count number of isFile or isError pages
- Add file and error counts to crawl replay.json endpoint (filePageCount and errorPageCount)
- Add migration 0028 to set isFile / isError for each page, aggregate filePageCount / errorPageCount per crawl
- Determine if page is a file based on loadState == 2, mime type or status code and lack of title
- add a 'expire_at_duration_seconds' which is 75% of actual presign
duration time, or <25% remaining until presigned URL actually expires to
ensure presigned URLs are updated early than when they actually expire
- set cached expireAt time to the renew at time for more frequent
updates
- update QA configmap in place with updated presigned URLs when expireAt
time is reached
- mount qa config volume under /tmp/qa/ without subPath to get automatic
updates, which crawler will handle
- tests: fix qa test typo (from main)
- fixes#1864
Fixes#1846
- Ensure meter auto-updates as new stats are ready
- Switch meter to new QA run when new analysis run is started
- Remove Files from QA meter (files and errors will be reported separately)
Co-authored-by: emma <hi@emma.cafe>
Co-authored-by: sua yoo <sua@webrecorder.org>
Fixes#1833
- Add firstSeed and seedCount to workflow information in profile detail
API endpoint (tests updated accordingly), update name of model used for
limited workflow information to be more accurate
- Fix name display in Crawl Workflows list at bottom of Profile detail
page to be consistent with rest of application
---------
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
This PR adds Identical Files to the QA Page Match Analysis meter bars.
To do this, the backend calculates the number of non-HTML pages once and
includes it under the key `Files` in each of the `screenshotMatch` and
`textMatch` QA stats return arrays.
The backend additionally removes the file count from "No Data" to
prevent these from being counted twice.
---------
Co-authored-by: emma <hi@emma.cafe>
Resolves https://github.com/webrecorder/browsertrix/issues/1409
### Changes
- Enables clicking on Browser Profiles column header to sort the table, including by starting URL
- More consistent column widths throughout app
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: emma <hi@emma.cafe>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
- allow configuring QA run scale via 'qa_scale' setting in helm values
(overriding any setting on the qa crawljob)
- adds additional comments to browser instances helm values settings for clarity
- fixes#1842
Currently, the workflow crawl settings were not being included at all in
QA runs.
This mounts the crawl workflow config, as well as QA configmap, into QA
run crawls, allowing for page limits from crawl workflow to be applied
to QA runs.
It also allows a different number of browser instances to be used for QA
runs, as QA runs might work better with less browsers, (eg. 2 instead of
4). This can be set with `qa_browser_instances` in helm chart.
Default qa browser workers to 1 if unset (for now, for best results)
Fixes#1828
This PR introduces backend changes that add the following fields to the
Profile model:
- `modified`
- `modifiedBy`
- `modifiedByName`
- `createdBy`
- `createdByName`
Modified fields are set to the same as the created fields when the
resource is created, and changed when the profile is updated (profile
itself or metadata).
The list profiles endpoint now also supports `sortBy` and
`sortDirection` options. The endpoint defaults to sorting by `modified`
in descending order, but can also sort on `created` and `name`.
Tests have also been updated to reflect all new behavior.
clean up adding user vs changing role logic:
- when adding user, ensure user doesn't exist
- when changing roles, ensure user does exist
add test for changing roles of existing user
Fixes#1821
- ensure max_crawler_memory_size is inited before it is set!
- pass profile_browser_memory / profile_browser_cpu from chart values
- map volume to /tmp/home to avoid persisting /tmp for profiles
Fixes https://github.com/webrecorder/browsertrix/issues/1743
On the backend, support adding user to new org and improved error
messaging:
- if user exists is not part of the org to be added to, add user to the
registration org, return 201
- if user is already part of the org, return 400 with
'user_already_is_org_member' error
- if user is not being added to a new org but already exists, return
'user_already_exists'
frontend:
- if user user same password, they will just be logged in and added to
registration org (same as before)
- if user uses wrong password, show alert message
- handle both user_already_is_org_member and user_already_registered
with alert message and link to log-in page.
note: this means existing user is added to the registration org even if
they provide wrong password, but they won't be able
to login until they use correct password/reset/etc...
Adds a `max_crawler_memory` chart setting, which, if set, will
defines the upper crawler memory limit that crawler pods can be resized up to.
If not set, auto resizing is disabled and pods are always set to 'crawler_memory' memory
if 'registration_enabled' is set, check 'registration_org_id' for org id
of an existing org that new users should be added to when they register.
if omitted, default to the default org
Fixes#1729
To support #1683, it would be useful to be able to sort by 'last QA
start time' in addition to/instead of last QA state.
- make sorting consistent with workflow sorting
- sortBy fields renamed to lastQAState and lastQAStarted
- Current QA runs are now included in the lastQAState/lastQAStarted fields, rather than being separated out to different values
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Repository Index: Generate an index.yaml in ./docx/helm-repo/index.yaml
to allow for browsertrix to be a helm repository.
docs: rename docs.browsertrix.cloud -> docs.browsertrix.com
docs: update deployment doc to mention helm repo as preferred way to
install
docs build action: generate repository index in GH action
publish action: update auto-generated message to mention installing from
the repo.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
As additional support for #1683, include the active QA stats in the
crawl response, along with active QA state.
This will allow showing progress of QA run in the archived items list.
Fixes#1659
Takes an arbitrary set of thresholds for text and screenshot matches as
a comma-separated list of floats.
Returns a list of groupings for each that include the lower boundary and
count for all thresholds passed in.
- fixes#1684
- can be used to optionally restrict QA to only some crawls (eg. with
browsertrix-crawler>=1.0.0)
- enforce error on backend (return 400) and handle special error on the
frontend
Backend work for #1672
Adds new sort options to /crawls and /all-crawls GET list endpoints:
- `reviewStatus`
- `qaRunCount`: number of completed QA runs for crawl (also added to
CrawlOut)
- `qaState` (sorts by `activeQAState` first, then `lastQAState`, both of
which are added to CrawlOut)
- priority classes <-10 are ignored by cluster-autoscaler so QA jobs
with too low priorities never run
- start crawl priorities at 0 going down (same as before)
- start qa run priorities at -2 going down (instead of -100)
- this means a crawl of with scale of 3 can be preempted by 1st qa pod,
but otherwise crawls have higher priority
- rename priority classes as they are otherwise immutable and error on
helm upgrade
This allows for more room in lower pri classes for other type of
objects, while keeping in mind the -10 and below threshold: (see:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md)
Fixes#1670
No longer need to pass pages to the ConfigMap. The ConfigMap has a size
limit and will fail if there are too many pages.
With this change, the page list for QA will be read directly from the
WACZ files pages.jsonl / extraPages.jsonl entries.
Fixes#1648
- Tracks failed QA runs in database, not only successful ones
- Includes failed QA runs in list endpoint by default
- Adds `skipFailed` param to list endpoint to return only successful
runs
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
QA Details page:
- Enables QA tab with ability to start automated analysis QA Run + view a and manual review status
- Pages listed with review status + overall crawl review status shown on QA details (relates to #1508)
- Initial placeholder for QA run analytics (part of #1589)
- Addresses a good deal of #1477
Automated Analysis QA in Review Mode:
- Ability to select from multiple analysis QA runs / view QA runs in QA details
- Shows analysis screenshot, text and resources compare and replay tabs (fixes#1496)
- Sorting by worst screenshot / worst text score for each QA run
- Includes pages sidebar with screenshot/text/resource compare results (fixes#1497)
Manual Review QA in Review Mode:
- Per-page replay available as separate tab (fixes#1499)
- Supports thumbs up, thumbs down, notes for each page
- Supports entering review status approval (good/acceptable/bad can be entered when finishing review
---------
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
- increase time for going to waiting_capacity from starting to 150
seconds
- relax requirement for state transitions, allow complete from waiting
- additional type safety for different states, ensure mark_finished()
only called with non-running states, add `Literal` types for all the
state types.
- set memory limit to 1.2x memory request to provide extra padding and
avoid OOM
- attempt to resize crawler pods by 1.2x when exceeding 90% of available
memory
- do a 'soft OOM' (send extra SIGTERM) to pod when reaching 100% of
requested memory, resulting in faster graceful restart, but avoiding a
system-instant OOM Kill
- Fixes#1632
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#1617
Filters added:
- reviewed: filter by page has approval or at least one note (true) or
neither (false)
- approved: filter by approval value (accepts list of strings,
comma-separated, each of which are coerced into True, False, or None, or
ignored if they are invalid values)
- hasNotes: filter by has at least one note (true) or not (false)
Tests have also been added to ensure that results are as expected.
Supports running QA Runs via the QA API!
Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes
#1498
Also requires the latest Browsertrix Crawler 1.1.0+ (from
webrecorder/browsertrix-crawler#469 branch)
Notable changes:
- QARun objects contain info about QA runs, which are crawls
performed on data loaded from existing crawls.
- Various crawl db operations can be performed on either the crawl or
`qa.` object, and core crawl fields have been moved to CoreCrawlable.
- While running,`QARun` data stored in a single `qa` object, while
finished qa runs are added to `qaFinished` dictionary on the Crawl. The
QA list API returns data from the finished list, sorted by most recent
first.
- Includes additional type fixes / type safety, especially around
BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific
get_upload(), get_basecrawl(), get_crawl() getters for internal use and
get_crawl_out() for API
- Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch)
along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#1597
New endpoints (replacing old migration) to re-add crawl pages to db from
WACZs.
After a few implementation attempts, we settled on using
[remotezip](https://github.com/gtsystem/python-remotezip) to handle
parsing of the zip files and streaming their contents line-by-line for
pages. I've also modified the sync log streaming to use remotezip as
well, which allows us to remove our own zip module and let remotezip
handle the complexity of parsing zip files.
Database inserts for pages from WACZs are batched 100 at a time to help
speed up the endpoint, and the task is kicked off using
asyncio.create_task so as not to block before giving a response.
StorageOps now contains a method for streaming the bytes of any file in
a remote WACZ, requiring only the presigned URL for the WACZ and the
name of the file to stream.
- instead of overriding the content-type header globally, pass
'application/merge-patch+json' to
self.custom_api.patch_namespaced_custom_object() directly
- bump kubernetes-asyncio to 29.0.0
- fixes potential issues with global override of the header in
kubernetes-asyncio
- copy of #1602 for main
- Backend: Use separate resource constraints for profiles: default
profile browser resources to either 'profile_browser_cpu' /
'profile_browser_memory' or single browser 'crawler_memory_base' /
'crawler_cpu_base', instead of scaled to the number of browser workers
- Frontend: check that profile html page is loading, keep retrying if
still getting nginx error instead of loading an iframe with the error.
Fixes#1598 (Copy of #1599 from 1.9.4)
I came across [this
problem](https://forum.webrecorder.net/t/deleting-crawl-failure/512) and
noticed that the access URL is used when deleting files, causing my file
deletions to fail on OpenStack SWIFT S3 (relates to #1090). This trivial
change makes it work there.
Allow maximum scale option to be fully configurable via
`max_crawl_scale`. Already configurable on the backend, and now exposed
to the frontend via API `/api/settings` `maxCrawlScale` value.
The workflow editor and workflow details are updated to allow selecting
the scale up to the maxCrawlScale setting (which defaults to 3 if not
set).
Fixes#1539
Adds `reviewStatus` field to `BaseCrawl` model, updatable via the crawl
update API endpoint. Acceptable values are "good", "acceptable" or
"failure", enforced by an Enum.
Added to `BaseCrawl` so that we can extend support to uploads more
easily later on, but for now we'll only display this for crawls in the
frontend.
The operator class has gotten fairly large, this is a first pass in
refactoring operator.py into a submodule instead, with multiple operator
instances which handle different types of objects.
- The main k8s interface has been split into K8sOpApi which extends K8sApi
and is shared across all operators.
- Each operator extends BaseOperator which also has an instance of K8sOpApi
- The CrawlOperator is still the bulk of the functionality, but will likely be further refactored
to support QA jobs
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#1558
- Adds crawl errors to database incrementally during crawl rather than
after crawl completes
- Simplifies crawl /errors API endpoint to always return errors from
database
- increases the failureThreshold for startupProbe for the api backend
container to account for long running migrations, upto 300 seconds
- add `/healthzStartup` which checks if db is ready
- bump
- keeps `/healthz` to always return 200 when running
- increases livenessProbe failureThreshold to be higher than readiness
probe, following recommended best practice of liveness probe > readiness
probe
- fixes#1559
Fixes#1502
- Adds pages to database as they get added to Redis during crawl
- Adds migration to add pages to database for older crawls from
pages.jsonl and extraPages.jsonl files in WACZ
- Adds GET, list GET, and PATCH update endpoints for pages
- Adds POST (add), PATCH, and POST (delete) endpoints for page notes,
each with their own id, timestamp, and user info in addition to text
- Adds page_ops methods for 1. adding resources/urls to page, and 2.
adding automated heuristics and supplemental info (mime, type, etc.) to
page (for use in crawl QA job)
- Modifies `Migration` class to accept kwargs so that we can pass in ops
classes as needed for migrations
- Deletes WACZ files and pages from database for failed crawls during
crawl_finished process
- Deletes crawl pages when a crawl is deleted
Note: Requires a crawler version 1.0.0 beta3 or later, with support for
`--writePagesToRedis` to populate pages at crawl completion. Beta 4 is
configured in the test chart, which should be upgraded to stable 1.0.0
when it's released.
Connected to https://github.com/webrecorder/browsertrix-crawler/pull/464
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Supports setting WARC prefix for WARCs inside WACZ to `<org slug>-<slug
[crawl name | first seed host]>`.
- Prefix set via WARC_PREFIX env var, supported in browsertrix-crawler
1.0.0-beta.4 or higher
If crawl name is provided, uses crawl name, other hostname of first
seed. The name is 'sluggified', using lowercase alphanum characters
separated by dashes.
Ex: in an organization called `Default Org`, a crawl of
`https://specs.webrecorder.net/` and no name will have WARCs named:
`default-org-specs-webrecorder-net-....warc.gz`
If the crawl is given the name `SPECS`, the WARCs will be named
`default-org-specs-manual-....warc.gz`
Fixes#412 in a default way.
This PR addresses a possible failure when Redis pod was inaccessible
from Crawler pod.
- Ensure crawl is set to 'waiting_for_capacity' if either no crawler
pods are available or no redis pod. previously, missing/inaccessible
redis would not result in 'waiting_for_capacity' if crawler pods are
available
- Rework logic: if no crawler and redis after >60 seconds, shutdown
redis. if crawler and no redis, init (or reinit) redis
- track 'lastUpdatedTime' in db when incrementing exec time to avoid
double counting if lastUpdatedTime has not changed, eg. if operator sync
fails.
- add redis timeout of 20 seconds to avoid timing out operator responses
if redis conn takes too long, assume unavailable
configmap: add --screenshot thumbnail,view as default screenshots
version: update update-version.sh to add newline in version.py to match
new black formatting (from changes in #1507)
Fixes#1519
Fixes#1341
Adds "User Agent" field to workflow editor under the Browser Settings
tab. If not set, the crawler will use the browser's default user agent.
Also added to docs and to the workflow details page (if set).
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Fixes#1385
## Changes
Supports multiple crawler 'channels' which can be configured to
different browsertrix-crawler versions
- Replaces `crawler_image` in helm chart with `crawler_channels` array
similar to how storages are handled
- The `default` crawler channel must always be provided and specifies
the default crawler image
- Adds backend `/orgs/{oid}/crawlconfigs/crawler-channels` API endpoint
to fetch information about available crawler versions (name, image, and
label) and test
- Adds crawler channel select to workflow creation/edit screens and
profile creation dialog, and updates related API endpoints and
configmaps accordingly. The select dropdown is shown only if more than
one channel is configured.
- Adds `crawlerChannel` to workflow and crawl details.
- Add `image` to crawler image, used to display actual image used as
part of the crawl.
- Modifies `crawler_crawl_id` backend test fixture to use `test` crawler
version to ensure crawler versions other than latest work
- Adds migration to add `crawlerChannel` set to `default` to existing
workflow and profile objects and workflow configmaps
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Fixes#1158
Introduces two new API endpoints that stream crawling statistics CSVs
(with a suggested attachment filename header):
- `GET /api/orgs/all/crawls/stats` - crawls from all orgs (superuser
only)
- `GET /api/orgs/{oid}/crawls/stats` - crawls from just one org
(available to org crawler/admin users as well as superusers)
Also includes tests for both endpoints.
If configmap is missing (eg. was accidentally deleted from k8s) recreate
the configmap when updating the crawl workflow or running a crawl.
Previously, this would result in an error, but now the configmap should
be correctly recreated.
Fixes#1358
- Adds `extraExecMinutes` and `giftedExecMinutes` org quotas, which are
not reset monthly but are updateable amounts that carry across months
- Adds `quotaUpdate` field to `Organization` to track when quotas were
updated with timestamp
- Adds `extraExecMinutesAvailable` and `giftedExecMinutesAvailable`
fields to `Organization` to help with tracking available time left
(includes tested migration to initialize these to 0)
- Modifies org backend to track time across multiple categories, using
monthlyExecSeconds, then giftedExecSeconds, then extraExecSeconds.
All time is also written into crawlExecSeconds, which is now the monthly
total and also contains any overage time above the quotas
- Updates Dashboard crawling meter to include all types of execution
time if `extraExecMinutes` and/or `giftedExecMinutes` are set above 0
- Updates Dashboard Usage History table to include all types of
execution time (only displaying columns that have data)
- Adds backend nightly test to check handling of quotas and execution
time
- Includes migration to add new fields and copy crawlExecSeconds to
monthlyExecSeconds for previous months
Co-authored-by: emma <hi@emma.cafe>
Fixes#1395
- Adds new `POST /orgs/<orgid>/jobs/retryFailed` API endpoint to retry all failed
background jobs for a specific org.
- Also adds `POST /orgs/all/jobs/retryFailed` for superadmin to retry all failed background jobs for all orgs
Closes#1294
### Changes
- `crawl-list` component
- Adds a check if there are any items in the actions menu. If not, skip
rendering the actions menu.
- This allows us to give the component no actions! Currently required to
remove them for viewers!
- Collection Details
- Hides "Remove from Collection" option for viewers
- Crawls List
- Removes the single "View Crawl Details" option from archived items for
viewers
- All the other actions were already set up correctly to be used by all
roles!
- Dashboard
- Hides org settings gear icon button unless the user is an admin
- Hides "Create New" dropdown for viewers
- Workflow Details
- Hides workflow edit icon button for viewers
- Hides the "Delete Crawl" option in archived items for viewers
- Hides the "Run Crawl" option for viewers
- Workflow List
- Hides all edit-related options for viewers, the only option now is
copying tags
- Removes the deactivate / delete options (were only visible when
running a crawl) in the workflow list actions
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: sua yoo <sua@suayoo.com>
- Emails are now processed from Jinja2 templates found in
`charts/email-templates`, to support easier updates via helm chart in
the future.
- The available templates are: `invite`, `password_reset`, `validate` and
`failed_bg_job`.
- Each template can be text only or also include HTML. The format of the
template is:
```
subject
~~~
<html content>
~~~
text
```
- A new `support_email` field is also added to the email block in
values.yaml
Invite Template:
- Currently, only the invite template includes an HTML version, other
templates are text only.
- The same template is used for new and existing users, with slightly
different text if adding user to an existing org.
- If user is invited by the superadmin, the invited by field is not
included, otherwise it also includes 'You have been invited by X to join Y'
- Adds two new crawl finished state, stopped_by_user and
stopped_quota_reached
- Tracking other possible 'stop reasons' in operator, though not making
them distinct states for now.
- Updated frontend with 'Stopped by User' and 'Stopped: Time Quota
Reached', shown with same icon as current partial_complete
- Added migration of partial_complete to either stopped_by_user or
complete (no historical quota data available)
- Addresses edge case in scaling: if crawl never scaled (no redis entry,
no pod), automatically scale down
- Edge case in status: if crawl is somehow 'canceled' but not deleted,
immediately delete crawl object and begin finalizing.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#1307Fixes#1132
Related to #1306
Deleted webhook notifications include the org id and item/collection id.
This PR also includes API docs for the new webhooks and extends the
existing tests to account for the new webhooks.
This PR also does some additional cleanup for existing webhooks:
- Remove `downloadUrls` from item finished webhook bodies
- Rename collection webhook body `downloadUrls` to `downloadUrl`, since
we only ever have one per collection
- Fix API docs for existing webhooks, one of which had the wrong
response body
Fixes#1328
- Adds /retry endpoint for retrying failed jobs.
- Returns 400 error if previous job still running or has succeeded
- Keeps track of previous failed attempts in previousAttempts array on failed job.
- Also amends the similar webhook /retry endpoint to use `POST` for consistency.
- Remove duplicate api tag for backgroundjobs
Fixes#1364
Regression fix for issue introduced in storage refactoring (see issue
for more details).
Changes:
1. Add `profiles/` prefix to profile filename passed in to crawler for
profile creation and written into db
2. Remove hardcoded `profiles/` prefix from crawler YAML
3. Add migration to add `profiles/` prefix to profile filenames that
don't already have it, including updating PROFILE_FILENAME in ConfigMaps
This way between the related storage document and the profile filename,
we have the full path to the object in the database rather than relying
on additional prefixes hardcoded into k8s job YAML files.
Note that this as a follow-up it'll be necessary to manually move any
profiles that had been written into the `<oid>` "directory" in object
storage rather than `<oid>/profiles` to the latter. This should only
affect profiles created very recently in a 1.8.0-beta release.
- move authsign secret to signer and make port configurable
- rename storages to more general ops-configs
- put 'storages.json' path into env var
- rename backend secret to backend-auth
- cronjobs: don't keep succeeded jobs around, triggers operator update
Fixes#1337
Crawl timeout is tracked via `elapsedCrawlTime` field on the crawl
status, which is similar to regular crawl execution time, but only
counts one pod if scale > 1. If scale == 1, this time is equivalent.
Crawl is gracefully stopped when the elapsed execution time exceeds the
timeout. For more responsiveness, also adding current crawl time since
last update interval.
Details:
- handle crawl timeout via elapsed crawl time - longest running time of a
single pod, instead of expire time.
- include current running from last update for best precision
- more accurately count elapsed time crawl is actually running
- store elapsedCrawlTime in addition to crawlExecTime, storing the
longest duration of each pod since last test interval
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Instead of adding the app templates launched from the backend via
`backend/btrixcloud/templates`, add them to a configmap and mount the
configmap in the same location.
This allows these templates to be updated, like other values in
charts/... without having to rebuild any of the images, speeding up dev
and maintenance time.
Changes include:
- move backend/btrixcloud/templates -> chart/app-templates/
- add app-templates/*.yaml to app-templates configmap
- mount app-templates configmap to /app/btrixcloud/templates/ in api and op containers
- instead of restarting crawler when exclusion added/removed, add a
message to a redis list (per crawler instance)
- no longer filtering existing queue on backend, now handled via crawler (implemented in 0.12.0 via webrecorder/browsertrix-crawler#408)
- match response optimization: instead of returning first 1000 matches,
limits response to 500K and returns however many matches fit in that
response size (for optional pagination on frontend)
Fixes#1252
Supports a generic background job system, with two background jobs,
CreateReplicaJob and DeleteReplicaJob.
- CreateReplicaJob runs on new crawls, uploads, profiles and updates the
`replicas` array with the info about the replica after the job succeeds.
- DeleteReplicaJob deletes the replica.
- Both jobs are created from the new `replica_job.yaml` template. The
CreateReplicaJob sets secrets for primary storage + replica storage,
while DeleteReplicaJob only needs the replica storage.
- The job is processed in the operator when the job is finalized
(deleted), which should happen immediately when the job is done, either
because it succeeds or because the backoffLimit is reached (currently
set to 3).
- /jobs/ api lists all jobs using a paginated response, including filtering and sorting
- /jobs/<job id> returns details for a particular job
- tests: nightly tests updated to check create + delete replica jobs for crawls as well as uploads, job api endpoints
- tests: also fixes to timeouts in nightly tests to avoid crawls finishing too quickly.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
This PR adds more type safety to the backend codebase:
- All ops classes calls should be type checked
- Avoiding circular references with TYPE_CHECKING conditional
- Consistent UUID usage: uuid.UUID / UUID4 with just UUID
- Crawl states moved to models, made into lists
- Additional typing added as needed, fixed a few type related errors
- CrawlOps / UploadOps / BaseCrawlOps now all have same param init order
to simplify changes
- check the 'btrix.org' instead of 'oid' labels in getting related
crawls
- fixes regression introduced in #1296 where labels where all org id
labels were switched to 'btrix.org' for consistency
- Refactors storage to support replicas + custom storages on the Org.
- There is a default primary + replica storage, while an Org can also have
primary and replica storages.
- StorageRef object is used to store references to default and custom
storage.
- CrawlFile has been updated to contain a StorageRef instead of a
def_storage_name, which references
either a default storage (in StorageOps) or custom storage (in
Organization)
- There is also a 'replicas' Optional[List[StorageRef]] which contains
replicas, if any.
- CrawlFileOut contain a numReplicas for how many replicas exist for
a given file.
- Migration: migration 0020 added to migrate existing Orgs, CrawlFile and ProfileFile objects to new storage system (CrawlFile and ProfileFile now extend BaseFile)
Part of #1262
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#1261Closes#1092
The quota for monthly execution minutes is treated as a hard cap. Once
it is exceeded, an alert indicating that an org has exceeded its monthly
execution minutes will display and the user will be unable to start new
crawls. Any running crawls will be stopped once the quota is exceeded.
An execution minutes meter bar is also added in the Org Dashboard and
displayed if a quota is set. More detail in #1305 which was
merged into this branch.
## Changes
- Enable setting 'maxExecMinutesPerMonth' in orgs list quotas by superadmin
- Enforce quota by stopping crawls in operator once quota is reached
- Show alert banner once execution time quota is hit:
- Once quota is hit, disable Run Crawl buttons in frontend, return 403
message with `exec_minutes_quota_reached` detail in backend from
crawl config `/run` endpoint, and don't run new workflows on creation
(similar to storage quota)
- Display execution time for crawls in the crawl details overview,
immediately below
- Show execution minutes meter on dashboard (from #1305)
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: sua yoo <sua@webrecorder.org>
Fixes#1297
Ensures proper typing for UUIDs in FastAPI input models, to avoid
explicit conversions, which may throw errors.
This avoids possible 500 errors (due to ValueError exceptions) when
converting UUIDs from user input.
Instead, will get more 422 errors from FastAPI.
UUID conversions remaining are in operator / profile handling where
UUIDs are retrieved from previously set fields, remaining user input
conversions in user auth and collection list are wrapped in exceptions.
For `profileid`, update fastapi models to support union of UUID, null,
and EmptyStr (new empty string only type), to differentiate removing
profile (empty string) vs not changing at all (null) for config updates
Fixes#1306
- Include full `resources` with expireAt (as string) in crawlFinished
and uploadFinished webhook notifications rather than using the
`downloadUrls` field (this is retained for collections).
- Set default presigned duration to one minute short of 1 week and enforce
maximum supported by S3
- Add 'storage_presign_duration_minutes' commented out to helm values.yaml
- Update tests
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#1270
After 5 consecutive failed logins from the same user, we now prevent the
user from logging in even with the correct password until they reset it
via their email, or wait an hour.
- After failure threshold is reached, all further login attempts are rejected
- Attempts for invalid email addresses are also tracked
- On 6th try, a reset password email is automatically sent, only once
- Failed login counter resets after an hour of no further logins after last attempted login.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- avoid exception if 'errors' (or 'files' keys) don't exist (part of
#1297)
- ensure 'errors' list always set on output model for consistency,
defaulting to empty list
- fix tests for 'errors' being an empty empty list
follow-up to #1300 (merging 1.7.1 release into main)
Fixes#1050
Major refactor of the user/auth system to remove fastapi_users
dependency. Refactors users.py to be standalone
and adds new auth.py module for handling auth. UserManager now works
similar to other ops classes.
The auth should be fully backwards compatible with fastapi_users auth,
including accepting previous JWT tokens w/o having to re-login. The User
data model in mongodb is also unchanged.
Additional fixes:
- allows updating fastapi to latest
- add webhook docs to openapi (follow up to #1041)
API changes:
- Removing the`GET, PATCH, DELETE /users/<id>` endpoints, which were not
in used before, as users are scoped to orgs. For deletion, probably
auto-delete when user is removed from last org (to be implemented).
- Rename `/users/me-with-orgs` is renamed to just `/users/me/`
- New `PUT /users/me/change-password` endpoint with password required to update password, fixes #1269, supersedes #1272
Frontend changes:
- Fixes from #1272 to support new change password endpoint.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: sua yoo <sua@suayoo.com>
- Replaces org UUID in URL/browser location bar with org slug.
- Refactor: Adds shared app state utility using https://sijakret.github.io/lit-shared-state/ to
access org data from deep descendants.
- Backwards compatible: org UUID URLs should auto-redirect to org slug URLs.
- Show the org UUID in org settings general tab for use with APIs
(Resolves#1258, Follows #1279)
Optimizes webhooks by passing oid directly to webhooks:
- avoids extra crawl lookup
- possible for crawl to be deleted before webhook is processed via
operator (resulting in crawl lookup to fail)
- add more typing to operator and webhooks
Fixes#1271
Using .log for now due to broader support for opening with default viewers
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Fixes#1278
- Adds `GET /orgs/slug-lookup` endpoint returning `{id: slug}` for all
orgs
- Restricts new endpoint and existing `GET /orgs/slugs` to superadmins
* storage ops: follow up to #1257:
- fix refactor typo
- add type hints for all storageops apis (add mypy_boto3_s3 and types_aiobotocore_s3 for type hints)
- Add slug field with uniqueness constraint to Organization
- Use python-slugify to generate slug from name and import that in migration
- Require name in all /rename and org creation requests
- Auto-generate slug for new org with no slug or when /rename is called w/o a slug
- Auto-generate slug for 'default-org' based on name
- Add /api/orgs/slugs GET endpoint to return all slugs in use
- tests: extend backend test-requirements.txt from requirements to allow testing slugify
- tests: move get_redis_crawl_stats() to avoid extra dependency in utils
* storage ops refactor:
- create StorageOps class similar to other ops classes
- init storages list in StorageOps, no longer require lookup up default storages via CrawlManager
- convert all storage functions to members, add storageops to operator
- remove unused params, ensure crawl exists for rollover restart
- add env var to determine if using local minio to use correct endpoint URL
* crawls /seeds endpoint: just return empty list if not a crawl (eg. upload)
* crawlmanager: remove unused code, rename check_storage -> has_storage
* store execution time in operator:
- rename isNewCrash -> isNewExit, crashTime -> exitTime
- keep track of exitCode
- add execTime counter, increment when state has a 'finishedAt' and 'startedAt' state
- ensure pods are complete before deleting
- store 'crawlExecSeconds' on crawl and org levels, add to Crawl, CrawlOut, Organization models
* support for fast cancel:
- set redis ':canceled' key to immediately cancel crawl
- delete crawl pods to ensure pod exits immediately
- in finalizer, don't wait for pods to complete when canceling (but still check if terminated)
- add currentTime in pod.status.running.startedAt times for all existing pods
- logging: log exec time, missing finishedAt
- logging: don't log exit code 11 (interrupt due to time/size limits) as a crash
* don't wait for pods completed on failed with existing browsertrix-crawler image
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
* keep track of per pod status on crawljob:
- crashes time, and reason
- 'used' vs 'allocated' resources
- 'percent' used / allocated
* crawl log errors: log error when crawler crashes via OOM, either via redis error log
or to console
* add initial autoscaling support!
- detect if metrics server is available via K8SApi.is_pod_metrics_available()
- if available, use metrics for 'used' fields
- if no metrics, set memory used for redis only (using redis apis)
- allow overriding memory and cpu via newMemory and newCpu settings on pod status
- scale memory / cpu based on newMemory and newCpu setting
- templates: update jinja templates to allow restarting crawler and redis with new resources
- ci: enable metrics-server on k3d, microk8s and nightly k3d ci runs
* roles: cleanup unused roles, add permissions for listing metrics
* stats for running crawls:
- update in db via operator
- avoids losing stats if redis pod happens to be done
- tradeoff is more db access in operator, but less extra connections to redis + already
loading from db in backend
- size stat: ensure size of previous files is added to the stats
* crawler deployment tweaks:
- adjust cpu/mem per browser
- add --headless flag to configmap to use new headless mode by default!
- Require that all passwords are between 8 and 64 characters
- Fixes account settings password reset form to only trigger
logged-in event after successful password change.
- Password validation can be extended within the UserManager's
validate_password method to add or modify requirements.
- Add tests for password validation
- If set, and any of the seeds fails, the entire crawl is marked as a failure.
- Add checkbox which adds --failOnFailedSeed checkbox to URL list workflows
- Add 'Fail Crawl On Failed URL' to crawl workflow setup docs
- Applies user permissions check before deleting anything in all /delete endpoints
- Shuts down running crawls before deleting anything in /all-crawls/delete as well as /crawls/delete
- Splits delete_list.crawl_ids into crawls and upload lists at same time as checks in /all-crawls/delete
- Updates frontend notification message to Only org owners can delete other users' archived items. when a crawler user attempts to delete another users' archived items
- Adds "Logs" tab to workflow detail
- Shows error logs in expandable section in "Watch" tab
- Show corresponding message (no logs yet or logs temporarily unavailable) when `/errors` returns 503 based on crawl state
- text tweaks: use error logs instead of logs, change 'crawl start' -> 'crawl continue' in log message
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- Remove config.seeds from workflow and crawl detail endpoints
- Add new paginated GET /crawls/{crawl_id}/seeds and /crawlconfigs/{cid}/seeds endpoints to retrieve seeds for a crawl or workflow
- Include firstSeed in GET /crawlconfigs/{cid} endpoint (was missing before)
- Modify frontend to fetch seeds from new /seeds endpoints with loading indicator
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>