Fixes#2186
Background job emails will no longer fail to send for jobs unrelated to
file replication or replica deletion.
Also uses `AnyJob` for paginated background job response model, to fix
typing being out of data following addition of other types of background
jobs and lower overhead for adding new ones moving forward.
Fixes#2112
- Moves org storage recalculation to background job, modify endpoint to
return job id as part of response
- Updates crawl + QA backend tests that broke due to
https://webrecorder.net website changes
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Non-admin users should not be given option to rename org when invited to
a new org:
- set firstOrgAdmin to true only when invite is for an admin
- default to false instead of null
- update tests to check
Resolves#1354
Supports crawling through pre-configured proxy servers, allowing users to select which proxy servers to use (requires browsertrix crawler 1.3+)
Config:
- proxies defined in btrix-proxies subchart
- can be configured via btrix-proxies key or separate proxies.yaml file via separate subchart
- proxies list refreshed automatically if crawler_proxies.json changes if subchart is deployed
- support for ssh and socks5 proxies
- proxy keys added to secrets in subchart
- support for default proxy to be always used if no other proxy configured, prevent starting cluster if default proxy not available
- prevent starting manual crawl if previously configured proxy is no longer available, return error
- force 'btrix' username and group name on browsertrix-crawler non-root user to support ssh
Operator:
- support crawling through proxies, pass proxyId in CrawlJob
- support running profile browsers which designated proxy, pass proxyId to ProfileJob
- prevent starting scheduled crawl if previously configured proxy is no longer available
API / Access:
- /api/orgs/all/crawlconfigs/crawler-proxies - get all proxies (superadmin only)
- /api/orgs/{oid}/crawlconfigs/crawler-proxies - get proxies available to particular org
- /api/orgs/{oid}/proxies - update allowed proxies for particular org (superadmin only)
- superadmin can configure which orgs can use which proxies, stored on the org
- superadmin can also allow an org to access all 'shared' proxies, to avoid having to allow a shared proxy on each org.
UI:
- Superadmin has 'Edit Proxies' dialog to configure for each org if it has: dedicated proxies, has access to shared proxies.
- User can select a proxy in Crawl Workflow browser settings
- Users can choose to launch a browser profile with a particular proxy
- Display which proxy is used to create profile in profile selector
- Users can choose with default proxy to use for new workflows in Crawling Defaults
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
don't set inviterEmail / inviterName if the inviter is the superuser:
- return fromSuperuser true/false
- if fromSuperuser, don't set inviterEmail / inviterName
- tests: add tests for non-superuser admin invites
- Add a custom EmailStr type which lowercases the full e-mail, not just
the domain.
- Ensure EmailStr is used throughout wherever e-mails are used, both for
invites and user models
- Tests: update to check for lowercase email responses, e-mails returned
from APIs are always lowercase
- Tests: remove tests where '@' was ur-lencoded, should not be possible
since POSTing JSON and no url-decoding is done/expected. E-mails should
have '@' present.
- Fixes#2083 where invites were rejected due to case differences
- CI: pin pymongo dependency due to latest releases update, update python used for CI
- add POST /orgs/<id>/defaults/crawling API to update all defaults
(defaults unset are cleared)
- defaults returned as 'crawlingDefaults' object on Org, if set
- fixes#2016
---------
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
* Fixes issue in FailedLogin model:
- fix data-model to remove nested 'attempted.attempted'
- migrate existing data to remove nested field
* Also, avoid setting dt_now() in model as that results in fixed date for
all objects:
- update FailedLogin to update 'attempted' date on every attempt
- also update PageNote object to set date in constructor
* Update text for too many logins to make it clear it is set only if its a
valid email
* fixes#2001
Tweaks to how execution time is tracked for more accuracy + excluding
waiting states:
- don't update if crawl state is in a 'waiting state' (waiting for
capacity or waiting for org limit)
- rename start states -> waiting states for clarity
- reset lastUpdatedTime if two consecutive updates of non-running state,
to ensure non-running states don't count, but also account for
occasional hiccups -- if only one update detects non-running state,
don't reset
- webhooks: move start webhook to when crawl actually starts for first
time (db lastUpdatedTime is not yet + crawl is running)
- don't set lastUpdatedTime until pods actually running
- set crawljob update interval to every 10 seconds for more accurate
execution time tracking
- frontend: show seconds in 'Execution Time' display
- Follow-up to #1914, allows SubscriptionUpdate event to also update
quotas.
- Passes current usage info + current billing page URL to portalUrl
request for external app to be able to respond with best portalUrl
- get_origin() moved to utils to be available more generally.
- Updates billing tab to show current plans, switches order of quotas to
list execution time, storage first
- no longer being used with latest stream-zip
- was not computed correctly in the crawler
- counterpart to webrecorder/browsertrix-crawler#657
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
an org is made read-only while crawls are running:
- treat similar to other stopped_* states, do a graceful stop
- update UI to display "Stopped: Crawling Disabled" for this status
- don't add corresponding skipped status - just skip running crawls if org is read-only
Fixes#1957
Adds three new webhook events related to QA: analysis started, analysis
ended, and crawl reviewed.
Tests have been updated accordingly.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#1968
Changes:
- `stopped_quota_reached` and `skipped_quota_reached` migrated to new
values that indicate which quota was reached
- Before crawls are run, the operator checks if storage or exec mins
quotas are reached and if so fails the crawl with the appropriate state
of `skipped_storage_quota_reached` or `skipped_time_quota_reached`
- While crawls are running, the operator checks if the exec mins quota
is reached or if the size of all running crawls will mean the storage
quota is reached once uploaded; if so, the crawl is stopped gracefully
and given `stopped_storage_quota_needed` or `stopped_time_quota_reached`
state as appropriate
- Adds new nightly tests for enforcing storage quota
* updates pydantic to 2.x
* also update to python 3.12
* additional type fixes:
- all Optional[] types must have a default value
- update to constrained types
- URL types converted from str
- test updates
Fixes#1940
Follow-up to regressions from #1928, this PR:
- Fixes response models for queue endpoints, which had incorrect model
- Adds tests for queue get, queue match, and exclusions add / remove to
ensure regressions like this can be caught via tests. This involves
starting a new crawl in test_run_crawls() instead of relying on implicit
running via fixtures, make it easier to test crawl while it's running.
- Adds additional typing for crawls apis, including making
delete_crawls() have correct typing, consistent derived class override
- Adds check to ensure queue + exclusion operations can not be called
when crawl is not running
- ensure crawlFilenameTemplate is part of the CrawlConfig model
- change CrawlConfig init to use type-safe construction
- add a run_now_internal() that is shared for starting crawl, either on
demand or from new config
- add OrgOps.can_run_crawls() to check against org quotas for crawling
- cleanup profile updates, remove _lookup_profile, only check for
EmptyStr in update
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#1926
- adds /subscriptions/import endpoint for importing an existing subscription to an existing org
- add SubscriptionImport object and log as 'import' event in subscription events collection
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#1920
Adds response models to all API endpoints that were missing them,
documenting current behavior without making any changes at this stage to
standardize responses.
Follow-up work will involve adding generics to some of the response models
Fixes#1916
- Add `created` field to Organization and OrgOut, set on org creation
- Add migration to backfill `created` dates from first workflow
`created`
- Replace `datetime.now()` and `datetime.utcnow()` across app with
consistent timezone-aware `utils.dt_now` helper function, which now uses
`datetime.now(timezone.utc)`. This is in part to ensure consistency in
how we handle datetimes, and also to get ahead of timezone naive
datetime creation methods like `datetime.utcnow()` being deprecated in
Python 3.12. For more, see:
https://blog.miguelgrinberg.com/post/it-s-time-for-a-change-datetime-utcnow-is-now-deprecated
Fixes https://github.com/webrecorder/browsertrix/issues/1905
- adds a new top-level `/api/subscriptions` endpoint and SubOps handler on
the backend.
- enable subscriptions API endpoints available only if `billing_enabled` is
set in helm chart
- new POST /subscriptions/create, /subscriptions/update,
/subscriptions/cancel API endpoints
- Subscriptions mongo collection storing timestamped /subscription
API events
- GET /subscriptions/events API to get subscription events, support for filtering and sorting
- Subscription data model
- Support for setting and handling readOnlyOnCancel on org
- /orgs/<id>/billing-portal to lookup portalUrl using external API
- subscription in org getter and list views
- mark org as readOnly for subscription status `paused_payment_failed`, clears it on status `active`
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#1432
Refactors the invite + registration system to be simpler and more consistent
with regards to existing user invites. Previously, per-user invites are
stored in the user.invites dict instead of in the invites collection,
which creates a few issues:
- Existing user do not show up in Org Invites list: #1432
- Existing user invites also do not expire, unlike new user invites,
creating potential security issue.
Instead, existing user invites should be treated like new user invites.
This PR moves them into the same collection,
adding a `userid` field to InvitePending to match with an existing user.
If a user already exists, it will be matched by userid, instead of by
email. This allows for user to update their email while still being
invited. Note that the email of the invited existing user will not
change in the invite email. This is also by design: an admin of one org
should not be given any hint that an invited user already has an
account, such as by having their email automatically update. For an org
admin, the invite to a new or existing user should be indistinguishable.
The sha256 of invite token is stored instead of actual token for better
security.
The registration system has also been refactored with the following
changes:
- Auto-creation of new orgs for new users has been removed
- User.create_user() replaces the old User._create() and just creates the user with
additional complex logic around org auto-add
- Users are added to org in org add_user_to_org()
- Users are added to org through invites with add_user_with_invite()
Tests:
- Additional tests include verifying that existing and new pending
invites appear in the pending invites list
- Tests for `/users/invite/<token>?email=` and
`/users/me/invite/<token>` endpoints
- Deleting pending invites
- Additional tests added for user self-registration, including existing
user self-registration to default org of existing user (in nightly
tests)
Fixes#890
This PR introduces new streaming superuser-only API endpoints to export
and import database information for an organization. New Adminstrator
deployment documentation on how to manage the process and copy files
between S3 buckets as needed is also included.
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Updates the /api/orgs/create endpoint to:
- not have name / slug be required, will be renamed on first user via
#1870
- support optional quotas
- support optional first admin user email, who will receive an invite to
join the org.
Also supports a new shared secret mechanism, to allow an external
automation to access the /api/orgs/create endpoint (and only that
endpoint thus far) via a shared secret instead of normal login.
Fixes https://github.com/webrecorder/browsertrix/issues/1883
Backend work for https://github.com/webrecorder/browsertrix/issues/1876
- If readOnly is set true, disallow crawls and QA analysis runs
- If readOnly is set to true, skip scheduled crawls
- Add endpoint to set `readOnly` with optional `readOnlyReason` (which
is automatically set back to an empty string when `readOnly` is being
set to false), which can be displayed in banner
- Operator: ensures cronjobs that are skipped due to internal logic (eg. readonly mode) simply succeed right away and do not leave a k8s job dangling.
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Backend work for #1859
- Remove file count from qa stats endpoint
- Compute isFile or isError per page when page is added
- Increment filePageCount and errorPageCount per crawl to count number of isFile or isError pages
- Add file and error counts to crawl replay.json endpoint (filePageCount and errorPageCount)
- Add migration 0028 to set isFile / isError for each page, aggregate filePageCount / errorPageCount per crawl
- Determine if page is a file based on loadState == 2, mime type or status code and lack of title
Fixes#1833
- Add firstSeed and seedCount to workflow information in profile detail
API endpoint (tests updated accordingly), update name of model used for
limited workflow information to be more accurate
- Fix name display in Crawl Workflows list at bottom of Profile detail
page to be consistent with rest of application
---------
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
This PR introduces backend changes that add the following fields to the
Profile model:
- `modified`
- `modifiedBy`
- `modifiedByName`
- `createdBy`
- `createdByName`
Modified fields are set to the same as the created fields when the
resource is created, and changed when the profile is updated (profile
itself or metadata).
The list profiles endpoint now also supports `sortBy` and
`sortDirection` options. The endpoint defaults to sorting by `modified`
in descending order, but can also sort on `created` and `name`.
Tests have also been updated to reflect all new behavior.
Fixes https://github.com/webrecorder/browsertrix/issues/1743
On the backend, support adding user to new org and improved error
messaging:
- if user exists is not part of the org to be added to, add user to the
registration org, return 201
- if user is already part of the org, return 400 with
'user_already_is_org_member' error
- if user is not being added to a new org but already exists, return
'user_already_exists'
frontend:
- if user user same password, they will just be logged in and added to
registration org (same as before)
- if user uses wrong password, show alert message
- handle both user_already_is_org_member and user_already_registered
with alert message and link to log-in page.
note: this means existing user is added to the registration org even if
they provide wrong password, but they won't be able
to login until they use correct password/reset/etc...
To support #1683, it would be useful to be able to sort by 'last QA
start time' in addition to/instead of last QA state.
- make sorting consistent with workflow sorting
- sortBy fields renamed to lastQAState and lastQAStarted
- Current QA runs are now included in the lastQAState/lastQAStarted fields, rather than being separated out to different values
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
As additional support for #1683, include the active QA stats in the
crawl response, along with active QA state.
This will allow showing progress of QA run in the archived items list.
Fixes#1659
Takes an arbitrary set of thresholds for text and screenshot matches as
a comma-separated list of floats.
Returns a list of groupings for each that include the lower boundary and
count for all thresholds passed in.
Backend work for #1672
Adds new sort options to /crawls and /all-crawls GET list endpoints:
- `reviewStatus`
- `qaRunCount`: number of completed QA runs for crawl (also added to
CrawlOut)
- `qaState` (sorts by `activeQAState` first, then `lastQAState`, both of
which are added to CrawlOut)
Fixes#1670
No longer need to pass pages to the ConfigMap. The ConfigMap has a size
limit and will fail if there are too many pages.
With this change, the page list for QA will be read directly from the
WACZ files pages.jsonl / extraPages.jsonl entries.
- increase time for going to waiting_capacity from starting to 150
seconds
- relax requirement for state transitions, allow complete from waiting
- additional type safety for different states, ensure mark_finished()
only called with non-running states, add `Literal` types for all the
state types.
Supports running QA Runs via the QA API!
Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes
#1498
Also requires the latest Browsertrix Crawler 1.1.0+ (from
webrecorder/browsertrix-crawler#469 branch)
Notable changes:
- QARun objects contain info about QA runs, which are crawls
performed on data loaded from existing crawls.
- Various crawl db operations can be performed on either the crawl or
`qa.` object, and core crawl fields have been moved to CoreCrawlable.
- While running,`QARun` data stored in a single `qa` object, while
finished qa runs are added to `qaFinished` dictionary on the Crawl. The
QA list API returns data from the finished list, sorted by most recent
first.
- Includes additional type fixes / type safety, especially around
BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific
get_upload(), get_basecrawl(), get_crawl() getters for internal use and
get_crawl_out() for API
- Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch)
along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#1539
Adds `reviewStatus` field to `BaseCrawl` model, updatable via the crawl
update API endpoint. Acceptable values are "good", "acceptable" or
"failure", enforced by an Enum.
Added to `BaseCrawl` so that we can extend support to uploads more
easily later on, but for now we'll only display this for crawls in the
frontend.
Fixes#1502
- Adds pages to database as they get added to Redis during crawl
- Adds migration to add pages to database for older crawls from
pages.jsonl and extraPages.jsonl files in WACZ
- Adds GET, list GET, and PATCH update endpoints for pages
- Adds POST (add), PATCH, and POST (delete) endpoints for page notes,
each with their own id, timestamp, and user info in addition to text
- Adds page_ops methods for 1. adding resources/urls to page, and 2.
adding automated heuristics and supplemental info (mime, type, etc.) to
page (for use in crawl QA job)
- Modifies `Migration` class to accept kwargs so that we can pass in ops
classes as needed for migrations
- Deletes WACZ files and pages from database for failed crawls during
crawl_finished process
- Deletes crawl pages when a crawl is deleted
Note: Requires a crawler version 1.0.0 beta3 or later, with support for
`--writePagesToRedis` to populate pages at crawl completion. Beta 4 is
configured in the test chart, which should be upgraded to stable 1.0.0
when it's released.
Connected to https://github.com/webrecorder/browsertrix-crawler/pull/464
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#1341
Adds "User Agent" field to workflow editor under the Browser Settings
tab. If not set, the crawler will use the browser's default user agent.
Also added to docs and to the workflow details page (if set).
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Fixes#1385
## Changes
Supports multiple crawler 'channels' which can be configured to
different browsertrix-crawler versions
- Replaces `crawler_image` in helm chart with `crawler_channels` array
similar to how storages are handled
- The `default` crawler channel must always be provided and specifies
the default crawler image
- Adds backend `/orgs/{oid}/crawlconfigs/crawler-channels` API endpoint
to fetch information about available crawler versions (name, image, and
label) and test
- Adds crawler channel select to workflow creation/edit screens and
profile creation dialog, and updates related API endpoints and
configmaps accordingly. The select dropdown is shown only if more than
one channel is configured.
- Adds `crawlerChannel` to workflow and crawl details.
- Add `image` to crawler image, used to display actual image used as
part of the crawl.
- Modifies `crawler_crawl_id` backend test fixture to use `test` crawler
version to ensure crawler versions other than latest work
- Adds migration to add `crawlerChannel` set to `default` to existing
workflow and profile objects and workflow configmaps
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>