Subscription Management: used check to ensure subscription can be auto-canceled if
not activated.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Overhauls URL presigning by:
- cache the presigned urls in a flat, separate mongodb collection which
has an expiring index
- update presigned urls if not found / expired automatically in index
- remove logic on storing presignedUrl in files
- support cacheing presigned URL for thumbnails.
- add endpoints to clear presigned urls for org or for all files in all
orgs (superadmin only)
- supersedes #2438, fix for #2437
- removes previous presignedUrl and expireAt data from crawls and QA
runs
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#2406
Converts migration 0042 to launch a background job (parallelized across
several pods) to migrate all crawls by optimizing their pages and
setting `version: 2` on the crawl when complete.
Also Optimizes MongoDB queries for better performance.
Migration Improvements:
- Add `isMigrating` and `version` fields to `BaseCrawl`
- Add new background job type to use in migration with accompanying
`migration_job.yaml` template that allows for parallelization
- Add new API endpoint to launch this crawl migration job, and ensure
that we have list and retry endpoints for superusers that work with
background jobs that aren't tied to a specific org
- Rework background job models and methods now that not all background
jobs are tied to a single org
- Ensure new crawls and uploads have `version` set to `2`
- Modify crawl and collection replay.json endpoints to only include
fields for replay optimization (`initialPages`, `pageQueryUrl`,
`preloadResources`) if all relevant crawls/uploads have `version` set to
`2`
- Remove `distinct` calls from migration pathways
- Consolidate collection recompute stats
Query Optimizations:
- Remove all uses of $group and $facet
- Optimize /replay.json endpoints to precompute preload_resources, avoid
fetching crawl list twice
- Optimize /collections endpoint by not fetching resources
- Rename /urls -> /pageUrlCounts and avoid $group, instead sort with
index, either by seed + ts or by url to get top matches.
- Use $gte instead of $regex to get prefix matches on URL
- Use $text instead of $regex to get text search on title
- Remove total from /pages and /pageUrlCounts queries by not using
$facet
- frontend: only call /pageUrlCounts when dialog is opened.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
- Refactors dashboard and org profile preview to use private API
endpoint, to fix public collections not showing when the org
visibility is hidden
- Adds additional sorting options for collections
- Adds unique page url counts for archived items, collections, and
organizations to backend and exposes this in collections
- Shows collection period (i.e. `dateEarliest` to `dateLatest`) in
collections list
- Shows same collection metadata in private and public views, updates
private view info bar
- Fixes "Update Org Profile" action item showing for crawler roles
---------
Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: sua yoo <sua@suayoo.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#2257
This is a follow-up to the public collections work, which adds pages to
the database for uploads. All crawls and uploads now have a `pageCount`
field which is populated when the item is successfully added. A new
migration is also added to populate the field for existing archived
items that don't have it set yet.
OrgMetrics have also been modified to include `crawlPageCount` and
`uploadPageCount`, and to include the total of both in `pageCount`, and
all three included in the frontend org dashboard.
The frontend has been updated to use `pageCount` rather than
`stats.done` wherever appropriate, meaning that in archived item lists
and details we now have a consistent page count for both crawls and
uploads.
### New functionality
- Deploy this branch
- Create new crawls and uploads and verify that page count appears
correctly throughout the frontend for all new crawls and uploads
### Migration
- Deploy from latest main
- Create some crawls and uploads
- Change to this branch and re-deploy
- Verify migration ran without errors in backend logs
- Verify that page count has been populated successfully by checking
archived items lists, crawl and upload detail pages, and dashboard to
ensure there are no longer any missing page counts.
---------
Co-authored-by: emma <hi@emma.cafe>
Resolves https://github.com/webrecorder/browsertrix/issues/2298
## Changes
- Slugs added to collections, can be specified separately when creating
or updating collections or else is based off of supplied collection name
- Migration added to backfill slugs for existing collections
- Redirect collection to newest slug if changed
- Adds option to copy public profile link to "Public Collections" action
menu
- Show "Back to <Org>" link instead of breadcrumbs
---------
Co-authored-by: sua yoo <sua@suayoo.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#2260
- Adds `lastCrawlFinished` to Organization model, updated after crawls
are added/deleted and with an idempotent migration to backfill existing
orgs
- Adds Last Crawl column to end of admin orgs list table
- Adds subscription icon next to existing status icon in orgs list
- Adds "lastCrawlFinished", "subscriptionStatus", and "subscriptionPlan"
sort options to orgs list backend endpoint in anticipation of future
sorting/filtering of orgs list
---------
Co-authored-by: emma <hi@emma.cafe>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#1051
If org with provided slug doesn't exist or no public collections exist
for that org, return same 404 response with a detail of
"public_profile_not_found" to prevent people from using public endpoint
to determine whether an org exists.
Endpoint is `GET /api/public-collections/<org-slug>` (no auth needed) to
avoid collisions with existing org and collection endpoints.
Fixes#2158
- Adds `Organization.listPublicCollections` field and API endpoint to
update it
- Replaces `Collection.isPublic` boolean with `Collection.access`
(values: `private`, `unlisted`, `public`) and add database migration
- Update frontend to use `Collection.access` instead of `isPublic`,
otherwise not changing current behavior
---------
Co-authored-by: sua yoo <sua@suayoo.com>
Adds sending a cancellation email when a subscription is cancelled.
- The email may also include an option survey optional survey URL, if
configured in helm chart `survey_url` setting.
- Cancellation e-mail configured in `sub_cancel` e-mail template
- E-mails are sent to all org admins.
- Also adds `trialing_canceled` subscription state to differentiate from
a default `trialing` which will automatically rollover into `active`.
- The email is sent when: a new cancellation date is added for an
`active` subscription, or a `trialing` subscription is changed to to
`trialing_canceled`. (A subscription can be canceled/uncanceled several
times before actual date, and e-mail is sent every time it is canceled.)
- The 'You have X days left of your trial' is also always displayed when
state is in trialing_canceled.
Fixes#2229
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#2112
- Moves org storage recalculation to background job, modify endpoint to
return job id as part of response
- Updates crawl + QA backend tests that broke due to
https://webrecorder.net website changes
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Resolves#1354
Supports crawling through pre-configured proxy servers, allowing users to select which proxy servers to use (requires browsertrix crawler 1.3+)
Config:
- proxies defined in btrix-proxies subchart
- can be configured via btrix-proxies key or separate proxies.yaml file via separate subchart
- proxies list refreshed automatically if crawler_proxies.json changes if subchart is deployed
- support for ssh and socks5 proxies
- proxy keys added to secrets in subchart
- support for default proxy to be always used if no other proxy configured, prevent starting cluster if default proxy not available
- prevent starting manual crawl if previously configured proxy is no longer available, return error
- force 'btrix' username and group name on browsertrix-crawler non-root user to support ssh
Operator:
- support crawling through proxies, pass proxyId in CrawlJob
- support running profile browsers which designated proxy, pass proxyId to ProfileJob
- prevent starting scheduled crawl if previously configured proxy is no longer available
API / Access:
- /api/orgs/all/crawlconfigs/crawler-proxies - get all proxies (superadmin only)
- /api/orgs/{oid}/crawlconfigs/crawler-proxies - get proxies available to particular org
- /api/orgs/{oid}/proxies - update allowed proxies for particular org (superadmin only)
- superadmin can configure which orgs can use which proxies, stored on the org
- superadmin can also allow an org to access all 'shared' proxies, to avoid having to allow a shared proxy on each org.
UI:
- Superadmin has 'Edit Proxies' dialog to configure for each org if it has: dedicated proxies, has access to shared proxies.
- User can select a proxy in Crawl Workflow browser settings
- Users can choose to launch a browser profile with a particular proxy
- Display which proxy is used to create profile in profile selector
- Users can choose with default proxy to use for new workflows in Crawling Defaults
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
don't set inviterEmail / inviterName if the inviter is the superuser:
- return fromSuperuser true/false
- if fromSuperuser, don't set inviterEmail / inviterName
- tests: add tests for non-superuser admin invites
- Add a custom EmailStr type which lowercases the full e-mail, not just
the domain.
- Ensure EmailStr is used throughout wherever e-mails are used, both for
invites and user models
- Tests: update to check for lowercase email responses, e-mails returned
from APIs are always lowercase
- Tests: remove tests where '@' was ur-lencoded, should not be possible
since POSTing JSON and no url-decoding is done/expected. E-mails should
have '@' present.
- Fixes#2083 where invites were rejected due to case differences
- CI: pin pymongo dependency due to latest releases update, update python used for CI
- add POST /orgs/<id>/defaults/crawling API to update all defaults
(defaults unset are cleared)
- defaults returned as 'crawlingDefaults' object on Org, if set
- fixes#2016
---------
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Tweaks to how execution time is tracked for more accuracy + excluding
waiting states:
- don't update if crawl state is in a 'waiting state' (waiting for
capacity or waiting for org limit)
- rename start states -> waiting states for clarity
- reset lastUpdatedTime if two consecutive updates of non-running state,
to ensure non-running states don't count, but also account for
occasional hiccups -- if only one update detects non-running state,
don't reset
- webhooks: move start webhook to when crawl actually starts for first
time (db lastUpdatedTime is not yet + crawl is running)
- don't set lastUpdatedTime until pods actually running
- set crawljob update interval to every 10 seconds for more accurate
execution time tracking
- frontend: show seconds in 'Execution Time' display
- Follow-up to #1914, allows SubscriptionUpdate event to also update
quotas.
- Passes current usage info + current billing page URL to portalUrl
request for external app to be able to respond with best portalUrl
- get_origin() moved to utils to be available more generally.
- Updates billing tab to show current plans, switches order of quotas to
list execution time, storage first
- instead of looking up storage and exec min quotas from oid, and
loading an org each time, load org once and then check quotas on the org
object - many times the org was already available, and was looked up
again
- storage and exec quota checks become sync
- rename can_run_crawl() to more generic can_write_data(), optionally
also checks exec minutes
- typing: get_org_by_id() always returns org, or throws, adjust methods
accordingly (don't check for none, catch exception)
- typing: fix typo in BaseOperator, catch type errors in operator
'org_ops'
- operator quota check: use up-to-date 'status.size' for current job,
ignore current job in all jobs list to avoid double-counting
- follow up to #1969
Fixes#1968
Changes:
- `stopped_quota_reached` and `skipped_quota_reached` migrated to new
values that indicate which quota was reached
- Before crawls are run, the operator checks if storage or exec mins
quotas are reached and if so fails the crawl with the appropriate state
of `skipped_storage_quota_reached` or `skipped_time_quota_reached`
- While crawls are running, the operator checks if the exec mins quota
is reached or if the size of all running crawls will mean the storage
quota is reached once uploaded; if so, the crawl is stopped gracefully
and given `stopped_storage_quota_needed` or `stopped_time_quota_reached`
state as appropriate
- Adds new nightly tests for enforcing storage quota
Fixes#1955
Orgs list endpoint sorting now works as follows:
- Default org is always sorted first
- Name sorting now works on a lowercased version of the org names to
ensure lexical sorting
The lodash `sortBy` resorting of orgs in the "All Organizations"
dropdown list in the nav bar has also been removed so that the backend
sorting is applied instead.
Tests have been updated accordingly.
* updates pydantic to 2.x
* also update to python 3.12
* additional type fixes:
- all Optional[] types must have a default value
- update to constrained types
- URL types converted from str
- test updates
Fixes#1940
Fixes#1927
Also adds tests to ensure index is working as expected, and migration to
rename orgs that have names or slugs identical to other orgs except for
case before the new case-insensitive index is built.
- ensure crawlFilenameTemplate is part of the CrawlConfig model
- change CrawlConfig init to use type-safe construction
- add a run_now_internal() that is shared for starting crawl, either on
demand or from new config
- add OrgOps.can_run_crawls() to check against org quotas for crawling
- cleanup profile updates, remove _lookup_profile, only check for
EmptyStr in update
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#1926
- adds /subscriptions/import endpoint for importing an existing subscription to an existing org
- add SubscriptionImport object and log as 'import' event in subscription events collection
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#1920
Adds response models to all API endpoints that were missing them,
documenting current behavior without making any changes at this stage to
standardize responses.
Follow-up work will involve adding generics to some of the response models
Fixes#1916
- Add `created` field to Organization and OrgOut, set on org creation
- Add migration to backfill `created` dates from first workflow
`created`
- Replace `datetime.now()` and `datetime.utcnow()` across app with
consistent timezone-aware `utils.dt_now` helper function, which now uses
`datetime.now(timezone.utc)`. This is in part to ensure consistency in
how we handle datetimes, and also to get ahead of timezone naive
datetime creation methods like `datetime.utcnow()` being deprecated in
Python 3.12. For more, see:
https://blog.miguelgrinberg.com/post/it-s-time-for-a-change-datetime-utcnow-is-now-deprecated
Initial implementation of #1892
- Modifies the backend to return `duplicate_org_name` or
`duplicate_org_slug` as appropriate on a pymongo `DuplicateKeyError`
- Updates frontend to handle `duplicate_org_name`, `duplicate_org_slug`,
and `invalid_slug` error details
- Update errors to be more consistent, also return `duplicate_org_subscription.subId` for duplicate subscription instead of the more generic `already_exists`
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes https://github.com/webrecorder/browsertrix/issues/1905
- adds a new top-level `/api/subscriptions` endpoint and SubOps handler on
the backend.
- enable subscriptions API endpoints available only if `billing_enabled` is
set in helm chart
- new POST /subscriptions/create, /subscriptions/update,
/subscriptions/cancel API endpoints
- Subscriptions mongo collection storing timestamped /subscription
API events
- GET /subscriptions/events API to get subscription events, support for filtering and sorting
- Subscription data model
- Support for setting and handling readOnlyOnCancel on org
- /orgs/<id>/billing-portal to lookup portalUrl using external API
- subscription in org getter and list views
- mark org as readOnly for subscription status `paused_payment_failed`, clears it on status `active`
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
The default org will always be sorted first, regardless of sort options.
Orgs after the first will be sorted by name ascending by default.
Sorting currently supported on name, slug, and readOnly.
Fixes#1432
Refactors the invite + registration system to be simpler and more consistent
with regards to existing user invites. Previously, per-user invites are
stored in the user.invites dict instead of in the invites collection,
which creates a few issues:
- Existing user do not show up in Org Invites list: #1432
- Existing user invites also do not expire, unlike new user invites,
creating potential security issue.
Instead, existing user invites should be treated like new user invites.
This PR moves them into the same collection,
adding a `userid` field to InvitePending to match with an existing user.
If a user already exists, it will be matched by userid, instead of by
email. This allows for user to update their email while still being
invited. Note that the email of the invited existing user will not
change in the invite email. This is also by design: an admin of one org
should not be given any hint that an invited user already has an
account, such as by having their email automatically update. For an org
admin, the invite to a new or existing user should be indistinguishable.
The sha256 of invite token is stored instead of actual token for better
security.
The registration system has also been refactored with the following
changes:
- Auto-creation of new orgs for new users has been removed
- User.create_user() replaces the old User._create() and just creates the user with
additional complex logic around org auto-add
- Users are added to org in org add_user_to_org()
- Users are added to org through invites with add_user_with_invite()
Tests:
- Additional tests include verifying that existing and new pending
invites appear in the pending invites list
- Tests for `/users/invite/<token>?email=` and
`/users/me/invite/<token>` endpoints
- Deleting pending invites
- Additional tests added for user self-registration, including existing
user self-registration to default org of existing user (in nightly
tests)
Fixes#890
This PR introduces new streaming superuser-only API endpoints to export
and import database information for an organization. New Adminstrator
deployment documentation on how to manage the process and copy files
between S3 buckets as needed is also included.
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Updates the /api/orgs/create endpoint to:
- not have name / slug be required, will be renamed on first user via
#1870
- support optional quotas
- support optional first admin user email, who will receive an invite to
join the org.
Also supports a new shared secret mechanism, to allow an external
automation to access the /api/orgs/create endpoint (and only that
endpoint thus far) via a shared secret instead of normal login.
Resolves https://github.com/webrecorder/browsertrix/issues/1874
Support for new two-part sign up flow if first admin user is added to org
- If new user, user registers first, then is able to change the org name / slug on following screen
- If existing user, user accepts invite, then is able to change the org name / slug on following screen
- After confirming org slug name, user is taken to dashboard, or error is shown if org name or slug already taken.
- If org name == org id, org name and slug is automatically set to `{Your Name}'s Archive` when first user is registered / accepts invite
- Email templates updated to better reflect new / existing users and not show org name if it is 'unset' (org name == org id internally)
- tests: frontend unit testing for accept + invite screens.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: sua yoo <sua@suayoo.com>
Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Fixes#1890
Adds validation for org slugs, ensuring that they contain only ASCII
alphanumeric characters and dashes (`-`). If an invalid slug is
provided, an HTTPException is returned with status code 400 and detail
`invalid_slug`.
Fixes https://github.com/webrecorder/browsertrix/issues/1883
Backend work for https://github.com/webrecorder/browsertrix/issues/1876
- If readOnly is set true, disallow crawls and QA analysis runs
- If readOnly is set to true, skip scheduled crawls
- Add endpoint to set `readOnly` with optional `readOnlyReason` (which
is automatically set back to an empty string when `readOnly` is being
set to false), which can be displayed in banner
- Operator: ensures cronjobs that are skipped due to internal logic (eg. readonly mode) simply succeed right away and do not leave a k8s job dangling.
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
clean up adding user vs changing role logic:
- when adding user, ensure user doesn't exist
- when changing roles, ensure user does exist
add test for changing roles of existing user
Fixes#1821
Fixes https://github.com/webrecorder/browsertrix/issues/1743
On the backend, support adding user to new org and improved error
messaging:
- if user exists is not part of the org to be added to, add user to the
registration org, return 201
- if user is already part of the org, return 400 with
'user_already_is_org_member' error
- if user is not being added to a new org but already exists, return
'user_already_exists'
frontend:
- if user user same password, they will just be logged in and added to
registration org (same as before)
- if user uses wrong password, show alert message
- handle both user_already_is_org_member and user_already_registered
with alert message and link to log-in page.
note: this means existing user is added to the registration org even if
they provide wrong password, but they won't be able
to login until they use correct password/reset/etc...
if 'registration_enabled' is set, check 'registration_org_id' for org id
of an existing org that new users should be added to when they register.
if omitted, default to the default org
Fixes#1729
Supports running QA Runs via the QA API!
Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes
#1498
Also requires the latest Browsertrix Crawler 1.1.0+ (from
webrecorder/browsertrix-crawler#469 branch)
Notable changes:
- QARun objects contain info about QA runs, which are crawls
performed on data loaded from existing crawls.
- Various crawl db operations can be performed on either the crawl or
`qa.` object, and core crawl fields have been moved to CoreCrawlable.
- While running,`QARun` data stored in a single `qa` object, while
finished qa runs are added to `qaFinished` dictionary on the Crawl. The
QA list API returns data from the finished list, sorted by most recent
first.
- Includes additional type fixes / type safety, especially around
BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific
get_upload(), get_basecrawl(), get_crawl() getters for internal use and
get_crawl_out() for API
- Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch)
along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>