Fixes#2260
- Adds `lastCrawlFinished` to Organization model, updated after crawls
are added/deleted and with an idempotent migration to backfill existing
orgs
- Adds Last Crawl column to end of admin orgs list table
- Adds subscription icon next to existing status icon in orgs list
- Adds "lastCrawlFinished", "subscriptionStatus", and "subscriptionPlan"
sort options to orgs list backend endpoint in anticipation of future
sorting/filtering of orgs list
---------
Co-authored-by: emma <hi@emma.cafe>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#2182
This rather large PR adds the rest of what should be needed for public
collections work in the frontend.
New API endpoints include:
- Public collections endpoints: GET, streaming download
- Paginated list of URLs in collection with snapshot (page) info for
each
- Collection endpoint to set home URL
- Collection endpoint to upload thumbnail as stream
- DELETE endpoint to remove collection thumbnail
Changes to existing API endpoints include:
- Paginating public collection list results
- Several `pages` endpoints that previously only supported `/crawls/` in
their path, e.g. `/orgs/{oid}/crawls/all/pages/reAdd`, now support
`/uploads/` and `/all-crawls/` namespaces as well. This is necessitated
by adding pages for uploads to the database (see below). For
`/orgs/{oid}/namespace/all/pages/reAdd`, `crawls` or `uploads` will
serve as a filter to only affect crawls of that given type. Other
endpoints are more liberal at this point, and will perform the same
action regardless of the namespace used in the route (we'll likely want
to change this in a follow-up to be more consistent).
- `/orgs/{oid}/namespace/all/pages/reAdd` now kicks off a background job
rather than doing all of the computation in an asyncio task in the
backend container. The background job additionally updates collection
date ranges, page/size counts, and tags for each collection in the org
after pages have been (re)added.
Other big changes:
- New uploads will now have their pages read into the database!
Collection page counts now also include uploads
- A migration was added to start a background job for each org that will
add the pages for previously-uploaded WACZ files to the database and
update collections accordingly
- Adds a new `ImageFile` subclass of `BaseFile` for thumbnails that we
can use for other user-uploaded image files moving forward, with
separate output models for authenticated and public endpoints
Fixes#1051
If org with provided slug doesn't exist or no public collections exist
for that org, return same 404 response with a detail of
"public_profile_not_found" to prevent people from using public endpoint
to determine whether an org exists.
Endpoint is `GET /api/public-collections/<org-slug>` (no auth needed) to
avoid collisions with existing org and collection endpoints.
Fixes#2158
- Adds `Organization.listPublicCollections` field and API endpoint to
update it
- Replaces `Collection.isPublic` boolean with `Collection.access`
(values: `private`, `unlisted`, `public`) and add database migration
- Update frontend to use `Collection.access` instead of `isPublic`,
otherwise not changing current behavior
---------
Co-authored-by: sua yoo <sua@suayoo.com>
Closes#2223
- [x] Adds `localesAvailable` to `/api/settings` endpoint, and uses that
list if available, rather than the full list of translated locales, to
determine which options to display to users
- [x] ~~Uses the user's browser locales, filtered to the current
language setting, for formatting numbers, dates, and durations~~
- [x] Adds & persists checkbox for "use same language for formatting
dates and numbers" in user settings
- [x] Replaces uses of `sl-format-bytes` with `localize.bytes(...)`, and
`sl-format-date` with replacement `btrix-format-date` that properly
handles fallback locales
- [x] Caches all number/duration/datetime formatters by a combined key
consisting of app language, browser language, browser setting, and
formatter options so that all formatters can be reused if needed
(previously any formatter with non-default options would be recreated
every render)
- [x] Splits out ordinal formatting from number formatter, as it didn't
make much sense in some non-English locales
- [x] Adds a little demo of date/time/duration/number formatting so you
can see what effect your language settings have
https://github.com/user-attachments/assets/724858cb-b140-4d72-a38d-83f602c71bc7
---------
Signed-off-by: emma <hi@emma.cafe>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Fixes#2112
- Moves org storage recalculation to background job, modify endpoint to
return job id as part of response
- Updates crawl + QA backend tests that broke due to
https://webrecorder.net website changes
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Non-admin users should not be given option to rename org when invited to
a new org:
- set firstOrgAdmin to true only when invite is for an admin
- default to false instead of null
- update tests to check
- download via presigned URLs via requests instead of boto APIs, remove boto
- follow-up to #1933 for streaming download improvements
- fixes datapackage.json in multi-wacz to contain the same resources
objects with: `name`, `path`, `hash`, `bytes` to match single WACZ.
- Add additional metadata to multi-wacz datapackage.json, including `type`
(`crawl`, `upload`, `collection`, `qaRun`), `id` (unique id for the
object), `title` / `description` if available (for
crawl/upload/collection), and `crawlId` for `qaRun`
don't set inviterEmail / inviterName if the inviter is the superuser:
- return fromSuperuser true/false
- if fromSuperuser, don't set inviterEmail / inviterName
- tests: add tests for non-superuser admin invites
- Add a custom EmailStr type which lowercases the full e-mail, not just
the domain.
- Ensure EmailStr is used throughout wherever e-mails are used, both for
invites and user models
- Tests: update to check for lowercase email responses, e-mails returned
from APIs are always lowercase
- Tests: remove tests where '@' was ur-lencoded, should not be possible
since POSTing JSON and no url-decoding is done/expected. E-mails should
have '@' present.
- Fixes#2083 where invites were rejected due to case differences
- CI: pin pymongo dependency due to latest releases update, update python used for CI
Use timezone aware datetimes instead of timezone naive datetimes:
- Update mongodb client to use tz-aware conversion
- Convert dt_now() to return timezone aware UTC date
- Rename to_k8s_date -> date_to_str, just returns ISO UTC date with 'Z'
(instead of '+00:00' suffix)
- Rename from_k8s_date -> str_to_date, returns timezone aware date from
str
- Standardize all string<->date conversion to use either date_to_str or
str_to_date
- Update frontend to assume iso date, not append 'Z' directly
- Update tests to check for 'Z' suffix on some dates
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- add POST /orgs/<id>/defaults/crawling API to update all defaults
(defaults unset are cleared)
- defaults returned as 'crawlingDefaults' object on Org, if set
- fixes#2016
---------
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
* Fixes issue in FailedLogin model:
- fix data-model to remove nested 'attempted.attempted'
- migrate existing data to remove nested field
* Also, avoid setting dt_now() in model as that results in fixed date for
all objects:
- update FailedLogin to update 'attempted' date on every attempt
- also update PageNote object to set date in constructor
* Update text for too many logins to make it clear it is set only if its a
valid email
* fixes#2001
- fix validation error if user doesn'r exist
- always return success even if user doesn't exist for security reasons
- add test for forgot password endpoint
- Follow-up to #1914, allows SubscriptionUpdate event to also update
quotas.
- Passes current usage info + current billing page URL to portalUrl
request for external app to be able to respond with best portalUrl
- get_origin() moved to utils to be available more generally.
- Updates billing tab to show current plans, switches order of quotas to
list execution time, storage first
Fixes#1957
Adds three new webhook events related to QA: analysis started, analysis
ended, and crawl reviewed.
Tests have been updated accordingly.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#1412
## Changes
### Backend
- Adds `all-crawls`, `crawls`, and `uploads` API endpoints to download
archived item as multi-WACZ
- Download QA runs as multi-WACZ
- Adds backend tests for new endpoints
- Update to new version of stream-zip library which does not require crc-32 to be present for ZIP members,
computes after streaming, fixing invalid crc-32 issues as previously computed crc-32s from crawler may be invalid.
### Frontend
Adds ability to download archived item from:
- Button in archived item detail Files tab
- Archived item details actions menu
- Archived items list menu
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#1955
Orgs list endpoint sorting now works as follows:
- Default org is always sorted first
- Name sorting now works on a lowercased version of the org names to
ensure lexical sorting
The lodash `sortBy` resorting of orgs in the "All Organizations"
dropdown list in the nav bar has also been removed so that the backend
sorting is applied instead.
Tests have been updated accordingly.
* updates pydantic to 2.x
* also update to python 3.12
* additional type fixes:
- all Optional[] types must have a default value
- update to constrained types
- URL types converted from str
- test updates
Fixes#1940
Follow-up to regressions from #1928, this PR:
- Fixes response models for queue endpoints, which had incorrect model
- Adds tests for queue get, queue match, and exclusions add / remove to
ensure regressions like this can be caught via tests. This involves
starting a new crawl in test_run_crawls() instead of relying on implicit
running via fixtures, make it easier to test crawl while it's running.
- Adds additional typing for crawls apis, including making
delete_crawls() have correct typing, consistent derived class override
- Adds check to ensure queue + exclusion operations can not be called
when crawl is not running
Fixes#1927
Also adds tests to ensure index is working as expected, and migration to
rename orgs that have names or slugs identical to other orgs except for
case before the new case-insensitive index is built.
Fixes#1926
- adds /subscriptions/import endpoint for importing an existing subscription to an existing org
- add SubscriptionImport object and log as 'import' event in subscription events collection
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#1916
- Add `created` field to Organization and OrgOut, set on org creation
- Add migration to backfill `created` dates from first workflow
`created`
- Replace `datetime.now()` and `datetime.utcnow()` across app with
consistent timezone-aware `utils.dt_now` helper function, which now uses
`datetime.now(timezone.utc)`. This is in part to ensure consistency in
how we handle datetimes, and also to get ahead of timezone naive
datetime creation methods like `datetime.utcnow()` being deprecated in
Python 3.12. For more, see:
https://blog.miguelgrinberg.com/post/it-s-time-for-a-change-datetime-utcnow-is-now-deprecated
Initial implementation of #1892
- Modifies the backend to return `duplicate_org_name` or
`duplicate_org_slug` as appropriate on a pymongo `DuplicateKeyError`
- Updates frontend to handle `duplicate_org_name`, `duplicate_org_slug`,
and `invalid_slug` error details
- Update errors to be more consistent, also return `duplicate_org_subscription.subId` for duplicate subscription instead of the more generic `already_exists`
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes https://github.com/webrecorder/browsertrix/issues/1905
- adds a new top-level `/api/subscriptions` endpoint and SubOps handler on
the backend.
- enable subscriptions API endpoints available only if `billing_enabled` is
set in helm chart
- new POST /subscriptions/create, /subscriptions/update,
/subscriptions/cancel API endpoints
- Subscriptions mongo collection storing timestamped /subscription
API events
- GET /subscriptions/events API to get subscription events, support for filtering and sorting
- Subscription data model
- Support for setting and handling readOnlyOnCancel on org
- /orgs/<id>/billing-portal to lookup portalUrl using external API
- subscription in org getter and list views
- mark org as readOnly for subscription status `paused_payment_failed`, clears it on status `active`
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
The default org will always be sorted first, regardless of sort options.
Orgs after the first will be sorted by name ascending by default.
Sorting currently supported on name, slug, and readOnly.
Fixes#1432
Refactors the invite + registration system to be simpler and more consistent
with regards to existing user invites. Previously, per-user invites are
stored in the user.invites dict instead of in the invites collection,
which creates a few issues:
- Existing user do not show up in Org Invites list: #1432
- Existing user invites also do not expire, unlike new user invites,
creating potential security issue.
Instead, existing user invites should be treated like new user invites.
This PR moves them into the same collection,
adding a `userid` field to InvitePending to match with an existing user.
If a user already exists, it will be matched by userid, instead of by
email. This allows for user to update their email while still being
invited. Note that the email of the invited existing user will not
change in the invite email. This is also by design: an admin of one org
should not be given any hint that an invited user already has an
account, such as by having their email automatically update. For an org
admin, the invite to a new or existing user should be indistinguishable.
The sha256 of invite token is stored instead of actual token for better
security.
The registration system has also been refactored with the following
changes:
- Auto-creation of new orgs for new users has been removed
- User.create_user() replaces the old User._create() and just creates the user with
additional complex logic around org auto-add
- Users are added to org in org add_user_to_org()
- Users are added to org through invites with add_user_with_invite()
Tests:
- Additional tests include verifying that existing and new pending
invites appear in the pending invites list
- Tests for `/users/invite/<token>?email=` and
`/users/me/invite/<token>` endpoints
- Deleting pending invites
- Additional tests added for user self-registration, including existing
user self-registration to default org of existing user (in nightly
tests)
Fixes#890
This PR introduces new streaming superuser-only API endpoints to export
and import database information for an organization. New Adminstrator
deployment documentation on how to manage the process and copy files
between S3 buckets as needed is also included.
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Updates the /api/orgs/create endpoint to:
- not have name / slug be required, will be renamed on first user via
#1870
- support optional quotas
- support optional first admin user email, who will receive an invite to
join the org.
Also supports a new shared secret mechanism, to allow an external
automation to access the /api/orgs/create endpoint (and only that
endpoint thus far) via a shared secret instead of normal login.
Fixes#1890
Adds validation for org slugs, ensuring that they contain only ASCII
alphanumeric characters and dashes (`-`). If an invalid slug is
provided, an HTTPException is returned with status code 400 and detail
`invalid_slug`.
Fixes https://github.com/webrecorder/browsertrix/issues/1883
Backend work for https://github.com/webrecorder/browsertrix/issues/1876
- If readOnly is set true, disallow crawls and QA analysis runs
- If readOnly is set to true, skip scheduled crawls
- Add endpoint to set `readOnly` with optional `readOnlyReason` (which
is automatically set back to an empty string when `readOnly` is being
set to false), which can be displayed in banner
- Operator: ensures cronjobs that are skipped due to internal logic (eg. readonly mode) simply succeed right away and do not leave a k8s job dangling.
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Backend work for first two tasks of
https://github.com/webrecorder/browsertrix/issues/1875
New /billing API endpoint to be added separately once we have a better
idea of what data we can get from the payment processor.
Backend work for #1859
- Remove file count from qa stats endpoint
- Compute isFile or isError per page when page is added
- Increment filePageCount and errorPageCount per crawl to count number of isFile or isError pages
- Add file and error counts to crawl replay.json endpoint (filePageCount and errorPageCount)
- Add migration 0028 to set isFile / isError for each page, aggregate filePageCount / errorPageCount per crawl
- Determine if page is a file based on loadState == 2, mime type or status code and lack of title
Fixes#1846
- Ensure meter auto-updates as new stats are ready
- Switch meter to new QA run when new analysis run is started
- Remove Files from QA meter (files and errors will be reported separately)
Co-authored-by: emma <hi@emma.cafe>
Co-authored-by: sua yoo <sua@webrecorder.org>