Fixes#1955
Orgs list endpoint sorting now works as follows:
- Default org is always sorted first
- Name sorting now works on a lowercased version of the org names to
ensure lexical sorting
The lodash `sortBy` resorting of orgs in the "All Organizations"
dropdown list in the nav bar has also been removed so that the backend
sorting is applied instead.
Tests have been updated accordingly.
* updates pydantic to 2.x
* also update to python 3.12
* additional type fixes:
- all Optional[] types must have a default value
- update to constrained types
- URL types converted from str
- test updates
Fixes#1940
Follow-up to regressions from #1928, this PR:
- Fixes response models for queue endpoints, which had incorrect model
- Adds tests for queue get, queue match, and exclusions add / remove to
ensure regressions like this can be caught via tests. This involves
starting a new crawl in test_run_crawls() instead of relying on implicit
running via fixtures, make it easier to test crawl while it's running.
- Adds additional typing for crawls apis, including making
delete_crawls() have correct typing, consistent derived class override
- Adds check to ensure queue + exclusion operations can not be called
when crawl is not running
Fixes#1927
Also adds tests to ensure index is working as expected, and migration to
rename orgs that have names or slugs identical to other orgs except for
case before the new case-insensitive index is built.
Fixes#1926
- adds /subscriptions/import endpoint for importing an existing subscription to an existing org
- add SubscriptionImport object and log as 'import' event in subscription events collection
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#1916
- Add `created` field to Organization and OrgOut, set on org creation
- Add migration to backfill `created` dates from first workflow
`created`
- Replace `datetime.now()` and `datetime.utcnow()` across app with
consistent timezone-aware `utils.dt_now` helper function, which now uses
`datetime.now(timezone.utc)`. This is in part to ensure consistency in
how we handle datetimes, and also to get ahead of timezone naive
datetime creation methods like `datetime.utcnow()` being deprecated in
Python 3.12. For more, see:
https://blog.miguelgrinberg.com/post/it-s-time-for-a-change-datetime-utcnow-is-now-deprecated
Initial implementation of #1892
- Modifies the backend to return `duplicate_org_name` or
`duplicate_org_slug` as appropriate on a pymongo `DuplicateKeyError`
- Updates frontend to handle `duplicate_org_name`, `duplicate_org_slug`,
and `invalid_slug` error details
- Update errors to be more consistent, also return `duplicate_org_subscription.subId` for duplicate subscription instead of the more generic `already_exists`
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes https://github.com/webrecorder/browsertrix/issues/1905
- adds a new top-level `/api/subscriptions` endpoint and SubOps handler on
the backend.
- enable subscriptions API endpoints available only if `billing_enabled` is
set in helm chart
- new POST /subscriptions/create, /subscriptions/update,
/subscriptions/cancel API endpoints
- Subscriptions mongo collection storing timestamped /subscription
API events
- GET /subscriptions/events API to get subscription events, support for filtering and sorting
- Subscription data model
- Support for setting and handling readOnlyOnCancel on org
- /orgs/<id>/billing-portal to lookup portalUrl using external API
- subscription in org getter and list views
- mark org as readOnly for subscription status `paused_payment_failed`, clears it on status `active`
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
The default org will always be sorted first, regardless of sort options.
Orgs after the first will be sorted by name ascending by default.
Sorting currently supported on name, slug, and readOnly.
Fixes#1432
Refactors the invite + registration system to be simpler and more consistent
with regards to existing user invites. Previously, per-user invites are
stored in the user.invites dict instead of in the invites collection,
which creates a few issues:
- Existing user do not show up in Org Invites list: #1432
- Existing user invites also do not expire, unlike new user invites,
creating potential security issue.
Instead, existing user invites should be treated like new user invites.
This PR moves them into the same collection,
adding a `userid` field to InvitePending to match with an existing user.
If a user already exists, it will be matched by userid, instead of by
email. This allows for user to update their email while still being
invited. Note that the email of the invited existing user will not
change in the invite email. This is also by design: an admin of one org
should not be given any hint that an invited user already has an
account, such as by having their email automatically update. For an org
admin, the invite to a new or existing user should be indistinguishable.
The sha256 of invite token is stored instead of actual token for better
security.
The registration system has also been refactored with the following
changes:
- Auto-creation of new orgs for new users has been removed
- User.create_user() replaces the old User._create() and just creates the user with
additional complex logic around org auto-add
- Users are added to org in org add_user_to_org()
- Users are added to org through invites with add_user_with_invite()
Tests:
- Additional tests include verifying that existing and new pending
invites appear in the pending invites list
- Tests for `/users/invite/<token>?email=` and
`/users/me/invite/<token>` endpoints
- Deleting pending invites
- Additional tests added for user self-registration, including existing
user self-registration to default org of existing user (in nightly
tests)
Fixes#890
This PR introduces new streaming superuser-only API endpoints to export
and import database information for an organization. New Adminstrator
deployment documentation on how to manage the process and copy files
between S3 buckets as needed is also included.
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Updates the /api/orgs/create endpoint to:
- not have name / slug be required, will be renamed on first user via
#1870
- support optional quotas
- support optional first admin user email, who will receive an invite to
join the org.
Also supports a new shared secret mechanism, to allow an external
automation to access the /api/orgs/create endpoint (and only that
endpoint thus far) via a shared secret instead of normal login.
Fixes#1890
Adds validation for org slugs, ensuring that they contain only ASCII
alphanumeric characters and dashes (`-`). If an invalid slug is
provided, an HTTPException is returned with status code 400 and detail
`invalid_slug`.
Fixes https://github.com/webrecorder/browsertrix/issues/1883
Backend work for https://github.com/webrecorder/browsertrix/issues/1876
- If readOnly is set true, disallow crawls and QA analysis runs
- If readOnly is set to true, skip scheduled crawls
- Add endpoint to set `readOnly` with optional `readOnlyReason` (which
is automatically set back to an empty string when `readOnly` is being
set to false), which can be displayed in banner
- Operator: ensures cronjobs that are skipped due to internal logic (eg. readonly mode) simply succeed right away and do not leave a k8s job dangling.
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Backend work for first two tasks of
https://github.com/webrecorder/browsertrix/issues/1875
New /billing API endpoint to be added separately once we have a better
idea of what data we can get from the payment processor.
Backend work for #1859
- Remove file count from qa stats endpoint
- Compute isFile or isError per page when page is added
- Increment filePageCount and errorPageCount per crawl to count number of isFile or isError pages
- Add file and error counts to crawl replay.json endpoint (filePageCount and errorPageCount)
- Add migration 0028 to set isFile / isError for each page, aggregate filePageCount / errorPageCount per crawl
- Determine if page is a file based on loadState == 2, mime type or status code and lack of title
Fixes#1846
- Ensure meter auto-updates as new stats are ready
- Switch meter to new QA run when new analysis run is started
- Remove Files from QA meter (files and errors will be reported separately)
Co-authored-by: emma <hi@emma.cafe>
Co-authored-by: sua yoo <sua@webrecorder.org>
Fixes#1833
- Add firstSeed and seedCount to workflow information in profile detail
API endpoint (tests updated accordingly), update name of model used for
limited workflow information to be more accurate
- Fix name display in Crawl Workflows list at bottom of Profile detail
page to be consistent with rest of application
---------
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
This PR adds Identical Files to the QA Page Match Analysis meter bars.
To do this, the backend calculates the number of non-HTML pages once and
includes it under the key `Files` in each of the `screenshotMatch` and
`textMatch` QA stats return arrays.
The backend additionally removes the file count from "No Data" to
prevent these from being counted twice.
---------
Co-authored-by: emma <hi@emma.cafe>
Resolves https://github.com/webrecorder/browsertrix/issues/1409
### Changes
- Enables clicking on Browser Profiles column header to sort the table, including by starting URL
- More consistent column widths throughout app
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: emma <hi@emma.cafe>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
This PR introduces backend changes that add the following fields to the
Profile model:
- `modified`
- `modifiedBy`
- `modifiedByName`
- `createdBy`
- `createdByName`
Modified fields are set to the same as the created fields when the
resource is created, and changed when the profile is updated (profile
itself or metadata).
The list profiles endpoint now also supports `sortBy` and
`sortDirection` options. The endpoint defaults to sorting by `modified`
in descending order, but can also sort on `created` and `name`.
Tests have also been updated to reflect all new behavior.
clean up adding user vs changing role logic:
- when adding user, ensure user doesn't exist
- when changing roles, ensure user does exist
add test for changing roles of existing user
Fixes#1821
To support #1683, it would be useful to be able to sort by 'last QA
start time' in addition to/instead of last QA state.
- make sorting consistent with workflow sorting
- sortBy fields renamed to lastQAState and lastQAStarted
- Current QA runs are now included in the lastQAState/lastQAStarted fields, rather than being separated out to different values
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
As additional support for #1683, include the active QA stats in the
crawl response, along with active QA state.
This will allow showing progress of QA run in the archived items list.
Fixes#1659
Takes an arbitrary set of thresholds for text and screenshot matches as
a comma-separated list of floats.
Returns a list of groupings for each that include the lower boundary and
count for all thresholds passed in.
Backend work for #1672
Adds new sort options to /crawls and /all-crawls GET list endpoints:
- `reviewStatus`
- `qaRunCount`: number of completed QA runs for crawl (also added to
CrawlOut)
- `qaState` (sorts by `activeQAState` first, then `lastQAState`, both of
which are added to CrawlOut)
- Remove globals from profile, uploads, and qa test modules in favor of fixtures
- Add retries to fix intermittent test failures due to timing
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#1648
- Tracks failed QA runs in database, not only successful ones
- Includes failed QA runs in list endpoint by default
- Adds `skipFailed` param to list endpoint to return only successful
runs
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#1617
Filters added:
- reviewed: filter by page has approval or at least one note (true) or
neither (false)
- approved: filter by approval value (accepts list of strings,
comma-separated, each of which are coerced into True, False, or None, or
ignored if they are invalid values)
- hasNotes: filter by has at least one note (true) or not (false)
Tests have also been added to ensure that results are as expected.
Supports running QA Runs via the QA API!
Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes
#1498
Also requires the latest Browsertrix Crawler 1.1.0+ (from
webrecorder/browsertrix-crawler#469 branch)
Notable changes:
- QARun objects contain info about QA runs, which are crawls
performed on data loaded from existing crawls.
- Various crawl db operations can be performed on either the crawl or
`qa.` object, and core crawl fields have been moved to CoreCrawlable.
- While running,`QARun` data stored in a single `qa` object, while
finished qa runs are added to `qaFinished` dictionary on the Crawl. The
QA list API returns data from the finished list, sorted by most recent
first.
- Includes additional type fixes / type safety, especially around
BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific
get_upload(), get_basecrawl(), get_crawl() getters for internal use and
get_crawl_out() for API
- Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch)
along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#1597
New endpoints (replacing old migration) to re-add crawl pages to db from
WACZs.
After a few implementation attempts, we settled on using
[remotezip](https://github.com/gtsystem/python-remotezip) to handle
parsing of the zip files and streaming their contents line-by-line for
pages. I've also modified the sync log streaming to use remotezip as
well, which allows us to remove our own zip module and let remotezip
handle the complexity of parsing zip files.
Database inserts for pages from WACZs are batched 100 at a time to help
speed up the endpoint, and the task is kicked off using
asyncio.create_task so as not to block before giving a response.
StorageOps now contains a method for streaming the bytes of any file in
a remote WACZ, requiring only the presigned URL for the WACZ and the
name of the file to stream.
Allow maximum scale option to be fully configurable via
`max_crawl_scale`. Already configurable on the backend, and now exposed
to the frontend via API `/api/settings` `maxCrawlScale` value.
The workflow editor and workflow details are updated to allow selecting
the scale up to the maxCrawlScale setting (which defaults to 3 if not
set).
Fixes#1539
Adds `reviewStatus` field to `BaseCrawl` model, updatable via the crawl
update API endpoint. Acceptable values are "good", "acceptable" or
"failure", enforced by an Enum.
Added to `BaseCrawl` so that we can extend support to uploads more
easily later on, but for now we'll only display this for crawls in the
frontend.
Fixes#1502
- Adds pages to database as they get added to Redis during crawl
- Adds migration to add pages to database for older crawls from
pages.jsonl and extraPages.jsonl files in WACZ
- Adds GET, list GET, and PATCH update endpoints for pages
- Adds POST (add), PATCH, and POST (delete) endpoints for page notes,
each with their own id, timestamp, and user info in addition to text
- Adds page_ops methods for 1. adding resources/urls to page, and 2.
adding automated heuristics and supplemental info (mime, type, etc.) to
page (for use in crawl QA job)
- Modifies `Migration` class to accept kwargs so that we can pass in ops
classes as needed for migrations
- Deletes WACZ files and pages from database for failed crawls during
crawl_finished process
- Deletes crawl pages when a crawl is deleted
Note: Requires a crawler version 1.0.0 beta3 or later, with support for
`--writePagesToRedis` to populate pages at crawl completion. Beta 4 is
configured in the test chart, which should be upgraded to stable 1.0.0
when it's released.
Connected to https://github.com/webrecorder/browsertrix-crawler/pull/464
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#1385
## Changes
Supports multiple crawler 'channels' which can be configured to
different browsertrix-crawler versions
- Replaces `crawler_image` in helm chart with `crawler_channels` array
similar to how storages are handled
- The `default` crawler channel must always be provided and specifies
the default crawler image
- Adds backend `/orgs/{oid}/crawlconfigs/crawler-channels` API endpoint
to fetch information about available crawler versions (name, image, and
label) and test
- Adds crawler channel select to workflow creation/edit screens and
profile creation dialog, and updates related API endpoints and
configmaps accordingly. The select dropdown is shown only if more than
one channel is configured.
- Adds `crawlerChannel` to workflow and crawl details.
- Add `image` to crawler image, used to display actual image used as
part of the crawl.
- Modifies `crawler_crawl_id` backend test fixture to use `test` crawler
version to ensure crawler versions other than latest work
- Adds migration to add `crawlerChannel` set to `default` to existing
workflow and profile objects and workflow configmaps
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>