Commit Graph

623 Commits

Author SHA1 Message Date
Ilya Kreymer
04c8b50423
add a crawling defaults on the Org to allow setting certain crawl workflow fields as defaults: (#2031)
- add POST /orgs/<id>/defaults/crawling API to update all defaults
(defaults unset are cleared)
- defaults returned as 'crawlingDefaults' object on Org, if set
- fixes #2016

---------

Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
2024-08-22 10:36:04 -07:00
Ilya Kreymer
86c9e538c1
quickfix: webhooks: ensure the 'crawl_reviewed' webhook is sent async, doesn't delay submitting a review (#2033)
make the call to `create_crawl_reviewed_notification` be called with
create_task (similar to other user-initiated webhook events), to avoid
extra wait for webhook to complete
2024-08-20 17:50:18 -07:00
Ilya Kreymer
8c9a14b6a2
Ensure Subscription Update doesn't update the gifted quotas (#2012)
- add a separate OrgQuotasIn where all quota updates are optional
- ensure gifted quotas are never updated as part of org update
- update tests
2024-08-20 13:15:03 -07:00
Tessa Walsh
916813af2d
Include user and user org info in login response (#2014)
Fixes #2013 

Adds the `/users/me` response data to the API login endpoint response
under the key `user_info` and adds a test.
2024-08-12 18:51:42 -07:00
Ilya Kreymer
d9f49afcc5
type fixes on util functions (#2009)
Some additional typing for util.py functions and resultant changes
2024-08-12 10:54:45 -07:00
Ilya Kreymer
12f994b864
QA: Count QA execution minutes separately for now (#2011)
For now, keep QA exec time separate, as it may be scaled differently and currently still in beta.
2024-08-09 13:13:21 -07:00
Ilya Kreymer
4ec7cf8adc
Additional operator edge case fixes (#2007)
Fix a few edge-case situations:
- Restart evicted pods that have reached the terminal `Failed` state
with reason `Evicted`, by just recreating them. These pods will not be
automatically retried, so need to be recreated (usually happens due to
memory pressure from the node)
- Don't treat containers in ContainerCreating as running, even though
this state is usually quick, its possible for containers to get stuck
there, and will improve accuracy of exec seconds tracking.
- Consolidate state transition for running states, either sets to
running or to pending-wait/generate-wacz/upload-wacz and allows changing
from to either of these states from each other or waiting_capacity
2024-08-09 13:12:25 -07:00
Ilya Kreymer
8ff1ad39a7 version: bump to 1.11.3 2024-08-08 15:16:18 -07:00
Ilya Kreymer
ed9038fbdb version: bump to 1.11.2 2024-08-07 12:37:26 -07:00
Ilya Kreymer
5f53db75ee
fix resetting of invalid logins: (#2002)
* Fixes issue in FailedLogin model:
- fix data-model to remove nested 'attempted.attempted'
- migrate existing data to remove nested field

* Also, avoid setting dt_now() in model as that results in fixed date for
all objects:
- update FailedLogin to update 'attempted' date on every attempt
- also update PageNote object to set date in constructor

* Update text for too many logins to make it clear it is set only if its a
valid email

* fixes #2001
2024-08-07 12:36:06 -07:00
Ilya Kreymer
41d43ae249
Fix forgot password for invalid user (#1999)
- fix validation error if user doesn'r exist
- always return success even if user doesn't exist for security reasons
- add test for forgot password endpoint
2024-08-07 11:02:40 -07:00
Ilya Kreymer
7fa2b61b29
Execution time tracking tweaks (#1994)
Tweaks to how execution time is tracked for more accuracy + excluding
waiting states:
- don't update if crawl state is in a 'waiting state' (waiting for
capacity or waiting for org limit)
- rename start states -> waiting states for clarity
- reset lastUpdatedTime if two consecutive updates of non-running state,
to ensure non-running states don't count, but also account for
occasional hiccups -- if only one update detects non-running state,
don't reset
- webhooks: move start webhook to when crawl actually starts for first
time (db lastUpdatedTime is not yet + crawl is running)
- don't set lastUpdatedTime until pods actually running
- set crawljob update interval to every 10 seconds for more accurate
execution time tracking
- frontend: show seconds in 'Execution Time' display
2024-08-06 09:44:44 -07:00
Ilya Kreymer
4a2725aaa6
operator: adjust state transition rules to ensure 'running' state always accounted for in db (#1989)
don't rely on current status, always set state to running when running
to ensure idempotency in case of multiple calls
2024-08-05 16:00:21 -07:00
Ilya Kreymer
1c153dfd3c
Subscription Update Quotas (#1988)
- Follow-up to #1914, allows SubscriptionUpdate event to also update
quotas.
- Passes current usage info + current billing page URL to portalUrl
request for external app to be able to respond with best portalUrl
- get_origin() moved to utils to be available more generally.
- Updates billing tab to show current plans, switches order of quotas to
list execution time, storage first
2024-08-05 15:59:47 -07:00
Ilya Kreymer
0c29008b7d version: bump to 1.11.1 2024-07-30 11:23:41 -07:00
Ilya Kreymer
894aa29d4b
remove crc32 from CrawlFile (#1980)
- no longer being used with latest stream-zip
- was not computed correctly in the crawler
- counterpart to webrecorder/browsertrix-crawler#657

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-07-30 11:23:15 -07:00
Ilya Kreymer
4aca107710 version: bump to 1.11.0 2024-07-29 12:52:39 -07:00
Ilya Kreymer
e9aeff1836
add a 'stopped_org_readonly' state for crawls that are running while org is made read-only (#1977)
an org is made read-only while crawls are running:
- treat similar to other stopped_* states, do a graceful stop
- update UI to display "Stopped: Crawling Disabled" for this status
- don't add corresponding skipped status - just skip running crawls if org is read-only
2024-07-29 12:24:40 -07:00
Ilya Kreymer
96691a33fa
Fix for cronjob skipping response (#1976)
If a cronjob is disabled, the operator should quickly return a success
value so that the job can be terminated.
Was previously returning an incorrect response, causing disabled
cronjobs to not be cleaned up. Add proper typing to always return correct response
2024-07-29 12:24:18 -07:00
Tessa Walsh
551660bb62
Add webhooks for qaAnalysisStarted, qaAnalysisFinished, and crawlReviewed (#1974)
Fixes #1957 

Adds three new webhook events related to QA: analysis started, analysis
ended, and crawl reviewed.

Tests have been updated accordingly.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-07-25 16:53:49 -07:00
Ilya Kreymer
94e985ae13
optimize org quota lookups (#1973)
- instead of looking up storage and exec min quotas from oid, and
loading an org each time, load org once and then check quotas on the org
object - many times the org was already available, and was looked up
again
- storage and exec quota checks become sync
- rename can_run_crawl() to more generic can_write_data(), optionally
also checks exec minutes
- typing: get_org_by_id() always returns org, or throws, adjust methods
accordingly (don't check for none, catch exception)
- typing: fix typo in BaseOperator, catch type errors in operator
'org_ops'
- operator quota check: use up-to-date 'status.size' for current job,
ignore current job in all jobs list to avoid double-counting
- follow up to #1969
2024-07-25 14:00:16 -07:00
Tessa Walsh
d38abbca7f
Standardize handling of storage and execution time quotas (#1969)
Fixes #1968 

Changes:
- `stopped_quota_reached` and `skipped_quota_reached` migrated to new
values that indicate which quota was reached
- Before crawls are run, the operator checks if storage or exec mins
quotas are reached and if so fails the crawl with the appropriate state
of `skipped_storage_quota_reached` or `skipped_time_quota_reached`
- While crawls are running, the operator checks if the exec mins quota
is reached or if the size of all running crawls will mean the storage
quota is reached once uploaded; if so, the crawl is stopped gracefully
and given `stopped_storage_quota_needed` or `stopped_time_quota_reached`
state as appropriate
- Adds new nightly tests for enforcing storage quota
2024-07-25 12:49:11 -07:00
Tessa Walsh
27ee16d308
Implement downloading archived item + QA runs as multi-WACZ (#1933)
Fixes #1412 

## Changes

### Backend
- Adds `all-crawls`, `crawls`, and `uploads` API endpoints to download
archived item as multi-WACZ
- Download QA runs as multi-WACZ
- Adds backend tests for new endpoints
- Update to new version of stream-zip library which does not require crc-32 to be present for ZIP members,
computes after streaming, fixing invalid crc-32 issues as previously computed crc-32s from crawler may be invalid.

### Frontend

Adds ability to download archived item from:

- Button in archived item detail Files tab
- Archived item details actions menu
- Archived items list menu



---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-07-25 10:28:57 -07:00
Ilya Kreymer
01ddf95a56
allow disabling of auto-resize of crawler pods (#1964)
- only enable if 'enable_auto_resize' is true, default to false
- if true, set memory limit to 1.2 of memory requests, resize when
hitting 'soft oom' of initial request, adjust by 1.2 (current behavior)
up to max_crawler_memory
- if false, set memory limit to max_crawler_memory and never adjust
memory requests or memory limits
- part of #1959
2024-07-23 21:00:40 -04:00
Ilya Kreymer
a8c5f07b7c
Add support e-mail to settings (#1960)
Adds support email to /api/settings
Also adds a response model for this endpoint and consolidates api tests

Addresses request in #1912
2024-07-23 20:58:12 -04:00
Tessa Walsh
a02f7a6826
Ensure lexical sort for org names (#1958)
Fixes #1955 

Orgs list endpoint sorting now works as follows:
- Default org is always sorted first
- Name sorting now works on a lowercased version of the org names to
ensure lexical sorting

The lodash `sortBy` resorting of orgs in the "All Organizations"
dropdown list in the nav bar has also been removed so that the backend
sorting is applied instead.

Tests have been updated accordingly.
2024-07-23 13:13:04 -07:00
Ilya Kreymer
8c0321bdea
Pydantic 2.x update + type fixes + python 3.12 (#1947)
* updates pydantic to 2.x
* also update to python 3.12
* additional type fixes:
- all Optional[] types must have a default value
- update to constrained types
- URL types converted from str
- test updates

Fixes #1940
2024-07-22 17:23:03 -07:00
Ilya Kreymer
cb909ffc95
api docs cleanup + readd webhooks: (#1949)
- readd webhooks (regression from #1941)
- set order of tags in docs
- add missing tag to route
2024-07-22 09:00:59 -07:00
Ilya Kreymer
cd00f52cca
Fix queue response models + additional testing for queue + exclusions (#1948)
Follow-up to regressions from #1928, this PR:
- Fixes response models for queue endpoints, which had incorrect model
- Adds tests for queue get, queue match, and exclusions add / remove to
ensure regressions like this can be caught via tests. This involves
starting a new crawl in test_run_crawls() instead of relying on implicit
running via fixtures, make it easier to test crawl while it's running.
- Adds additional typing for crawls apis, including making
delete_crawls() have correct typing, consistent derived class override
- Adds check to ensure queue + exclusion operations can not be called
when crawl is not running
2024-07-22 09:00:23 -07:00
Ilya Kreymer
0cc99044e7 quickfix: pin mypy version to avoid issues with latest release 2024-07-19 18:30:57 -07:00
Tessa Walsh
2237120cd5
Add API endpoint to recalculate org storage (#1943)
Fixes #1942 

This process might be a bit slow for large orgs, may consider moving it to background job in #1898.
2024-07-19 18:29:20 -07:00
Tessa Walsh
6ccaad26d8
Ensure org name and slug uniqueness is case-insensitive (#1929)
Fixes #1927 

Also adds tests to ensure index is working as expected, and migration to
rename orgs that have names or slugs identical to other orgs except for
case before the new case-insensitive index is built.
2024-07-18 15:30:12 -07:00
Ilya Kreymer
b1ccdc4d16
OpenAPI Metadata for API Endpoints (#1941)
- Updates the `/docs` and `/redoc` API endpoints to have better metadata,
including using Browsertrix favicon and our logo for the `/redoc` endpoint.
- add new logo file 'docs-logo.svg' to root

Based on info at:
https://fastapi.tiangolo.com/how-to/extending-openapi/
https://fastapi.tiangolo.com/tutorial/metadata/

---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2024-07-18 11:11:38 -07:00
Tessa Walsh
3bf7967754
Fix regression with saving new workflow due to profileid type error (#1946)
Fixes #1945
2024-07-18 09:35:52 -07:00
Tessa Walsh
c772ee2362
Fix response model for crawl errors API endpoint (#1939)
Follow-up fix for #1920 for crawl errors endpoint, which returns a 500
following #1928, caught in nightly tests.
2024-07-17 10:52:14 -07:00
Ilya Kreymer
335700e683
Additional typing cleanup (#1938)
Misc typing fixes, including in profiles and time functions

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-07-17 10:49:22 -07:00
Ilya Kreymer
4db3053a9f
fix crawlFilenameTemplate + add_crawl_config cleanup (fixes #1932) (#1935)
- ensure crawlFilenameTemplate is part of the CrawlConfig model
- change CrawlConfig init to use type-safe construction
- add a run_now_internal() that is shared for starting crawl, either on
demand or from new config
- add OrgOps.can_run_crawls() to check against org quotas for crawling
- cleanup profile updates, remove _lookup_profile, only check for
EmptyStr in update

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-07-17 10:48:25 -07:00
Ilya Kreymer
27059c91a5 version: bump to 1.11.0-beta.1 2024-07-17 10:06:49 -07:00
Tessa Walsh
60afb19472
Add API endpoint to import subscription for existing org (#1930)
Fixes #1926 

- adds /subscriptions/import endpoint for importing an existing subscription to an existing org
- add SubscriptionImport object and log as 'import' event in subscription events collection

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-07-16 16:17:02 -07:00
Tessa Walsh
d41647e6c2
Document all API endpoints with response models (#1928)
Fixes #1920 

Adds response models to all API endpoints that were missing them,
documenting current behavior without making any changes at this stage to
standardize responses.

Follow-up work will involve adding generics to some of the response models
2024-07-16 12:48:38 -07:00
Tessa Walsh
aaf18e70a0
Add created date to Organization and fix datetimes across backend (#1921)
Fixes #1916

- Add `created` field to Organization and OrgOut, set on org creation
- Add migration to backfill `created` dates from first workflow
`created`
- Replace `datetime.now()` and `datetime.utcnow()` across app with
consistent timezone-aware `utils.dt_now` helper function, which now uses
`datetime.now(timezone.utc)`. This is in part to ensure consistency in
how we handle datetimes, and also to get ahead of timezone naive
datetime creation methods like `datetime.utcnow()` being deprecated in
Python 3.12. For more, see:
https://blog.miguelgrinberg.com/post/it-s-time-for-a-change-datetime-utcnow-is-now-deprecated
2024-07-15 19:46:32 -07:00
Tessa Walsh
a546fb6fe0
Improve handling of duplicate org name/slug (#1917)
Initial implementation of #1892 

- Modifies the backend to return `duplicate_org_name` or
`duplicate_org_slug` as appropriate on a pymongo `DuplicateKeyError`
- Updates frontend to handle `duplicate_org_name`, `duplicate_org_slug`,
and `invalid_slug` error details
- Update errors to be more consistent, also return `duplicate_org_subscription.subId` for duplicate subscription instead of the more generic `already_exists`
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-07-10 19:24:50 -07:00
Ilya Kreymer
9a67e28f13
Adds Subscription API (#1914)
Fixes https://github.com/webrecorder/browsertrix/issues/1905

- adds a new top-level `/api/subscriptions` endpoint and SubOps handler on
the backend.
- enable subscriptions API endpoints available only if `billing_enabled` is
set in helm chart
- new POST /subscriptions/create, /subscriptions/update,
/subscriptions/cancel API endpoints
- Subscriptions mongo collection storing timestamped /subscription
API events
- GET /subscriptions/events API to get subscription events, support for filtering and sorting
- Subscription data model 
- Support for setting and handling readOnlyOnCancel on org
- /orgs/<id>/billing-portal to lookup portalUrl using external API
- subscription in org getter and list views
- mark org as readOnly for subscription status `paused_payment_failed`, clears it on status `active`

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-07-10 17:41:16 -07:00
Tessa Walsh
5aa0ab62cb
Add nightly backend tests for org deletion while browsers are running (#1919)
Fixes #1918
2024-07-10 16:52:27 -07:00
sua yoo
c97900ec2b
Merge branch 'main' into frontend-org-manage-readonly 2024-07-08 11:20:30 -07:00
Tessa Walsh
192737ea99
Add API endpoint to delete org (#1448)
Fixes #903 

Adds superuser-only API endpoint to delete an org and all of its data

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-07-03 16:00:11 -04:00
Tessa Walsh
bca05ac185 Fix typing 2024-07-03 11:25:01 -04:00
Tessa Walsh
497cfdc561
Merge branch 'main' into frontend-org-manage-readonly 2024-07-03 11:15:47 -04:00
Tessa Walsh
787ebc8738 Add one more pylint disable comment 2024-07-03 11:14:46 -04:00
Tessa Walsh
5a563d20d9 Fix linting issues 2024-07-03 11:10:10 -04:00
Tessa Walsh
d3fb33a78a Add and apply backend sorting for org list
The default org will always be sorted first, regardless of sort options.
Orgs after the first will be sorted by name ascending by default.
Sorting currently supported on name, slug, and readOnly.
2024-07-03 11:01:01 -04:00
Ilya Kreymer
1c42e21b8a
Refactor Invites and Registration, Flatten Per-User Invites (#1902)
Fixes #1432

Refactors the invite + registration system to be simpler and more consistent
with regards to existing user invites. Previously, per-user invites are
stored in the user.invites dict instead of in the invites collection,
which creates a few issues:
- Existing user do not show up in Org Invites list: #1432 
- Existing user invites also do not expire, unlike new user invites,
creating potential security issue.

Instead, existing user invites should be treated like new user invites.
This PR moves them into the same collection,
adding a `userid` field to InvitePending to match with an existing user.

If a user already exists, it will be matched by userid, instead of by
email. This allows for user to update their email while still being
invited. Note that the email of the invited existing user will not
change in the invite email. This is also by design: an admin of one org
should not be given any hint that an invited user already has an
account, such as by having their email automatically update. For an org
admin, the invite to a new or existing user should be indistinguishable.

The sha256 of invite token is stored instead of actual token for better
security.

The registration system has also been refactored with the following
changes:
- Auto-creation of new orgs for new users has been removed
- User.create_user() replaces the old User._create() and just creates the user with
additional complex logic around org auto-add
- Users are added to org in org add_user_to_org()
- Users are added to org through invites with add_user_with_invite()

Tests:
- Additional tests include verifying that existing and new pending
invites appear in the pending invites list
- Tests for `/users/invite/<token>?email=` and
`/users/me/invite/<token>` endpoints
- Deleting pending invites
- Additional tests added for user self-registration, including existing
user self-registration to default org of existing user (in nightly
tests)
2024-07-02 15:13:27 -07:00
Tessa Walsh
f076e7d9e3
Add superuser API endpoints to export and import org data (#1394)
Fixes #890 

This PR introduces new streaming superuser-only API endpoints to export
and import database information for an organization. New Adminstrator
deployment documentation on how to manage the process and copy files
between S3 buckets as needed is also included.

---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-07-02 17:14:34 -04:00
Tessa Walsh
bdfc0948d3
Disable uploading and creating browser profiles when org is read-only (#1907)
Fixes #1904 

Follow-up to read-only enforcement, with improved tests.
2024-07-01 23:15:38 -07:00
Ilya Kreymer
e1ef894275
Extends Org Create endpont + shared secret auth (#1897)
Updates the /api/orgs/create endpoint to:
- not have name / slug be required, will be renamed on first user via
#1870
- support optional quotas
- support optional first admin user email, who will receive an invite to
join the org.

Also supports a new shared secret mechanism, to allow an external
automation to access the /api/orgs/create endpoint (and only that
endpoint thus far) via a shared secret instead of normal login.
2024-07-01 09:37:02 -07:00
Ilya Kreymer
3cd52342a7
Remove Crawl Workflow Configmaps (#1894)
Fixes #1893 

- Removes crawl workflow-scoped configmaps, and replaces with operator-controlled
per-crawl configmaps that only contain the json config passed to Browsertrix
Crawler (as a volume).
- Other configmap settings replaced are replaced the custom CrawlJob options
(mostly already were, just added profile_filename and storage_filename)
- Cron jobs also updated to create CrawlJob without relying on configmaps,
querying the db for additional settings.
- The `userid` associated with cron jobs is set to the user that last modified
 the schedule of the crawl, rather than whomever last modified the workflow
- Various functions that deal with updating configmaps have been removed,
including in migrations.
- New migration 0029 added to remove all crawl workflow configmaps
2024-06-28 15:25:23 -07:00
Tessa Walsh
8a904c9031
feat: Rename org when accepting org invite for first admin (#1870)
Resolves https://github.com/webrecorder/browsertrix/issues/1874

Support for new two-part sign up flow if first admin user is added to org
- If new user, user registers first, then is able to change the org name / slug on following screen
- If existing user, user accepts invite, then is able to change the org name / slug on following screen
- After confirming org slug name, user is taken to dashboard, or error is shown if org name or slug already taken.
- If org name == org id, org name and slug is automatically set to `{Your Name}'s Archive` when first user is registered / accepts invite
- Email templates updated to better reflect new / existing users and not show org name if it is 'unset' (org name == org id internally)
- tests: frontend unit testing for accept + invite screens.

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: sua yoo <sua@suayoo.com>
Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
2024-06-27 16:08:31 -07:00
Tessa Walsh
b7631d1b91
Add slug validation and test (#1891)
Fixes #1890 

Adds validation for org slugs, ensuring that they contain only ASCII
alphanumeric characters and dashes (`-`). If an invalid slug is
provided, an HTTPException is returned with status code 400 and detail
`invalid_slug`.
2024-06-26 15:04:54 -04:00
Ilya Kreymer
6df10d5fb0
Improved Scale Handling (#1889)
Fixes #1888 

Refactors scale handling:
- Ensures number of scaled instances does not exceed number of pages,
but is also at minimum 1
- Checks for finish condition to be numFailed + numDone >= desired scale
- If at least one instance succeeds, crawl considers successful / done.
- If all instances fail, crawl considered failed
- Ensures that pod done count >= redis done count

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-06-26 10:24:45 -07:00
Tessa Walsh
9140dd75bc
Add and enforce readOnly field in Organization (#1886)
Fixes https://github.com/webrecorder/browsertrix/issues/1883
Backend work for https://github.com/webrecorder/browsertrix/issues/1876

- If readOnly is set true, disallow crawls and QA analysis runs
- If readOnly is set to true, skip scheduled crawls
- Add endpoint to set `readOnly` with optional `readOnlyReason` (which
is automatically set back to an empty string when `readOnly` is being
set to false), which can be displayed in banner
- Operator: ensures cronjobs that are skipped due to internal logic (eg. readonly mode) simply succeed right away and do not leave a k8s job dangling.

---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-06-25 19:30:53 -07:00
Ilya Kreymer
3bd714ea9d
QA stats aggregation: exclude isFile / isError pages from stats (#1879)
Follow-up to: #1868, exclude pages that have isFile or isError set to
true from the stats aggregation.
2024-06-25 08:54:42 -07:00
Tessa Walsh
7af3980323
Add billing enabled and sales email to Helm chart and /settings API endpoint (#1873)
Backend work for first two tasks of
https://github.com/webrecorder/browsertrix/issues/1875

New /billing API endpoint to be added separately once we have a better
idea of what data we can get from the payment processor.
2024-06-25 10:55:29 -04:00
Tessa Walsh
879e509b39 Backend: Move page file and error counts to crawl replay.json endpoint (#1868)
Backend work for #1859

- Remove file count from qa stats endpoint
- Compute isFile or isError per page when page is added
- Increment filePageCount and errorPageCount per crawl to count number of isFile or isError pages
- Add file and error counts to crawl replay.json endpoint (filePageCount and errorPageCount)
- Add migration 0028 to set isFile / isError for each page, aggregate filePageCount / errorPageCount per crawl
- Determine if page is a file based on loadState == 2, mime type or status code and lack of title
2024-06-20 19:02:57 -07:00
Ilya Kreymer
553e2e352b
Merge branch 'main' into 1.10.2-release 2024-06-12 23:59:56 -07:00
Ilya Kreymer
fa6627ce70
ensure QA configmap is updated for long running QA runs: (#1865)
- add a 'expire_at_duration_seconds' which is 75% of actual presign
duration time, or <25% remaining until presigned URL actually expires to
ensure presigned URLs are updated early than when they actually expire
- set cached expireAt time to the renew at time for more frequent
updates
- update QA configmap in place with updated presigned URLs when expireAt
time is reached
- mount qa config volume under /tmp/qa/ without subPath to get automatic
updates, which crawler will handle
- tests: fix qa test typo (from main)
- fixes #1864
2024-06-12 10:51:35 -07:00
Tessa Walsh
8b0d1432af
Show QA meter while analysis is running (#1854)
Fixes #1846 

- Ensure meter auto-updates as new stats are ready
- Switch meter to new QA run when new analysis run is started
- Remove Files from QA meter (files and errors will be reported separately)

Co-authored-by: emma <hi@emma.cafe>
Co-authored-by: sua yoo <sua@webrecorder.org>
2024-06-12 12:32:01 -04:00
Ilya Kreymer
1ab6ec325b version: bump to 1.10.2 2024-06-11 16:28:40 -07:00
Ilya Kreymer
2ffb37bd14
tests: fix typo in waiting for qa run to stop test! (#1857)
Fixes not properly testing if activeQA is null, hopefully fixes
intermittent test failures!
2024-06-11 11:07:55 -04:00
Tessa Walsh
4edc05d503
Use standard firstSeed/seedCount fallback for workflows with no name in profile details (#1852)
Fixes #1833 

- Add firstSeed and seedCount to workflow information in profile detail
API endpoint (tests updated accordingly), update name of model used for
limited workflow information to be more accurate
- Fix name display in Crawl Workflows list at bottom of Profile detail
page to be consistent with rest of application

---------

Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
2024-06-06 14:28:19 -04:00
Tessa Walsh
a85f9496b0
Include number of Identical Files in QA stats and meter (#1848)
This PR adds Identical Files to the QA Page Match Analysis meter bars.
To do this, the backend calculates the number of non-HTML pages once and
includes it under the key `Files` in each of the `screenshotMatch` and
`textMatch` QA stats return arrays.

The backend additionally removes the file count from "No Data" to
prevent these from being counted twice.

---------

Co-authored-by: emma <hi@emma.cafe>
2024-06-06 13:15:19 -04:00
Ilya Kreymer
e3ee63f9b0 version: bump to 1.11.0-beta.0 2024-06-04 13:37:44 -07:00
sua yoo
4d4c8a04d4
feat: User-sort browser profiles list (#1839)
Resolves https://github.com/webrecorder/browsertrix/issues/1409

### Changes

- Enables clicking on Browser Profiles column header to sort the table, including by starting URL
- More consistent column widths throughout app

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: emma <hi@emma.cafe>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2024-06-04 13:57:03 -04:00
Ilya Kreymer
d42de92d75
QA analysis scale configurable in helm chart (#1843)
- allow configuring QA run scale via 'qa_scale' setting in helm values
(overriding any setting on the qa crawljob)
- adds additional comments to browser instances helm values settings for clarity
- fixes #1842
2024-05-30 12:59:21 -07:00
Ilya Kreymer
61239a40ed
include workflow config in QA runs + different browser instances for QA (#1829)
Currently, the workflow crawl settings were not being included at all in
QA runs.
This mounts the crawl workflow config, as well as QA configmap, into QA
run crawls, allowing for page limits from crawl workflow to be applied
to QA runs.

It also allows a different number of browser instances to be used for QA
runs, as QA runs might work better with less browsers, (eg. 2 instead of
4). This can be set with `qa_browser_instances` in helm chart.

Default qa browser workers to 1 if unset (for now, for best results)

Fixes #1828
2024-05-29 13:32:25 -07:00
Tessa Walsh
18e5ed94f1
Add migration to set profile modified date (#1832)
If modified field doesn't exist on older browser profiles, set it equal
to created date to match what we do for new profile creation.
2024-05-29 15:56:27 -04:00
Tessa Walsh
7e5d742fd1
Backend: Add modified field and track created/modifier users for profiles (#1820)
This PR introduces backend changes that add the following fields to the
Profile model:
- `modified`
- `modifiedBy`
- `modifiedByName`
- `createdBy`
- `createdByName`

Modified fields are set to the same as the created fields when the
resource is created, and changed when the profile is updated (profile
itself or metadata).

The list profiles endpoint now also supports `sortBy` and
`sortDirection` options. The endpoint defaults to sorting by `modified`
in descending order, but can also sort on `created` and `name`.

Tests have also been updated to reflect all new behavior.
2024-05-28 17:25:22 -04:00
Ilya Kreymer
85cd214101
Fix regression to changing user roles via PATCH /user-role API (#1824)
clean up adding user vs changing role logic:
- when adding user, ensure user doesn't exist
- when changing roles, ensure user does exist

add test for changing roles of existing user

Fixes #1821
2024-05-24 10:41:05 -07:00
Ilya Kreymer
4b6dd97c11 version: bump to 1.10.1 2024-05-23 22:24:58 -07:00
Ilya Kreymer
e853b62401 version: update to 1.10.0! 2024-05-20 19:30:22 -07:00
Ilya Kreymer
1ca107fa2b
quickfix: fix for additional lint error in updated pylint (#1805) 2024-05-15 17:00:15 -07:00
Ilya Kreymer
94d57b98ce version bump to 1.10.0-beta.7 2024-05-15 11:30:05 -07:00
Ilya Kreymer
26abbddd62
regression fix: remove 'project' from non-raw crawl object getters, (#1778)
unnecessary projection prevents creating Crawl object, now removed

Fixes #1777
2024-05-02 01:02:22 +02:00
Ilya Kreymer
e022994f4e version: update to 1.10.0-beta.6 2024-04-30 20:34:11 +02:00
Ilya Kreymer
375057a819
check that status.lastUpdatedTime is set before attempting to subtract! (#1754)
Don't subtract none value!
2024-04-30 20:33:46 +02:00
Ilya Kreymer
a3911f6a8a version: bump to 1.10.0-beta.5 2024-04-25 09:00:54 +02:00
Ilya Kreymer
f6c0791dc1
fix missing settings / typos: (#1748)
- ensure max_crawler_memory_size is inited before it is set!
- pass profile_browser_memory / profile_browser_cpu from chart values
- map volume to /tmp/home to avoid persisting /tmp for profiles
2024-04-25 09:00:17 +02:00
Ilya Kreymer
a09f565ce5 version: bump to 1.10.0-beta.4 2024-04-24 16:53:39 +02:00
Ilya Kreymer
b061a39d5f
Handle registration when user already exists (#1744)
Fixes https://github.com/webrecorder/browsertrix/issues/1743

On the backend, support adding user to new org and improved error
messaging:
- if user exists is not part of the org to be added to, add user to the
registration org, return 201
- if user is already part of the org, return 400 with
'user_already_is_org_member' error
- if user is not being added to a new org but already exists, return
'user_already_exists'

frontend:
- if user user same password, they will just be logged in and added to
registration org (same as before)
- if user uses wrong password, show alert message
- handle both user_already_is_org_member and user_already_registered
with alert message and link to log-in page.

note: this means existing user is added to the registration org even if
they provide wrong password, but they won't be able
to login until they use correct password/reset/etc...
2024-04-24 16:40:25 +02:00
Ilya Kreymer
f286f04130
Typo Fix: fix typos in setting max crawler memory (#1747)
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-04-24 10:29:09 -04:00
Ilya Kreymer
f89027ac89 version: 1.10.0-beta.3 2024-04-24 15:45:17 +02:00
Ilya Kreymer
ec74eb4242
operator: add 'max_crawler_memory' to limit autosizing of crawler pods (#1746)
Adds a `max_crawler_memory` chart setting, which, if set, will
defines the upper crawler memory limit that crawler pods can be resized up to.
If not set, auto resizing is disabled and pods are always set to 'crawler_memory' memory
2024-04-24 15:16:32 +02:00
Ilya Kreymer
41655ef829 version: bump to 1.10.0-beta.2 2024-04-23 23:19:16 +02:00
Ilya Kreymer
b94070160b
allow configuring designated registration org to which new users can register (#1735)
if 'registration_enabled' is set, check 'registration_org_id' for org id
of an existing org that new users should be added to when they register.
if omitted, default to the default org

Fixes #1729
2024-04-23 17:11:37 -04:00
sua yoo
1915274e26
Fix QA review comments (#1723)
Fixes https://github.com/webrecorder/browsertrix/issues/1710

Fixes date and deletion for newly added comments.
2024-04-23 16:31:52 -04:00
Tessa Walsh
b8caeb88e9
Ensure QA run WACZs are deleted (#1715)
- When qa run is deleted
- When crawl is deleted

And adds tests for WACZ deletion.

Fixes #1713
2024-04-22 18:04:09 -04:00
Ilya Kreymer
1844e761dc
Support sorting by last QA started time (#1712)
To support #1683, it would be useful to be able to sort by 'last QA
start time' in addition to/instead of last QA state.
- make sorting consistent with workflow sorting
- sortBy fields renamed to lastQAState and lastQAStarted
- Current QA runs are now included in the lastQAState/lastQAStarted fields, rather than being separated out to different values

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-04-22 13:00:52 -07:00
Ilya Kreymer
b574f00d2b
Add Repository Index + Chart Rename + Docs Rename (#1708)
Repository Index: Generate an index.yaml in ./docx/helm-repo/index.yaml
to allow for browsertrix to be a helm repository.
docs: rename docs.browsertrix.cloud -> docs.browsertrix.com
docs: update deployment doc to mention helm repo as preferred way to
install
docs build action: generate repository index in GH action
publish action: update auto-generated message to mention installing from
the repo.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-04-21 09:42:25 -07:00
Ilya Kreymer
4360e0c1b5
Update tests with latest crawler (#1711)
tests: use 'latest' crawler release for testing, now that 1.1.x is released.
2024-04-20 15:56:26 -07:00
Tessa Walsh
80008a2853
Add post load delay to Browsertrix (#1700)
Fixes #1699 

Adds post load delay to:
- Backend `RawCrawlConfig` model
- Frontend (workflow editor and config details component)
- Workflow setup docs
2024-04-18 20:03:47 -07:00
Ilya Kreymer
9609ff4194
Add 'activeQAStats' field (#1694)
As additional support for #1683, include the active QA stats in the
crawl response, along with active QA state.
This will allow showing progress of QA run in the archived items list.
2024-04-18 10:05:39 -04:00
Tessa Walsh
b87860c68a
Ensure /all-crawls?sortBy=qaState always sorts crawls above uploads (#1691)
Follow-up to #1686
2024-04-17 19:14:29 -07:00
Tessa Walsh
30ab139ff2
Add QA run aggregate stats API endpoint (#1682)
Fixes #1659 

Takes an arbitrary set of thresholds for text and screenshot matches as
a comma-separated list of floats.

Returns a list of groupings for each that include the lower boundary and
count for all thresholds passed in.
2024-04-17 13:24:18 -04:00
Ilya Kreymer
835014d829
restrict qa runs to a 'min_qa_crawler_image' if set in the chart (#1685)
- fixes #1684
- can be used to optionally restrict QA to only some crawls (eg. with
browsertrix-crawler>=1.0.0)
- enforce error on backend (return 400) and handle special error on the
frontend
2024-04-17 08:48:33 -07:00
Tessa Walsh
c800da1732
Add reviewStatus, qaState, and qaRunCount sort options to crawls/all-crawls list endpoints (#1686)
Backend work for #1672 

Adds new sort options to /crawls and /all-crawls GET list endpoints:

- `reviewStatus`
- `qaRunCount`: number of completed QA runs for crawl (also added to
CrawlOut)
- `qaState` (sorts by `activeQAState` first, then `lastQAState`, both of
which are added to CrawlOut)
2024-04-16 23:54:09 -07:00
Tessa Walsh
87e0873f1a
Add mime field to Page model (#1678) 2024-04-17 00:57:49 -04:00
Vinzenz Sinapius
1b034957ff
Improve reliability of backend tests (#1675)
- Remove globals from profile, uploads, and qa test modules in favor of fixtures
- Add retries to fix intermittent test failures due to timing

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-04-16 14:22:41 -04:00
Ilya Kreymer
95f5605af7
renumber crawl priority classes: (#1673)
- priority classes <-10 are ignored by cluster-autoscaler so QA jobs
with too low priorities never run
- start crawl priorities at 0 going down (same as before)
- start qa run priorities at -2 going down (instead of -100)
- this means a crawl of with scale of 3 can be preempted by 1st qa pod,
but otherwise crawls have higher priority
- rename priority classes as they are otherwise immutable and error on
helm upgrade

This allows for more room in lower pri classes for other type of
objects, while keeping in mind the -10 and below threshold: (see:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md)
2024-04-13 12:24:43 -07:00
Ilya Kreymer
f243d34395
Remove pages from QA Configmap (#1671)
Fixes #1670 

No longer need to pass pages to the ConfigMap. The ConfigMap has a size
limit and will fail if there are too many pages.

With this change, the page list for QA will be read directly from the
WACZ files pages.jsonl / extraPages.jsonl entries.
2024-04-12 16:04:33 -07:00
Tessa Walsh
172a9bf0cd
Change crawl.reviewStatus to 1-5 scale int (#1664) 2024-04-09 17:51:06 -07:00
Ilya Kreymer
a7cda3b11b version: bump to 1.10.0-beta.1 2024-04-05 18:24:14 -07:00
Tessa Walsh
4229b94736
Track failed QA runs and include in list endpoint (#1650)
Fixes #1648 

- Tracks failed QA runs in database, not only successful ones
- Includes failed QA runs in list endpoint by default
- Adds `skipFailed` param to list endpoint to return only successful
runs

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-04-04 18:51:06 -07:00
Ilya Kreymer
5c08c9679c
fix issue with incorrect number of total pages if any of the seeds is a redirect (#1649)
Following changes in webrecorder/browsertrix-crawler#475,
webrecorder/browsertrix-crawler#509, the crawler adds a redirected seed
to the seen list. To account for this, it needs to be subtracted to get
the total page count.
2024-04-04 15:55:44 -07:00
sua yoo
83c9203a11
Initial QA Review UI! (#1624)
QA Details page:
- Enables QA tab with ability to start automated analysis QA Run + view a and manual review status
- Pages listed with review status + overall crawl review status shown on QA details (relates to #1508)
- Initial placeholder for QA run analytics (part of #1589)
- Addresses a good deal of #1477

Automated Analysis QA in Review Mode:
- Ability to select from multiple analysis QA runs / view QA runs in QA details
- Shows analysis screenshot, text and resources compare and replay tabs (fixes #1496)
- Sorting by worst screenshot / worst text score for each QA run
- Includes pages sidebar with screenshot/text/resource compare results (fixes #1497)

Manual Review QA in Review Mode:
- Per-page replay available as separate tab (fixes #1499)
- Supports thumbs up, thumbs down, notes for each page
- Supports entering review status approval (good/acceptable/bad can be entered when finishing review

---------
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2024-04-04 15:09:52 -07:00
Ilya Kreymer
ffc4b5b58f
operator state fixes (follow up fomr #1639) (#1640)
- increase time for going to waiting_capacity from starting to 150
seconds
- relax requirement for state transitions, allow complete from waiting
- additional type safety for different states, ensure mark_finished()
only called with non-running states, add `Literal` types for all the
state types.
2024-03-29 15:12:16 -07:00
Ilya Kreymer
3438133fcb
Crawler pod memory padding + auto scaling (#1631)
- set memory limit to 1.2x memory request to provide extra padding and
avoid OOM
- attempt to resize crawler pods by 1.2x when exceeding 90% of available
memory
- do a 'soft OOM' (send extra SIGTERM) to pod when reaching 100% of
requested memory, resulting in faster graceful restart, but avoiding a
system-instant OOM Kill
- Fixes #1632

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-28 16:39:00 -07:00
Tessa Walsh
00ced6dd6b
Add single page QA GET endpoint (#1635)
Fixes #1634 

Also make sure other get page endpoint without qa uses PageOut model
2024-03-27 14:57:59 -07:00
Tessa Walsh
66b4532321
Give test_crawl_timeout 10 mins to finish (#1627)
Related to https://github.com/webrecorder/browsertrix-cloud/issues/1620

Follow-up to https://github.com/webrecorder/browsertrix-cloud/pull/1621,
which didn't seem to fix the problem.

I'm giving it much more time here in the hopes that it solves it (since
it's a nightly test, time shouldn't be such a pressing issue).
2024-03-26 18:33:30 -07:00
Tessa Walsh
e9895e78a2
Add additional filters to page list endpoints (#1622)
Fixes #1617 

Filters added:

- reviewed: filter by page has approval or at least one note (true) or
neither (false)
- approved: filter by approval value (accepts list of strings,
comma-separated, each of which are coerced into True, False, or None, or
ignored if they are invalid values)
- hasNotes: filter by has at least one note (true) or not (false)

Tests have also been added to ensure that results are as expected.
2024-03-21 21:33:07 -07:00
Tessa Walsh
b3b1e0d7d8
Fix intermittent crawl timeout test failure (#1621)
Fixes #1620 

This increases the total timeout from 60 seconds to 120 seconds for
crawl to complete, which should be sufficient given how intermittently
the failure has been happening. Can increase it further if needed.
2024-03-21 17:18:27 -07:00
Ilya Kreymer
4f676e4e82
QA Runs Initial Backend Implementation (#1586)
Supports running QA Runs via the QA API!

Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes
#1498

Also requires the latest Browsertrix Crawler 1.1.0+ (from
webrecorder/browsertrix-crawler#469 branch)

Notable changes:
- QARun objects contain info about QA runs, which are crawls
performed on data loaded from existing crawls.

- Various crawl db operations can be performed on either the crawl or
`qa.` object, and core crawl fields have been moved to CoreCrawlable.

- While running,`QARun` data stored in a single `qa` object, while
finished qa runs are added to `qaFinished` dictionary on the Crawl. The
QA list API returns data from the finished list, sorted by most recent
first.

- Includes additional type fixes / type safety, especially around
BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific
get_upload(), get_basecrawl(), get_crawl() getters for internal use and
get_crawl_out() for API

- Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) 
along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-20 22:42:16 -07:00
Tessa Walsh
21ae38362e
Add endpoints to read pages from older crawl WACZs into database (#1562)
Fixes #1597

New endpoints (replacing old migration) to re-add crawl pages to db from
WACZs.

After a few implementation attempts, we settled on using
[remotezip](https://github.com/gtsystem/python-remotezip) to handle
parsing of the zip files and streaming their contents line-by-line for
pages. I've also modified the sync log streaming to use remotezip as
well, which allows us to remove our own zip module and let remotezip
handle the complexity of parsing zip files.

Database inserts for pages from WACZs are batched 100 at a time to help
speed up the endpoint, and the task is kicked off using
asyncio.create_task so as not to block before giving a response.

StorageOps now contains a method for streaming the bytes of any file in
a remote WACZ, requiring only the presigned URL for the WACZ and the
name of the file to stream.
2024-03-19 14:14:21 -07:00
Ilya Kreymer
5a4902b6d4
kubernetes api: avoid overriding content-type header in kubernetes-asyncio, pass in via arg instead (main) (#1605)
- instead of overriding the content-type header globally, pass
'application/merge-patch+json' to
self.custom_api.patch_namespaced_custom_object() directly
- bump kubernetes-asyncio to 29.0.0
- fixes potential issues with global override of the header in
kubernetes-asyncio
- copy of #1602 for main
2024-03-18 11:17:54 -07:00
Ilya Kreymer
e7af081af1
profile browser fixes: better resource usage + load retry (main) (#1604)
- Backend: Use separate resource constraints for profiles: default
profile browser resources to either 'profile_browser_cpu' /
'profile_browser_memory' or single browser 'crawler_memory_base' /
'crawler_cpu_base', instead of scaled to the number of browser workers

- Frontend: check that profile html page is loading, keep retrying if
still getting nginx error instead of loading an iframe with the error.

Fixes #1598 (Copy of #1599 from 1.9.4)
2024-03-16 15:07:04 -07:00
wvengen
6278157f40
Make storage deletion work on more S3 providers, don't use access URL for deletion (#1600)
I came across [this
problem](https://forum.webrecorder.net/t/deleting-crawl-failure/512) and
noticed that the access URL is used when deleting files, causing my file
deletions to fail on OpenStack SWIFT S3 (relates to #1090). This trivial
change makes it work there.
2024-03-16 04:17:23 -04:00
Ilya Kreymer
08f6847194
Configurable Max Scale for frontend (#1557)
Allow maximum scale option to be fully configurable via
`max_crawl_scale`. Already configurable on the backend, and now exposed
to the frontend via API `/api/settings` `maxCrawlScale` value.

The workflow editor and workflow details are updated to allow selecting
the scale up to the maxCrawlScale setting (which defaults to 3 if not
set).
2024-03-11 16:21:20 -07:00
Ilya Kreymer
ea494fa6e6
Merge V1.9.3 changes into main (#1583)
- Fix execution time checking by keeping lastUpdatedTime in db by
@ikreymer in https://github.com/webrecorder/browsertrix-cloud/pull/1573
- disable postcss-lit for var css
- Prevent closing tooltips from closing collection share dialog by
@SuaYoo in https://github.com/webrecorder/browsertrix-cloud/pull/1579
- Fix pending exclusion pagination by @SuaYoo in
https://github.com/webrecorder/browsertrix-cloud/pull/1578
- Fix regex escape in exclusion editor text match by @SuaYoo in
https://github.com/webrecorder/browsertrix-cloud/pull/1577

---------
Co-authored-by: emma <hi@emma.cafe>
Co-authored-by: sua yoo <sua@webrecorder.org>
2024-03-06 15:38:22 -08:00
Tessa Walsh
c20e754269
Add updatable QA reviewStatus field to crawls (#1575)
Fixes #1539 

Adds `reviewStatus` field to `BaseCrawl` model, updatable via the crawl
update API endpoint. Acceptable values are "good", "acceptable" or
"failure", enforced by an Enum.

Added to `BaseCrawl` so that we can extend support to uploads more
easily later on, but for now we'll only display this for crawls in the
frontend.
2024-03-05 16:49:23 -08:00
Tessa Walsh
ec0db1c323
Temporarily remove pages migration (#1572)
Removing until we have a better tested solution, including to avoid testing of QA runs for new crawls in beta.
2024-03-04 10:30:04 -08:00
Ilya Kreymer
09a0d51843
pages: set page status to 200 if unset and loadState != 0 (#1563)
Follow up to #1516, ensure page status is set to 200 if no status is
provided, if loadState is not 0
2024-02-29 15:15:17 -08:00
Ilya Kreymer
2ac6584942
Refactor operator class into module (#1564)
The operator class has gotten fairly large, this is a first pass in
refactoring operator.py into a submodule instead, with multiple operator
instances which handle different types of objects.

- The main k8s interface has been split into K8sOpApi which extends K8sApi
and is shared across all operators.
- Each operator extends BaseOperator which also has an instance of K8sOpApi
- The CrawlOperator is still the bulk of the functionality, but will likely be further refactored
to support QA jobs

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-02-29 14:40:12 -08:00
Tessa Walsh
da19691184
Add crawl errors incrementally during crawl (#1561)
Fixes #1558 

- Adds crawl errors to database incrementally during crawl rather than
after crawl completes
- Simplifies crawl /errors API endpoint to always return errors from
database
2024-02-29 09:16:34 -08:00
Ilya Kreymer
804f755787
Increase startup probe time to account for long-running migrations (#1560)
- increases the failureThreshold for startupProbe for the api backend
container to account for long running migrations, upto 300 seconds
- add `/healthzStartup` which checks if db is ready
- bump 
- keeps `/healthz` to always return 200 when running
- increases livenessProbe failureThreshold to be higher than readiness
probe, following recommended best practice of liveness probe > readiness
probe
- fixes #1559
2024-02-28 14:22:33 -08:00
Tessa Walsh
14189b7cfb
Add crawl pages and related API endpoints (#1516)
Fixes #1502 

- Adds pages to database as they get added to Redis during crawl
- Adds migration to add pages to database for older crawls from
pages.jsonl and extraPages.jsonl files in WACZ
- Adds GET, list GET, and PATCH update endpoints for pages
- Adds POST (add), PATCH, and POST (delete) endpoints for page notes,
each with their own id, timestamp, and user info in addition to text
- Adds page_ops methods for 1. adding resources/urls to page, and 2.
adding automated heuristics and supplemental info (mime, type, etc.) to
page (for use in crawl QA job)
- Modifies `Migration` class to accept kwargs so that we can pass in ops
classes as needed for migrations
- Deletes WACZ files and pages from database for failed crawls during
crawl_finished process
- Deletes crawl pages when a crawl is deleted

Note: Requires a crawler version 1.0.0 beta3 or later, with support for
`--writePagesToRedis` to populate pages at crawl completion. Beta 4 is
configured in the test chart, which should be upgraded to stable 1.0.0
when it's released.

Connected to https://github.com/webrecorder/browsertrix-crawler/pull/464

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-02-28 12:11:35 -05:00
Ilya Kreymer
8ae032ff88 More friendly WARC prefix inside WACZ based on Org slug + Crawl Name / First Seed URL. (#1537)
Supports setting WARC prefix for WARCs inside WACZ to `<org slug>-<slug
[crawl name | first seed host]>`.
- Prefix set via WARC_PREFIX env var, supported in browsertrix-crawler
1.0.0-beta.4 or higher
If crawl name is provided, uses crawl name, other hostname of first
seed. The name is 'sluggified', using lowercase alphanum characters
separated by dashes.

Ex: in an organization called `Default Org`, a crawl of
`https://specs.webrecorder.net/` and no name will have WARCs named:
`default-org-specs-webrecorder-net-....warc.gz`
If the crawl is given the name `SPECS`, the WARCs will be named
`default-org-specs-manual-....warc.gz`

Fixes #412 in a default way.
2024-02-22 23:54:23 -08:00
Ilya Kreymer
a8e3ff1141 version: bump to 1.10.0-beta.0 2024-02-20 00:22:29 -08:00
Ilya Kreymer
c1cffe9ecd version: bump to 1.9.1 2024-02-16 09:44:18 -08:00
Ilya Kreymer
64bf21311d version: bump to 1.9.0! 2024-02-14 13:30:46 -08:00
Ilya Kreymer
1d266e3cea bump to 1.9.0.beta.5 2024-02-12 18:29:39 -08:00
Ilya Kreymer
4bc8152640 version: bump to 1.9.0-beta.4 2024-02-09 16:17:13 -08:00
Ilya Kreymer
0653657637
better handling of failed redis connection + exec time updates (#1520)
This PR addresses a possible failure when Redis pod was inaccessible
from Crawler pod.
- Ensure crawl is set to 'waiting_for_capacity' if either no crawler
pods are available or no redis pod. previously, missing/inaccessible
redis would not result in 'waiting_for_capacity' if crawler pods are
available
- Rework logic: if no crawler and redis after >60 seconds, shutdown
redis. if crawler and no redis, init (or reinit) redis
- track 'lastUpdatedTime' in db when incrementing exec time to avoid
double counting if lastUpdatedTime has not changed, eg. if operator sync
fails.
- add redis timeout of 20 seconds to avoid timing out operator responses
if redis conn takes too long, assume unavailable
2024-02-09 16:14:29 -08:00
Ilya Kreymer
65fec64197
storages: use asynccontextmanager instead of sync to close client (#1521)
Follow-up to #1481, use the asyncontextmanager with `async with` as only
used in async functions (which call run_in_executor)
2024-02-08 08:28:53 -08:00
Ilya Kreymer
b2a5dbf2cd
enable screenshots by default + fix py version formatting (#1518)
configmap: add --screenshot thumbnail,view as default screenshots 
version: update update-version.sh to add newline in version.py to match
new black formatting (from changes in #1507)
Fixes #1519
2024-02-07 17:07:28 -08:00
Ilya Kreymer
7aebce66f6 version: bump to 1.9.0-beta.3 2024-02-07 15:21:10 -08:00
Tessa Walsh
a898c2b456
Format backend with Black 24 (#1507)
Fixes #1506
2024-02-07 11:35:34 -08:00
Tessa Walsh
b252931c71
Add scale to CrawlOut (#1487)
Fixes #1486 

`scale` is already saved in the crawl but needed to be added to
`CrawlOut`.
2024-01-23 14:10:37 -08:00
Ilya Kreymer
bf38063e0a
Close sync S3 client (#1481)
Cleanup of boto3 sync client, ensure that it is used as a context manager like
async client.
2024-01-18 18:18:41 -05:00
Tessa Walsh
950844dc92
Add migration to fix issues with previous migrations (#1480)
Fixes #1479 

- Update null crawlTimeouts in db from null to 0
- Update crawlerChannel in configmaps
2024-01-18 16:59:40 -05:00
Ilya Kreymer
ad19941318
operator: use 'default' CRAWLER_CHANNEL if none is set (#1478)
Use default channel if CRAWLER_CHANNEL not set in crawlconfig configmap,
consistent with how other configmap settings for cronjobs are used.
2024-01-18 11:13:03 -08:00
Ilya Kreymer
e43feedc43 version: bump to 1.9.0-beta.2 2024-01-18 10:01:38 -08:00
Ilya Kreymer
370590b14f version: bump to 1.9.0-beta.1 2024-01-17 14:58:25 -08:00