Commit Graph

426 Commits

Author SHA1 Message Date
Ilya Kreymer
4360e0c1b5
Update tests with latest crawler (#1711)
tests: use 'latest' crawler release for testing, now that 1.1.x is released.
2024-04-20 15:56:26 -07:00
Tessa Walsh
80008a2853
Add post load delay to Browsertrix (#1700)
Fixes #1699 

Adds post load delay to:
- Backend `RawCrawlConfig` model
- Frontend (workflow editor and config details component)
- Workflow setup docs
2024-04-18 20:03:47 -07:00
Ilya Kreymer
9609ff4194
Add 'activeQAStats' field (#1694)
As additional support for #1683, include the active QA stats in the
crawl response, along with active QA state.
This will allow showing progress of QA run in the archived items list.
2024-04-18 10:05:39 -04:00
Tessa Walsh
b87860c68a
Ensure /all-crawls?sortBy=qaState always sorts crawls above uploads (#1691)
Follow-up to #1686
2024-04-17 19:14:29 -07:00
Tessa Walsh
30ab139ff2
Add QA run aggregate stats API endpoint (#1682)
Fixes #1659 

Takes an arbitrary set of thresholds for text and screenshot matches as
a comma-separated list of floats.

Returns a list of groupings for each that include the lower boundary and
count for all thresholds passed in.
2024-04-17 13:24:18 -04:00
Ilya Kreymer
835014d829
restrict qa runs to a 'min_qa_crawler_image' if set in the chart (#1685)
- fixes #1684
- can be used to optionally restrict QA to only some crawls (eg. with
browsertrix-crawler>=1.0.0)
- enforce error on backend (return 400) and handle special error on the
frontend
2024-04-17 08:48:33 -07:00
Tessa Walsh
c800da1732
Add reviewStatus, qaState, and qaRunCount sort options to crawls/all-crawls list endpoints (#1686)
Backend work for #1672 

Adds new sort options to /crawls and /all-crawls GET list endpoints:

- `reviewStatus`
- `qaRunCount`: number of completed QA runs for crawl (also added to
CrawlOut)
- `qaState` (sorts by `activeQAState` first, then `lastQAState`, both of
which are added to CrawlOut)
2024-04-16 23:54:09 -07:00
Tessa Walsh
87e0873f1a
Add mime field to Page model (#1678) 2024-04-17 00:57:49 -04:00
Vinzenz Sinapius
1b034957ff
Improve reliability of backend tests (#1675)
- Remove globals from profile, uploads, and qa test modules in favor of fixtures
- Add retries to fix intermittent test failures due to timing

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-04-16 14:22:41 -04:00
Ilya Kreymer
95f5605af7
renumber crawl priority classes: (#1673)
- priority classes <-10 are ignored by cluster-autoscaler so QA jobs
with too low priorities never run
- start crawl priorities at 0 going down (same as before)
- start qa run priorities at -2 going down (instead of -100)
- this means a crawl of with scale of 3 can be preempted by 1st qa pod,
but otherwise crawls have higher priority
- rename priority classes as they are otherwise immutable and error on
helm upgrade

This allows for more room in lower pri classes for other type of
objects, while keeping in mind the -10 and below threshold: (see:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md)
2024-04-13 12:24:43 -07:00
Ilya Kreymer
f243d34395
Remove pages from QA Configmap (#1671)
Fixes #1670 

No longer need to pass pages to the ConfigMap. The ConfigMap has a size
limit and will fail if there are too many pages.

With this change, the page list for QA will be read directly from the
WACZ files pages.jsonl / extraPages.jsonl entries.
2024-04-12 16:04:33 -07:00
Tessa Walsh
172a9bf0cd
Change crawl.reviewStatus to 1-5 scale int (#1664) 2024-04-09 17:51:06 -07:00
Ilya Kreymer
a7cda3b11b version: bump to 1.10.0-beta.1 2024-04-05 18:24:14 -07:00
Tessa Walsh
4229b94736
Track failed QA runs and include in list endpoint (#1650)
Fixes #1648 

- Tracks failed QA runs in database, not only successful ones
- Includes failed QA runs in list endpoint by default
- Adds `skipFailed` param to list endpoint to return only successful
runs

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-04-04 18:51:06 -07:00
Ilya Kreymer
5c08c9679c
fix issue with incorrect number of total pages if any of the seeds is a redirect (#1649)
Following changes in webrecorder/browsertrix-crawler#475,
webrecorder/browsertrix-crawler#509, the crawler adds a redirected seed
to the seen list. To account for this, it needs to be subtracted to get
the total page count.
2024-04-04 15:55:44 -07:00
sua yoo
83c9203a11
Initial QA Review UI! (#1624)
QA Details page:
- Enables QA tab with ability to start automated analysis QA Run + view a and manual review status
- Pages listed with review status + overall crawl review status shown on QA details (relates to #1508)
- Initial placeholder for QA run analytics (part of #1589)
- Addresses a good deal of #1477

Automated Analysis QA in Review Mode:
- Ability to select from multiple analysis QA runs / view QA runs in QA details
- Shows analysis screenshot, text and resources compare and replay tabs (fixes #1496)
- Sorting by worst screenshot / worst text score for each QA run
- Includes pages sidebar with screenshot/text/resource compare results (fixes #1497)

Manual Review QA in Review Mode:
- Per-page replay available as separate tab (fixes #1499)
- Supports thumbs up, thumbs down, notes for each page
- Supports entering review status approval (good/acceptable/bad can be entered when finishing review

---------
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2024-04-04 15:09:52 -07:00
Ilya Kreymer
ffc4b5b58f
operator state fixes (follow up fomr #1639) (#1640)
- increase time for going to waiting_capacity from starting to 150
seconds
- relax requirement for state transitions, allow complete from waiting
- additional type safety for different states, ensure mark_finished()
only called with non-running states, add `Literal` types for all the
state types.
2024-03-29 15:12:16 -07:00
Ilya Kreymer
3438133fcb
Crawler pod memory padding + auto scaling (#1631)
- set memory limit to 1.2x memory request to provide extra padding and
avoid OOM
- attempt to resize crawler pods by 1.2x when exceeding 90% of available
memory
- do a 'soft OOM' (send extra SIGTERM) to pod when reaching 100% of
requested memory, resulting in faster graceful restart, but avoiding a
system-instant OOM Kill
- Fixes #1632

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-28 16:39:00 -07:00
Tessa Walsh
00ced6dd6b
Add single page QA GET endpoint (#1635)
Fixes #1634 

Also make sure other get page endpoint without qa uses PageOut model
2024-03-27 14:57:59 -07:00
Tessa Walsh
66b4532321
Give test_crawl_timeout 10 mins to finish (#1627)
Related to https://github.com/webrecorder/browsertrix-cloud/issues/1620

Follow-up to https://github.com/webrecorder/browsertrix-cloud/pull/1621,
which didn't seem to fix the problem.

I'm giving it much more time here in the hopes that it solves it (since
it's a nightly test, time shouldn't be such a pressing issue).
2024-03-26 18:33:30 -07:00
Tessa Walsh
e9895e78a2
Add additional filters to page list endpoints (#1622)
Fixes #1617 

Filters added:

- reviewed: filter by page has approval or at least one note (true) or
neither (false)
- approved: filter by approval value (accepts list of strings,
comma-separated, each of which are coerced into True, False, or None, or
ignored if they are invalid values)
- hasNotes: filter by has at least one note (true) or not (false)

Tests have also been added to ensure that results are as expected.
2024-03-21 21:33:07 -07:00
Tessa Walsh
b3b1e0d7d8
Fix intermittent crawl timeout test failure (#1621)
Fixes #1620 

This increases the total timeout from 60 seconds to 120 seconds for
crawl to complete, which should be sufficient given how intermittently
the failure has been happening. Can increase it further if needed.
2024-03-21 17:18:27 -07:00
Ilya Kreymer
4f676e4e82
QA Runs Initial Backend Implementation (#1586)
Supports running QA Runs via the QA API!

Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes
#1498

Also requires the latest Browsertrix Crawler 1.1.0+ (from
webrecorder/browsertrix-crawler#469 branch)

Notable changes:
- QARun objects contain info about QA runs, which are crawls
performed on data loaded from existing crawls.

- Various crawl db operations can be performed on either the crawl or
`qa.` object, and core crawl fields have been moved to CoreCrawlable.

- While running,`QARun` data stored in a single `qa` object, while
finished qa runs are added to `qaFinished` dictionary on the Crawl. The
QA list API returns data from the finished list, sorted by most recent
first.

- Includes additional type fixes / type safety, especially around
BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific
get_upload(), get_basecrawl(), get_crawl() getters for internal use and
get_crawl_out() for API

- Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) 
along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-20 22:42:16 -07:00
Tessa Walsh
21ae38362e
Add endpoints to read pages from older crawl WACZs into database (#1562)
Fixes #1597

New endpoints (replacing old migration) to re-add crawl pages to db from
WACZs.

After a few implementation attempts, we settled on using
[remotezip](https://github.com/gtsystem/python-remotezip) to handle
parsing of the zip files and streaming their contents line-by-line for
pages. I've also modified the sync log streaming to use remotezip as
well, which allows us to remove our own zip module and let remotezip
handle the complexity of parsing zip files.

Database inserts for pages from WACZs are batched 100 at a time to help
speed up the endpoint, and the task is kicked off using
asyncio.create_task so as not to block before giving a response.

StorageOps now contains a method for streaming the bytes of any file in
a remote WACZ, requiring only the presigned URL for the WACZ and the
name of the file to stream.
2024-03-19 14:14:21 -07:00
Ilya Kreymer
5a4902b6d4
kubernetes api: avoid overriding content-type header in kubernetes-asyncio, pass in via arg instead (main) (#1605)
- instead of overriding the content-type header globally, pass
'application/merge-patch+json' to
self.custom_api.patch_namespaced_custom_object() directly
- bump kubernetes-asyncio to 29.0.0
- fixes potential issues with global override of the header in
kubernetes-asyncio
- copy of #1602 for main
2024-03-18 11:17:54 -07:00
Ilya Kreymer
e7af081af1
profile browser fixes: better resource usage + load retry (main) (#1604)
- Backend: Use separate resource constraints for profiles: default
profile browser resources to either 'profile_browser_cpu' /
'profile_browser_memory' or single browser 'crawler_memory_base' /
'crawler_cpu_base', instead of scaled to the number of browser workers

- Frontend: check that profile html page is loading, keep retrying if
still getting nginx error instead of loading an iframe with the error.

Fixes #1598 (Copy of #1599 from 1.9.4)
2024-03-16 15:07:04 -07:00
wvengen
6278157f40
Make storage deletion work on more S3 providers, don't use access URL for deletion (#1600)
I came across [this
problem](https://forum.webrecorder.net/t/deleting-crawl-failure/512) and
noticed that the access URL is used when deleting files, causing my file
deletions to fail on OpenStack SWIFT S3 (relates to #1090). This trivial
change makes it work there.
2024-03-16 04:17:23 -04:00
Ilya Kreymer
08f6847194
Configurable Max Scale for frontend (#1557)
Allow maximum scale option to be fully configurable via
`max_crawl_scale`. Already configurable on the backend, and now exposed
to the frontend via API `/api/settings` `maxCrawlScale` value.

The workflow editor and workflow details are updated to allow selecting
the scale up to the maxCrawlScale setting (which defaults to 3 if not
set).
2024-03-11 16:21:20 -07:00
Ilya Kreymer
ea494fa6e6
Merge V1.9.3 changes into main (#1583)
- Fix execution time checking by keeping lastUpdatedTime in db by
@ikreymer in https://github.com/webrecorder/browsertrix-cloud/pull/1573
- disable postcss-lit for var css
- Prevent closing tooltips from closing collection share dialog by
@SuaYoo in https://github.com/webrecorder/browsertrix-cloud/pull/1579
- Fix pending exclusion pagination by @SuaYoo in
https://github.com/webrecorder/browsertrix-cloud/pull/1578
- Fix regex escape in exclusion editor text match by @SuaYoo in
https://github.com/webrecorder/browsertrix-cloud/pull/1577

---------
Co-authored-by: emma <hi@emma.cafe>
Co-authored-by: sua yoo <sua@webrecorder.org>
2024-03-06 15:38:22 -08:00
Tessa Walsh
c20e754269
Add updatable QA reviewStatus field to crawls (#1575)
Fixes #1539 

Adds `reviewStatus` field to `BaseCrawl` model, updatable via the crawl
update API endpoint. Acceptable values are "good", "acceptable" or
"failure", enforced by an Enum.

Added to `BaseCrawl` so that we can extend support to uploads more
easily later on, but for now we'll only display this for crawls in the
frontend.
2024-03-05 16:49:23 -08:00
Tessa Walsh
ec0db1c323
Temporarily remove pages migration (#1572)
Removing until we have a better tested solution, including to avoid testing of QA runs for new crawls in beta.
2024-03-04 10:30:04 -08:00
Ilya Kreymer
09a0d51843
pages: set page status to 200 if unset and loadState != 0 (#1563)
Follow up to #1516, ensure page status is set to 200 if no status is
provided, if loadState is not 0
2024-02-29 15:15:17 -08:00
Ilya Kreymer
2ac6584942
Refactor operator class into module (#1564)
The operator class has gotten fairly large, this is a first pass in
refactoring operator.py into a submodule instead, with multiple operator
instances which handle different types of objects.

- The main k8s interface has been split into K8sOpApi which extends K8sApi
and is shared across all operators.
- Each operator extends BaseOperator which also has an instance of K8sOpApi
- The CrawlOperator is still the bulk of the functionality, but will likely be further refactored
to support QA jobs

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-02-29 14:40:12 -08:00
Tessa Walsh
da19691184
Add crawl errors incrementally during crawl (#1561)
Fixes #1558 

- Adds crawl errors to database incrementally during crawl rather than
after crawl completes
- Simplifies crawl /errors API endpoint to always return errors from
database
2024-02-29 09:16:34 -08:00
Ilya Kreymer
804f755787
Increase startup probe time to account for long-running migrations (#1560)
- increases the failureThreshold for startupProbe for the api backend
container to account for long running migrations, upto 300 seconds
- add `/healthzStartup` which checks if db is ready
- bump 
- keeps `/healthz` to always return 200 when running
- increases livenessProbe failureThreshold to be higher than readiness
probe, following recommended best practice of liveness probe > readiness
probe
- fixes #1559
2024-02-28 14:22:33 -08:00
Tessa Walsh
14189b7cfb
Add crawl pages and related API endpoints (#1516)
Fixes #1502 

- Adds pages to database as they get added to Redis during crawl
- Adds migration to add pages to database for older crawls from
pages.jsonl and extraPages.jsonl files in WACZ
- Adds GET, list GET, and PATCH update endpoints for pages
- Adds POST (add), PATCH, and POST (delete) endpoints for page notes,
each with their own id, timestamp, and user info in addition to text
- Adds page_ops methods for 1. adding resources/urls to page, and 2.
adding automated heuristics and supplemental info (mime, type, etc.) to
page (for use in crawl QA job)
- Modifies `Migration` class to accept kwargs so that we can pass in ops
classes as needed for migrations
- Deletes WACZ files and pages from database for failed crawls during
crawl_finished process
- Deletes crawl pages when a crawl is deleted

Note: Requires a crawler version 1.0.0 beta3 or later, with support for
`--writePagesToRedis` to populate pages at crawl completion. Beta 4 is
configured in the test chart, which should be upgraded to stable 1.0.0
when it's released.

Connected to https://github.com/webrecorder/browsertrix-crawler/pull/464

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-02-28 12:11:35 -05:00
Ilya Kreymer
8ae032ff88 More friendly WARC prefix inside WACZ based on Org slug + Crawl Name / First Seed URL. (#1537)
Supports setting WARC prefix for WARCs inside WACZ to `<org slug>-<slug
[crawl name | first seed host]>`.
- Prefix set via WARC_PREFIX env var, supported in browsertrix-crawler
1.0.0-beta.4 or higher
If crawl name is provided, uses crawl name, other hostname of first
seed. The name is 'sluggified', using lowercase alphanum characters
separated by dashes.

Ex: in an organization called `Default Org`, a crawl of
`https://specs.webrecorder.net/` and no name will have WARCs named:
`default-org-specs-webrecorder-net-....warc.gz`
If the crawl is given the name `SPECS`, the WARCs will be named
`default-org-specs-manual-....warc.gz`

Fixes #412 in a default way.
2024-02-22 23:54:23 -08:00
Ilya Kreymer
a8e3ff1141 version: bump to 1.10.0-beta.0 2024-02-20 00:22:29 -08:00
Ilya Kreymer
c1cffe9ecd version: bump to 1.9.1 2024-02-16 09:44:18 -08:00
Ilya Kreymer
64bf21311d version: bump to 1.9.0! 2024-02-14 13:30:46 -08:00
Ilya Kreymer
1d266e3cea bump to 1.9.0.beta.5 2024-02-12 18:29:39 -08:00
Ilya Kreymer
4bc8152640 version: bump to 1.9.0-beta.4 2024-02-09 16:17:13 -08:00
Ilya Kreymer
0653657637
better handling of failed redis connection + exec time updates (#1520)
This PR addresses a possible failure when Redis pod was inaccessible
from Crawler pod.
- Ensure crawl is set to 'waiting_for_capacity' if either no crawler
pods are available or no redis pod. previously, missing/inaccessible
redis would not result in 'waiting_for_capacity' if crawler pods are
available
- Rework logic: if no crawler and redis after >60 seconds, shutdown
redis. if crawler and no redis, init (or reinit) redis
- track 'lastUpdatedTime' in db when incrementing exec time to avoid
double counting if lastUpdatedTime has not changed, eg. if operator sync
fails.
- add redis timeout of 20 seconds to avoid timing out operator responses
if redis conn takes too long, assume unavailable
2024-02-09 16:14:29 -08:00
Ilya Kreymer
65fec64197
storages: use asynccontextmanager instead of sync to close client (#1521)
Follow-up to #1481, use the asyncontextmanager with `async with` as only
used in async functions (which call run_in_executor)
2024-02-08 08:28:53 -08:00
Ilya Kreymer
b2a5dbf2cd
enable screenshots by default + fix py version formatting (#1518)
configmap: add --screenshot thumbnail,view as default screenshots 
version: update update-version.sh to add newline in version.py to match
new black formatting (from changes in #1507)
Fixes #1519
2024-02-07 17:07:28 -08:00
Ilya Kreymer
7aebce66f6 version: bump to 1.9.0-beta.3 2024-02-07 15:21:10 -08:00
Tessa Walsh
a898c2b456
Format backend with Black 24 (#1507)
Fixes #1506
2024-02-07 11:35:34 -08:00
Tessa Walsh
b252931c71
Add scale to CrawlOut (#1487)
Fixes #1486 

`scale` is already saved in the crawl but needed to be added to
`CrawlOut`.
2024-01-23 14:10:37 -08:00
Ilya Kreymer
bf38063e0a
Close sync S3 client (#1481)
Cleanup of boto3 sync client, ensure that it is used as a context manager like
async client.
2024-01-18 18:18:41 -05:00
Tessa Walsh
950844dc92
Add migration to fix issues with previous migrations (#1480)
Fixes #1479 

- Update null crawlTimeouts in db from null to 0
- Update crawlerChannel in configmaps
2024-01-18 16:59:40 -05:00