Commit Graph

1343 Commits

Author SHA1 Message Date
Henry Wilkinson
6cabb9c875
Changes download icons to cloud-download (#1688)
And now they're all the same!
2024-04-17 15:27:38 -04:00
Vinzenz Sinapius
a8336925b6
Run crawler and profilebrowser with non-root user (#1625)
With these changes, crawler and profilebrowser jobs run as a
non-root user.
2024-04-17 12:03:33 -07:00
Tessa Walsh
30ab139ff2
Add QA run aggregate stats API endpoint (#1682)
Fixes #1659 

Takes an arbitrary set of thresholds for text and screenshot matches as
a comma-separated list of floats.

Returns a list of groupings for each that include the lower boundary and
count for all thresholds passed in.
2024-04-17 13:24:18 -04:00
Ilya Kreymer
835014d829
restrict qa runs to a 'min_qa_crawler_image' if set in the chart (#1685)
- fixes #1684
- can be used to optionally restrict QA to only some crawls (eg. with
browsertrix-crawler>=1.0.0)
- enforce error on backend (return 400) and handle special error on the
frontend
2024-04-17 08:48:33 -07:00
Tessa Walsh
c800da1732
Add reviewStatus, qaState, and qaRunCount sort options to crawls/all-crawls list endpoints (#1686)
Backend work for #1672 

Adds new sort options to /crawls and /all-crawls GET list endpoints:

- `reviewStatus`
- `qaRunCount`: number of completed QA runs for crawl (also added to
CrawlOut)
- `qaState` (sorts by `activeQAState` first, then `lastQAState`, both of
which are added to CrawlOut)
2024-04-16 23:54:09 -07:00
Tessa Walsh
87e0873f1a
Add mime field to Page model (#1678) 2024-04-17 00:57:49 -04:00
Vinzenz Sinapius
1b034957ff
Improve reliability of backend tests (#1675)
- Remove globals from profile, uploads, and qa test modules in favor of fixtures
- Add retries to fix intermittent test failures due to timing

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-04-16 14:22:41 -04:00
Ilya Kreymer
95f5605af7
renumber crawl priority classes: (#1673)
- priority classes <-10 are ignored by cluster-autoscaler so QA jobs
with too low priorities never run
- start crawl priorities at 0 going down (same as before)
- start qa run priorities at -2 going down (instead of -100)
- this means a crawl of with scale of 3 can be preempted by 1st qa pod,
but otherwise crawls have higher priority
- rename priority classes as they are otherwise immutable and error on
helm upgrade

This allows for more room in lower pri classes for other type of
objects, while keeping in mind the -10 and below threshold: (see:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md)
2024-04-13 12:24:43 -07:00
Ilya Kreymer
f243d34395
Remove pages from QA Configmap (#1671)
Fixes #1670 

No longer need to pass pages to the ConfigMap. The ConfigMap has a size
limit and will fail if there are too many pages.

With this change, the page list for QA will be read directly from the
WACZ files pages.jsonl / extraPages.jsonl entries.
2024-04-12 16:04:33 -07:00
emma
ed08b734ba Fix failing Webpack build
- Includes eslint-related files in Dockerfile build step
- Switches to using a cache mount in Dockerfile build step for yarn
cache, rather than deleting yarn cache, which should speed up most
builds
2024-04-10 14:40:44 -04:00
Tessa Walsh
172a9bf0cd
Change crawl.reviewStatus to 1-5 scale int (#1664) 2024-04-09 17:51:06 -07:00
Emma Segal-Grossman
9aba24a90e
Emit eslint errors during webpack builds & during dev (#1667)
Following [discussion on
Discord](https://discord.com/channels/895426029194207262/910966759165657161/1227299497546223707),
this enables ESLint errors showing up during dev within Webpack, and
also enforces that all issues (both warnings and errors) will cause
builds to fail during CI.

Other changes:
- Updates some dependencies (primarily to versions that are
type-checked)
2024-04-09 17:42:31 -04:00
Tessa Walsh
435ca08f26
Remove URL prefix dropdown from new browser profile screen (#1660)
Switch to a text field and prepend `https://` as prefix if no valid
prefix is provided in the provided URL.
2024-04-08 15:36:21 -04:00
Ilya Kreymer
17f49a52de
email templates update + customization + doc update (fixes #1652) (#1653)
- modify invite email template to answer common questions
- email templates: make each email template overridable with --set-file
- docs: update customization doc to document how to customize email
templates

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-04-08 12:27:47 -07:00
Ilya Kreymer
a7cda3b11b version: bump to 1.10.0-beta.1 2024-04-05 18:24:14 -07:00
Tessa Walsh
4229b94736
Track failed QA runs and include in list endpoint (#1650)
Fixes #1648 

- Tracks failed QA runs in database, not only successful ones
- Includes failed QA runs in list endpoint by default
- Adds `skipFailed` param to list endpoint to return only successful
runs

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-04-04 18:51:06 -07:00
Ilya Kreymer
5c08c9679c
fix issue with incorrect number of total pages if any of the seeds is a redirect (#1649)
Following changes in webrecorder/browsertrix-crawler#475,
webrecorder/browsertrix-crawler#509, the crawler adds a redirected seed
to the seen list. To account for this, it needs to be subtracted to get
the total page count.
2024-04-04 15:55:44 -07:00
sua yoo
83c9203a11
Initial QA Review UI! (#1624)
QA Details page:
- Enables QA tab with ability to start automated analysis QA Run + view a and manual review status
- Pages listed with review status + overall crawl review status shown on QA details (relates to #1508)
- Initial placeholder for QA run analytics (part of #1589)
- Addresses a good deal of #1477

Automated Analysis QA in Review Mode:
- Ability to select from multiple analysis QA runs / view QA runs in QA details
- Shows analysis screenshot, text and resources compare and replay tabs (fixes #1496)
- Sorting by worst screenshot / worst text score for each QA run
- Includes pages sidebar with screenshot/text/resource compare results (fixes #1497)

Manual Review QA in Review Mode:
- Per-page replay available as separate tab (fixes #1499)
- Supports thumbs up, thumbs down, notes for each page
- Supports entering review status approval (good/acceptable/bad can be entered when finishing review

---------
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2024-04-04 15:09:52 -07:00
Henry Wilkinson
432a727218
Frontend: Fixes the "Replay Latest Crawl" button path (#1636)
- Fixes the path for the "Replay Latest Crawl" button on the watch crawl
page when the crawl workflow isn't running
2024-04-03 18:19:46 -04:00
Henry Wilkinson
6e8867c550
Adds documentation for exporting files (#1643)
Closes #1642 

### Changes
- Adds section to the collections page on downloading collections
- Changes the Files section on the archived items page to be more
explicit about downloading files because that's the only action you can
do there!

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-04-03 17:33:57 -04:00
Ilya Kreymer
ffc4b5b58f
operator state fixes (follow up fomr #1639) (#1640)
- increase time for going to waiting_capacity from starting to 150
seconds
- relax requirement for state transitions, allow complete from waiting
- additional type safety for different states, ensure mark_finished()
only called with non-running states, add `Literal` types for all the
state types.
2024-03-29 15:12:16 -07:00
Ilya Kreymer
c1817cbe04
add horizontal pod autoscaler for backend and frontend via helm charts (#1633)
Supports horizontal pod autoscaling (hpa) for backend and frontend pods:
- use cpu and memory averages
- adjust base memory + cpu for backend
- threshold set to 80% cpu and 95% memory utilization by default
(configurable in values.yaml)
- instead of backend and frontend replicas, set max replicas in
values.yaml
- only enable hpa if backend_max_replicas or frontend_max_replicas is
>1, default to 1 for now
2024-03-28 16:39:27 -07:00
Ilya Kreymer
3438133fcb
Crawler pod memory padding + auto scaling (#1631)
- set memory limit to 1.2x memory request to provide extra padding and
avoid OOM
- attempt to resize crawler pods by 1.2x when exceeding 90% of available
memory
- do a 'soft OOM' (send extra SIGTERM) to pod when reaching 100% of
requested memory, resulting in faster graceful restart, but avoiding a
system-instant OOM Kill
- Fixes #1632

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-28 16:39:00 -07:00
Ilya Kreymer
86311ab4ea
merge 1.9.5 fixes (#1637)
retry loading profile if initial load fails, follow-up to #1604
- Add missing setTimeout to retry profile loading

bump RWP to 1.8.15
2024-03-27 21:49:19 -07:00
Tessa Walsh
00ced6dd6b
Add single page QA GET endpoint (#1635)
Fixes #1634 

Also make sure other get page endpoint without qa uses PageOut model
2024-03-27 14:57:59 -07:00
Henry Wilkinson
275f69493f
Frontend: icon-button Cleanup (#1628)
Closes #1591

### Changes
- Converts one instance of a button with an icon in it to an `icon-button`
- Makes all the trashcan icon buttons have a red hover state
- Adds localization function & placeholder to upload dialog "Name" field
- Adds localization functions to some missing icon-button label
instances
- Adds a few missing icon button labels

Co-authored-by: sua yoo <sua@suayoo.com>
Co-authored-by: sua yoo <sua@webrecorder.org>
2024-03-27 14:57:32 -04:00
Ilya Kreymer
412eb2ef32
MetaController update (#1630)
Bump metacontroller to latest (4.11)
2024-03-27 08:49:56 -07:00
Tessa Walsh
66b4532321
Give test_crawl_timeout 10 mins to finish (#1627)
Related to https://github.com/webrecorder/browsertrix-cloud/issues/1620

Follow-up to https://github.com/webrecorder/browsertrix-cloud/pull/1621,
which didn't seem to fix the problem.

I'm giving it much more time here in the hopes that it solves it (since
it's a nightly test, time shouldn't be such a pressing issue).
2024-03-26 18:33:30 -07:00
Tessa Walsh
e9895e78a2
Add additional filters to page list endpoints (#1622)
Fixes #1617 

Filters added:

- reviewed: filter by page has approval or at least one note (true) or
neither (false)
- approved: filter by approval value (accepts list of strings,
comma-separated, each of which are coerced into True, False, or None, or
ignored if they are invalid values)
- hasNotes: filter by has at least one note (true) or not (false)

Tests have also been added to ensure that results are as expected.
2024-03-21 21:33:07 -07:00
Tessa Walsh
b3b1e0d7d8
Fix intermittent crawl timeout test failure (#1621)
Fixes #1620 

This increases the total timeout from 60 seconds to 120 seconds for
crawl to complete, which should be sufficient given how intermittently
the failure has been happening. Can increase it further if needed.
2024-03-21 17:18:27 -07:00
Ilya Kreymer
4f676e4e82
QA Runs Initial Backend Implementation (#1586)
Supports running QA Runs via the QA API!

Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes
#1498

Also requires the latest Browsertrix Crawler 1.1.0+ (from
webrecorder/browsertrix-crawler#469 branch)

Notable changes:
- QARun objects contain info about QA runs, which are crawls
performed on data loaded from existing crawls.

- Various crawl db operations can be performed on either the crawl or
`qa.` object, and core crawl fields have been moved to CoreCrawlable.

- While running,`QARun` data stored in a single `qa` object, while
finished qa runs are added to `qaFinished` dictionary on the Crawl. The
QA list API returns data from the finished list, sorted by most recent
first.

- Includes additional type fixes / type safety, especially around
BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific
get_upload(), get_basecrawl(), get_crawl() getters for internal use and
get_crawl_out() for API

- Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) 
along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-20 22:42:16 -07:00
sua yoo
05e03e0b90
Disable Prettier check in CI (#1619)
Disables `prettier:check` until discrepancies are handled in
https://github.com/webrecorder/browsertrix-cloud/issues/1618 so that
formatting issues don't fail CI runs.
2024-03-20 15:01:51 -07:00
Emma Segal-Grossman
d2862ff797
Emit more modern code for browsers (#1614)
Adds a `browserlist` field to `package.json`, which Webpack picks up so
it doesn't convert nullish coalescing operators into obfuscated messes
like `key !== null && key !== void 0 ? key : null`.

This improves output size a little & improves the debugging experience
as well.

Tested in Chrome, FF, & Safari locally and didn't encounter any issues.
2024-03-19 17:22:41 -04:00
Emma Segal-Grossman
41d6e79cb3
Clean up ESLint warnings in main (#1616)
See title.

The only place this changes behaviour is in the placeholder page list,
which will be replaced by the real one shortly, so I'm going to just
merge this.
2024-03-19 17:22:27 -04:00
Tessa Walsh
21ae38362e
Add endpoints to read pages from older crawl WACZs into database (#1562)
Fixes #1597

New endpoints (replacing old migration) to re-add crawl pages to db from
WACZs.

After a few implementation attempts, we settled on using
[remotezip](https://github.com/gtsystem/python-remotezip) to handle
parsing of the zip files and streaming their contents line-by-line for
pages. I've also modified the sync log streaming to use remotezip as
well, which allows us to remove our own zip module and let remotezip
handle the complexity of parsing zip files.

Database inserts for pages from WACZs are batched 100 at a time to help
speed up the endpoint, and the task is kicked off using
asyncio.create_task so as not to block before giving a response.

StorageOps now contains a method for streaming the bytes of any file in
a remote WACZ, requiring only the presigned URL for the WACZ and the
name of the file to stream.
2024-03-19 14:14:21 -07:00
Emma Segal-Grossman
2c44011b5b
Update node version mentioned in docs (#1615)
Follow-up to #1612 

cc @SuaYoo
2024-03-19 16:40:53 -04:00
sua yoo
dcd2efcd3b
Fix asset imports in tests (#1611)
Addresses failing test in
https://github.com/webrecorder/browsertrix-cloud/pull/1592 by fixing
asset imports in unit tests. Unit tests now import an empty string for
all assets--note: if we want to test actual asset content, will need to
update this config.
2024-03-19 13:06:07 -07:00
sua yoo
26820cbaba
Upgrade Node 16 > 18 (#1612) 2024-03-19 13:02:08 -07:00
sua yoo
b43f550ff3
Fix missing page component imports (#1610)
Missed bug introduced in
https://github.com/webrecorder/browsertrix-cloud/pull/1608, adds back
imports and disables `import-x` rule.
2024-03-18 20:55:35 -07:00
Emma Segal-Grossman
91df222cdf
Fix mismatch in prettier import order config (#1609)
Follow-up to #1608 — quick fix for an issue I encountered after merging
main into #1497

Just going to directly merge once this completes (cc @SuaYoo for
visibility)
2024-03-18 22:14:13 -04:00
sua yoo
c9c57fafee
fix: hide wip qa tab 2024-03-18 18:59:24 -07:00
Emma Segal-Grossman
b1e2f1b325
Add ESLint rules for import ordering (#1608)
Follow-up from
https://github.com/webrecorder/browsertrix-cloud/pull/1546#discussion_r1529001599
(cc @SuaYoo)

- Adds `eslint-plugin-import-x` and
`@ianvs/prettier-plugin-sort-imports` and configures rules for them both
so imports get sorted on format & on lint.
- Runs both on everything!
2024-03-18 21:50:02 -04:00
Ilya Kreymer
5a4902b6d4
kubernetes api: avoid overriding content-type header in kubernetes-asyncio, pass in via arg instead (main) (#1605)
- instead of overriding the content-type header globally, pass
'application/merge-patch+json' to
self.custom_api.patch_namespaced_custom_object() directly
- bump kubernetes-asyncio to 29.0.0
- fixes potential issues with global override of the header in
kubernetes-asyncio
- copy of #1602 for main
2024-03-18 11:17:54 -07:00
sua yoo
6e9c14aea6
test: fix frontend auth unit test 2024-03-18 11:00:13 -07:00
Henry Wilkinson
1093aa959f
Adds favicons! (#1584)
Closes #328 

## Changes

The app has favicons now!

Added:
- SVG 
- Changes to slightly brighter colours in dark mode for better contrast!
- Fallback ICO
- `apple-touch-icon` (some browsers also use this, not just iOS)
- Web manifest with app description
- Two web manifest icon sizes should users add the app to their local
launcher (Windows' Start or macOS' Dock / Launchpad
  - Lighting & render by @emma-sg, thanks!

The manifest and icons are copied to the root directory at build time by
webpack. All of the dedicated ways of doing this seemed more complicated
than this?

---------
Co-authored-by: emma <hi@emma.cafe>
2024-03-16 15:11:31 -07:00
Henry Wilkinson
fa194c3d0d
Docs: Update docs theme (#1594)
Partially addresses #1241 

### Changes
- Adds Browsertrix logo to readme
- It detects if you're in light or dark mode and adjusts the text color
accordingly! _The future is now!_
- Minor readme updates
- Updates icon and adds favicon SVGs to the docs
- This does not yet use Konsole for the docs site title. Will have to
sort this out later along with private hosting for that font.
- Updates docs theme to use new brand colours — picked the green for
this one, will probably be consistent across all of Webrecorder's MKDocs
sites.
2024-03-16 15:09:31 -07:00
Ilya Kreymer
e7af081af1
profile browser fixes: better resource usage + load retry (main) (#1604)
- Backend: Use separate resource constraints for profiles: default
profile browser resources to either 'profile_browser_cpu' /
'profile_browser_memory' or single browser 'crawler_memory_base' /
'crawler_cpu_base', instead of scaled to the number of browser workers

- Frontend: check that profile html page is loading, keep retrying if
still getting nginx error instead of loading an iframe with the error.

Fixes #1598 (Copy of #1599 from 1.9.4)
2024-03-16 15:07:04 -07:00
sua yoo
960f54bf4e
Update issue reporting templates (#1596)
Changes:
- Edits templates for succinctness and precision
- Separate section for screenshots and OS/browser for bugs
- Removes requirements and TODO section of features to simplify
interface for external-facing requests
2024-03-16 07:27:19 -04:00
wvengen
6278157f40
Make storage deletion work on more S3 providers, don't use access URL for deletion (#1600)
I came across [this
problem](https://forum.webrecorder.net/t/deleting-crawl-failure/512) and
noticed that the access URL is used when deleting files, causing my file
deletions to fail on OpenStack SWIFT S3 (relates to #1090). This trivial
change makes it work there.
2024-03-16 04:17:23 -04:00
sua yoo
eb7036bf87
Add QA tab to archived item detail (#1590)
Adds tab with placeholders as a starting point to work off of. The badge and button is not currently linked up to any data or actions.
2024-03-12 14:05:16 -07:00