Commit Graph

1319 Commits

Author SHA1 Message Date
Tessa Walsh
00ced6dd6b
Add single page QA GET endpoint (#1635)
Fixes #1634 

Also make sure other get page endpoint without qa uses PageOut model
2024-03-27 14:57:59 -07:00
Henry Wilkinson
275f69493f
Frontend: icon-button Cleanup (#1628)
Closes #1591

### Changes
- Converts one instance of a button with an icon in it to an `icon-button`
- Makes all the trashcan icon buttons have a red hover state
- Adds localization function & placeholder to upload dialog "Name" field
- Adds localization functions to some missing icon-button label
instances
- Adds a few missing icon button labels

Co-authored-by: sua yoo <sua@suayoo.com>
Co-authored-by: sua yoo <sua@webrecorder.org>
2024-03-27 14:57:32 -04:00
Ilya Kreymer
412eb2ef32
MetaController update (#1630)
Bump metacontroller to latest (4.11)
2024-03-27 08:49:56 -07:00
Tessa Walsh
66b4532321
Give test_crawl_timeout 10 mins to finish (#1627)
Related to https://github.com/webrecorder/browsertrix-cloud/issues/1620

Follow-up to https://github.com/webrecorder/browsertrix-cloud/pull/1621,
which didn't seem to fix the problem.

I'm giving it much more time here in the hopes that it solves it (since
it's a nightly test, time shouldn't be such a pressing issue).
2024-03-26 18:33:30 -07:00
Tessa Walsh
e9895e78a2
Add additional filters to page list endpoints (#1622)
Fixes #1617 

Filters added:

- reviewed: filter by page has approval or at least one note (true) or
neither (false)
- approved: filter by approval value (accepts list of strings,
comma-separated, each of which are coerced into True, False, or None, or
ignored if they are invalid values)
- hasNotes: filter by has at least one note (true) or not (false)

Tests have also been added to ensure that results are as expected.
2024-03-21 21:33:07 -07:00
Tessa Walsh
b3b1e0d7d8
Fix intermittent crawl timeout test failure (#1621)
Fixes #1620 

This increases the total timeout from 60 seconds to 120 seconds for
crawl to complete, which should be sufficient given how intermittently
the failure has been happening. Can increase it further if needed.
2024-03-21 17:18:27 -07:00
Ilya Kreymer
4f676e4e82
QA Runs Initial Backend Implementation (#1586)
Supports running QA Runs via the QA API!

Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes
#1498

Also requires the latest Browsertrix Crawler 1.1.0+ (from
webrecorder/browsertrix-crawler#469 branch)

Notable changes:
- QARun objects contain info about QA runs, which are crawls
performed on data loaded from existing crawls.

- Various crawl db operations can be performed on either the crawl or
`qa.` object, and core crawl fields have been moved to CoreCrawlable.

- While running,`QARun` data stored in a single `qa` object, while
finished qa runs are added to `qaFinished` dictionary on the Crawl. The
QA list API returns data from the finished list, sorted by most recent
first.

- Includes additional type fixes / type safety, especially around
BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific
get_upload(), get_basecrawl(), get_crawl() getters for internal use and
get_crawl_out() for API

- Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) 
along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-20 22:42:16 -07:00
sua yoo
05e03e0b90
Disable Prettier check in CI (#1619)
Disables `prettier:check` until discrepancies are handled in
https://github.com/webrecorder/browsertrix-cloud/issues/1618 so that
formatting issues don't fail CI runs.
2024-03-20 15:01:51 -07:00
Emma Segal-Grossman
d2862ff797
Emit more modern code for browsers (#1614)
Adds a `browserlist` field to `package.json`, which Webpack picks up so
it doesn't convert nullish coalescing operators into obfuscated messes
like `key !== null && key !== void 0 ? key : null`.

This improves output size a little & improves the debugging experience
as well.

Tested in Chrome, FF, & Safari locally and didn't encounter any issues.
2024-03-19 17:22:41 -04:00
Emma Segal-Grossman
41d6e79cb3
Clean up ESLint warnings in main (#1616)
See title.

The only place this changes behaviour is in the placeholder page list,
which will be replaced by the real one shortly, so I'm going to just
merge this.
2024-03-19 17:22:27 -04:00
Tessa Walsh
21ae38362e
Add endpoints to read pages from older crawl WACZs into database (#1562)
Fixes #1597

New endpoints (replacing old migration) to re-add crawl pages to db from
WACZs.

After a few implementation attempts, we settled on using
[remotezip](https://github.com/gtsystem/python-remotezip) to handle
parsing of the zip files and streaming their contents line-by-line for
pages. I've also modified the sync log streaming to use remotezip as
well, which allows us to remove our own zip module and let remotezip
handle the complexity of parsing zip files.

Database inserts for pages from WACZs are batched 100 at a time to help
speed up the endpoint, and the task is kicked off using
asyncio.create_task so as not to block before giving a response.

StorageOps now contains a method for streaming the bytes of any file in
a remote WACZ, requiring only the presigned URL for the WACZ and the
name of the file to stream.
2024-03-19 14:14:21 -07:00
Emma Segal-Grossman
2c44011b5b
Update node version mentioned in docs (#1615)
Follow-up to #1612 

cc @SuaYoo
2024-03-19 16:40:53 -04:00
sua yoo
dcd2efcd3b
Fix asset imports in tests (#1611)
Addresses failing test in
https://github.com/webrecorder/browsertrix-cloud/pull/1592 by fixing
asset imports in unit tests. Unit tests now import an empty string for
all assets--note: if we want to test actual asset content, will need to
update this config.
2024-03-19 13:06:07 -07:00
sua yoo
26820cbaba
Upgrade Node 16 > 18 (#1612) 2024-03-19 13:02:08 -07:00
sua yoo
b43f550ff3
Fix missing page component imports (#1610)
Missed bug introduced in
https://github.com/webrecorder/browsertrix-cloud/pull/1608, adds back
imports and disables `import-x` rule.
2024-03-18 20:55:35 -07:00
Emma Segal-Grossman
91df222cdf
Fix mismatch in prettier import order config (#1609)
Follow-up to #1608 — quick fix for an issue I encountered after merging
main into #1497

Just going to directly merge once this completes (cc @SuaYoo for
visibility)
2024-03-18 22:14:13 -04:00
sua yoo
c9c57fafee
fix: hide wip qa tab 2024-03-18 18:59:24 -07:00
Emma Segal-Grossman
b1e2f1b325
Add ESLint rules for import ordering (#1608)
Follow-up from
https://github.com/webrecorder/browsertrix-cloud/pull/1546#discussion_r1529001599
(cc @SuaYoo)

- Adds `eslint-plugin-import-x` and
`@ianvs/prettier-plugin-sort-imports` and configures rules for them both
so imports get sorted on format & on lint.
- Runs both on everything!
2024-03-18 21:50:02 -04:00
Ilya Kreymer
5a4902b6d4
kubernetes api: avoid overriding content-type header in kubernetes-asyncio, pass in via arg instead (main) (#1605)
- instead of overriding the content-type header globally, pass
'application/merge-patch+json' to
self.custom_api.patch_namespaced_custom_object() directly
- bump kubernetes-asyncio to 29.0.0
- fixes potential issues with global override of the header in
kubernetes-asyncio
- copy of #1602 for main
2024-03-18 11:17:54 -07:00
sua yoo
6e9c14aea6
test: fix frontend auth unit test 2024-03-18 11:00:13 -07:00
Henry Wilkinson
1093aa959f
Adds favicons! (#1584)
Closes #328 

## Changes

The app has favicons now!

Added:
- SVG 
- Changes to slightly brighter colours in dark mode for better contrast!
- Fallback ICO
- `apple-touch-icon` (some browsers also use this, not just iOS)
- Web manifest with app description
- Two web manifest icon sizes should users add the app to their local
launcher (Windows' Start or macOS' Dock / Launchpad
  - Lighting & render by @emma-sg, thanks!

The manifest and icons are copied to the root directory at build time by
webpack. All of the dedicated ways of doing this seemed more complicated
than this?

---------
Co-authored-by: emma <hi@emma.cafe>
2024-03-16 15:11:31 -07:00
Henry Wilkinson
fa194c3d0d
Docs: Update docs theme (#1594)
Partially addresses #1241 

### Changes
- Adds Browsertrix logo to readme
- It detects if you're in light or dark mode and adjusts the text color
accordingly! _The future is now!_
- Minor readme updates
- Updates icon and adds favicon SVGs to the docs
- This does not yet use Konsole for the docs site title. Will have to
sort this out later along with private hosting for that font.
- Updates docs theme to use new brand colours — picked the green for
this one, will probably be consistent across all of Webrecorder's MKDocs
sites.
2024-03-16 15:09:31 -07:00
Ilya Kreymer
e7af081af1
profile browser fixes: better resource usage + load retry (main) (#1604)
- Backend: Use separate resource constraints for profiles: default
profile browser resources to either 'profile_browser_cpu' /
'profile_browser_memory' or single browser 'crawler_memory_base' /
'crawler_cpu_base', instead of scaled to the number of browser workers

- Frontend: check that profile html page is loading, keep retrying if
still getting nginx error instead of loading an iframe with the error.

Fixes #1598 (Copy of #1599 from 1.9.4)
2024-03-16 15:07:04 -07:00
sua yoo
960f54bf4e
Update issue reporting templates (#1596)
Changes:
- Edits templates for succinctness and precision
- Separate section for screenshots and OS/browser for bugs
- Removes requirements and TODO section of features to simplify
interface for external-facing requests
2024-03-16 07:27:19 -04:00
wvengen
6278157f40
Make storage deletion work on more S3 providers, don't use access URL for deletion (#1600)
I came across [this
problem](https://forum.webrecorder.net/t/deleting-crawl-failure/512) and
noticed that the access URL is used when deleting files, causing my file
deletions to fail on OpenStack SWIFT S3 (relates to #1090). This trivial
change makes it work there.
2024-03-16 04:17:23 -04:00
sua yoo
eb7036bf87
Add QA tab to archived item detail (#1590)
Adds tab with placeholders as a starting point to work off of. The badge and button is not currently linked up to any data or actions.
2024-03-12 14:05:16 -07:00
Henry Wilkinson
16e8b761c0
Frontend: Various icon updates (#1569)
Closes #1568 

## Changes
- Status icons are now filled!
- Uses Bootstrap Icons' new `copy` icon for all actions involving
copying to clipboard!
  - Finally! A real copy icon! 🎉 
  - Removes `copy-code.svg` as it is no longer used
- Actions involving duplicating objects still use `files`... Which is
good! Now they have distinct symbols!
- Adds orange to the tailwind colour palette

---------

Co-authored-by: sua yoo <sua@webrecorder.org>
2024-03-12 15:18:10 -04:00
sua yoo
9f312c075e
Manually approve pages in QA review (#1576)
- Automatically update view to first page if page ID isn't specified
- Show current page URL in location bar (resolves
https://github.com/webrecorder/browsertrix-cloud/issues/1495)
- Approve, reject, or leave notes on a page
- Display temporary list of links to pages in the sidebar
2024-03-12 10:08:51 -07:00
Henry Wilkinson
8ba29ca776
Browsertrix Cloud → Browsertrix text rename (#1466)
Part of #1241

### Changes
- Renames all instances of "Browsertrix Cloud" to "Browsertrix" on the
front end, emails, and documentation

---------

Co-authored-by: emma <hi@emma.cafe>
2024-03-12 11:30:05 -04:00
Ilya Kreymer
08f6847194
Configurable Max Scale for frontend (#1557)
Allow maximum scale option to be fully configurable via
`max_crawl_scale`. Already configurable on the backend, and now exposed
to the frontend via API `/api/settings` `maxCrawlScale` value.

The workflow editor and workflow details are updated to allow selecting
the scale up to the maxCrawlScale setting (which defaults to 3 if not
set).
2024-03-11 16:21:20 -07:00
Emma Segal-Grossman
8462c08206
Fix a couple linting issues (#1565) 2024-03-11 16:20:37 -07:00
sua yoo
548261e663
Fix shoelace icon loading (#1587)
Loads `sl-icon` synchronously to get correct base path when running
webpack-dev-server.
2024-03-11 13:38:58 -07:00
dependabot[bot]
a5521c6866
Bump cryptography from 41.0.1 to 42.0.4 in /ansible (#1574)
Bumps [cryptography](https://github.com/pyca/cryptography) from 41.0.1
to 42.0.4.

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-03-06 16:24:36 -08:00
Ilya Kreymer
ea494fa6e6
Merge V1.9.3 changes into main (#1583)
- Fix execution time checking by keeping lastUpdatedTime in db by
@ikreymer in https://github.com/webrecorder/browsertrix-cloud/pull/1573
- disable postcss-lit for var css
- Prevent closing tooltips from closing collection share dialog by
@SuaYoo in https://github.com/webrecorder/browsertrix-cloud/pull/1579
- Fix pending exclusion pagination by @SuaYoo in
https://github.com/webrecorder/browsertrix-cloud/pull/1578
- Fix regex escape in exclusion editor text match by @SuaYoo in
https://github.com/webrecorder/browsertrix-cloud/pull/1577

---------
Co-authored-by: emma <hi@emma.cafe>
Co-authored-by: sua yoo <sua@webrecorder.org>
2024-03-06 15:38:22 -08:00
Tessa Walsh
c20e754269
Add updatable QA reviewStatus field to crawls (#1575)
Fixes #1539 

Adds `reviewStatus` field to `BaseCrawl` model, updatable via the crawl
update API endpoint. Acceptable values are "good", "acceptable" or
"failure", enforced by an Enum.

Added to `BaseCrawl` so that we can extend support to uploads more
easily later on, but for now we'll only display this for crawls in the
frontend.
2024-03-05 16:49:23 -08:00
Emma Segal-Grossman
780dd09321
Create ArchivedItemPage and ArchivedItemPageComment types (#1567)
Based on #1534

Figured this should be in place so we can work on other front-end things
with these, rather than dealing with refactoring later

<!-- Fixes #issue_number -->

### Changes

- Adds `ArchivedItemPage` and `ArchivedItemPageComment` types from #1534
(thank you @SuaYoo!)
- Adds typedefs for match and resource count properties
- sets properties optional in the db schema to optional in the type as
well

### Manual testing

1.

### Screenshots

| Page | Image/video |
| ---- | ----------- |
|      |             |

<!-- ### Follow-ups -->
2024-03-04 18:52:09 -05:00
Tessa Walsh
ec0db1c323
Temporarily remove pages migration (#1572)
Removing until we have a better tested solution, including to avoid testing of QA runs for new crawls in beta.
2024-03-04 10:30:04 -08:00
Tessa Walsh
144000c7a3
Add guide for customizing Helm chart values (#1556)
Fixes #1555 

This is a first pass at some of the configuration options within the
Helm chart that might be most applicable to users. Emphasis is placed on
configuration that's particular to our application, such as storage and
crawler channels.

---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2024-03-04 12:03:11 -05:00
Ilya Kreymer
09a0d51843
pages: set page status to 200 if unset and loadState != 0 (#1563)
Follow up to #1516, ensure page status is set to 200 if no status is
provided, if loadState is not 0
2024-02-29 15:15:17 -08:00
Ilya Kreymer
2ac6584942
Refactor operator class into module (#1564)
The operator class has gotten fairly large, this is a first pass in
refactoring operator.py into a submodule instead, with multiple operator
instances which handle different types of objects.

- The main k8s interface has been split into K8sOpApi which extends K8sApi
and is shared across all operators.
- Each operator extends BaseOperator which also has an instance of K8sOpApi
- The CrawlOperator is still the bulk of the functionality, but will likely be further refactored
to support QA jobs

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-02-29 14:40:12 -08:00
Tessa Walsh
da19691184
Add crawl errors incrementally during crawl (#1561)
Fixes #1558 

- Adds crawl errors to database incrementally during crawl rather than
after crawl completes
- Simplifies crawl /errors API endpoint to always return errors from
database
2024-02-29 09:16:34 -08:00
Ilya Kreymer
804f755787
Increase startup probe time to account for long-running migrations (#1560)
- increases the failureThreshold for startupProbe for the api backend
container to account for long running migrations, upto 300 seconds
- add `/healthzStartup` which checks if db is ready
- bump 
- keeps `/healthz` to always return 200 when running
- increases livenessProbe failureThreshold to be higher than readiness
probe, following recommended best practice of liveness probe > readiness
probe
- fixes #1559
2024-02-28 14:22:33 -08:00
Tessa Walsh
14189b7cfb
Add crawl pages and related API endpoints (#1516)
Fixes #1502 

- Adds pages to database as they get added to Redis during crawl
- Adds migration to add pages to database for older crawls from
pages.jsonl and extraPages.jsonl files in WACZ
- Adds GET, list GET, and PATCH update endpoints for pages
- Adds POST (add), PATCH, and POST (delete) endpoints for page notes,
each with their own id, timestamp, and user info in addition to text
- Adds page_ops methods for 1. adding resources/urls to page, and 2.
adding automated heuristics and supplemental info (mime, type, etc.) to
page (for use in crawl QA job)
- Modifies `Migration` class to accept kwargs so that we can pass in ops
classes as needed for migrations
- Deletes WACZ files and pages from database for failed crawls during
crawl_finished process
- Deletes crawl pages when a crawl is deleted

Note: Requires a crawler version 1.0.0 beta3 or later, with support for
`--writePagesToRedis` to populate pages at crawl completion. Beta 4 is
configured in the test chart, which should be upgraded to stable 1.0.0
when it's released.

Connected to https://github.com/webrecorder/browsertrix-crawler/pull/464

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-02-28 12:11:35 -05:00
sua yoo
974b919eef
docs: remove reference to prod 2024-02-26 13:26:47 -08:00
sua yoo
86a816662e
add api reference section 2024-02-26 12:58:21 -08:00
Emma Segal-Grossman
f6e82d9335
Archived item nav button quickfix (#1543)
Navigation buttons weren't being laid out properly and were overflowing
in unintentional ways, this fixes that, and then also updates navigation
buttons & puts them into use everywhere elements service the purpose of
navigation buttons were used instead!


<img width="452" alt="Screenshot 2024-02-24 at 10 37 41 PM"
src="https://github.com/webrecorder/browsertrix-cloud/assets/5727389/a77ed1be-3f95-4e03-a4d8-e3740229621e">
<img width="519" alt="Screenshot 2024-02-24 at 10 38 06 PM"
src="https://github.com/webrecorder/browsertrix-cloud/assets/5727389/684bc9a4-bec2-4258-b264-662dc441e75f">
<img width="273" alt="Screenshot 2024-02-24 at 10 38 20 PM"
src="https://github.com/webrecorder/browsertrix-cloud/assets/5727389/863d9d9a-121e-4682-8c12-eaf94ae69c7c">
<img width="410" alt="Screenshot 2024-02-24 at 10 38 25 PM"
src="https://github.com/webrecorder/browsertrix-cloud/assets/5727389/b321375c-d063-4c00-b876-36a592c85a35">
<img width="200" alt="Screenshot 2024-02-24 at 10 38 37 PM"
src="https://github.com/webrecorder/browsertrix-cloud/assets/5727389/62bbb5d1-d4f3-4ba3-8cd5-035242424f3a">
2024-02-25 02:04:53 -05:00
Ilya Kreymer
ae59617e02 ci fix: deploy-dev.yaml fix, install poetry earlier, add decrypt values to sparse checkout 2024-02-23 18:40:36 -08:00
Ilya Kreymer
5e003f36a0 ci: also publish helm chart for *-release branches 2024-02-22 23:54:23 -08:00
Tessa Walsh
fa35d8994f Disable useSitemap by default in new workflows (#1541) 2024-02-22 23:54:23 -08:00
Ilya Kreymer
8ae032ff88 More friendly WARC prefix inside WACZ based on Org slug + Crawl Name / First Seed URL. (#1537)
Supports setting WARC prefix for WARCs inside WACZ to `<org slug>-<slug
[crawl name | first seed host]>`.
- Prefix set via WARC_PREFIX env var, supported in browsertrix-crawler
1.0.0-beta.4 or higher
If crawl name is provided, uses crawl name, other hostname of first
seed. The name is 'sluggified', using lowercase alphanum characters
separated by dashes.

Ex: in an organization called `Default Org`, a crawl of
`https://specs.webrecorder.net/` and no name will have WARCs named:
`default-org-specs-webrecorder-net-....warc.gz`
If the crawl is given the name `SPECS`, the WARCs will be named
`default-org-specs-manual-....warc.gz`

Fixes #412 in a default way.
2024-02-22 23:54:23 -08:00