Commit Graph

1587 Commits

Author SHA1 Message Date
Tessa Walsh
a51f7c635e
Add behavior logs from Redis to database and add endpoint to serve (#2526)
Backend work for #2524

This PR adds a second dedicated endpoint similar to `/errors`, as a
combined log endpoint would give a false impression of being the
complete crawl logs (which is far from what we're serving in Browsertrix
at this point).

Eventually when we have support for streaming live crawl logs in
`crawls/<id>/logs` I'd ideally like to deprecate these two dedicated
endpoints in favor of using that, but for now this seems like the best
solution.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-04-08 02:16:10 +02:00
Tessa Walsh
f84f6f55e0
Add basic backend validation for selectLinks (#2510)
Follow-up to #2152 

Related to https://github.com/webrecorder/browsertrix/pull/2487

This PR provides very basic validation of the `config.selectLinks`
argument on workflow creation and update. Namely, it checks that:
- `config.selectLinks` is not an empty array
- Each entry consists of two non-empty text sequences separated by `->`

At this point we're not validating the actual CSS selector on the
backend, though we could add that down the road.

Tests have been added accordingly.

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-04-07 21:36:05 +02:00
sua yoo
23f9e08a22
feat: Add custom behaviors to workflow (#2520)
Resolves https://github.com/webrecorder/browsertrix/issues/2151
Follows https://github.com/webrecorder/browsertrix/pull/2505

## Changes

- Allows users to set custom behaviors in workflow editor.
- Allows one or more behaviors, as simple URL or Git URL to be added
- Calls validation endpoint to check if URL is valid.

---------

Co-authored-by: emma <hi@emma.cafe>
2025-04-02 17:45:27 -07:00
Tessa Walsh
cd7b695520
Add backend support for custom behaviors + validation endpoint (#2505)
Backend support for #2151 

Adds support for specifying custom behaviors via a list of strings.

When workflows are added or modified, minimal backend validation is done
to ensure that all custom behavior URLs are valid URLs (after removing
the git prefix and custom query arguments).

A separate `POST /crawlconfigs/validate/custom-behavior` endpoint is
also added, which can be used to validate a custom behavior URL. It
performs the same syntax check as above and then:
- For URL directly to behavior file, ensures URL resolves and returns a
2xx/3xx status code
- For Git repositories, uses `git ls-remote` to ensure they exist (and
that branch exists if specified)

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2025-04-02 16:20:51 -07:00
Ilya Kreymer
c067a0fe7c
fix qa page sorting: (#2530)
was sorting on qa.{qa_run_id} after the value was already replaced with
'qa', thus was sorting on non-existent value
fixes #2529
2025-04-02 09:25:38 -07:00
sua yoo
f6481272f4
feat: Specify custom link selectors (#2487)
- Allows users to specify page link selectors in workflow "Scope"
section
- Adds new `<btrix-syntax-input>` component for syntax-highlighted
inputs
- Refactors highlight.js implementation to prevent unnecessary language
loading
- Updates exclusion table header styles

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2025-04-02 00:32:34 -07:00
Ilya Kreymer
b5b4c4da15 version: update to 1.14.8 2025-03-31 14:17:53 -07:00
Ilya Kreymer
62e47a8817
support overriding crawler image pull policy per channel (#2523)
- add 'imagePullPolicy' field to each crawler channel declaration
- if unset, defaults to the setting in the existing
'crawler_image_pull_policy' field.

fixes #2522

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-03-31 14:11:41 -07:00
sua yoo
df8c80f3cc
task: Display built-in behaviors as list (#2518)
- Displays built-in behaviors as single field in workflow settings
- Standardizes how "None" is displayed in workflow settings
- Refactors behavior names into enum
2025-03-26 17:09:02 -07:00
Ilya Kreymer
61809ab3c5 ci: typo fix, move 'workflow_dispatch' to correct place 2025-03-26 13:02:38 -07:00
Ilya Kreymer
0925da6768
CI: Update python version + script (#2521)
Ensure we're on the latest versions CI actions + python (except lint check, due to issue)
Also allow running the Microk8s tests on demand with workflow dispatch
2025-03-26 12:53:18 -07:00
Ilya Kreymer
b3950dd03f version: update to 1.14.7 2025-03-25 17:25:24 -07:00
Ilya Kreymer
9250befea4
ingress: remove X-Forward-Proto snippet, no longer needed (and now possibly considered unsafe) (#2519)
X-Forward-Proto is now already provided by the standard ingress-nginx config
2025-03-25 17:24:55 -07:00
Ilya Kreymer
21a372057b
Fix user emails use userout (#2511)
Follow-up to #2495, actually ensure org subscription data is in included
in admin email response

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-03-24 12:04:39 -07:00
Ilya Kreymer
46be6a0cf6 version: bump to 1.14.6 2025-03-20 16:52:20 -07:00
Henry Wilkinson
c797e8446d
docs: Add UI documentation page on status icons (#2506)
### Changes
- Adds status icons page
- Moves action menus page to the UI development docs folder
- Fixes sentence fragment
2025-03-20 16:51:20 -07:00
Henry Wilkinson
c770b9ec22
frontend: move name field to the top of the signup form (#2508)
Fixes #2507

Does what it says on the tin!
2025-03-20 16:50:43 -07:00
Ilya Kreymer
4c0ddd0fe3
crawl replay: remove isSeed=true from initialPages query (#2509)
- matches initial query for collections
- fixes 'Show Non-Seed Pages' not appearing for crawl replay
2025-03-20 15:03:41 -07:00
Ilya Kreymer
cb14ac3a00
add org subs info to /api/users/emails endpoint (#2495)
Include additional info in this superadmin-only endpoint.
2025-03-20 08:31:23 -07:00
Ilya Kreymer
b63caf74ad
cleanup unused chart values + change mongo default (#2484)
- Removes chart values that are unused
- Also change `local-mongo.default` -> `local-mongo`,
`local-minio.default` -> `local-minio` as some users have reported
issues with `.default` and it will certainly break if not deploying
Browsertrix in the `default `namespace.
2025-03-20 08:30:45 -07:00
Henry Wilkinson
cf6690e74a
docs: add development section on action menus (#2429)
Closes #2428
2025-03-19 18:46:09 -04:00
Ilya Kreymer
c9c32d86e2
login: don't set default slug if user not part of any orgs #2491 (#2492)
if logged in user is not part of any orgs, still allow logging in,
instead of throwing an exception due to accessing non-existent org

---------

Co-authored-by: sua yoo <sua@suayoo.com>
2025-03-19 15:23:16 -07:00
sua yoo
0bc210d905
devex: Add frontend code snippet & update dev docs (#2494)
- Adds VSCode file template for component unit testing.
- Updates development docs with details on UI dev
2025-03-19 14:22:20 -07:00
Emma Segal-Grossman
b471192cbc
Workflow editor footer button: ensure isCrawlRunning is false if editing a new workflow (#2496)
Reported by @tw4l 

Quick fix for the bug I introduced in 1bc3c35 in #2481. I didn't
properly test on the workflow editor in a "new workflow" state, and
didn't realize that the component that fetches the workflow state for an
existing workflow wouldn't be rendered for a new workflow, so the update
to the loading state never occurred for new workflows. This fix
explicitly sets `isCrawlRunning` to `false` instead of `null` for new
workflows, so that the loading state isn't displayed.

Tested locally with both new and existing workflows (in both non-running
and running states).
2025-03-19 15:44:16 -04:00
Ilya Kreymer
6be1f6674c
fixes token lifetime bug / improve security (#2490)
- fix jwt_token_lifetime being in hours, not minutes, remove extra * 60
- don't return userids in user list for org admins, instead just key
users by email, which is already unique
2025-03-19 10:07:09 -07:00
Ilya Kreymer
eb300815a7
Fixes #2488 (#2493)
- Fixes #2488 
- Adds a k8s api call to set `suspend=false` on Job when associated
CrawlJob is finished.
- bump version - released as 1.14.5
2025-03-19 10:06:25 -07:00
sua yoo
d2601a037e
feat: Show running crawl when editing workflow (#2481)
Part of https://github.com/webrecorder/browsertrix/issues/2366

## Changes

- Displays latest running crawl status when editing workflow
- Disables "Run Now" button if crawl is currently running

Currently, clicking "Run Now" will result in a preventable server error
if the crawl is already running. The change in this PR is in preparation
for being able to update a currently running crawl and doesn't require
any backend changes.

## Manual testing

1. Log in as crawler
2. Go to edit crawl workflow
3. Open same workflow in another tab
4. Run the workflow
5. Go back to edit tab. Verify "Starting" status is shown next to "Save"
button and "Run Crawl" button is disabled

## Screenshots

| Page | Image/video |
| ---- | ----------- |
| Edit Workflow | <img width="354" alt="Screenshot 2025-03-11 at 1 34
07 PM"
src="https://github.com/user-attachments/assets/02f7fb4a-219d-43a4-bb1f-1f2b40ac1480"
/> |


<!-- ## Follow-ups -->

---------

Co-authored-by: emma <hi@emma.cafe>
2025-03-18 18:54:04 -04:00
Emma Segal-Grossman
89a6e84377
Fix broken thumbnail images not taking up appropriate size on ff (#2486)
Closes #2485 

Also adds alt text to collection thumbnail images.
2025-03-18 18:53:10 -04:00
sua yoo
bcb73932d4
docs: Organize readme and fix doc links (#2479)
Resolves https://github.com/webrecorder/browsertrix/issues/2478

## Changes

- Organizes README
- Fixes relative links in mkdocs

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-03-11 18:37:20 -07:00
Emma Segal-Grossman
b2c5b9bc59
Hide breadcrumbs for private orgs (#2477)
Hides "Back to [org name]" breadcrumb when viewing a public/unlisted
collection when the public gallery isn't enabled for the org (except
when logged into that org).
2025-03-11 15:05:35 -04:00
sua yoo
ac1236f15b
feat: Add behaviors section to workflow form (#2464)
- Moves "Per-Page Limits" fields to new "Page Behavior" section
- Fixes workflow settings closing tags with refactor to how sections are
rendered
- Updates user guide with behaviors documentation

---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2025-03-11 11:40:20 -07:00
emma
a42d83c9f6
add content-length and etag headers to thumbnail endpoint 2025-03-10 13:58:41 -04:00
Ilya Kreymer
d8365c734f version: bump to 1.14.4 2025-03-08 15:58:18 -08:00
Ilya Kreymer
00a42515c8
docs: add public collections gallery howto (#2462)
- Updated how collections gallery and presentation and sharing pages
- Collections gallery page content extracted from blog post, linked from blog post
- Each page has one video covering the gallery setting and individual collection presentation
- Cleaned up text on both to avoid duplicated content (thanks @DaleLore)



---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: DaleLore <DaleLoreNY@gmail.com>
2025-03-08 15:57:13 -08:00
Ilya Kreymer
75eb04c37b
Translations update from Hosted Weblate (#2467) (#2471)
Translations update from [Hosted Weblate](https://hosted.weblate.org)
for

[Browsertrix/Browsertrix](https://hosted.weblate.org/projects/browsertrix/browsertrix/).



Current translation status:

![Weblate translation

status](https://hosted.weblate.org/widget/browsertrix/browsertrix/horizontal-auto.svg)

---------

Co-authored-by: Weblate (bot) <hosted@weblate.org>
Co-authored-by: Anne Paz <anelisespaz@gmail.com>
Co-authored-by: weblate <1607653+weblate@users.noreply.github.com>
2025-03-07 12:40:43 -08:00
Emma Segal-Grossman
8078f3866b
Add missing "payment never made" subscription status to superadmin org list (#2457) 2025-03-07 12:38:09 -08:00
sua yoo
fa05d68292
fix: Open and highlight correct workflow form section on tab click (#2463)
Fixes https://github.com/webrecorder/browsertrix/issues/2461

## Changes

Opens workflow form section when clicking on section navigation link,
fixing issue with scroll position impacting unopened panels.
2025-03-07 12:35:24 -08:00
Ilya Kreymer
03fa00df45
set default crawler channel if not set, possible fix for #2458 (#2469)
update default RWP version
2025-03-07 12:32:19 -08:00
Ilya Kreymer
6c192df49d
Add thumbnail endpoint (#2468)
- Add /thumbnail collections endpoint to serve the thumbnail as an image for public
collections.
- Also fix uploading thumbnail images to use correct mime, if available.
2025-03-07 12:29:36 -08:00
Tessa Walsh
13bf818914
Fix nightly tests (#2460)
Fixes #2459 

- Set `/data/` as primary storage `access_endpoint_url` in nightly test
chart
- Modify nightly test GH Actions workflow to spawn a separate job per
nightly test module using dynamic matrix
- Set configuration not to fail other jobs if one job fails
- Modify failing tests:
- Add fixture to background job nightly test module so it can run alone
- Add retry loop to crawlconfig stats nightly test so it's less
dependent on timing

GitHub limits each workflow to 256 jobs, so this should continue to be
able to scale up for us without issue.

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2025-03-06 16:23:30 -08:00
Ilya Kreymer
9466e83d18 version: bump to 1.14.3 2025-03-03 15:20:40 -08:00
Ilya Kreymer
afa892000b
replay api: add downloadUrl to replay endpoints to be used by RWP (#2456)
RWP (2.3.3+) can determine if the 'Download Archive' menu item should be
showed based on the value of downloadUrl.
If set to 'null', will hide the menu item:
- set downloadUrl to public collection download for public collections
replay
- set downloadUrl to null for private collection and crawl replay to
hide the download menu item in RWP (otherwise have to add the
auth_header query with bearer token and should assess security before
doing that..)

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-03-03 14:11:28 -08:00
sua yoo
65a40c4816
feat: Show additional collection details (#2455)
Resolves https://github.com/webrecorder/browsertrix/issues/2452

## Changes

- Displays page count and collection size in listing grid
- Displays month if collection period is in the same year
- Displays collection size in About > Details section
- Minor refactor: move byte formatting into `localize.ts` utility file,
move slash (`/`) separator into own utility file
2025-03-03 13:15:27 -08:00
Ilya Kreymer
e13c3bfb48
move db migrations to initContainers: (#2449)
- should avoid gunicorn worker timeouts for long running migrations,
also fixes #2439
- add main_migrations as entrypoint to just run db migrations, using
existing init_ops() call
- first run 'migrations' container with same resources as 'app' and 'op'
- additional typing for initializing db
- cleanup unused code related to running only once, waiting for db to be ready
- fixes #2447
2025-03-03 13:13:15 -08:00
Ilya Kreymer
702c9ab3b7
Better cacheing of presigned URLs + support for thumbnails (#2446)
Overhauls URL presigning by:
- cache the presigned urls in a flat, separate mongodb collection which
has an expiring index
- update presigned urls if not found / expired automatically in index
- remove logic on storing presignedUrl in files
- support cacheing presigned URL for thumbnails.
- add endpoints to clear presigned urls for org or for all files in all
orgs (superadmin only)
- supersedes #2438, fix for #2437
- removes previous presignedUrl and expireAt data from crawls and QA
runs

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-03-03 12:05:23 -08:00
Ilya Kreymer
631b019baf
optimize public collection loading: (#2444)
- remove query for /collections endpoint just to get the org name
- add orgName to single /collection endpoint, where it is already
available on the backend
2025-03-03 10:13:30 -08:00
Ilya Kreymer
2263745df3
Fix replay.json 400 response for empty collection (#2445)
- fix #2443 
- don't throw error in list_pages() if no crawls provided, just return
empty list
- ensure an empty collection returns 200 on replay.json, add tests
2025-03-03 09:38:19 -08:00
Ilya Kreymer
2e86ee3fcc
Weblate (#2450)
Translations update from [Hosted Weblate](https://hosted.weblate.org)
for
[Browsertrix/Browsertrix](https://hosted.weblate.org/projects/browsertrix/browsertrix/).

Current translation status:

![Weblate translation
status](https://hosted.weblate.org/widget/browsertrix/browsertrix/horizontal-auto.svg)

Co-authored-by: Weblate (bot) <hosted@weblate.org>
Co-authored-by: Anne Paz <anelisespaz@gmail.com>
Co-authored-by: weblate <1607653+weblate@users.noreply.github.com>
2025-03-02 19:46:00 -08:00
Ilya Kreymer
64621ba6c0
frontend: fix rendering when backend not available yet (#2448)
- don't wait for languages to be ready to render UI, as this can result
in empty page if backend can not be reached.
- catch if /api/settings returns an invalid response to show 'backend
initializing' message
- will support initContainers where backend may return 5xx error while
backend is initializing, via #2449

Note: this results in locale picker showing all available locales if
backend is not available, not just filtered ones, but I think that's a
reasonable trade-off.
2025-03-01 14:02:37 -08:00
Emma Segal-Grossman
53b531ce3e
Show download button on public collection pages regardless of collection access (#2442)
Reported here
https://discord.com/channels/895426029194207262/1011678975636013066/1345095899008860224

Public-facing collections (whether public or unlisted) should have the
download button visible if "show download button" is enabled.
2025-02-28 22:07:38 -08:00