Commit Graph

1613 Commits

Author SHA1 Message Date
Ilya Kreymer
6be1f6674c
fixes token lifetime bug / improve security (#2490)
- fix jwt_token_lifetime being in hours, not minutes, remove extra * 60
- don't return userids in user list for org admins, instead just key
users by email, which is already unique
2025-03-19 10:07:09 -07:00
Ilya Kreymer
eb300815a7
Fixes #2488 (#2493)
- Fixes #2488 
- Adds a k8s api call to set `suspend=false` on Job when associated
CrawlJob is finished.
- bump version - released as 1.14.5
2025-03-19 10:06:25 -07:00
sua yoo
d2601a037e
feat: Show running crawl when editing workflow (#2481)
Part of https://github.com/webrecorder/browsertrix/issues/2366

## Changes

- Displays latest running crawl status when editing workflow
- Disables "Run Now" button if crawl is currently running

Currently, clicking "Run Now" will result in a preventable server error
if the crawl is already running. The change in this PR is in preparation
for being able to update a currently running crawl and doesn't require
any backend changes.

## Manual testing

1. Log in as crawler
2. Go to edit crawl workflow
3. Open same workflow in another tab
4. Run the workflow
5. Go back to edit tab. Verify "Starting" status is shown next to "Save"
button and "Run Crawl" button is disabled

## Screenshots

| Page | Image/video |
| ---- | ----------- |
| Edit Workflow | <img width="354" alt="Screenshot 2025-03-11 at 1 34
07 PM"
src="https://github.com/user-attachments/assets/02f7fb4a-219d-43a4-bb1f-1f2b40ac1480"
/> |


<!-- ## Follow-ups -->

---------

Co-authored-by: emma <hi@emma.cafe>
2025-03-18 18:54:04 -04:00
Emma Segal-Grossman
89a6e84377
Fix broken thumbnail images not taking up appropriate size on ff (#2486)
Closes #2485 

Also adds alt text to collection thumbnail images.
2025-03-18 18:53:10 -04:00
sua yoo
bcb73932d4
docs: Organize readme and fix doc links (#2479)
Resolves https://github.com/webrecorder/browsertrix/issues/2478

## Changes

- Organizes README
- Fixes relative links in mkdocs

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-03-11 18:37:20 -07:00
Emma Segal-Grossman
b2c5b9bc59
Hide breadcrumbs for private orgs (#2477)
Hides "Back to [org name]" breadcrumb when viewing a public/unlisted
collection when the public gallery isn't enabled for the org (except
when logged into that org).
2025-03-11 15:05:35 -04:00
sua yoo
ac1236f15b
feat: Add behaviors section to workflow form (#2464)
- Moves "Per-Page Limits" fields to new "Page Behavior" section
- Fixes workflow settings closing tags with refactor to how sections are
rendered
- Updates user guide with behaviors documentation

---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2025-03-11 11:40:20 -07:00
emma
a42d83c9f6
add content-length and etag headers to thumbnail endpoint 2025-03-10 13:58:41 -04:00
Ilya Kreymer
d8365c734f version: bump to 1.14.4 2025-03-08 15:58:18 -08:00
Ilya Kreymer
00a42515c8
docs: add public collections gallery howto (#2462)
- Updated how collections gallery and presentation and sharing pages
- Collections gallery page content extracted from blog post, linked from blog post
- Each page has one video covering the gallery setting and individual collection presentation
- Cleaned up text on both to avoid duplicated content (thanks @DaleLore)



---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: DaleLore <DaleLoreNY@gmail.com>
2025-03-08 15:57:13 -08:00
Ilya Kreymer
75eb04c37b
Translations update from Hosted Weblate (#2467) (#2471)
Translations update from [Hosted Weblate](https://hosted.weblate.org)
for

[Browsertrix/Browsertrix](https://hosted.weblate.org/projects/browsertrix/browsertrix/).



Current translation status:

![Weblate translation

status](https://hosted.weblate.org/widget/browsertrix/browsertrix/horizontal-auto.svg)

---------

Co-authored-by: Weblate (bot) <hosted@weblate.org>
Co-authored-by: Anne Paz <anelisespaz@gmail.com>
Co-authored-by: weblate <1607653+weblate@users.noreply.github.com>
2025-03-07 12:40:43 -08:00
Emma Segal-Grossman
8078f3866b
Add missing "payment never made" subscription status to superadmin org list (#2457) 2025-03-07 12:38:09 -08:00
sua yoo
fa05d68292
fix: Open and highlight correct workflow form section on tab click (#2463)
Fixes https://github.com/webrecorder/browsertrix/issues/2461

## Changes

Opens workflow form section when clicking on section navigation link,
fixing issue with scroll position impacting unopened panels.
2025-03-07 12:35:24 -08:00
Ilya Kreymer
03fa00df45
set default crawler channel if not set, possible fix for #2458 (#2469)
update default RWP version
2025-03-07 12:32:19 -08:00
Ilya Kreymer
6c192df49d
Add thumbnail endpoint (#2468)
- Add /thumbnail collections endpoint to serve the thumbnail as an image for public
collections.
- Also fix uploading thumbnail images to use correct mime, if available.
2025-03-07 12:29:36 -08:00
Tessa Walsh
13bf818914
Fix nightly tests (#2460)
Fixes #2459 

- Set `/data/` as primary storage `access_endpoint_url` in nightly test
chart
- Modify nightly test GH Actions workflow to spawn a separate job per
nightly test module using dynamic matrix
- Set configuration not to fail other jobs if one job fails
- Modify failing tests:
- Add fixture to background job nightly test module so it can run alone
- Add retry loop to crawlconfig stats nightly test so it's less
dependent on timing

GitHub limits each workflow to 256 jobs, so this should continue to be
able to scale up for us without issue.

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2025-03-06 16:23:30 -08:00
Ilya Kreymer
9466e83d18 version: bump to 1.14.3 2025-03-03 15:20:40 -08:00
Ilya Kreymer
afa892000b
replay api: add downloadUrl to replay endpoints to be used by RWP (#2456)
RWP (2.3.3+) can determine if the 'Download Archive' menu item should be
showed based on the value of downloadUrl.
If set to 'null', will hide the menu item:
- set downloadUrl to public collection download for public collections
replay
- set downloadUrl to null for private collection and crawl replay to
hide the download menu item in RWP (otherwise have to add the
auth_header query with bearer token and should assess security before
doing that..)

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-03-03 14:11:28 -08:00
sua yoo
65a40c4816
feat: Show additional collection details (#2455)
Resolves https://github.com/webrecorder/browsertrix/issues/2452

## Changes

- Displays page count and collection size in listing grid
- Displays month if collection period is in the same year
- Displays collection size in About > Details section
- Minor refactor: move byte formatting into `localize.ts` utility file,
move slash (`/`) separator into own utility file
2025-03-03 13:15:27 -08:00
Ilya Kreymer
e13c3bfb48
move db migrations to initContainers: (#2449)
- should avoid gunicorn worker timeouts for long running migrations,
also fixes #2439
- add main_migrations as entrypoint to just run db migrations, using
existing init_ops() call
- first run 'migrations' container with same resources as 'app' and 'op'
- additional typing for initializing db
- cleanup unused code related to running only once, waiting for db to be ready
- fixes #2447
2025-03-03 13:13:15 -08:00
Ilya Kreymer
702c9ab3b7
Better cacheing of presigned URLs + support for thumbnails (#2446)
Overhauls URL presigning by:
- cache the presigned urls in a flat, separate mongodb collection which
has an expiring index
- update presigned urls if not found / expired automatically in index
- remove logic on storing presignedUrl in files
- support cacheing presigned URL for thumbnails.
- add endpoints to clear presigned urls for org or for all files in all
orgs (superadmin only)
- supersedes #2438, fix for #2437
- removes previous presignedUrl and expireAt data from crawls and QA
runs

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-03-03 12:05:23 -08:00
Ilya Kreymer
631b019baf
optimize public collection loading: (#2444)
- remove query for /collections endpoint just to get the org name
- add orgName to single /collection endpoint, where it is already
available on the backend
2025-03-03 10:13:30 -08:00
Ilya Kreymer
2263745df3
Fix replay.json 400 response for empty collection (#2445)
- fix #2443 
- don't throw error in list_pages() if no crawls provided, just return
empty list
- ensure an empty collection returns 200 on replay.json, add tests
2025-03-03 09:38:19 -08:00
Ilya Kreymer
2e86ee3fcc
Weblate (#2450)
Translations update from [Hosted Weblate](https://hosted.weblate.org)
for
[Browsertrix/Browsertrix](https://hosted.weblate.org/projects/browsertrix/browsertrix/).

Current translation status:

![Weblate translation
status](https://hosted.weblate.org/widget/browsertrix/browsertrix/horizontal-auto.svg)

Co-authored-by: Weblate (bot) <hosted@weblate.org>
Co-authored-by: Anne Paz <anelisespaz@gmail.com>
Co-authored-by: weblate <1607653+weblate@users.noreply.github.com>
2025-03-02 19:46:00 -08:00
Ilya Kreymer
64621ba6c0
frontend: fix rendering when backend not available yet (#2448)
- don't wait for languages to be ready to render UI, as this can result
in empty page if backend can not be reached.
- catch if /api/settings returns an invalid response to show 'backend
initializing' message
- will support initContainers where backend may return 5xx error while
backend is initializing, via #2449

Note: this results in locale picker showing all available locales if
backend is not available, not just filtered ones, but I think that's a
reasonable trade-off.
2025-03-01 14:02:37 -08:00
Emma Segal-Grossman
53b531ce3e
Show download button on public collection pages regardless of collection access (#2442)
Reported here
https://discord.com/channels/895426029194207262/1011678975636013066/1345095899008860224

Public-facing collections (whether public or unlisted) should have the
download button visible if "show download button" is enabled.
2025-02-28 22:07:38 -08:00
Ilya Kreymer
cb52da66dc version: bump to 1.14.2 2025-02-27 14:13:03 -08:00
Tessa Walsh
45aa0a32b6
Calculate total for crawl QA page endpoint (#2435)
Fixes #2434 

Patch fix for a regression in Browsertrix 1.4.0-1.4.1 where total was
not being calculated for QA page list endpoint but still being included
in response, which led to total always being 0 and pages not loading in
the frontend review screen as a result.
2025-02-27 11:46:35 -08:00
Ilya Kreymer
376c9981dc version: bump to 1.14.1 2025-02-26 23:15:01 -08:00
Tessa Walsh
3dc8c825c6
Add superadmin endpoint to readd scheduled workflow cronjobs (#2430)
Adds new superadmin-only `POST /orgs/all/crawlconfigs/reAddCronjobs`
endpoint to update/recreate scheduled workflow cronjobs across all orgs.
2025-02-26 23:13:53 -08:00
Tessa Walsh
da77b066a4
Prevent btrix helper from doing anything to k8s contexts other than docker-desktop (#2431)
The `./btrix` development helper shouldn't be used for anything other
than local dev, which this commit helps to enforce.

When running any command, if the k8s context is anything other than
`docker-desktop` the script will now shut down immediately without doing
anything and print the message: "Attempting to modify context other than
docker-desktop not supported. Quitting."
2025-02-26 23:13:25 -08:00
Ilya Kreymer
67668438c0
ingress: only set ssl-redirect if using tls (#2432)
otherwise, http path should be accessible. Can be used when TLS
termination handled outside of ingress.
2025-02-26 23:12:07 -08:00
Emma Segal-Grossman
00e85c3e94
Add "Copy <item type> ID" to a bunch of menus (#2426)
Addresses feedback from here
https://discord.com/channels/895426029194207262/910966759165657161/1344367205004873819
by @tw4l.

Add "Copy <item type> ID" to a bunch of menus, including all list and
detail pages, as well as all other item/crawl/page lists.

| Screenshots |
|--------|
| <img width="323" alt="Screenshot 2025-02-26 at 3 56 48 PM"
src="https://github.com/user-attachments/assets/32044c47-65f3-4e80-8f39-df5fd2101324"
/> |
| <img width="246" alt="Screenshot 2025-02-26 at 4 02 06 PM"
src="https://github.com/user-attachments/assets/8f2d6272-f450-4923-b5c9-751a2eea9a26"
/> |
| <img width="419" alt="Screenshot 2025-02-26 at 4 02 55 PM"
src="https://github.com/user-attachments/assets/0c005a33-055d-4fb7-a79e-9bedae57b785"
/> |
| <img width="1104" alt="Screenshot 2025-02-26 at 1 57 01 PM"
src="https://github.com/user-attachments/assets/7ee43400-1b30-4c78-89a0-3ddb89ef90ca"
/> |
| <img width="292" alt="Screenshot 2025-02-26 at 4 01 10 PM"
src="https://github.com/user-attachments/assets/929f7870-aa83-4f3c-947a-efad377e0b49"
/> |
| <img width="240" alt="Screenshot 2025-02-26 at 4 03 19 PM"
src="https://github.com/user-attachments/assets/45bff838-f741-45ce-b1a7-a8cfefa9656b"
/> |

---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2025-02-26 16:58:00 -05:00
Ilya Kreymer
e67708bd4f version: update to 1.14.0 2025-02-24 14:49:46 -08:00
Henry Wilkinson
c56481fc66
Add deepLink attribute to public collection replay embed (#2420)
### Changes

- Public collections can now be deeplinked

### Caveats

- When users click the _About this Collection_ tab and then return to
the _Browse Collection_ tab, the deeplink is gone until they visit
another page.
2025-02-24 14:33:39 -08:00
Ilya Kreymer
83180efac9
remove dropping page index on migrations (#2418)
Don't need it for now, and this will now be slow due to amount of pages.
Can readd in future migrations if we need it..

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-02-24 12:29:02 -08:00
Ilya Kreymer
8a507f0473
Consolidate list page endpoints + better QA sorting + optimize pages fix (#2417)
- consolidate list_pages() and list_replay_query_pages() into
list_pages()
- to keep backwards compatibility, add <crawl>/pagesSearch that does not
include page totals, keep <crawl>/pages with page total (slower)
- qa frontend: add default 'Crawl Order' sort order, to better show
pages in QA view
- bgjob: account for parallelism in bgjobs, add logging if succeeded
mismatches parallelism
- QA sorting: default to 'crawl order' by default to get better results.
- Optimize pages job: also cover crawls that may not have any pages but have pages listed in done stats
- Bgjobs: give custom op jobs more memory
2025-02-21 13:47:20 -08:00
sua yoo
06f6d9d4f2
feat: Move admin route to own namespace (#2405)
Resolves https://github.com/webrecorder/browsertrix/issues/2382

## Changes
- Moves superadmin to `/admin` URL namespace
- Removes superadmin views from main webpack chunks
2025-02-20 18:43:31 -08:00
sua yoo
8db80f5570
feat: Workflow form collapsible section enhancements (#2381)
Resolves https://github.com/webrecorder/browsertrix/issues/2359

## Changes

- Track when a workflow form section is opened
- Hide workflow form section navigation on small screens

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-02-20 18:42:00 -08:00
Ilya Kreymer
3ca68bf1d2 version: 1.14.0-beta.6 2025-02-20 15:37:33 -08:00
Tessa Walsh
f8fb2d2c8d
Rework crawl page migration + MongoDB Query Optimizations (#2412)
Fixes #2406 

Converts migration 0042 to launch a background job (parallelized across
several pods) to migrate all crawls by optimizing their pages and
setting `version: 2` on the crawl when complete.

Also Optimizes MongoDB queries for better performance.

Migration Improvements:

- Add `isMigrating` and `version` fields to `BaseCrawl`
- Add new background job type to use in migration with accompanying
`migration_job.yaml` template that allows for parallelization
- Add new API endpoint to launch this crawl migration job, and ensure
that we have list and retry endpoints for superusers that work with
background jobs that aren't tied to a specific org
- Rework background job models and methods now that not all background
jobs are tied to a single org
- Ensure new crawls and uploads have `version` set to `2`
- Modify crawl and collection replay.json endpoints to only include
fields for replay optimization (`initialPages`, `pageQueryUrl`,
`preloadResources`) if all relevant crawls/uploads have `version` set to
`2`
- Remove `distinct` calls from migration pathways
- Consolidate collection recompute stats

Query Optimizations:
- Remove all uses of $group and $facet
- Optimize /replay.json endpoints to precompute preload_resources, avoid
fetching crawl list twice
- Optimize /collections endpoint by not fetching resources 
- Rename /urls -> /pageUrlCounts and avoid $group, instead sort with
index, either by seed + ts or by url to get top matches.
- Use $gte instead of $regex to get prefix matches on URL
- Use $text instead of $regex to get text search on title
- Remove total from /pages and /pageUrlCounts queries by not using
$facet
- frontend: only call /pageUrlCounts when dialog is opened.


---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2025-02-20 15:26:11 -08:00
Ilya Kreymer
f7cd476b1a
Additional French Translations from Weblate (#2410)
Co-authored-by: Weblate (bot) <hosted@weblate.org>
Co-authored-by: Bricaud Frédéric <frederic.bricaud@banq.qc.ca>
Co-authored-by: Webrecorder Dev <dev@webrecorder.org>
Co-authored-by: Carole Gagné <carole.gagne@banq.qc.ca>
Co-authored-by: weblate <1607653+weblate@users.noreply.github.com>
2025-02-20 11:04:34 -08:00
Ilya Kreymer
36e723cc51
Adjust crawler pvc on exit code 3 (out of storage) (#2375)
crawler 1.5.0 now has an exit code 3 for when crawler is actually out of
disk space. The operator should handle this by immediately adjusting the
PVC size.

Ideally, crawler will be improved to avoid this, but since this can
still happen, operator should be able to respond and fix the issue.
2025-02-20 11:03:28 -08:00
Ilya Kreymer
88a9f3baf7
ensure running crawl configmap is updated when exclusions are added/removed (#2409)
exclusions are already updated dynamically if crawler pod is running,
but when crawler pod is restarted, this ensures new exclusions are also
picked up:
- mount configmap in separate path, avoiding subPath, to allow dynamic
updates of mounted volume
- adds a lastConfigUpdate timestamp to CrawlJob - if lastConfigUpdate in
spec is different from current, the configmap is recreated by operator
- operator: also update image from channel avoid any issues with
updating crawler in channel
- only updates for exclusion add/remove so far, can later be expanded to
other crawler settings (see: #2355 for broader running crawl config
updates)
- fixes #2408
2025-02-19 11:42:19 -08:00
Emma Segal-Grossman
905fe059a4
Add superadmin instance stats card (#2404)
Closes #2401


https://github.com/user-attachments/assets/cbd288d7-8e9c-4e86-ae87-6a308f6bdd58
2025-02-18 17:29:26 -05:00
Emma Segal-Grossman
f1dc790ab4
Org dashboard: update collection grid empty text state when view is set to "all" (#2402)
Tested locally.

cc @SuaYoo
2025-02-17 21:05:48 -05:00
Ilya Kreymer
d23bca1f73 style change: remove spaces from python version docstring 2025-02-17 16:52:49 -08:00
Ilya Kreymer
a7c8ca4028 version: bump to 1.14.0-beta.1 2025-02-17 16:48:27 -08:00
Tessa Walsh
6c2d8c88c8
Modify page upload migration (#2400)
Related to #2396 

Changes to migration 0037:
- Re-adds pages in migration rather than in background job to avoid race
condition with later migrations
- Re-adds pages for all uploads in all orgs

Fix for readd pages for org:
- Ensure org filter is applied!
- Fix wrong type
- Remove distinct, use iterator to iterate over crawls faster.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-02-17 16:47:58 -08:00
Emma Segal-Grossman
629cf7c404
Add a small sticky banner when logged in as superadmin (#2393)
While ideally we don't need to use superadmin for many things, there are
still a lot of places where it's necessary, especially around customer
service. This makes it a little more visible when that's the case, just
as a reminder. I could see this coming in handy especially for newer
people who might not have the experience to know to look for the "admin"
and "running crawls" buttons.

<img width="1088" alt="Screenshot 2025-02-13 at 1 12 58 PM"
src="https://github.com/user-attachments/assets/70b975e1-af6b-4e8c-9e49-52c4c66e9721"
/>
2025-02-17 17:42:36 -05:00