Commit Graph

513 Commits

Author SHA1 Message Date
Ilya Kreymer
03fa00df45
set default crawler channel if not set, possible fix for #2458 (#2469)
update default RWP version
2025-03-07 12:32:19 -08:00
Ilya Kreymer
6c192df49d
Add thumbnail endpoint (#2468)
- Add /thumbnail collections endpoint to serve the thumbnail as an image for public
collections.
- Also fix uploading thumbnail images to use correct mime, if available.
2025-03-07 12:29:36 -08:00
Ilya Kreymer
9466e83d18 version: bump to 1.14.3 2025-03-03 15:20:40 -08:00
Ilya Kreymer
afa892000b
replay api: add downloadUrl to replay endpoints to be used by RWP (#2456)
RWP (2.3.3+) can determine if the 'Download Archive' menu item should be
showed based on the value of downloadUrl.
If set to 'null', will hide the menu item:
- set downloadUrl to public collection download for public collections
replay
- set downloadUrl to null for private collection and crawl replay to
hide the download menu item in RWP (otherwise have to add the
auth_header query with bearer token and should assess security before
doing that..)

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-03-03 14:11:28 -08:00
Ilya Kreymer
e13c3bfb48
move db migrations to initContainers: (#2449)
- should avoid gunicorn worker timeouts for long running migrations,
also fixes #2439
- add main_migrations as entrypoint to just run db migrations, using
existing init_ops() call
- first run 'migrations' container with same resources as 'app' and 'op'
- additional typing for initializing db
- cleanup unused code related to running only once, waiting for db to be ready
- fixes #2447
2025-03-03 13:13:15 -08:00
Ilya Kreymer
702c9ab3b7
Better cacheing of presigned URLs + support for thumbnails (#2446)
Overhauls URL presigning by:
- cache the presigned urls in a flat, separate mongodb collection which
has an expiring index
- update presigned urls if not found / expired automatically in index
- remove logic on storing presignedUrl in files
- support cacheing presigned URL for thumbnails.
- add endpoints to clear presigned urls for org or for all files in all
orgs (superadmin only)
- supersedes #2438, fix for #2437
- removes previous presignedUrl and expireAt data from crawls and QA
runs

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-03-03 12:05:23 -08:00
Ilya Kreymer
631b019baf
optimize public collection loading: (#2444)
- remove query for /collections endpoint just to get the org name
- add orgName to single /collection endpoint, where it is already
available on the backend
2025-03-03 10:13:30 -08:00
Ilya Kreymer
2263745df3
Fix replay.json 400 response for empty collection (#2445)
- fix #2443 
- don't throw error in list_pages() if no crawls provided, just return
empty list
- ensure an empty collection returns 200 on replay.json, add tests
2025-03-03 09:38:19 -08:00
Ilya Kreymer
cb52da66dc version: bump to 1.14.2 2025-02-27 14:13:03 -08:00
Tessa Walsh
45aa0a32b6
Calculate total for crawl QA page endpoint (#2435)
Fixes #2434 

Patch fix for a regression in Browsertrix 1.4.0-1.4.1 where total was
not being calculated for QA page list endpoint but still being included
in response, which led to total always being 0 and pages not loading in
the frontend review screen as a result.
2025-02-27 11:46:35 -08:00
Ilya Kreymer
376c9981dc version: bump to 1.14.1 2025-02-26 23:15:01 -08:00
Tessa Walsh
3dc8c825c6
Add superadmin endpoint to readd scheduled workflow cronjobs (#2430)
Adds new superadmin-only `POST /orgs/all/crawlconfigs/reAddCronjobs`
endpoint to update/recreate scheduled workflow cronjobs across all orgs.
2025-02-26 23:13:53 -08:00
Ilya Kreymer
e67708bd4f version: update to 1.14.0 2025-02-24 14:49:46 -08:00
Ilya Kreymer
83180efac9
remove dropping page index on migrations (#2418)
Don't need it for now, and this will now be slow due to amount of pages.
Can readd in future migrations if we need it..

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-02-24 12:29:02 -08:00
Ilya Kreymer
8a507f0473
Consolidate list page endpoints + better QA sorting + optimize pages fix (#2417)
- consolidate list_pages() and list_replay_query_pages() into
list_pages()
- to keep backwards compatibility, add <crawl>/pagesSearch that does not
include page totals, keep <crawl>/pages with page total (slower)
- qa frontend: add default 'Crawl Order' sort order, to better show
pages in QA view
- bgjob: account for parallelism in bgjobs, add logging if succeeded
mismatches parallelism
- QA sorting: default to 'crawl order' by default to get better results.
- Optimize pages job: also cover crawls that may not have any pages but have pages listed in done stats
- Bgjobs: give custom op jobs more memory
2025-02-21 13:47:20 -08:00
Ilya Kreymer
3ca68bf1d2 version: 1.14.0-beta.6 2025-02-20 15:37:33 -08:00
Tessa Walsh
f8fb2d2c8d
Rework crawl page migration + MongoDB Query Optimizations (#2412)
Fixes #2406 

Converts migration 0042 to launch a background job (parallelized across
several pods) to migrate all crawls by optimizing their pages and
setting `version: 2` on the crawl when complete.

Also Optimizes MongoDB queries for better performance.

Migration Improvements:

- Add `isMigrating` and `version` fields to `BaseCrawl`
- Add new background job type to use in migration with accompanying
`migration_job.yaml` template that allows for parallelization
- Add new API endpoint to launch this crawl migration job, and ensure
that we have list and retry endpoints for superusers that work with
background jobs that aren't tied to a specific org
- Rework background job models and methods now that not all background
jobs are tied to a single org
- Ensure new crawls and uploads have `version` set to `2`
- Modify crawl and collection replay.json endpoints to only include
fields for replay optimization (`initialPages`, `pageQueryUrl`,
`preloadResources`) if all relevant crawls/uploads have `version` set to
`2`
- Remove `distinct` calls from migration pathways
- Consolidate collection recompute stats

Query Optimizations:
- Remove all uses of $group and $facet
- Optimize /replay.json endpoints to precompute preload_resources, avoid
fetching crawl list twice
- Optimize /collections endpoint by not fetching resources 
- Rename /urls -> /pageUrlCounts and avoid $group, instead sort with
index, either by seed + ts or by url to get top matches.
- Use $gte instead of $regex to get prefix matches on URL
- Use $text instead of $regex to get text search on title
- Remove total from /pages and /pageUrlCounts queries by not using
$facet
- frontend: only call /pageUrlCounts when dialog is opened.


---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2025-02-20 15:26:11 -08:00
Ilya Kreymer
36e723cc51
Adjust crawler pvc on exit code 3 (out of storage) (#2375)
crawler 1.5.0 now has an exit code 3 for when crawler is actually out of
disk space. The operator should handle this by immediately adjusting the
PVC size.

Ideally, crawler will be improved to avoid this, but since this can
still happen, operator should be able to respond and fix the issue.
2025-02-20 11:03:28 -08:00
Ilya Kreymer
88a9f3baf7
ensure running crawl configmap is updated when exclusions are added/removed (#2409)
exclusions are already updated dynamically if crawler pod is running,
but when crawler pod is restarted, this ensures new exclusions are also
picked up:
- mount configmap in separate path, avoiding subPath, to allow dynamic
updates of mounted volume
- adds a lastConfigUpdate timestamp to CrawlJob - if lastConfigUpdate in
spec is different from current, the configmap is recreated by operator
- operator: also update image from channel avoid any issues with
updating crawler in channel
- only updates for exclusion add/remove so far, can later be expanded to
other crawler settings (see: #2355 for broader running crawl config
updates)
- fixes #2408
2025-02-19 11:42:19 -08:00
Ilya Kreymer
d23bca1f73 style change: remove spaces from python version docstring 2025-02-17 16:52:49 -08:00
Ilya Kreymer
a7c8ca4028 version: bump to 1.14.0-beta.1 2025-02-17 16:48:27 -08:00
Tessa Walsh
6c2d8c88c8
Modify page upload migration (#2400)
Related to #2396 

Changes to migration 0037:
- Re-adds pages in migration rather than in background job to avoid race
condition with later migrations
- Re-adds pages for all uploads in all orgs

Fix for readd pages for org:
- Ensure org filter is applied!
- Fix wrong type
- Remove distinct, use iterator to iterate over crawls faster.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-02-17 16:47:58 -08:00
Ilya Kreymer
5bebb6161a
Issue 2396 readd pages fixes (#2398)
readd pages fixes:
    - add additional mem to background job
- copy page qa data to separate temp coll when re-adding pages, then
merge back in
2025-02-17 13:52:11 -08:00
Ilya Kreymer
e112f96614
Upload Fixes: (#2397)
- ensure upload pages are always added with a new uuid, to avoid any
duplicates with existing uploads, even if upload wacz is actually a
crawl from different browsertrix instance, etc..
- cleanup upload names with slugify, which also replaces spaces, fixes
uploading wacz filenames with spaces in them
- part of fix for #2396

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-02-17 13:05:33 -08:00
Tessa Walsh
39d99e7c5d
Add support for custom link selectors to backend (#2346)
Related to #2152 

This PR adds backend support for custom link selectors via `selectLinks`
on the crawl workflow config. Tests have been updated as well.

It also adds `selectLinks` to the frontend in a minimal and for now
hardcoded way that we can use as a basis for proper frontend support
moving forward.

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2025-02-13 22:22:27 -08:00
Ilya Kreymer
4516268a70
misc fixes: cors + disable buffering for uploads (#2395)
- ensure pages endpoint support CORS for local dev
- disable proxy request buffering to support large uploads
2025-02-13 19:38:20 -08:00
Tessa Walsh
7f1af9bb31
Mark all pages from pages.jsonl as seeds (#2390)
Fixes #2389 

All pages from `pages/pages.jsonl` files now have `isSeed: True` in the
database, in addition to any pages that explicitly have `seed` set to
true in the actual JSONL.

Tests have been added to ensure that all pages from our fixture uploads
have `isSeed: True`.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-02-13 16:54:30 -08:00
Ilya Kreymer
7b2932c582
Add initial pages + pagesQuery endpoint to /replay.json APIs (#2380)
Fixes #2360 

- Adds `initialPages` to /replay.json response for collections, returning
up-to 25 pages (seed pages first, then sorted by capture time).
- Adds `pagesQueryUrl` to /replay.json
- Adds a public pages search endpoint to support public collections.
- Adds `preloadResources`, including list of WACZ files that should
always be loaded, to /replay.json

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-02-13 16:53:47 -08:00
sua yoo
f7b9b73a68
fix: Sort filtered collection page URLs (#2384)
Fixes https://github.com/webrecorder/browsertrix/issues/2383

- Fixes unpredictable sort order when typing in collection page URL
- Fixes page URL results flickering in and out while typing

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-02-12 11:59:20 -05:00
Emma Segal-Grossman
f8a44258d8
Merge pull request #2332 from webrecorder/frontend-collection-editing-dialog
Collection editing and sharing revamp
2025-02-11 18:27:35 -05:00
Tessa Walsh
98a45b0d85
Add collection page list/search endpoint (#2354)
Fixes #2353

Adds a new endpoint to list pages in a collection, with filtering
available on `url` (exact match), `ts`, `urlPrefix`, `isSeed`, and
`depth`, as well as accompanying tests. Additional sort options have
been added as well.

These same filters and sort options have also been added to the crawl
pages endpoint.

Also fixes an issue where `isSeed` wasn't being set in the database when
false but only added on serialization, which was preventing filtering
from working as expected.
2025-02-10 16:44:37 -08:00
Ilya Kreymer
001839a521
Fix max pages quota setting and display (#2370)
- add ensure_page_limit_quotas() which sets the config limit to the max
pages quota, if any
- set the page limit on the config when: creating new crawl, creating
configmap
- don't set the quota page limit on new or existing crawl workflows
(remove setting it on new workflows) to allow updated quotas to take
affect for next crawl
- frontend: correctly display page limit on workflow settings page from
org quotas, if any.
- operator: get org on each sync in one place
- fixes #2369

---------

Co-authored-by: sua yoo <sua@webrecorder.org>
2025-02-10 16:15:21 -08:00
Tessa Walsh
0e9e70f3a3
Add WACZ filename, depth, favIconUrl, isSeed to pages (#2352)
Adds `filename` to pages, pointed to the WACZ file those files come
from, as well as depth, favIconUrl, and isSeed. Also adds an idempotent
migration to backfill this information for existing pages, and increases
the backend container's startupProbe time to 24 hours to give it sufficient
time to finish the migration.
---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-02-05 15:50:04 -05:00
Ilya Kreymer
ea3b5e7322 quickfix: fix typo (missing self) that did not make it into #2351 2025-01-30 13:11:42 -08:00
Tessa Walsh
0a8df62ab4
Ensure collection stats are updated when WACZ is added on upload (#2351)
Fixes #2350 

Collection earliest/latest dates and the collection modified date are
also now updated when crawls or uploads are added to a collection via
the collection auto-add feature.
2025-01-30 13:05:56 -08:00
Tessa Walsh
b0aebb599a
Reformat with Black for 2025 ruleset (#2349) 2025-01-29 16:57:06 -05:00
Tessa Walsh
9363095d62
Validate exclusion regexes on backend (#2316) 2025-01-23 13:32:54 -05:00
Tessa Walsh
763c654484
feat: Update collection sorting, metadata, stats (#2327)
- Refactors dashboard and org profile preview to use private API
endpoint, to fix public collections not showing when the org
visibility is hidden
- Adds additional sorting options for collections
- Adds unique page url counts for archived items, collections, and
organizations to backend and exposes this in collections
- Shows collection period (i.e. `dateEarliest` to `dateLatest`) in
collections list
- Shows same collection metadata in private and public views, updates
private view info bar
- Fixes "Update Org Profile" action item showing for crawler roles

---------

Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: sua yoo <sua@suayoo.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-01-23 13:32:23 -05:00
Ilya Kreymer
28d39d8c4d
Fix migration to avoid duplicate collection slugs and names (#2318)
Follow-up to #2301 

Updates the 0039 migration to ensure collection slugs and names are
unique by:
- Removing all indexes
- Setting `slug` to random value
- Adding unique index to `slug` field.
- Attempting to set slug from name using `slug_from_name()`
- If rejected due to duplicate, append `-<counter>` at end of slug. Also
update name with ` <counter>`.
- Now that names should also be unique, add unique index on name field.

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-01-21 14:23:32 -08:00
Tessa Walsh
6797b41de0
Add pageCount to crawls and uploads and use in frontend for page counts (#2315)
Fixes #2257 

This is a follow-up to the public collections work, which adds pages to
the database for uploads. All crawls and uploads now have a `pageCount`
field which is populated when the item is successfully added. A new
migration is also added to populate the field for existing archived
items that don't have it set yet.

OrgMetrics have also been modified to include `crawlPageCount` and
`uploadPageCount`, and to include the total of both in `pageCount`, and
all three included in the frontend org dashboard.

The frontend has been updated to use `pageCount` rather than
`stats.done` wherever appropriate, meaning that in archived item lists
and details we now have a consistent page count for both crawls and
uploads.

### New functionality

- Deploy this branch
- Create new crawls and uploads and verify that page count appears
correctly throughout the frontend for all new crawls and uploads

### Migration

- Deploy from latest main
- Create some crawls and uploads
- Change to this branch and re-deploy
- Verify migration ran without errors in backend logs
- Verify that page count has been populated successfully by checking
archived items lists, crawl and upload detail pages, and dashboard to
ensure there are no longer any missing page counts.

---------

Co-authored-by: emma <hi@emma.cafe>
2025-01-16 14:41:14 -08:00
Tessa Walsh
5684e896af
Add support for autoclick (#2313)
Fixes #2259 

This PR brings backend and frontend support for the new autoclick
behavior in Browsertrix, introduces in Browsertrix 1.5.0+

On the backend, we introduce `min_autoclick_crawler_image` to
`values.yaml`, with a default value of
`"docker.io/webrecorder/browsertrix-crawler:1.5.0"`. If this is set and
the crawler version for a new crawl is less than this value, the
autoclick behavior is removed from the behaviors list in the configmap
created for the crawl.

The one caveat for this is that a crawler image tag like "latest" will
always be parsed as greater than `min_autoclick_crawler_image`, so there
is the potential for the crawler to run into issues if using a
non-numeric image tag with an older version of the crawler. For
production we use hardcoded specific versions of the crawler except for
the dev channel, which from here on out will including autoclick
support, so I think this should be okay (and is also true of the
existing implementation for checking `min_qa_crawler_image`).

On the frontend, I've added a checkbox (unchecked by default) in the
"Limits" section just below the current checkbox for autoscroll. We
might want to move these to a different section eventually - I'm not
sure Limits is the right place for them - but I wanted to be consistent
with things as they are.

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2025-01-16 12:44:00 -08:00
Tessa Walsh
4583babecb
feat: Add slug to collections and use it in public collection URLs (#2301)
Resolves https://github.com/webrecorder/browsertrix/issues/2298

## Changes

- Slugs added to collections, can be specified separately when creating
or updating collections or else is based off of supplied collection name
- Migration added to backfill slugs for existing collections
- Redirect collection to newest slug if changed
- Adds option to copy public profile link to "Public Collections" action
menu
- Show "Back to <Org>" link instead of breadcrumbs

---------
Co-authored-by: sua yoo <sua@suayoo.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-01-15 22:44:32 -08:00
sua yoo
4347fcdba5
feat: Show collection created date (#2302)
- Shows collection created date in detail view (if present)
- Adds `black` formatter to vscode extension recommendations

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-01-14 11:22:00 -05:00
Tessa Walsh
cbcf087a48
Add last crawl and subscription status indicators to org list (#2273)
Fixes #2260 

- Adds `lastCrawlFinished` to Organization model, updated after crawls
are added/deleted and with an idempotent migration to backfill existing
orgs
- Adds Last Crawl column to end of admin orgs list table
- Adds subscription icon next to existing status icon in orgs list
- Adds "lastCrawlFinished", "subscriptionStatus", and "subscriptionPlan"
sort options to orgs list backend endpoint in anticipation of future
sorting/filtering of orgs list

---------

Co-authored-by: emma <hi@emma.cafe>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-01-14 10:57:06 -05:00
Ilya Kreymer
12f358b826
Merge pull request #2271 from webrecorder/public-collections-feature
feat: Public collections, includes:
- feat: Public org profile page #2172
- feat: Collection thumbnails, start page, and public view updates #2209
- feat: Track collection events #2256
2025-01-13 19:32:45 -08:00
Ilya Kreymer
bab5345ad5 version: bump to 1.14.0-beta.0 for public collections! 2025-01-13 19:29:54 -08:00
Tessa Walsh
d8655d3bc6
Use id for thumbnail size error detail 2025-01-13 15:15:49 -08:00
Tessa Walsh
be9ff04ee8
Make more explicit error message for large thumbnails 2025-01-13 15:15:49 -08:00
Tessa Walsh
eb88e9f90c
Add missing os import 2025-01-13 15:15:48 -08:00
Tessa Walsh
a031fab313
Backend work for public collections (#2198)
Fixes #2182 

This rather large PR adds the rest of what should be needed for public
collections work in the frontend.

New API endpoints include:

- Public collections endpoints: GET, streaming download
- Paginated list of URLs in collection with snapshot (page) info for
each
- Collection endpoint to set home URL
- Collection endpoint to upload thumbnail as stream
- DELETE endpoint to remove collection thumbnail

Changes to existing API endpoints include:

- Paginating public collection list results
- Several `pages` endpoints that previously only supported `/crawls/` in
their path, e.g. `/orgs/{oid}/crawls/all/pages/reAdd`, now support
`/uploads/` and `/all-crawls/` namespaces as well. This is necessitated
by adding pages for uploads to the database (see below). For
`/orgs/{oid}/namespace/all/pages/reAdd`, `crawls` or `uploads` will
serve as a filter to only affect crawls of that given type. Other
endpoints are more liberal at this point, and will perform the same
action regardless of the namespace used in the route (we'll likely want
to change this in a follow-up to be more consistent).
- `/orgs/{oid}/namespace/all/pages/reAdd` now kicks off a background job
rather than doing all of the computation in an asyncio task in the
backend container. The background job additionally updates collection
date ranges, page/size counts, and tags for each collection in the org
after pages have been (re)added.

Other big changes:

- New uploads will now have their pages read into the database!
Collection page counts now also include uploads
- A migration was added to start a background job for each org that will
add the pages for previously-uploaded WACZ files to the database and
update collections accordingly
- Adds a new `ImageFile` subclass of `BaseFile` for thumbnails that we
can use for other user-uploaded image files moving forward, with
separate output models for authenticated and public endpoints
2025-01-13 15:15:48 -08:00