- Updated how collections gallery and presentation and sharing pages
- Collections gallery page content extracted from blog post, linked from blog post
- Each page has one video covering the gallery setting and individual collection presentation
- Cleaned up text on both to avoid duplicated content (thanks @DaleLore)
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: DaleLore <DaleLoreNY@gmail.com>
Resolves https://github.com/webrecorder/browsertrix/issues/2452
## Changes
- Displays page count and collection size in listing grid
- Displays month if collection period is in the same year
- Displays collection size in About > Details section
- Minor refactor: move byte formatting into `localize.ts` utility file,
move slash (`/`) separator into own utility file
- remove query for /collections endpoint just to get the org name
- add orgName to single /collection endpoint, where it is already
available on the backend
- don't wait for languages to be ready to render UI, as this can result
in empty page if backend can not be reached.
- catch if /api/settings returns an invalid response to show 'backend
initializing' message
- will support initContainers where backend may return 5xx error while
backend is initializing, via #2449
Note: this results in locale picker showing all available locales if
backend is not available, not just filtered ones, but I think that's a
reasonable trade-off.
### Changes
- Public collections can now be deeplinked
### Caveats
- When users click the _About this Collection_ tab and then return to
the _Browse Collection_ tab, the deeplink is gone until they visit
another page.
- consolidate list_pages() and list_replay_query_pages() into
list_pages()
- to keep backwards compatibility, add <crawl>/pagesSearch that does not
include page totals, keep <crawl>/pages with page total (slower)
- qa frontend: add default 'Crawl Order' sort order, to better show
pages in QA view
- bgjob: account for parallelism in bgjobs, add logging if succeeded
mismatches parallelism
- QA sorting: default to 'crawl order' by default to get better results.
- Optimize pages job: also cover crawls that may not have any pages but have pages listed in done stats
- Bgjobs: give custom op jobs more memory
Resolves https://github.com/webrecorder/browsertrix/issues/2359
## Changes
- Track when a workflow form section is opened
- Hide workflow form section navigation on small screens
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#2406
Converts migration 0042 to launch a background job (parallelized across
several pods) to migrate all crawls by optimizing their pages and
setting `version: 2` on the crawl when complete.
Also Optimizes MongoDB queries for better performance.
Migration Improvements:
- Add `isMigrating` and `version` fields to `BaseCrawl`
- Add new background job type to use in migration with accompanying
`migration_job.yaml` template that allows for parallelization
- Add new API endpoint to launch this crawl migration job, and ensure
that we have list and retry endpoints for superusers that work with
background jobs that aren't tied to a specific org
- Rework background job models and methods now that not all background
jobs are tied to a single org
- Ensure new crawls and uploads have `version` set to `2`
- Modify crawl and collection replay.json endpoints to only include
fields for replay optimization (`initialPages`, `pageQueryUrl`,
`preloadResources`) if all relevant crawls/uploads have `version` set to
`2`
- Remove `distinct` calls from migration pathways
- Consolidate collection recompute stats
Query Optimizations:
- Remove all uses of $group and $facet
- Optimize /replay.json endpoints to precompute preload_resources, avoid
fetching crawl list twice
- Optimize /collections endpoint by not fetching resources
- Rename /urls -> /pageUrlCounts and avoid $group, instead sort with
index, either by seed + ts or by url to get top matches.
- Use $gte instead of $regex to get prefix matches on URL
- Use $text instead of $regex to get text search on title
- Remove total from /pages and /pageUrlCounts queries by not using
$facet
- frontend: only call /pageUrlCounts when dialog is opened.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
While ideally we don't need to use superadmin for many things, there are
still a lot of places where it's necessary, especially around customer
service. This makes it a little more visible when that's the case, just
as a reminder. I could see this coming in handy especially for newer
people who might not have the experience to know to look for the "admin"
and "running crawls" buttons.
<img width="1088" alt="Screenshot 2025-02-13 at 1 12 58 PM"
src="https://github.com/user-attachments/assets/70b975e1-af6b-4e8c-9e49-52c4c66e9721"
/>
Related to #2152
This PR adds backend support for custom link selectors via `selectLinks`
on the crawl workflow config. Tests have been updated as well.
It also adds `selectLinks` to the frontend in a minimal and for now
hardcoded way that we can use as a basis for proper frontend support
moving forward.
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Adds a switch to switch between viewing public collections only
(default) and all collections on org dashboard.
Also updates the `house-fill` icon to `house` in a couple places
(@Shrinks99)
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
- Merges existing collection content into one page
- Updates ArchiveWeb.page link
- Adds redirect from /collections → /collection
- Moves content relevant to presentation & sharing out of the intro
- Adds new content about sharing collections!
---------
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Co-authored-by: sua yoo <sua@webrecorder.org>
Fixes https://github.com/webrecorder/browsertrix/issues/2383
- Fixes unpredictable sort order when typing in collection page URL
- Fixes page URL results flickering in and out while typing
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- Updates links to running crawls to redirect to workflow "Watch" tab
- Removes unused "Jump to crawl" superadmin widgets
- Refactors archived item component to remove references to active
crawls
- Moves page count out from under "Size" label in archived item detail
- Renames "Pages Crawled" to "Pages" in archived item leading heading
and detail overview
- Renames "Crawl ID" to "Archived Item ID"
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
- add ensure_page_limit_quotas() which sets the config limit to the max
pages quota, if any
- set the page limit on the config when: creating new crawl, creating
configmap
- don't set the quota page limit on new or existing crawl workflows
(remove setting it on new workflows) to allow updated quotas to take
affect for next crawl
- frontend: correctly display page limit on workflow settings page from
org quotas, if any.
- operator: get org on each sync in one place
- fixes#2369
---------
Co-authored-by: sua yoo <sua@webrecorder.org>
- Displays workflow form as collapsible sections
- Combines run now toggle into submit
- Fixes exclusion field errors not preventing form submission
- Refactors `<btrix-observable>` into new `Observable` controller
---------
Co-authored-by: emma <hi@emma.cafe>
- Refactors dashboard and org profile preview to use private API
endpoint, to fix public collections not showing when the org
visibility is hidden
- Adds additional sorting options for collections
- Adds unique page url counts for archived items, collections, and
organizations to backend and exposes this in collections
- Shows collection period (i.e. `dateEarliest` to `dateLatest`) in
collections list
- Shows same collection metadata in private and public views, updates
private view info bar
- Fixes "Update Org Profile" action item showing for crawler roles
---------
Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: sua yoo <sua@suayoo.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>