Commit Graph

228 Commits

Author SHA1 Message Date
Emma Segal-Grossman
8db0e44843
Feat: New email templating system & service (#2712)
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-08-01 17:00:24 -04:00
Tessa Walsh
d3e241ad03
Validate seed files on backend and add tests (#2781)
Fixes #2780 

This PR adds additional backend validation for seed file uploads to fail
a seed upload if no valid seeds are found. It adds two new test cases to
ensure seed uploads will fail for binary files and for text files that
do not contain any valid URLs.
2025-07-31 23:20:58 -07:00
Tessa Walsh
993f82a49b
Add last crawl's stats object to CrawlConfigOut (#2714)
Fixes #2709 

Will allow us to display information about page counts (found, done) in
the workflow list.
2025-07-23 20:10:46 -07:00
Ilya Kreymer
89027ef16e
quickfix: delete seedfile after the workflow has been deleted (#2763)
Since seedfile deletion checks that the seedfile is not used in any
workflow, it should be deleted after the workflow is removed.
noticed in checking #2744
2025-07-23 20:10:29 -07:00
Tessa Walsh
f7ba712646 Add seed file support to Browsertrix backend (#2710)
Fixes #2673 

Changes in this PR:

- Adds a new `file_uploads.py` module and corresponding `/files` API
prefix with methods/endpoints for uploading, GETing, and deleting seed
files (can be extended to other types of files moving forward)
- Seed files are supported via `CrawlConfig.config.seedFileId` on POST
and PATCH endpoints. This seedFileId is replaced by a presigned url when
passed to the crawler by the operator
- Seed files are read when first uploaded to calculate `firstSeed` and
`seedCount` and store them in the database, and this is copied into the
workflow and crawl documents when they are created.
- Logic is added to store `firstSeed` and `seedCount` for other
workflows as well, and a migration added to backfill data, to maintain
consistency and fix some of the pymongo aggregations that previously
assumed all workflows would have at least one `Seed` object in
`CrawlConfig.seeds`
- Seed file and thumbnail storage stats are added to org stats
- Seed file and thumbnail uploads first check that the org's storage
quota has not been exceeded and return a 400 if so
- A cron background job (run weekly each Sunday at midnight by default,
but configurable) is added to look for seed files at least x minutes old
(1440 minutes, or 1 day, by default, but configurable) that are not in
use in any workflows, and to delete them when they are found. The
backend pods will ensure this k8s batch job exists when starting up and
create it if it does not already exist. A database entry for each run of
the job is created in the operator on job completion so that it'll
appear in the `/jobs` API endpoints, but retrying of this type of
regularly scheduled background job is not supported as we don't want to
accidentally create multiple competing scheduled jobs.
- Adds a `min_seed_file_crawler_image` value to the Helm chart that is
checked before creating a crawl from a workflow if set. If a workflow
cannot be run, return the detail of the exception in
`CrawlConfigAddedResponse.errorDetail` so that we can display the reason
in the frontend
- Add SeedFile model from base UserFile (former ImageFIle), ensure all APIs
returning uploaded files return an absolute pre-signed URL (either with external origin or internal service origin)

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-07-22 19:11:02 -07:00
Ilya Kreymer
8107b054f6
Profiles: Make browser commit API call idempotent (#2728)
- Fix race condition related to browser commit time
- The profile commit request waits for browser to actual finish, and
profile saved. This can cause request to time out, resulting in a retry,
in which the browser has already been closed.
- With these changes, the commit is now 'idempotent' and returns a
waiting_for_browser until the profile is actually committed.
- On frontend, keep pinging commit endpoint with a timeout while 'waiting_for_browser' is returned, actual committed when endpoint returns profile id.

---------

Co-authored-by: sua yoo <sua@suayoo.com>
2025-07-22 17:59:49 -07:00
Ilya Kreymer
3af94ca03d
Ensure replay.json returns correct origin for pagesQueryUrl (#2741)
- Use the Host + X-Forwarded-Proto header from API request
- Fixes #2740, better fix for #2720 avoiding need for separate alias
2025-07-16 10:48:24 -07:00
Emma Segal-Grossman
945c458011
Use curly quote for default archive name instead of straight quote (#2700)
Tiny little thing that's been bugging me for a little while now.
2025-07-16 11:41:19 -04:00
Tessa Walsh
d91a3bc088
Run webhook tests nightly (#2738)
Fixes #2737 

- Moves webhook-related tests to run nightly, to speed up CI runs and
avoid the periodic failures we've been getting lately.
- Also ensures all try/except blocks that have time.sleep in the 'try' also have a time.sleep in 'except'
to avoid fast-looping retries

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-07-15 18:05:57 -07:00
Emma Segal-Grossman
f91bfda42e
Allow searching by multiple tags & profiles with "and"/"or" options for tags (#2717)
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-07-11 22:35:52 -04:00
Emma Segal-Grossman
74c72ce551
Include tag counts in tag filter & tag input autocomplete (#2711) 2025-07-08 15:20:41 -04:00
Tessa Walsh
5b4fee73e6
Remove workflows from GET profile endpoint + add inUse flag instead (#2703)
Connected to #2661 

- Removes crawl workflows from being returned as part of the profile
response.
- Frontend: removes display of workflows in profile details.
- Adds 'inUse' flag to all profile responses to indicate profile is in
use by at least one workflow
- Adds 'profileid' as possible filter for workflows search in
preparation for filtering by profile id (#2708)
- Make 'profile_in_use' a proper error (returning 400) on profile
delete.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-07-02 16:44:12 -07:00
Ilya Kreymer
d4a2a66d6d
additional scale / browser window cleanup to properly support QA: (#2663)
- follow up to #2627 
- use qa_num_browser_windows to set exact number of QA browsers,
fallback to qa_scale
- set num_browser_windows and num_browsers_per_pod using crawler / qa
values depending if QA crawl
- scale_from_browser_windows() accepts optional browsers_per_pod if
dealing with possible QA override
- store 'desiredScale' in CrawlStatus to avoid recomputing for later
scale resolving
- ensure status.scale is always the actual scale observed
2025-06-12 13:09:04 -04:00
Ilya Kreymer
8ea16393c5
Optimize single-page crawl workflows (#2656)
For single page crawls:
- Always force 1 browser to be used, ignoring browser windows/scale
setting
- Don't use custom PVC volumes in crawler / redis, just use emptyDir -
no chance of crawler being interrupted and restarted on different
machine for a single page.

Adds a 'is_single_page' check to CrawlConfig, checking for either limit
or scopeType / no extra hops.

Fixes #2655
2025-06-10 12:13:57 -07:00
Tessa Walsh
dc41468daf
Allow users to run crawls with 1 or 2 browser windows (#2627)
Fixes #2425 

## Changed

- Switch backend to primarily using number of browser windows rather
than scale multiplier (including migration to calculate `browserWindows`
from `scale` for existing workflows and crawls)
- Still support `scale` in addition to `browserWindows` in input models
for creating and updating workflows and re-adjusting live crawl scale
for backwards compatibility
- Adds new `max_browser_windows` value to Helm chart, but calculates the
value from `max_crawl_scale` as fallback for users with that value
already set in local charts
- Rework frontend to allow users to select multiples of
`crawler_browser_instances` or any value below
`crawler_browser_instances` for browser windows. For instance, with
`crawler_browser_instances=4` and `max_browser_windows=8`, the user
would be presented with the following options: 1, 2, 3, 4, 8
- Sets maximum width of screencast to image width returned by `message`

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: sua yoo <sua@suayoo.com>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2025-06-03 13:37:30 -07:00
Ilya Kreymer
cb50c7c2c2
Pause / Resume Crawls Initial Implmentation. (#2572)
- add 'pause' crawl state (fixes #2567)
- gracefully shut down crawler pods, and then redis pod when paused
- crawler uploads WACZ before shutting down (dependent on
webrecorder/browsertrix-crawler#824, supported in 1.6.1+)
- add 'paused_at' on crawl spec to indicate when crawl is paused
- support max pause time limit, after which crawl becomes automatically
stopped.
- add 'stopped_pause_expired' when pause automatically expires and crawl
is stopped
- /crawl/<id>/{pause,resume} apis to toggle 'paused' on crawl spec
- ui: add pause/resume button, paused state (partially addresses #2568)
- ui: add pausing/resuming derivative states when crawl is running and
pausing, or paused and not pausing (partially addresses #2569)
- Designed to work with crawler 1.6.1+ which support pausing + uploading on pause

Work on #2566, Fixes #2576 

---------
Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: sua yoo <sua@suayoo.com>
2025-05-21 14:05:16 -07:00
Ilya Kreymer
8a713155ef
remove deleted collections from crawlconfigs (#2615)
simplified version of #2608, add a remove_collection_from_all_configs() in CrawlConfigs, also check org.
update tests to ensure removal
2025-05-20 18:38:40 -07:00
Ilya Kreymer
86e35e358d
Add Org Check for Collection access (#2616)
Ensure collection access checks org membership
2025-05-20 15:30:22 -07:00
Tessa Walsh
1492397656
Add ISO-639-1 language code validation to backend (#2602)
- Add backend validation for language codes
- Add migration to look for invalid ISO-639-1 language codes in
workflows, crawls, and org crawling defaults, and fix any found
2025-05-13 16:54:33 -04:00
Tessa Walsh
6f81d588a9
Ensure crawl page counts are correct when re-adding pages (#2601)
Fixes #2600 

This PR fixes the issue by ensuring that crawl page counts (total,
unique, files, errors) are reset to 0 when crawl pages are deleted, such
as right before being re-added.

It also adds a migration will recalculates file and error page counts
for each crawl without re-adding pages from the WACZ files.
2025-05-13 14:05:41 -04:00
Ilya Kreymer
1570011ec7
compute top page origins for each collection (#2483)
A quick PR to fix #2482:
- compute topPageHosts as part of existing collection stats compute
- store top 10 results in collection for now.
- display in collection About sidebar
- fixes #2482 

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-05-08 14:22:40 -07:00
Tessa Walsh
3e169ebc15
Add API endpoint to check if subscription is activated (#2582)
Subscription Management: used check to ensure subscription can be auto-canceled if
not activated.

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-05-06 17:36:58 -07:00
sua yoo
1fa43335c0
feat: Apply saved workflow settings to current crawl (#2514)
Resolves https://github.com/webrecorder/browsertrix/issues/2366

## Changes

Allows users to update current crawl with newly saved workflow settings.

## Manual testing

1. Log in as crawler
2. Start a crawl
3. Go to edit workflow. Verify "Update Crawl" button is shown
4. Click "Update Crawl". Verify crawl is updated with new settings

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-04-29 11:43:14 -07:00
Tessa Walsh
55bedcb0b7
feat: Custom autoclick selector (#2517)
Resolves #2504

## Changes

- Allows users to customize autoclick selector in workflows
- Refactors `btrix-syntax-input` to support rendering label and help
text `sl-input`
- Show autoclick selector in workflow / crawl settings
- Adds 'clickSelector' with default of 'a' to backend crawl config.

---------

Co-authored-by: sua yoo <sua@suayoo.com>
Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
2025-04-08 05:53:40 +02:00
Tessa Walsh
a51f7c635e
Add behavior logs from Redis to database and add endpoint to serve (#2526)
Backend work for #2524

This PR adds a second dedicated endpoint similar to `/errors`, as a
combined log endpoint would give a false impression of being the
complete crawl logs (which is far from what we're serving in Browsertrix
at this point).

Eventually when we have support for streaming live crawl logs in
`crawls/<id>/logs` I'd ideally like to deprecate these two dedicated
endpoints in favor of using that, but for now this seems like the best
solution.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-04-08 02:16:10 +02:00
Tessa Walsh
f84f6f55e0
Add basic backend validation for selectLinks (#2510)
Follow-up to #2152 

Related to https://github.com/webrecorder/browsertrix/pull/2487

This PR provides very basic validation of the `config.selectLinks`
argument on workflow creation and update. Namely, it checks that:
- `config.selectLinks` is not an empty array
- Each entry consists of two non-empty text sequences separated by `->`

At this point we're not validating the actual CSS selector on the
backend, though we could add that down the road.

Tests have been added accordingly.

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-04-07 21:36:05 +02:00
Tessa Walsh
cd7b695520
Add backend support for custom behaviors + validation endpoint (#2505)
Backend support for #2151 

Adds support for specifying custom behaviors via a list of strings.

When workflows are added or modified, minimal backend validation is done
to ensure that all custom behavior URLs are valid URLs (after removing
the git prefix and custom query arguments).

A separate `POST /crawlconfigs/validate/custom-behavior` endpoint is
also added, which can be used to validate a custom behavior URL. It
performs the same syntax check as above and then:
- For URL directly to behavior file, ensures URL resolves and returns a
2xx/3xx status code
- For Git repositories, uses `git ls-remote` to ensure they exist (and
that branch exists if specified)

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2025-04-02 16:20:51 -07:00
Ilya Kreymer
21a372057b
Fix user emails use userout (#2511)
Follow-up to #2495, actually ensure org subscription data is in included
in admin email response

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-03-24 12:04:39 -07:00
Ilya Kreymer
4c0ddd0fe3
crawl replay: remove isSeed=true from initialPages query (#2509)
- matches initial query for collections
- fixes 'Show Non-Seed Pages' not appearing for crawl replay
2025-03-20 15:03:41 -07:00
Ilya Kreymer
6be1f6674c
fixes token lifetime bug / improve security (#2490)
- fix jwt_token_lifetime being in hours, not minutes, remove extra * 60
- don't return userids in user list for org admins, instead just key
users by email, which is already unique
2025-03-19 10:07:09 -07:00
Ilya Kreymer
afa892000b
replay api: add downloadUrl to replay endpoints to be used by RWP (#2456)
RWP (2.3.3+) can determine if the 'Download Archive' menu item should be
showed based on the value of downloadUrl.
If set to 'null', will hide the menu item:
- set downloadUrl to public collection download for public collections
replay
- set downloadUrl to null for private collection and crawl replay to
hide the download menu item in RWP (otherwise have to add the
auth_header query with bearer token and should assess security before
doing that..)

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-03-03 14:11:28 -08:00
Ilya Kreymer
702c9ab3b7
Better cacheing of presigned URLs + support for thumbnails (#2446)
Overhauls URL presigning by:
- cache the presigned urls in a flat, separate mongodb collection which
has an expiring index
- update presigned urls if not found / expired automatically in index
- remove logic on storing presignedUrl in files
- support cacheing presigned URL for thumbnails.
- add endpoints to clear presigned urls for org or for all files in all
orgs (superadmin only)
- supersedes #2438, fix for #2437
- removes previous presignedUrl and expireAt data from crawls and QA
runs

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-03-03 12:05:23 -08:00
Ilya Kreymer
2263745df3
Fix replay.json 400 response for empty collection (#2445)
- fix #2443 
- don't throw error in list_pages() if no crawls provided, just return
empty list
- ensure an empty collection returns 200 on replay.json, add tests
2025-03-03 09:38:19 -08:00
Tessa Walsh
45aa0a32b6
Calculate total for crawl QA page endpoint (#2435)
Fixes #2434 

Patch fix for a regression in Browsertrix 1.4.0-1.4.1 where total was
not being calculated for QA page list endpoint but still being included
in response, which led to total always being 0 and pages not loading in
the frontend review screen as a result.
2025-02-27 11:46:35 -08:00
Ilya Kreymer
8a507f0473
Consolidate list page endpoints + better QA sorting + optimize pages fix (#2417)
- consolidate list_pages() and list_replay_query_pages() into
list_pages()
- to keep backwards compatibility, add <crawl>/pagesSearch that does not
include page totals, keep <crawl>/pages with page total (slower)
- qa frontend: add default 'Crawl Order' sort order, to better show
pages in QA view
- bgjob: account for parallelism in bgjobs, add logging if succeeded
mismatches parallelism
- QA sorting: default to 'crawl order' by default to get better results.
- Optimize pages job: also cover crawls that may not have any pages but have pages listed in done stats
- Bgjobs: give custom op jobs more memory
2025-02-21 13:47:20 -08:00
Tessa Walsh
f8fb2d2c8d
Rework crawl page migration + MongoDB Query Optimizations (#2412)
Fixes #2406 

Converts migration 0042 to launch a background job (parallelized across
several pods) to migrate all crawls by optimizing their pages and
setting `version: 2` on the crawl when complete.

Also Optimizes MongoDB queries for better performance.

Migration Improvements:

- Add `isMigrating` and `version` fields to `BaseCrawl`
- Add new background job type to use in migration with accompanying
`migration_job.yaml` template that allows for parallelization
- Add new API endpoint to launch this crawl migration job, and ensure
that we have list and retry endpoints for superusers that work with
background jobs that aren't tied to a specific org
- Rework background job models and methods now that not all background
jobs are tied to a single org
- Ensure new crawls and uploads have `version` set to `2`
- Modify crawl and collection replay.json endpoints to only include
fields for replay optimization (`initialPages`, `pageQueryUrl`,
`preloadResources`) if all relevant crawls/uploads have `version` set to
`2`
- Remove `distinct` calls from migration pathways
- Consolidate collection recompute stats

Query Optimizations:
- Remove all uses of $group and $facet
- Optimize /replay.json endpoints to precompute preload_resources, avoid
fetching crawl list twice
- Optimize /collections endpoint by not fetching resources 
- Rename /urls -> /pageUrlCounts and avoid $group, instead sort with
index, either by seed + ts or by url to get top matches.
- Use $gte instead of $regex to get prefix matches on URL
- Use $text instead of $regex to get text search on title
- Remove total from /pages and /pageUrlCounts queries by not using
$facet
- frontend: only call /pageUrlCounts when dialog is opened.


---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2025-02-20 15:26:11 -08:00
Ilya Kreymer
e112f96614
Upload Fixes: (#2397)
- ensure upload pages are always added with a new uuid, to avoid any
duplicates with existing uploads, even if upload wacz is actually a
crawl from different browsertrix instance, etc..
- cleanup upload names with slugify, which also replaces spaces, fixes
uploading wacz filenames with spaces in them
- part of fix for #2396

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-02-17 13:05:33 -08:00
Tessa Walsh
39d99e7c5d
Add support for custom link selectors to backend (#2346)
Related to #2152 

This PR adds backend support for custom link selectors via `selectLinks`
on the crawl workflow config. Tests have been updated as well.

It also adds `selectLinks` to the frontend in a minimal and for now
hardcoded way that we can use as a basis for proper frontend support
moving forward.

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2025-02-13 22:22:27 -08:00
Tessa Walsh
7f1af9bb31
Mark all pages from pages.jsonl as seeds (#2390)
Fixes #2389 

All pages from `pages/pages.jsonl` files now have `isSeed: True` in the
database, in addition to any pages that explicitly have `seed` set to
true in the actual JSONL.

Tests have been added to ensure that all pages from our fixture uploads
have `isSeed: True`.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-02-13 16:54:30 -08:00
Ilya Kreymer
7b2932c582
Add initial pages + pagesQuery endpoint to /replay.json APIs (#2380)
Fixes #2360 

- Adds `initialPages` to /replay.json response for collections, returning
up-to 25 pages (seed pages first, then sorted by capture time).
- Adds `pagesQueryUrl` to /replay.json
- Adds a public pages search endpoint to support public collections.
- Adds `preloadResources`, including list of WACZ files that should
always be loaded, to /replay.json

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-02-13 16:53:47 -08:00
Emma Segal-Grossman
f8a44258d8
Merge pull request #2332 from webrecorder/frontend-collection-editing-dialog
Collection editing and sharing revamp
2025-02-11 18:27:35 -05:00
Tessa Walsh
98a45b0d85
Add collection page list/search endpoint (#2354)
Fixes #2353

Adds a new endpoint to list pages in a collection, with filtering
available on `url` (exact match), `ts`, `urlPrefix`, `isSeed`, and
`depth`, as well as accompanying tests. Additional sort options have
been added as well.

These same filters and sort options have also been added to the crawl
pages endpoint.

Also fixes an issue where `isSeed` wasn't being set in the database when
false but only added on serialization, which was preventing filtering
from working as expected.
2025-02-10 16:44:37 -08:00
Tessa Walsh
0e9e70f3a3
Add WACZ filename, depth, favIconUrl, isSeed to pages (#2352)
Adds `filename` to pages, pointed to the WACZ file those files come
from, as well as depth, favIconUrl, and isSeed. Also adds an idempotent
migration to backfill this information for existing pages, and increases
the backend container's startupProbe time to 24 hours to give it sufficient
time to finish the migration.
---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-02-05 15:50:04 -05:00
Tessa Walsh
0a8df62ab4
Ensure collection stats are updated when WACZ is added on upload (#2351)
Fixes #2350 

Collection earliest/latest dates and the collection modified date are
also now updated when crawls or uploads are added to a collection via
the collection auto-add feature.
2025-01-30 13:05:56 -08:00
Tessa Walsh
9363095d62
Validate exclusion regexes on backend (#2316) 2025-01-23 13:32:54 -05:00
Tessa Walsh
763c654484
feat: Update collection sorting, metadata, stats (#2327)
- Refactors dashboard and org profile preview to use private API
endpoint, to fix public collections not showing when the org
visibility is hidden
- Adds additional sorting options for collections
- Adds unique page url counts for archived items, collections, and
organizations to backend and exposes this in collections
- Shows collection period (i.e. `dateEarliest` to `dateLatest`) in
collections list
- Shows same collection metadata in private and public views, updates
private view info bar
- Fixes "Update Org Profile" action item showing for crawler roles

---------

Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: sua yoo <sua@suayoo.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-01-23 13:32:23 -05:00
Tessa Walsh
6797b41de0
Add pageCount to crawls and uploads and use in frontend for page counts (#2315)
Fixes #2257 

This is a follow-up to the public collections work, which adds pages to
the database for uploads. All crawls and uploads now have a `pageCount`
field which is populated when the item is successfully added. A new
migration is also added to populate the field for existing archived
items that don't have it set yet.

OrgMetrics have also been modified to include `crawlPageCount` and
`uploadPageCount`, and to include the total of both in `pageCount`, and
all three included in the frontend org dashboard.

The frontend has been updated to use `pageCount` rather than
`stats.done` wherever appropriate, meaning that in archived item lists
and details we now have a consistent page count for both crawls and
uploads.

### New functionality

- Deploy this branch
- Create new crawls and uploads and verify that page count appears
correctly throughout the frontend for all new crawls and uploads

### Migration

- Deploy from latest main
- Create some crawls and uploads
- Change to this branch and re-deploy
- Verify migration ran without errors in backend logs
- Verify that page count has been populated successfully by checking
archived items lists, crawl and upload detail pages, and dashboard to
ensure there are no longer any missing page counts.

---------

Co-authored-by: emma <hi@emma.cafe>
2025-01-16 14:41:14 -08:00
Tessa Walsh
4583babecb
feat: Add slug to collections and use it in public collection URLs (#2301)
Resolves https://github.com/webrecorder/browsertrix/issues/2298

## Changes

- Slugs added to collections, can be specified separately when creating
or updating collections or else is based off of supplied collection name
- Migration added to backfill slugs for existing collections
- Redirect collection to newest slug if changed
- Adds option to copy public profile link to "Public Collections" action
menu
- Show "Back to <Org>" link instead of breadcrumbs

---------
Co-authored-by: sua yoo <sua@suayoo.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-01-15 22:44:32 -08:00
sua yoo
4347fcdba5
feat: Show collection created date (#2302)
- Shows collection created date in detail view (if present)
- Adds `black` formatter to vscode extension recommendations

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-01-14 11:22:00 -05:00
Tessa Walsh
cbcf087a48
Add last crawl and subscription status indicators to org list (#2273)
Fixes #2260 

- Adds `lastCrawlFinished` to Organization model, updated after crawls
are added/deleted and with an idempotent migration to backfill existing
orgs
- Adds Last Crawl column to end of admin orgs list table
- Adds subscription icon next to existing status icon in orgs list
- Adds "lastCrawlFinished", "subscriptionStatus", and "subscriptionPlan"
sort options to orgs list backend endpoint in anticipation of future
sorting/filtering of orgs list

---------

Co-authored-by: emma <hi@emma.cafe>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-01-14 10:57:06 -05:00