Commit Graph

79 Commits

Author SHA1 Message Date
Emma Segal-Grossman
b0f2d87ce2
hotfix: workflow list - rewrite arrays in url search params to remove items (#2734)
## Changes

- Deletes and rewrites arrays in URL search params in workflow list when
editing array filters (i.e. tags & profiles)
- Removes a missed `console.log`
- bump to 1.17.3

cc @SuaYoo

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-07-14 14:30:18 -07:00
Ilya Kreymer
b915e734d1 version: bump to 1.17.2 2025-06-30 14:20:43 -07:00
Tessa Walsh
db4621602e
Bump version to 1.17.1 (#2678) 2025-06-18 13:09:49 -04:00
Ilya Kreymer
dde23426b2 version: bump to 1.17.0! 2025-06-12 17:37:07 -04:00
Ilya Kreymer
0e06ccd746 version: bump to 1.17.0-beta.0 2025-06-02 14:46:32 -07:00
Ilya Kreymer
e995811dd4 version: bump to 1.16.2 2025-05-20 18:43:22 -07:00
Tessa Walsh
c73512dbd4
Bump version to 1.16.1 (#2606) 2025-05-13 17:29:49 -04:00
Ilya Kreymer
652e8a6085 version: bump to 1.16.0 2025-05-08 14:30:00 -07:00
Ilya Kreymer
0cb3bd19f6 version: update to 1.15.0 2025-04-09 12:28:01 +02:00
Ilya Kreymer
b5b4c4da15 version: update to 1.14.8 2025-03-31 14:17:53 -07:00
Ilya Kreymer
b3950dd03f version: update to 1.14.7 2025-03-25 17:25:24 -07:00
Ilya Kreymer
46be6a0cf6 version: bump to 1.14.6 2025-03-20 16:52:20 -07:00
Ilya Kreymer
eb300815a7
Fixes #2488 (#2493)
- Fixes #2488 
- Adds a k8s api call to set `suspend=false` on Job when associated
CrawlJob is finished.
- bump version - released as 1.14.5
2025-03-19 10:06:25 -07:00
Ilya Kreymer
d8365c734f version: bump to 1.14.4 2025-03-08 15:58:18 -08:00
Ilya Kreymer
9466e83d18 version: bump to 1.14.3 2025-03-03 15:20:40 -08:00
Ilya Kreymer
cb52da66dc version: bump to 1.14.2 2025-02-27 14:13:03 -08:00
Ilya Kreymer
376c9981dc version: bump to 1.14.1 2025-02-26 23:15:01 -08:00
Ilya Kreymer
e67708bd4f version: update to 1.14.0 2025-02-24 14:49:46 -08:00
Ilya Kreymer
8a507f0473
Consolidate list page endpoints + better QA sorting + optimize pages fix (#2417)
- consolidate list_pages() and list_replay_query_pages() into
list_pages()
- to keep backwards compatibility, add <crawl>/pagesSearch that does not
include page totals, keep <crawl>/pages with page total (slower)
- qa frontend: add default 'Crawl Order' sort order, to better show
pages in QA view
- bgjob: account for parallelism in bgjobs, add logging if succeeded
mismatches parallelism
- QA sorting: default to 'crawl order' by default to get better results.
- Optimize pages job: also cover crawls that may not have any pages but have pages listed in done stats
- Bgjobs: give custom op jobs more memory
2025-02-21 13:47:20 -08:00
Ilya Kreymer
3ca68bf1d2 version: 1.14.0-beta.6 2025-02-20 15:37:33 -08:00
Tessa Walsh
f8fb2d2c8d
Rework crawl page migration + MongoDB Query Optimizations (#2412)
Fixes #2406 

Converts migration 0042 to launch a background job (parallelized across
several pods) to migrate all crawls by optimizing their pages and
setting `version: 2` on the crawl when complete.

Also Optimizes MongoDB queries for better performance.

Migration Improvements:

- Add `isMigrating` and `version` fields to `BaseCrawl`
- Add new background job type to use in migration with accompanying
`migration_job.yaml` template that allows for parallelization
- Add new API endpoint to launch this crawl migration job, and ensure
that we have list and retry endpoints for superusers that work with
background jobs that aren't tied to a specific org
- Rework background job models and methods now that not all background
jobs are tied to a single org
- Ensure new crawls and uploads have `version` set to `2`
- Modify crawl and collection replay.json endpoints to only include
fields for replay optimization (`initialPages`, `pageQueryUrl`,
`preloadResources`) if all relevant crawls/uploads have `version` set to
`2`
- Remove `distinct` calls from migration pathways
- Consolidate collection recompute stats

Query Optimizations:
- Remove all uses of $group and $facet
- Optimize /replay.json endpoints to precompute preload_resources, avoid
fetching crawl list twice
- Optimize /collections endpoint by not fetching resources 
- Rename /urls -> /pageUrlCounts and avoid $group, instead sort with
index, either by seed + ts or by url to get top matches.
- Use $gte instead of $regex to get prefix matches on URL
- Use $text instead of $regex to get text search on title
- Remove total from /pages and /pageUrlCounts queries by not using
$facet
- frontend: only call /pageUrlCounts when dialog is opened.


---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2025-02-20 15:26:11 -08:00
Ilya Kreymer
a7c8ca4028 version: bump to 1.14.0-beta.1 2025-02-17 16:48:27 -08:00
Ilya Kreymer
bab5345ad5 version: bump to 1.14.0-beta.0 for public collections! 2025-01-13 19:29:54 -08:00
Ilya Kreymer
a21b2ff0df version: bump to 1.13.2 2025-01-08 22:58:33 -08:00
Ilya Kreymer
60d07762be version: bump to 1.13.1 2024-12-19 12:01:47 -08:00
Ilya Kreymer
cf60c43df2
version: bump to 1.13.0! (#2242) 2024-12-13 20:32:38 -08:00
Ilya Kreymer
84a74c43a4 version: bump to 1.13.0-beta.0 2024-10-10 11:38:13 -07:00
Ilya Kreymer
8192e5bed6 version: bump to 1.12.0 2024-10-03 16:45:54 -07:00
Vinzenz Sinapius
bb6e703f6a
Configure browsertrix proxies (#1847)
Resolves #1354

Supports crawling through pre-configured proxy servers, allowing users to select which proxy servers to use (requires browsertrix crawler 1.3+)

Config:
- proxies defined in btrix-proxies subchart
- can be configured via btrix-proxies key or separate proxies.yaml file via separate subchart
- proxies list refreshed automatically if crawler_proxies.json changes if subchart is deployed
- support for ssh and socks5 proxies
- proxy keys added to secrets in subchart
- support for default proxy to be always used if no other proxy configured, prevent starting cluster if default proxy not available
- prevent starting manual crawl if previously configured proxy is no longer available, return error
- force 'btrix' username and group name on browsertrix-crawler non-root user to support ssh

Operator:
- support crawling through proxies, pass proxyId in CrawlJob
- support running profile browsers which designated proxy, pass proxyId to ProfileJob
- prevent starting scheduled crawl if previously configured proxy is no longer available

API / Access:
- /api/orgs/all/crawlconfigs/crawler-proxies - get all proxies (superadmin only)
- /api/orgs/{oid}/crawlconfigs/crawler-proxies - get proxies available to particular org
- /api/orgs/{oid}/proxies - update allowed proxies for particular org (superadmin only)
- superadmin can configure which orgs can use which proxies, stored on the org
- superadmin can also allow an org to access all 'shared' proxies, to avoid having to allow a shared proxy on each org.

UI:
- Superadmin has 'Edit Proxies' dialog to configure for each org if it has: dedicated proxies, has access to shared proxies.
- User can select a proxy in Crawl Workflow browser settings
- Users can choose to launch a browser profile with a particular proxy
- Display which proxy is used to create profile in profile selector
- Users can choose with default proxy to use for new workflows in Crawling Defaults

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-10-02 18:35:45 -07:00
Ilya Kreymer
c242bb96d2 version: bump to 1.12.0-beta.0 2024-09-12 14:30:15 -07:00
Ilya Kreymer
b3c1195878 version: bump to 1.11.6 2024-09-05 17:31:10 -07:00
Ilya Kreymer
ea252e8da9 version: bump to 1.11.5 2024-08-27 10:00:53 -07:00
Ilya Kreymer
135c97419d version: update to 1.11.4 2024-08-26 12:31:56 -07:00
Ilya Kreymer
8ff1ad39a7 version: bump to 1.11.3 2024-08-08 15:16:18 -07:00
Ilya Kreymer
ed9038fbdb version: bump to 1.11.2 2024-08-07 12:37:26 -07:00
Ilya Kreymer
0c29008b7d version: bump to 1.11.1 2024-07-30 11:23:41 -07:00
Ilya Kreymer
4aca107710 version: bump to 1.11.0 2024-07-29 12:52:39 -07:00
Ilya Kreymer
27059c91a5 version: bump to 1.11.0-beta.1 2024-07-17 10:06:49 -07:00
Ilya Kreymer
e3ee63f9b0 version: bump to 1.11.0-beta.0 2024-06-04 13:37:44 -07:00
Ilya Kreymer
4b6dd97c11 version: bump to 1.10.1 2024-05-23 22:24:58 -07:00
Ilya Kreymer
e853b62401 version: update to 1.10.0! 2024-05-20 19:30:22 -07:00
Ilya Kreymer
94d57b98ce version bump to 1.10.0-beta.7 2024-05-15 11:30:05 -07:00
Ilya Kreymer
e022994f4e version: update to 1.10.0-beta.6 2024-04-30 20:34:11 +02:00
Ilya Kreymer
a3911f6a8a version: bump to 1.10.0-beta.5 2024-04-25 09:00:54 +02:00
Ilya Kreymer
a09f565ce5 version: bump to 1.10.0-beta.4 2024-04-24 16:53:39 +02:00
Ilya Kreymer
f89027ac89 version: 1.10.0-beta.3 2024-04-24 15:45:17 +02:00
Ilya Kreymer
41655ef829 version: bump to 1.10.0-beta.2 2024-04-23 23:19:16 +02:00
Ilya Kreymer
b574f00d2b
Add Repository Index + Chart Rename + Docs Rename (#1708)
Repository Index: Generate an index.yaml in ./docx/helm-repo/index.yaml
to allow for browsertrix to be a helm repository.
docs: rename docs.browsertrix.cloud -> docs.browsertrix.com
docs: update deployment doc to mention helm repo as preferred way to
install
docs build action: generate repository index in GH action
publish action: update auto-generated message to mention installing from
the repo.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-04-21 09:42:25 -07:00
Ilya Kreymer
a7cda3b11b version: bump to 1.10.0-beta.1 2024-04-05 18:24:14 -07:00
Ilya Kreymer
412eb2ef32
MetaController update (#1630)
Bump metacontroller to latest (4.11)
2024-03-27 08:49:56 -07:00