Commit Graph

289 Commits

Author SHA1 Message Date
Ilya Kreymer
dc8d510b11
webhook tweak: pass oid to crawl finished and upload finished webhooks (#1287)
Optimizes webhooks by passing oid directly to webhooks:
- avoids extra crawl lookup
- possible for crawl to be deleted before webhook is processed via
operator (resulting in crawl lookup to fail)
- add more typing to operator and webhooks
2023-10-16 10:51:36 -07:00
Ilya Kreymer
a295f5d05d version: bump to 1.7.0-beta.3 2023-10-15 18:31:03 -07:00
Tessa Walsh
2383b0d616
Set log download attachment name to crawl_id.log (#1280)
Fixes #1271
Using .log for now due to broader support for opening with default viewers

---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2023-10-13 20:00:37 -07:00
Tessa Walsh
c5ca250f37
Add id-slug lookup and restrict slugs endpoints to superadmins (#1279)
Fixes #1278 
- Adds `GET /orgs/slug-lookup` endpoint returning `{id: slug}` for all
orgs
- Restricts new endpoint and existing `GET /orgs/slugs` to superadmins
2023-10-13 17:02:19 -07:00
Ilya Kreymer
41c054d209
Storage ops followup type checking (#1274)
* storage ops: follow up to #1257:
- fix refactor typo
- add type hints for all storageops apis (add mypy_boto3_s3 and types_aiobotocore_s3 for type hints)
2023-10-11 14:03:00 -07:00
Tessa Walsh
266afdf8d9
Add slugs to org backend (#1250)
- Add slug field with uniqueness constraint to Organization
- Use python-slugify to generate slug from name and import that in migration
- Require name in all /rename and org creation requests
- Auto-generate slug for new org with no slug or when /rename is called w/o a slug
- Auto-generate slug for 'default-org' based on name

- Add /api/orgs/slugs GET endpoint to return all slugs in use

- tests: extend backend test-requirements.txt from requirements to allow testing slugify
- tests: move get_redis_crawl_stats() to avoid extra dependency in utils
2023-10-10 18:30:09 -07:00
Ilya Kreymer
16e7a1d0a2
Storage Ops Refactor (#1257)
* storage ops refactor:
- create StorageOps class similar to other ops classes
- init storages list in StorageOps, no longer require lookup up default storages via CrawlManager
- convert all storage functions to members, add storageops to operator
- remove unused params, ensure crawl exists for rollover restart
- add env var to determine if using local minio to use correct endpoint URL

* crawls /seeds endpoint: just return empty list if not a crawl (eg. upload)

* crawlmanager: remove unused code, rename check_storage -> has_storage
2023-10-10 15:04:23 -07:00
Ilya Kreymer
5cad9acee9
Compute crawl execution time in operator (#1256)
* store execution time in operator:
- rename isNewCrash -> isNewExit, crashTime -> exitTime
- keep track of exitCode
- add execTime counter, increment when state has a 'finishedAt' and 'startedAt' state
- ensure pods are complete before deleting
- store 'crawlExecSeconds' on crawl and org levels, add to Crawl, CrawlOut, Organization models

* support for fast cancel:
- set redis ':canceled' key to immediately cancel crawl
- delete crawl pods to ensure pod exits immediately
- in finalizer, don't wait for pods to complete when canceling (but still check if terminated)
- add currentTime in pod.status.running.startedAt times for all existing pods
- logging: log exec time, missing finishedAt
- logging: don't log exit code 11 (interrupt due to time/size limits) as a crash

* don't wait for pods completed on failed with existing browsertrix-crawler image

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-10-09 17:45:00 -07:00
Tessa Walsh
748c86700d
fix: lookup user object operator to pass to CrawlConfig.add_new_crawl (#1254)
fixes #1253 
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-10-05 21:30:10 -07:00
Ilya Kreymer
fa86555eed
Track pod resource usage, detect OOM crashes, handle auto-scaling (#1235)
* keep track of per pod status on crawljob:
- crashes time, and reason
- 'used' vs 'allocated' resources 
- 'percent' used / allocated

* crawl log errors: log error when crawler crashes via OOM, either via redis error log
or to console

* add initial autoscaling support!
- detect if metrics server is available via K8SApi.is_pod_metrics_available()
- if available, use metrics for 'used' fields
- if no metrics, set memory used for redis only (using redis apis)
- allow overriding memory and cpu via newMemory and newCpu settings on pod status
- scale memory / cpu based on newMemory and newCpu setting
- templates: update jinja templates to allow restarting crawler and redis with new resources
- ci: enable metrics-server on k3d, microk8s and nightly k3d ci runs

* roles: cleanup unused roles, add permissions for listing metrics

* stats for running crawls:
- update in db via operator
- avoids losing stats if redis pod happens to be done
- tradeoff is more db access in operator, but less extra connections to redis + already
loading from db in backend
- size stat: ensure size of previous files is added to the stats

* crawler deployment tweaks:
- adjust cpu/mem per browser
- add --headless flag to configmap to use new headless mode by default!
2023-10-05 20:41:18 -07:00
Ilya Kreymer
20560abb81 version: bump to 1.7.0-beta.2 2023-10-05 20:33:38 -07:00
Tessa Walsh
bbdb7f8ce5
Require that all passwords are between 8 and 64 characters (#1239)
- Require that all passwords are between 8 and 64 characters
- Fixes account settings password reset form to only trigger
logged-in event after successful password change.
- Password validation can be extended within the UserManager's
validate_password method to add or modify requirements.
- Add tests for password validation
2023-10-03 18:57:46 -07:00
Tessa Walsh
b1ead614ee
Add --failOnFailedSeed checkbox to URL list workflows (#1236)
- If set, and any of the seeds fails, the entire crawl is marked as a failure.
- Add checkbox which adds --failOnFailedSeed checkbox to URL list workflows
- Add 'Fail Crawl On Failed URL' to crawl workflow setup docs
2023-10-03 18:46:09 -07:00
Tessa Walsh
e9bac4c088
API delete endpoint improvements (#1232)
- Applies user permissions check before deleting anything in all /delete endpoints
- Shuts down running crawls before deleting anything in /all-crawls/delete as well as /crawls/delete
- Splits delete_list.crawl_ids into crawls and upload lists at same time as checks in /all-crawls/delete
- Updates frontend notification message to Only org owners can delete other users' archived items. when a crawler user attempts to delete another users' archived items
2023-10-03 13:05:00 -07:00
sua yoo
df190e12b9
Show running workflow error logs (#1224)
- Adds "Logs" tab to workflow detail
- Shows error logs in expandable section in "Watch" tab
- Show corresponding message (no logs yet or logs temporarily unavailable) when `/errors` returns 503 based on crawl state
- text tweaks: use error logs instead of logs, change 'crawl start' -> 'crawl continue' in log message

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-10-03 00:03:21 -07:00
Anish Lakhwara
a2dbad35c3
feat: use is_bool to check EMAIL_SMTP_USE_TLS (#1231)
- use is_bool to check EMAIL_SMTP_USE_TLS
- use is_bool for yaml values that are boolean
2023-10-02 21:29:36 -07:00
sua yoo
941a75ef12
Separate seeds into a new endpoints (#1217)
- Remove config.seeds from workflow and crawl detail endpoints
- Add new paginated GET /crawls/{crawl_id}/seeds and /crawlconfigs/{cid}/seeds endpoints to retrieve seeds for a crawl or workflow
- Include firstSeed in GET /crawlconfigs/{cid} endpoint (was missing before)
- Modify frontend to fetch seeds from new /seeds endpoints with loading indicator

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-10-02 10:56:12 -07:00
Anish Lakhwara
1bf531e1ec
Fix: Make Collections Public on Creation (#1213)
- Add isPublic to Add Collection endpoint, send isPublic from frontend
- Fixes #1212
2023-09-29 12:08:10 -07:00
Anish Lakhwara
037396f3d9
Fix: Stream log downloading from WACZ (#1225)
* Fix(backend): Stream logs without causing OOM

Also be smarter about when to use `heapq.merge` and when to use
`itertools.chain`: If all the logs are coming from the same instance we
`chain` them, otherwise we'll `merge` them

iterator fixes:
- group wacz files by instance by suffix, eg. -0.wacz, -1.wacz, -2.wacz
- sort wacz files, and all logs within each wacz file
- chain log iterators for all log files within wacz group
- merge log iterators across wacz files in different groups
- add type hints to help keep track of iterator helper functions
- add iter_lines() from botocore, use that for line parsing for simplicity

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-09-28 18:54:52 -07:00
Ilya Kreymer
d6bc467c54
improvements to redis pod: (#1219)
- add liveness check/fix readiness check - ensure 'redis-cli ping' actually returns 'PONG', as exit code is 0 even if errors
will detect situations where redis is not available, such as due to to max clients being reached
- bump redis memory/cpu for now (until autoscaling/automatic adjustment is available)
2023-09-28 13:00:31 -07:00
Ilya Kreymer
7eac0fdf95
optimization: convert all uses of 'async for' to use iterator directly (#1229)
- optimization: convert all uses of 'async for' to use iterator directly instead of converting to list to avoid
unbounded size lists
- additional cursor.to_list() to async for conversions for stats computation, simply crawlconfigs stats computation

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-09-28 12:31:08 -07:00
Vinzenz Sinapius
cabf4ccc21
Disable smtp_use_tls with false instead of empty string (#1184)
`smtp_use_tls = bool(os.environ.get("EMAIL_SMTP_USE_TLS", True))` would only disable tls when `EMAIL_SMTP_USE_TLS` is set to an empty string which is not intuitive
2023-09-28 12:10:20 -07:00
Ilya Kreymer
86a424af93
migration improvements: (#1228)
* migration improvements + rerunning migrations: (fixes #1227)
- avoid starting some workers while migration is still running
- ensure workers that aren't performing migration await for migration to complete
- backend will not be valid until migration is run
* allow rerunning migration from specified version via --set rerun_from_migration=<VERSION> (replaces rerun_last_migration)
2023-09-28 12:04:19 -07:00
Tessa Walsh
1f74f03447
Recalculate Organization.storedBytes in migration 0017 (#1220) 2023-09-28 11:22:10 -07:00
Tessa Walsh
7a56fa23f5
Remove username lookups for crawls and workflows by storing usernames in db (#1199)
* store usernames (createdByName, modifiedByName, startedByName) in db for workflows
* store userName for userid for crawls in db
* update output models to return usernames
* add migration 0018 to add usernames to existing crawls and crawlconfigs
* updated tests for crawl and config usernames
* use async for to iterate over crawls and crawlconfigs

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-09-28 09:37:23 -07:00
Ilya Kreymer
e6bccac953
exclude match api pagination: (#1214)
- limit how many exclusion matches are returned at once
- option to specify 'offset', 'limit' and return 'nextOffset' for further pagination
- set page limit to 1000 by default
2023-09-26 13:45:54 -07:00
Tessa Walsh
094f27bcff
Track bytes stored per file type and include in org metrics (#1207)
* Add bytes stored per type to org and metrics

The org now tracks bytesStored by type of crawl, uploads, and browser profiles
in addition to the total, and returns these values in the org metrics endpoint.

A migration is added to precompute these values in existing deployments.

In addition, all /metrics storage values are now returned solely as bytes, as
the GB form wasn't being used in the frontend and is unnecessary.

* Improve deletion of multiple archived item types via `/all-crawls` delete endpoint

- Update `/all-crawls` delete test to check that org and workflow size values
are correct following deletion.
- Fix bug where it was always assumed only one crawl was deleted per cid
and size was not tracked per cid
- Add type check within delete_crawls
2023-09-22 12:55:21 -04:00
Tessa Walsh
83f80d4103
Add org metrics API endpoint (#1196)
* Initial implementation of org metrics
 (This can eventually be sped up significantly by precomputing the
values and storing them in the db.)
* Rename storageQuota to storageQuotaBytes to be consistent
* Update tests to include metrics
2023-09-19 16:24:27 -05:00
Tessa Walsh
859f2271da fix(backend): call run now when updating crawlConfig #1194
Update backend/btrixcloud/crawlconfigs.py

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-09-19 11:57:41 -07:00
Tessa Walsh
9224f52f51
Remove config from list endpoints to speed up responses (#1193)
* Remove config from list endpoints

- Remove config field from workflow and crawl list endpoints
- Add seedCount to CrawlConfigOut on backend and Workflow on frontend
- Refactor CrawlConfig and CrawlConfigOut to extend CrawlConfigCore + CrawlConfigAdditional
- Refactor workflow list in frontend to use firstSeed and seedCount
- Frontend uses ListWorkflow type which is Omit<Workflow, "config">
2023-09-19 11:05:48 -05:00
Ilya Kreymer
65b7c10ba1 bump version to 1.7.0-beta.1 2023-09-18 14:33:03 -07:00
Ilya Kreymer
ff327c0b8b
Reset crawl state to running when any crawlers are running (after post-process states) (#1179)
* operator state changes: (fixes #1178)
- if at least one crawler is 'running' ensure state is reset back to running
- for multiple instances, set status to earliest state (not latest) to be consistent,
eg. if at least one crawl is running, set to running, if at least one is generating wacz, set to that
2023-09-15 09:16:46 -07:00
Tessa Walsh
2efc461b9b
Implement sync streaming for finished crawl logs (#1168)
- Crawl logs streamed from WACZs using the sync boto client
2023-09-14 17:05:19 -07:00
Ilya Kreymer
feb7ab7652
Improved type checking for backend with mypy (#1174)
* add mypy type check
- run type check on backend fix ambiguous typing issues
- add mypy to lint gh action + precommit hook
- add mypy.ini
2023-09-13 19:40:26 -07:00
Ilya Kreymer
4b34da033a
Refactor / Cleanup: move ops functions back into classes (#1171)
* remove almost all standalone functions and move them back into ops member functions
* operator now has access to all the ops classes as well
* keep two standalone functions used only in migrations

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-09-13 11:56:09 -07:00
Ilya Kreymer
9159c7c914
ensure max crawl size and max crawl timeout values are set to 0 when unused, instead of null (#1167)
- convert None->0 when creating CrawlJob
- ensure frontend sends 0 not null
- make input model require 'int = 0' instead of 'Optional[int] = 0'
2023-09-13 09:51:26 -07:00
Tessa Walsh
7cf2b11eb7
Add event webhook tests (#1155)
* Add success filter to webhook list GET endpoint

* Add sorting to webhooks list API and add event filter

* Test webhooks via echo server

* Set address to echo server on host from CI env var for k3d and microk8s

* Add -s back to pytest command for k3d ci

* Change pytest test path to avoid hanging on collecting tests

* Revert microk8s to only run on push to main
2023-09-12 22:08:40 -07:00
Ilya Kreymer
c9c39d47b7
Scheduled Crawl Refactor: Handle via Operator + Add Skipped Crawls on Quota Reached (#1162)
* use metacontroller's decoratorcontroller to create CrawlJob from Job
* scheduled job work:
- use existing job name for scheduled crawljob
- use suspended job, set startTime, completionTime and succeeded status on job when crawljob is done
- simplify cronjob template: remove job_image, cron_namespace, using same namespace as crawls,
placeholder job image for cronjobs

* move storage quota check to crawljob handler:
- add 'skipped_quota_reached' as new failed status type
- check for storage quota before checking if crawljob can be started, fail if not (check before any pods/pvcs created)

* frontend:
- show all crawls in crawl workflow, no need to filter by status
- add 'skipped_quota_reached' status, show as 'Skipped (Quota Reached)', render same as failed

* migration: make release namespace available as DEFAULT_NAMESPACE, delete old cronjobs in DEFAULT_NAMESPACE and recreate in crawlers namespace with new template
2023-09-12 13:05:43 -07:00
Tessa Walsh
9377a6f456
Issue all non-upload storage-quota-update events from LiteElement (#1151)
- More specific toast notification error messages to the action being attempted
- Single dismissable global banner shown when org storage is reached
- Removed check for storage quota reached in `runNow`, since buttons are disabled in UI, and errors handled if request fails.
- Allow creating new workflow when storage quota reached
- More responsive storage quota updates: add storageQuotaReached to archived item replay.json, updates w/o reload when crawl pushes quota over limit
- Modify LiteElement to check for storageQuotaReached on GET requests

---------
Co-authored-by: sua yoo <sua@suayoo.com>
2023-09-11 18:17:48 -07:00
Ilya Kreymer
ad9bca2e92
Operator refactor to control pods + pvcs directly instead of statefulsets (#1149)
- Ability for pod to be Completed, unlike in Statefulset - eg. if 3 pods are running and first one finishes, all 3 must be running until all 3 are done. With this setup, the first finished pod can remain in Completed state.
- Fixed shutdown order - crawler pods now correctly shutdown first before redis pods, by switching to background deletion.
- Pod priority decreases with scale: 1st instance of a new crawl can preempt 3rd or 2nd instance of another crawl
- Create priority classes upto 'max_crawl_scale, configured in values.yaml
- Improved scale change reconciliation: if increasing scale, immediately scale up. If decreasing scale,
graceful stop scaled-down instance to complete via redis 'stopone' key, wait until they exit with Completed state
before adjust status.scale / removing scaled down pods. Ensures unaccepted interrupts don't cause scaled down data to be deleted.
- Redis pod remains inactive until crawler is first active, or after no crawl pods are active for 60 seconds
- Configurable Redis storage with 'redis_storage' value, set to 3Gi by default
- CrawlJob deletion starts as soon as post-finish crawl operations are run
- Post-crawl operations get their own redis instance, since one during response is being cleaned up in finalizer
- Finalizer ignores request with incorrect state (returns 400 if reported as not finished while crawl is finished)
- Current resource usage added to status
- Profile browser: also manage single pod directly without statefulset for consistency.
- Restart pods via restartTime value: if spec.restartTime != status.restartTime, clear out pods and update status.restartTime (using OnDelete policy to avoid recreate loops in edge cases).
- Update to latest metacontroller (v4.11.0)
- Add --restartOnError flag for crawler (for browsertrix-crawler 0.11.0)
- Failed crawl logging: dd 'fail_crawl()' to be used for failing a crawl, which prints logs for default container (if enabled) as well as pod status
- tests: check other finished states to avoid stuck in infinite loop if crawl fails
- tests: disable disk utilization check, which adds unpredictability to crawl testing!
fixes #1147 

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-09-11 10:38:04 -07:00
Anish Lakhwara
e57148d0e9
feat: add SMTP {port, use_tls} config (#1142)
* feat: add SMTP {port, use_tls} config
* If `password` is None don't attempt to log in
* remove 'can be omitted' comment

---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2023-09-08 08:18:36 -07:00
Ilya Kreymer
e75b207f7e
Fix 0015 migration (#1154)
* migration: fix 0015 migration to ensure it reads the correct mongo collection, avoid variable overwrites and and uses org _id field. fixes #1153
2023-09-08 08:17:40 -07:00
Tessa Walsh
d2ededc895
Add and enforce org storage quota (#1106)
* Implement in backend

- Track bytesStored in org
- Add migration to pre-calculate based on size of crawlfiles and profilefiles
- Add methods to increase or decrease org storage when crawl or profile files
are added or deleted
- Include storageQuotaReached boolean in API responses that alter storage
- Don't start new crawls and fail uploads if storage quota reached

* Implement in frontend

- Add to orgs-list quotas
- Update org's storageQuotaReached based on backend endpoint responses
- Disable buttons when storage quota is met
- Show toast notification when attempting to run a crawl when org
storage quota is met
2023-09-07 12:45:43 -04:00
Ilya Kreymer
68bc053ba0
Print crawl log to operator log (mostly for testing) (#1148)
* log only if 'log_failed_crawl_lines' value is set to number of last lines to log
from failed container

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-09-06 17:53:02 -07:00
Ilya Kreymer
dce1ae6129
better resources scaling by number of browsers per crawler container (#1103)
- set crawler cpu / memory with fixed base + incremental bumps based on number of browsers
- allow parsing k8s quantities with parse_quantity, compute in operator
- set 'crawler_cpu = crawler_cpu_base + crawler_extra_cpu_per_browser * (num_browsers - 1)'
and same for memory
2023-09-06 01:42:44 -04:00
Ilya Kreymer
876ba1bf24
null check: check before accessing config in 'get_all_crawl_search_values' (#1144) 2023-09-05 23:57:05 -04:00
Tessa Walsh
147bfd9d44
Add event webhook notifications system to backend (#1061)
Initial set of backend API for event webhook notifications for the following events:
* Crawl started (including boolean indicating if crawl was scheduled)
* Crawl finished
* Upload finished
* Archived item added to collection
* Archived item removed from collection

Configuration of URLs is done via /api/orgs/<oid>/event-webhook-urls. If a URL is configured for a given event, a webhook notification is added to the database and then attempted to be sent (up to a total of 5 tries per overall attempt, with an increasing backoff between, implemented via use of the backoff library, which supports async).

webhook status available via /api/orgs/<oid>/webhooks

(Additional testing + potential fastapi integration left in separate follow-ups
Fixes #1041
2023-08-31 19:52:37 -07:00
Tessa Walsh
1aa951132c
Fix unsetting all collections via PATCH update (#1126) 2023-08-30 18:16:21 -04:00
Tessa Walsh
f6369ee01e
Add support for collectionIds to archived item PATCH endpoints (#1121)
* Add support for collectionIds to patch endpoints

* Make update available via all-crawls/ and add test

* Fix tests

* Always remove collectionIds from udpate

* Remove unnecessary fallback

* One more pass on expected values before update
2023-08-30 10:41:30 -04:00
Tessa Walsh
e667fe2e97
Add max crawl size option to backend and frontend (#1045)
Backend:
- add 'maxCrawlSize' to models and crawljob spec
- add 'MAX_CRAWL_SIZE' to configmap
- add maxCrawlSize to new crawlconfig + update APIs
- operator: gracefully stop crawl if current size (from stats) exceeds maxCrawlSize
- tests: add max crawl size tests

Frontend:
- Add Max Crawl Size text box Limits tab
- Users enter max crawl size in GB, convert to bytes
- Add BYTES_PER_GB as constant for converting to bytes
- docs: Crawl Size Limit to user guide workflow setup section

Operator Refactor:
- use 'status.stopping' instead of 'crawl.stopping' to indicate crawl is being stopped, as changing later has no effect in operator
- add is_crawl_stopping() to return if crawl is being stopped, based on crawl.stopping or size or time limit being reached
- crawlerjob status: store byte size under 'size', human readable size under 'sizeHuman' for clarity
- size stat always exists so remove unneeded conditional (defaults to 0)
- store raw byte size in 'size', human readable size in 'sizeHuman'

Charts:
- subchart: update crawlerjob crd in btrix-crds to show status.stopping instead of spec.stopping
- subchart: show 'sizeHuman' property instead of 'size'
- bump subchart version to 0.1.1

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-08-26 22:00:37 -07:00
Ilya Kreymer
2da6c1c905
1.6.3 Fixes - Fix workflow sort order for Latest Crawl + 'Remove From Collection' action menu on archived items in collections (#1113)
* fix latest crawl (lastRun) sort:
- don't cast 'started' value to string when setting as starting crawl time (regression from #937)
- caused incorrect sorting as finished crawl time was a datetime, while starting crawl time was a string
- move updated config crawl info in one place, simplify to avoid returning started time altogether, just set directly
- pass mdb crawlconfigs and crawls collections directly to add_new_crawl() function
- fixes #1108

* Add dropdown menu containing 'Remove from Collection' to archived items in collection view (#1110)
- Enables users to remove an item from a collection from the collection detail view - menu was previously missing
- Fixes: #1102 (missing dropdown menu) by making use of the inactive menu trigger button.
- Updates collection items page size to match "Archived Items" page size (20 items per page)

---------
Co-authored-by: sua yoo <sua@webrecorder.org>
2023-08-25 21:08:47 -07:00
Anish Lakhwara
8b16124675
feat: implement 'collections' array with {name, id} for archived item details (#1098)
- rename 'collections' -> 'collectionIds', adding migration 0014
- only populate 'collections' array with {name, id} pair for get_crawl() / single archived item
path, but not for aggregate/list methods
- remove Crawl.get_crawl(), redundant with BaseCrawl.get_crawl() version
- ensure _files_to_resources returns an empty [] instead of none if empty (matching BaseCrawl.get_crawl() behavior to Crawl.get_crawl())
- tests: update tests to use collectionIds for id list, add 'collections' for {name, id} test
- frontend: change Crawl object to have collectionIds instead of collections

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-08-25 00:26:46 -07:00
Ilya Kreymer
989ed2a8da
Use Shared Services for Crawling, Redis, Profile Browsers (#1088)
* refactor to use shared role-based service shared across pods:
- 'crawler' service for all crawler screencasting, scales 0 .. N with crawler-<ID>-N.crawl
- 'redis' service for all redis access, redis-<ID>-0.redis
- 'browser' service for all browser access (profile browsers), browser-<ID>-0.browser
- don't create a new service per crawl/profile at all
- enable 'publishNotReadyAddresses' for potentially faster resolving, esp for redis
- remove service as type managed by operator as no longer creating services dynamically
- remove frontend var CRAWLER_SVC_SUFFIX, suffix always '.crawler' to match crawler service name
2023-08-24 20:08:53 -07:00
Ilya Kreymer
e7f2d93f80 bump version to 1.7.0-beta.0 2023-08-23 12:03:45 -07:00
Tessa Walsh
ce5b52f8af
Add and enforce org maxPagesPerCrawl quota (#1044) 2023-08-23 10:38:36 -04:00
sua yoo
54cf4f23e4
Paginate Workflows and refactor to use server-side queries (#1078)
- Paginates Crawl Workflows when there are more than 10 workflows
- Refactors workflow search and crawl search to use the same component
- Adds sort by first seed, workflow creation date, and workflow modified date
- Separates "last run" date from "modified" date
- Update column layout into Name & Schedule (or Manual Ru'ri=), Latest Crawl (<finish time> in <duration>), total size, and last modified (modified by and modified time)
2023-08-22 16:29:17 -07:00
Ilya Kreymer
422452b5c1 bump to 1.6.2 2023-08-18 18:27:37 -07:00
Ilya Kreymer
90b2f94aef
follow-up to #1066: update redis to 5.0.0 which includes full fix for connection leak in from_url(), (#1081)
simplifies previous workaround addressed in 5.0.0
2023-08-15 20:34:47 -07:00
Ilya Kreymer
2e73148bea
fix redis connection leaks + exclusions error: (fixes #1065) (#1066)
* fix redis connection leaks + exclusions error: (fixes #1065)
- use contextmanager for accessing redis to ensure redis.close() is always called
- add get_redis_client() to k8sapi to ensure unified place to get redis client
- use connectionpool.from_url() until redis 5.0.0 is released to ensure auto close and single client settings are applied
- also: catch invalid regex passed to re.compile() in queue regex check, return 400 instead of 500 for invalid regex
- redis requirements: bump to 5.0.0rc2
2023-08-14 18:29:28 -07:00
Ilya Kreymer
9553115bbe
helm chart tweaks: (#1067)
* helm chart tweaks:
- lower mem requirements for backend and crawler
- disable cors in ingress to pass through cors headers from backend
- crawler statefulset: use ordered instead of parallel scaling policy to avoid single crawl taking up all crawling capacity quickly
2023-08-14 16:43:12 -07:00
Ilya Kreymer
d93ddaf620 bump version to 1.6.1 2023-08-11 12:50:41 -07:00
Ilya Kreymer
35ab6d6df6 bump to 1.6.0! 2023-08-09 15:40:27 -07:00
sua yoo
37733483d5
Standardize archived item filtering, sorting and labels (#1054)
Frontend:
- Renames list view to "All Archived Items"
- Refactors fetches to use single all-crawls endpoints
- Removes search by config ID for more search parity with uploads
- Adds sort by size
- Refactors property and method names to replace crawl*
- Replaces remaining references to "crawl" in copy with "item"'
- Rename Upload Archive button to Upload WACZ
- Fix focusout in item menu so menus close

Backend:
- Filter search values by type as well
- Only get list of cids for crawls in search values
- Don't list crawl/workflow ids in search values

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-08-09 12:13:55 -07:00
Ilya Kreymer
7a8f370bc2 bump version to 1.6.0-beta.4 for testing 2023-08-09 12:09:37 -07:00
Ilya Kreymer
de3e5907a7
backend: crawlout: include raw crawnconfig in api details, fixes #1030 (#1055) 2023-08-09 08:46:42 -07:00
Ilya Kreymer
8d0a4f2ca9
fix public collections endpoint returning 404 when not public (#1052)
tests: add tests for public collections endpoint when collection is public and when not
2023-08-04 13:29:13 -04:00
Tessa Walsh
7ff57ce6b5
Backend: standardize search values, filters, and sorting for archived items (#1039)
- all-crawls list endpoint filters now conform to 'Standardize list controls for archived items #1025' and URL decode values before passing them in
- Uploads list endpoint now includes all all-crawls filters relevant to uploads
- An all-crawls/search-values endpoint is added to support searching across all archived item types
- Crawl configuration names are now copied to the crawl when the crawl is created, and crawl names and descriptions are now editable via the backend API (note: this will require frontend changes as well to make them editable via the UI)
- Migration added to copy existing config names for active configs into their associated crawls. This migration has been tested in a local deployment
- New statuses generate-wacz, uploading-wacz, and pending-wait are added when relevant to tests to ensure that they pass
- Tests coverage added for all new all-crawls endpoints, filters, and sort values
2023-08-04 09:56:52 -07:00
Ilya Kreymer
362afa47bd
Support for Public / Shareable Collections (#1038)
* collections: support toggling collections public/private, viewable via RWP
- backend: add 'public' to collection model, support patching to update
- backend: add .../collections/<id>/public/replay.json for public access
- backend: add CORS handling for public endpoint
- frontend: support 'make shareable / make private' dropdown actions on collection detail + collection list views
- frontend: show shareable / private icons by collection name on detail + list views
- frontend: link to replayweb.page for standalone browsing
- frontend: add embed code popup when a collection is shareable
- refer to public collections as 'shareable' for now

---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2023-08-03 19:11:01 -07:00
Ilya Kreymer
45eaa0b3a3 version: bump to 1.6.0-beta.3 2023-08-01 09:48:17 -07:00
Ilya Kreymer
06cf9c7cc3
add crawl ending states: 'generate-wacz', 'uploading-wacz', 'pending-wait' that occur after a crawl is finished or is being stopped (#1022)
operator: ensure transitions from each of these states is supported, including to 'waiting_capacity'
add extra check on stopping to avoid transitioning back to a running state after crawl is finished
ui: add states to UI display, localization, add as active states
fixes #263
2023-08-01 00:15:59 -07:00
Ilya Kreymer
7ea6d76f10
Resource Constraints Cleanup: (fixes #895) (#1019)
* resource constraints: (fixes #895)
- for cpu, only set cpu requests
- for memory, set mem requests == mem limits
- add missing resource constraints for minio and scheduled job
- for crawler, set mem and cpu constraints per browser, scale based on browser instances per crawler
- add comments in values.yaml for crawler values being multiplied
- default values: bump crawler to 650 millicpu per browser instance just in case

cleanup: remove unused entries from main backend configmap
2023-08-01 00:11:16 -07:00
Vinzenz Sinapius
5807507f29
Add proxy settings for crawler and profilebrowser (#997) 2023-07-26 16:11:10 -07:00
Ilya Kreymer
6506965d98
Streaming Download for Collections (#1012)
* support streaming download of collections (part of #927)
- WACZ zip created on the fly using stream-zip
- add 'Download Collection' option to collection detail and list
- after editing collection, return to collection view
- tests: add test for streaming download, ensure WACZ files + datapackage present, STORE compression used

---------

Co-authored-by: sua yoo <sua@suayoo.com>
2023-07-26 15:42:17 -07:00
Tessa Walsh
c21153255a
Rename notes to description in frontend and backend (#1011)
- Rename crawl notes to description
- Add migration renaming notes -> description
- Stop inheriting workflow description in crawl
- Update frontend to replace crawl/upload notes with description
- Remove setting of config description from crawl list
- Adjust tests for changes
2023-07-26 13:00:04 -07:00
Ilya Kreymer
4bea7565bc
load handling: scale up redis only when crawler pods running (#1009)
Operator: Modified init behavior to only load redis when at least one crawler pod available:
- waits for at least one crawler pod to be available before starting redis pod, to avoid situation where many crawler pods are in pending mode, but redis pods are still running.
- redis statefulset starts at scale of 0
- once crawler pod becomes available, redis sts is scaled to 1 (via `initRedis==true` status)
- crawl remains in 'starting' or 'waiting_capacity' state until pod becomes available without redis pod running
- set to 'running' state only after redis and at least one crawler pod is available
- if no crawler pods available after running, or, if stuck in starting for >60 seconds, switch to 'waiting_capacity' state
- when switching to 'waiting_capacity', also scale down redis to 0, wait for crawler pod to become available, only then scale up redis to 1, and get back to 'running'

other tweaks:
- add new status field 'initRedis', default to false, not displayed
- crawler pod: consider 'ContainerCreating' state as available, as container will not be blocked by resource limits
- add a resync after 3 seconds when waiting for crawler pod or redis pod to become available, configurable via 'operator_fast_resync_secs'
- set_state: if not updating state, ensure state reflects actual value in db
2023-07-26 08:40:05 -07:00
Tessa Walsh
608a744aaf
Add migration to replace None with 0 for configmap CRAWL_TIMEOUT (#1008) 2023-07-24 15:49:26 -04:00
Tessa Walsh
fcd48b1831
Add totalSize to collections and make it sortable in list endpoint (#1001)
* Precompute collection.totalSize and make sortable

* Add migration to recompute collection data with totalSize
2023-07-24 13:12:23 -04:00
Tessa Walsh
9f32aa697b
Add collections and tags to upload API endpoints (#993)
* Add collections and tags to uploads

* Fix order of deletion check test

* Re-add tags to UploadedCrawl model after rebase

* Fix Users model heading
2023-07-21 16:44:56 +02:00
Tessa Walsh
4014d98243
Move pydantic models to separate module + refactor crawl response endpoints to be consistent (#983)
* Move all pydantic models to models.py to avoid circular dependencies
* Include automated crawl details in all-crawls GET endpoints
- ensure /all-crawls endpoint resolves names / firstSeed data same as /crawls endpoint for crawls to ensure consistent frontend display. fields added in get and list all-crawl endpoints for automated
crawls only:
- cid
- name
- description
- firstSeed
- seedCount
- profileName

* Add automated crawl fields to list all-crawls test

* Uncomment mongo readinessProbe

* cleanup CrawlOutWithResources:
- remove 'files' from output model, only resources should be returned
- add _files_to_resources() to simplify computing presigned 'resources' from raw 'files'
- update upload tests to be more consistent, 'files' never present, 'errors' always none

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-07-20 13:05:33 +02:00
Tessa Walsh
d5c3a8519f
Add crawler Use Sitemap option to Browsertrix Cloud (#978)
* Add user-guide docs for Use Sitemap option
---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2023-07-19 13:57:52 -04:00
Ilya Kreymer
a5312709bb
fix issues that caused cronjob container to crash: (#987)
- don't set CRAWL_TIMEOUT to "None" in configmap, and if encountered, just set to 0
- run register_exit_handler() after run loop has been inited
2023-07-18 18:08:53 +02:00
Ilya Kreymer
7d694754c6
uploads api ext: (#970)
- also support collectionId filter on /all-crawls
- update tests
2023-07-09 22:12:54 -07:00
Ilya Kreymer
f1bce310d0
uploads api: support filtering uploads by collectionId (#969)
tests: add collection filter test
2023-07-09 10:54:30 -07:00
Ilya Kreymer
2038e3d668
remove default: similar to #952, remove default extraHops setting as it disables 'url list' extraHops by forcing the value to 0 (#954) 2023-07-07 12:08:30 -07:00
Ilya Kreymer
7139b9a7a9
operator: ensure finished is always set (#953) 2023-07-07 12:08:15 -07:00
Ilya Kreymer
00eb62214d
Uploads API: BaseCrawl refactor + Initial support for /uploads endpoint (#937)
* basecrawl refactor: make crawls db more generic, supporting different types of 'base crawls': crawls, uploads, manual archives
- move shared functionality to basecrawl.py
- create a base BaseCrawl object, which contains start / finish time, metadata and files array
- create BaseCrawlOps, base class for CrawlOps, which supports base crawl deletion, querying and collection add/remove

* uploads api: (part of #929)
- new UploadCrawl object which extends BaseCrawl, has name and description
- support multipart form data data upload to /uploads/formdata
- support streaming upload of a single file via /uploads/stream, using botocore multipart upload to upload to s3-endpoint in parts
- require 'filename' param to set upload filename for streaming uploads (otherwise use form data names)
- sanitize filename, place uploads in /uploads/<uuid>/<sanitized-filename>-<random>.wacz
- uploads have internal id 'upload-<uuid>'
- create UploadedCrawl object with CrawlFiles pointing to the newly uploaded files, set state to 'complete'
- handle upload failures, abort multipart upload
- ensure uploads added within org bucket path
- return id / added when adding new UploadedCrawl
- support listing, deleting, and patch /uploads
- support upload details via /replay.json to support for replay
- add support for 'replaceId=<id>', which would remove all previous files in upload after new upload succeeds. if replaceId doesn't exist, create new upload. (only for stream endpoint so far).
- support patching upload metadata: notes, tags and name on uploads (UpdateUpload extends UpdateCrawl and adds 'name')

* base crawls api: Add /all-crawls list and delete endpoints for all crawl types (without resources)
- support all-crawls/<id>/replay.json with resources
- Use ListCrawlOut model for /all-crawls list endpoint
- Extend BaseCrawlOut from ListCrawlOut, add type
- use 'type: crawl' for crawls and 'type: upload' for uploads
- migration: ensure all previous crawl objects / missing type are set to 'type: crawl'
- indexes: add db indices on 'type' field and with 'type' field and oid, cid, finished, state

* tests: add test for multipart and streaming upload, listing uploads, deleting upload
- add sample WACZ for upload testing: 'example.wacz' and 'example-2.wacz'

* collections: support adding and remove both crawls and uploads via base crawl
- include collection_ids in /all-crawls list
- collections replay.json can include both crawls and uploads

bump version to 1.6.0-beta.2
---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-07-07 09:13:26 -07:00
Tessa Walsh
bf1e817da3
Unset default scopeType for seeds so they inherit parent scopeType by default (#952) 2023-07-06 15:03:05 -07:00
Ilya Kreymer
e37f220d6c version: bump to 1.6.0-beta.1 2023-06-16 18:53:32 -07:00
Tessa Walsh
c7051d5fbf
Backend API consistency pass (#921)
* Make API add and update method returns consistent

- Updates return {"updated": True}
- Adds return {"added": True}
- Both can additionally have other fields as needed, e.g. id or name

- remove Profile response model, as returning added / id only
- reformat

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-06-16 18:52:46 -07:00
Tessa Walsh
bd6dc79449
Add frontend support for auto-adding collections to workflows (#916)
- Adds collections search and list to workflow editor
- Adds collections to workflow details component
- Adds namePrefix filter to backend GET /orgs/{oid}/collections endpoint to support case-insensitive searching of collections
- Adds documentation for new setting

---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2023-06-12 18:18:05 -07:00
Tessa Walsh
325355d991
Fix post-crawl collection stats update and add test (#918)
This fixes #917, where crawls added to a collection via the workflow
autoAddCollections were not successfully represented in the crawl
and page count stats in the collection after completing.
2023-06-10 19:06:25 -07:00
Tessa Walsh
e10b7093c7
Fix bug preventing deleting collections with no crawls (#912) 2023-06-08 11:28:30 -07:00
Ilya Kreymer
4428184aea
frontend: configure running with a fixed 'replay.json', auth headers passed via separate config (#899)
wabac.js will reload the replay.json on 403 with new token (will be in next version of wabac.js)
presign urls: make presign timeout configurable (in minutes), defaults to 60 mins
dockerfile: fix configuring RWP_BASE_URL
2023-06-08 11:26:26 -07:00
Tessa Walsh
120f7ca158
Precompute crawl file stats (#906) 2023-06-07 16:39:49 -07:00
sua yoo
66b3befef9
Frontend collections beta UI (#886)
- Support for creating new collections and editing existing collections
- Can select crawling workflows which adds entire workflow, and then deselect individual crawls
- Can edit existing collections and add more crawls
- Can view, create and delete collections via new Collections top-level nav entry
2023-06-06 17:52:01 -07:00
Ilya Kreymer
3f42515914
crawls list: unset errors in crawls list response to avoid very large… (#904)
* crawls list: unset errors in crawls list response to avoid very large responses #872

* Remove errors from crawl replay.json

* Add tests to ensure errors are excluded from crawl GET endpoints

* Update tests to accept None for errors
---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-06-02 18:52:59 -07:00
Ilya Kreymer
00fb8ac048
Concurrent Crawl Limit (#874)
concurrent crawl limits: (addresses #866)
- support limits on concurrent crawls that can be run within a single org
- change 'waiting' state to 'waiting_org_limit' for concurrent crawl limit and 'waiting_capacity' for capacity-based
limits

orgs:
- add 'maxConcurrentCrawl' to new 'quotas' object on orgs
- add /quotas endpoint for updating quotas object

operator:
- add all crawljobs as related, appear to be returned in creation order
- operator: if concurrent crawl limit set, ensures current job is in the first N set of crawljobs (as provided via 'related' list of crawljob objects) before it can proceed to 'starting', otherwise set to 'waiting_org_limit'
- api: add org /quotas endpoint for configuring quotas
- remove 'new' state, always start with 'starting'
- crawljob: add 'oid' to crawljob spec and label for easier querying
- more stringent state transitions: add allowed_from to set_state()
- ensure state transitions only happened from allowed states, while failed/canceled can happen from any state
- ensure finished and state synched from db if transition not allowed
- add crawl indices by oid and cid

frontend: 
- show different waiting states on frontend: 'Waiting (Crawl Limit) and 'Waiting (At Capacity)'
- add gear icon on orgs admin page
- and initial popup for setting org quotas, showing all properties from org 'quotas' object

tests:
- add concurrent crawl limit nightly tests
- fix state waiting -> waiting_capacity
- ci: add logging of operator output on test failure
2023-05-30 15:38:03 -07:00
sua yoo
6208ead040
Sort collection by last updated (modified) (#897)
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-05-30 14:09:10 -04:00
Ilya Kreymer
4d30a64bc9
collection delete: (#896)
set delete endpoint to use DELETE verb, fix for #869
2023-05-29 18:19:04 -07:00
Tessa Walsh
df4c4e6c5a
Optimize workflow statistics updates (#892)
* optimizations:
- rename update_crawl_config_stats to stats_recompute_all, only used in migration to fetch all crawls
and do a full recompute of all file sizes
- add stats_recompute_last to only get last crawl by size, increment total size by specified amount, and incr/decr number of crawls
- Update migration 0007 to use stats_recompute_all
- Add isCrawlRunning, lastCrawlStopping, and lastRun to
stats_recompute_last
- Increment crawlSuccessfulCount in stats_recompute_last

* operator/crawls:
- operator: keep track of filesAddedSize in redis as well
- rename update_crawl to update_crawl_state_if_changed() and only update
if state is different, otherwise return false
- ensure mark_finished() operations only occur if crawl is state has changed
- don't clear 'stopping' flag, can track if crawl was stopped
- state always starts with "starting", don't reset to starting

tests:
- Add test for incremental workflow stats updating
- don't clear stopping==true, indicates crawl was manually stopped

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-05-26 22:57:08 -07:00