Commit Graph

32 Commits

Author SHA1 Message Date
Tessa Walsh
e9bac4c088
API delete endpoint improvements (#1232)
- Applies user permissions check before deleting anything in all /delete endpoints
- Shuts down running crawls before deleting anything in /all-crawls/delete as well as /crawls/delete
- Splits delete_list.crawl_ids into crawls and upload lists at same time as checks in /all-crawls/delete
- Updates frontend notification message to Only org owners can delete other users' archived items. when a crawler user attempts to delete another users' archived items
2023-10-03 13:05:00 -07:00
sua yoo
941a75ef12
Separate seeds into a new endpoints (#1217)
- Remove config.seeds from workflow and crawl detail endpoints
- Add new paginated GET /crawls/{crawl_id}/seeds and /crawlconfigs/{cid}/seeds endpoints to retrieve seeds for a crawl or workflow
- Include firstSeed in GET /crawlconfigs/{cid} endpoint (was missing before)
- Modify frontend to fetch seeds from new /seeds endpoints with loading indicator

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-10-02 10:56:12 -07:00
Tessa Walsh
7a56fa23f5
Remove username lookups for crawls and workflows by storing usernames in db (#1199)
* store usernames (createdByName, modifiedByName, startedByName) in db for workflows
* store userName for userid for crawls in db
* update output models to return usernames
* add migration 0018 to add usernames to existing crawls and crawlconfigs
* updated tests for crawl and config usernames
* use async for to iterate over crawls and crawlconfigs

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-09-28 09:37:23 -07:00
Tessa Walsh
d2ededc895
Add and enforce org storage quota (#1106)
* Implement in backend

- Track bytesStored in org
- Add migration to pre-calculate based on size of crawlfiles and profilefiles
- Add methods to increase or decrease org storage when crawl or profile files
are added or deleted
- Include storageQuotaReached boolean in API responses that alter storage
- Don't start new crawls and fail uploads if storage quota reached

* Implement in frontend

- Add to orgs-list quotas
- Update org's storageQuotaReached based on backend endpoint responses
- Disable buttons when storage quota is met
- Show toast notification when attempting to run a crawl when org
storage quota is met
2023-09-07 12:45:43 -04:00
Tessa Walsh
f6369ee01e
Add support for collectionIds to archived item PATCH endpoints (#1121)
* Add support for collectionIds to patch endpoints

* Make update available via all-crawls/ and add test

* Fix tests

* Always remove collectionIds from udpate

* Remove unnecessary fallback

* One more pass on expected values before update
2023-08-30 10:41:30 -04:00
Tessa Walsh
7ff57ce6b5
Backend: standardize search values, filters, and sorting for archived items (#1039)
- all-crawls list endpoint filters now conform to 'Standardize list controls for archived items #1025' and URL decode values before passing them in
- Uploads list endpoint now includes all all-crawls filters relevant to uploads
- An all-crawls/search-values endpoint is added to support searching across all archived item types
- Crawl configuration names are now copied to the crawl when the crawl is created, and crawl names and descriptions are now editable via the backend API (note: this will require frontend changes as well to make them editable via the UI)
- Migration added to copy existing config names for active configs into their associated crawls. This migration has been tested in a local deployment
- New statuses generate-wacz, uploading-wacz, and pending-wait are added when relevant to tests to ensure that they pass
- Tests coverage added for all new all-crawls endpoints, filters, and sort values
2023-08-04 09:56:52 -07:00
Tessa Walsh
c21153255a
Rename notes to description in frontend and backend (#1011)
- Rename crawl notes to description
- Add migration renaming notes -> description
- Stop inheriting workflow description in crawl
- Update frontend to replace crawl/upload notes with description
- Remove setting of config description from crawl list
- Adjust tests for changes
2023-07-26 13:00:04 -07:00
Tessa Walsh
c7051d5fbf
Backend API consistency pass (#921)
* Make API add and update method returns consistent

- Updates return {"updated": True}
- Adds return {"added": True}
- Both can additionally have other fields as needed, e.g. id or name

- remove Profile response model, as returning added / id only
- reformat

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-06-16 18:52:46 -07:00
Tessa Walsh
120f7ca158
Precompute crawl file stats (#906) 2023-06-07 16:39:49 -07:00
Ilya Kreymer
3f42515914
crawls list: unset errors in crawls list response to avoid very large… (#904)
* crawls list: unset errors in crawls list response to avoid very large responses #872

* Remove errors from crawl replay.json

* Add tests to ensure errors are excluded from crawl GET endpoints

* Update tests to accept None for errors
---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-06-02 18:52:59 -07:00
Tessa Walsh
9c7a312a4c
Rework collections to track collections in Crawl (#878)
* Track collections in Crawl rather than crawls in Collection
* Add delete collection API endpoint and tests
* Precompute collection crawlCount, pageCount, and tags and add them to
GET collection responses
* Add modified field to Collection
* Update collection replay.json method
* Make add and remove crawls accept list of crawl ids
* Auto-add new workflow crawls to collections when they successfully
complete via CrawlConfig.autoAddCollections field
* Move long-running post-crawl operator tasks into asyncio task
* Make CrawlConfig.autoAddCollections updatable via /update API endpoint
2023-05-25 15:41:50 -04:00
Ilya Kreymer
12f7db3ae2
tests: fixes for crawl cancel + crawl stopped (#864)
* tests:
- fix cancel crawl test by ensuring state is not running or waiting
- fix stop crawl test by ensuring stop is only initiated after at least one page has been crawled,
otherwise result may be failed, as no crawl data has been crawled yet (separate fix in crawler to avoid loop if stopped
before any data written webrecorder/browsertrix-crawler#314)
- bump page limit to 4 for tests to ensure crawl is partially complete, not fully complete when stopping
- allow canceled or partial_complete due to race condition

* chart: bump frontend limits in default, not just for tests (addresses #780)

* crawl stop before starting:
- if crawl stopped before it started, mark as canceled
- add test for stopping immediately, which should result in 'canceled' crawl
- attempt to increase resync interval for immediate failure
- nightly tests: increase page limit to test timeout

* backend:
- detect stopped-before-start crawl as 'failed' instead of 'done'
- stats: return stats counters as int instead of string
2023-05-22 20:17:29 -07:00
Ilya Kreymer
2cae065c46
Add Waiting state on the backend and frontend (#839)
* operator: add waiting state
- add pods as related objects
- inspect pod status, set crawl status to 'waiting' if no pods are running

frontend:
- frontend support for 'waiting' state
- show waiting icon from mocks

---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2023-05-08 17:05:01 -07:00
Ilya Kreymer
70319594c2
crawlconfig: fix default filename template, make configurable (#835)
* crawlconfig: fix default filename template, make configurable
- make default crawl file template configurable with 'default_crawl_filename_template' value in values.yaml
- set to '@ts-@hostsuffix.wacz' by default
- allow updating via 'crawlFilenameTemplate' in crawlconfig patch, which updates configmap
- tests: add test for custom 'default_crawl_filename_template'
2023-05-08 14:03:27 -07:00
Tessa Walsh
59e49eacd5
Update collections backend API (#759)
* Re-implement collections, storing crawlIds in collection

* Return collections for crawl endpoints and filter on coll name

* Remove crawl from all collections when deleted

* Revert get_collection_crawls to flat array of resources

* Fix tests
2023-04-14 12:17:18 -04:00
Ilya Kreymer
1c47a648a9
Max page limit override (#737)
* more page limit: update to #717, instead of setting --limit in each crawlconfig,
apply override --maxPageLimit setting, implemented in crawler, to override individually configured page limit

* update tests, no longer returning 'crawl_page_limit_exceeds_allowed'
2023-04-03 14:01:32 -07:00
Ilya Kreymer
887cb16146
Allow configurable max pages per crawl in deployment settings (#717)
* backend: max pages per crawl limit, part of fix for #716:
- set 'max_pages_crawl_limit' in values.yaml, default to 100,000
- if set/non-0, automatically set limit if none provided
- if set/non-0, return 400 if adding config with limit exceeding max limit
- return limit as 'maxPagesPerCrawl' in /api/settings
- api: /all/crawls - add runningOnly=0 to show all crawls, default to 1/true (for more reliable testing)

tests: add test for 'max_pages_per_crawl' setting
- ensure 'limit' can not be set higher than max_pages_per_crawl
- ensure pages crawled is at the limit
- set test limit to max 2 pages
- add settings test
- check for pages.jsonl and extraPages.jsonl when crawling 2 pages
2023-03-28 16:26:29 -07:00
Tessa Walsh
4724754efc
Filter and sort crawl and workflow list API endpoints in backend (#724)
* Re-implement pagination and paginate crawlconfig revs

First step toward simplifying pagination to set us up for sorting
and filtering of list endpoints. This commit removes fastapi-pagination
as a dependency.

* Migrate all HttpUrl seeds to Seeds

This commit also updates the frontend to always use Seeds and to
fix display issues resulting from the change.

* Filter and sort crawls and workflows

Crawls:
- Filter by createdBy (via userid param)
- Filter by state (comma-separated string for multiple values)
- Filter by first_seed, name, description
- Sort by started, finished, fileSize, firstSeed
- Sort descending by default to match frontend

Workflows:
- Filter by createdBy (formerly userid) and modifiedBy
- Filter by first_seed, name, description
- Sort by created, modified, firstSeed, lastCrawlTime

* Add crawlconfigs search-values API endpoint and test
2023-03-28 17:55:40 -04:00
Tessa Walsh
4136bdad2e
Add optional description to crawl configs and return in crawl endpoints (#707) 2023-03-21 15:39:09 -04:00
Tessa Walsh
e98c7172a9
Paginate API list endpoints (#659)
* Paginate API list endpoints

fastapi-pagination is pinned to 0.9.3, the latest release that plays
nicely with pinned versions of fastapi and fastapi-users.

* Increase page size via overriden Params and Page classes

* update api resource list keys

---------

Co-authored-by: sua yoo <sua@suayoo.com>
2023-03-06 14:41:25 -05:00
Tessa Walsh
ed94dde7e6
Include firstSeed and seedCount in crawl endpoints (#618) 2023-02-22 10:27:31 -05:00
Tessa Walsh
bd4fba7af7
Fix POST /orgs/{oid}/crawls/delete (#591)
* Fix POST /orgs/{oid}/crawls/delete

- Add permissions check to ensure crawler users can only delete
their own crawls
- Fix broken delete_crawls endpoint
- Delete files from storage as well as deleting crawl from db
- Add tests, including nightly test that ensures crawl files are
no longer accessible after the crawl is deleted
2023-02-15 21:06:12 -05:00
Tessa Walsh
ce8f426978
Add notes to crawl and crawl updates (#587) 2023-02-08 18:36:22 -08:00
Tessa Walsh
2e3b3cb228
Add API endpoint to update crawl tags (#545)
* Add API endpoint to update crawls (tags only for now)
* Allow setting tags to empty list in crawlconfig updates
2023-02-01 22:24:36 -05:00
Tessa Walsh
0fa60ebc45
Rename archives/teams -> orgs in codebase + add db migration (#486)
* Rename archives to orgs and aid to oid on backend

* Rename archive to org and aid to oid in frontend

* Remove translation artifact

* Rename team -> organization

* Add database migrations and run once on startup

* This commit also applies the new by_one_worker decorator to other
asyncio tasks to prevent heavy tasks from being run in each worker.

* Run black, pylint, and husky via pre-commit

* Set db version and use in migrations

* Update and prepare database in single task

* Migrate k8s configmaps
2023-01-18 14:51:04 -08:00
Ilya Kreymer
2daa742585
Copy tags from crawlconfig to crawl (#467), fixes #466
- add tags to crawl object
- ensure tags are copied from crawlconfig to crawl when crawl is created (both manually and scheduled)
- tests: add test to ensure tags added to crawl, remove redundant wait replaced with fixtures
2023-01-12 17:46:19 -08:00
Tessa Walsh
49460bb070
Add default organization + invite to default org (#465), #455
- Add default switch to Archive (org) model
- Set default org name via values.yaml
- Add check to ensure only one org with default org name exists
- Stop creating new orgs for new users
- Add new API endpoints for creating and renaming orgs (part of #457)
- Make Archive.name unique via index
- Wait for db connection on init, log if waiting
- Make archive-less invites invite user to default org with Owner role
- Rename default org from chart value if changed
- Don't create new org for invited users
2023-01-12 16:44:18 -08:00
Ilya Kreymer
7b5d82936d
backend: initial tags api support (addresses #365): (#434)
* backend: initial tags api support (addresses #365):
- add 'tags' field to crawlconfig (array of strings)
- allow querying crawlconfigs to specify multiple 'tag' query args, eg. tag=A&tag=B
- add /archives/<aid>/crawlconfigs/tags api to query by distinct tag, include index on aid + tag
tests: add tests for adding configs, querying by tags
tests: fix fixtures to retry login if initial attempts fails, use test seed of https://webrecorder.net instead of https://example.com/
2023-01-11 13:29:35 -08:00
Ilya Kreymer
56a6d7a5d8
Backend lint check (#451)
- apply lint + format fixes to backend
- add ci for lint + format fixes for backend
- use fixed version of pydantic
2023-01-10 16:17:06 -08:00
Tessa Walsh
d1b59c9bd0
Use archive_viewer_dep permissions to GET crawls (#443)
* Use archive_viewer_dep permissions to GET crawls

* Add is_viewer check to archive_dep

* Add API endpoint to add new user to archive directly (/archive/<id>/add-user)

* Add tests

* Refactor tests to use fixtures

* And remove login test that duplicates fixtures
2023-01-09 19:11:53 -08:00
Ilya Kreymer
dfca09fc9c
Add single crawl info api at /crawls/{crawl_id} (#418)
* backend: crawl info apis:
- add /crawls/{crawl_id} api endpoint which just lists the crawl info, without resolving the individual files
- move /crawls/{crawl_id}.json -> /crawls/{crawl_id}/replay.json for clarity that it's used for replay

* frontend: update api for new replay.json endpoint
2022-12-19 14:54:48 -08:00
Ilya Kreymer
82ffc0dfbc
Local Deployment Work: Support running locally + test cluster on CI (#396)
* k8s local deployment work:
- make it easier to deploy w/o ingress by setting 'local_service_port' (suggested port 30870)
- if using local minio, ensure file endpoints set to /data/ and /data/ proxies correctly to local bucket
- if not using minio, ensure file endpoints point to correct access / endpoint url.
- setup should work with docker desktop, minikube, microk8s and k3s!
- nginx chart: bump nginx memory limit to 20Mi
- nginx image: 00-default-override-resolver-config -> 00-browsertrix-nginx-init for clarity
- nginx image: use default nginx.conf, pin to nginx 1.23.2
- mongo: readd readiness probe, bump connect wait timeout (needed for ci)
- config: set superadmin username to 'admin'
- config schema: set 'name' as required 
- add sample chart values overrides:
- chart values: local-config.yaml for running locally with 'local_service_port'
- chart values: add microk8s-hosted.yaml for configuring a hosted microk8s setup
- chart values: add microk8s-ci.yaml for ci tests
- ci: remove docker swarm tests
- ci: add microk8s integration tests: launching cluster, logging in, running a crawl of example.com, downloading/checking WACZ
- bump to 1.1.0-beta.2
2022-12-02 19:58:34 -08:00