Commit Graph

22 Commits

Author SHA1 Message Date
sua yoo
941a75ef12
Separate seeds into a new endpoints (#1217)
- Remove config.seeds from workflow and crawl detail endpoints
- Add new paginated GET /crawls/{crawl_id}/seeds and /crawlconfigs/{cid}/seeds endpoints to retrieve seeds for a crawl or workflow
- Include firstSeed in GET /crawlconfigs/{cid} endpoint (was missing before)
- Modify frontend to fetch seeds from new /seeds endpoints with loading indicator

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-10-02 10:56:12 -07:00
Tessa Walsh
7a56fa23f5
Remove username lookups for crawls and workflows by storing usernames in db (#1199)
* store usernames (createdByName, modifiedByName, startedByName) in db for workflows
* store userName for userid for crawls in db
* update output models to return usernames
* add migration 0018 to add usernames to existing crawls and crawlconfigs
* updated tests for crawl and config usernames
* use async for to iterate over crawls and crawlconfigs

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-09-28 09:37:23 -07:00
Tessa Walsh
094f27bcff
Track bytes stored per file type and include in org metrics (#1207)
* Add bytes stored per type to org and metrics

The org now tracks bytesStored by type of crawl, uploads, and browser profiles
in addition to the total, and returns these values in the org metrics endpoint.

A migration is added to precompute these values in existing deployments.

In addition, all /metrics storage values are now returned solely as bytes, as
the GB form wasn't being used in the frontend and is unnecessary.

* Improve deletion of multiple archived item types via `/all-crawls` delete endpoint

- Update `/all-crawls` delete test to check that org and workflow size values
are correct following deletion.
- Fix bug where it was always assumed only one crawl was deleted per cid
and size was not tracked per cid
- Add type check within delete_crawls
2023-09-22 12:55:21 -04:00
Tessa Walsh
9224f52f51
Remove config from list endpoints to speed up responses (#1193)
* Remove config from list endpoints

- Remove config field from workflow and crawl list endpoints
- Add seedCount to CrawlConfigOut on backend and Workflow on frontend
- Refactor CrawlConfig and CrawlConfigOut to extend CrawlConfigCore + CrawlConfigAdditional
- Refactor workflow list in frontend to use firstSeed and seedCount
- Frontend uses ListWorkflow type which is Omit<Workflow, "config">
2023-09-19 11:05:48 -05:00
Ilya Kreymer
feb7ab7652
Improved type checking for backend with mypy (#1174)
* add mypy type check
- run type check on backend fix ambiguous typing issues
- add mypy to lint gh action + precommit hook
- add mypy.ini
2023-09-13 19:40:26 -07:00
Ilya Kreymer
4b34da033a
Refactor / Cleanup: move ops functions back into classes (#1171)
* remove almost all standalone functions and move them back into ops member functions
* operator now has access to all the ops classes as well
* keep two standalone functions used only in migrations

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-09-13 11:56:09 -07:00
Ilya Kreymer
c9c39d47b7
Scheduled Crawl Refactor: Handle via Operator + Add Skipped Crawls on Quota Reached (#1162)
* use metacontroller's decoratorcontroller to create CrawlJob from Job
* scheduled job work:
- use existing job name for scheduled crawljob
- use suspended job, set startTime, completionTime and succeeded status on job when crawljob is done
- simplify cronjob template: remove job_image, cron_namespace, using same namespace as crawls,
placeholder job image for cronjobs

* move storage quota check to crawljob handler:
- add 'skipped_quota_reached' as new failed status type
- check for storage quota before checking if crawljob can be started, fail if not (check before any pods/pvcs created)

* frontend:
- show all crawls in crawl workflow, no need to filter by status
- add 'skipped_quota_reached' status, show as 'Skipped (Quota Reached)', render same as failed

* migration: make release namespace available as DEFAULT_NAMESPACE, delete old cronjobs in DEFAULT_NAMESPACE and recreate in crawlers namespace with new template
2023-09-12 13:05:43 -07:00
Tessa Walsh
9377a6f456
Issue all non-upload storage-quota-update events from LiteElement (#1151)
- More specific toast notification error messages to the action being attempted
- Single dismissable global banner shown when org storage is reached
- Removed check for storage quota reached in `runNow`, since buttons are disabled in UI, and errors handled if request fails.
- Allow creating new workflow when storage quota reached
- More responsive storage quota updates: add storageQuotaReached to archived item replay.json, updates w/o reload when crawl pushes quota over limit
- Modify LiteElement to check for storageQuotaReached on GET requests

---------
Co-authored-by: sua yoo <sua@suayoo.com>
2023-09-11 18:17:48 -07:00
Tessa Walsh
d2ededc895
Add and enforce org storage quota (#1106)
* Implement in backend

- Track bytesStored in org
- Add migration to pre-calculate based on size of crawlfiles and profilefiles
- Add methods to increase or decrease org storage when crawl or profile files
are added or deleted
- Include storageQuotaReached boolean in API responses that alter storage
- Don't start new crawls and fail uploads if storage quota reached

* Implement in frontend

- Add to orgs-list quotas
- Update org's storageQuotaReached based on backend endpoint responses
- Disable buttons when storage quota is met
- Show toast notification when attempting to run a crawl when org
storage quota is met
2023-09-07 12:45:43 -04:00
Ilya Kreymer
876ba1bf24
null check: check before accessing config in 'get_all_crawl_search_values' (#1144) 2023-09-05 23:57:05 -04:00
Tessa Walsh
1aa951132c
Fix unsetting all collections via PATCH update (#1126) 2023-08-30 18:16:21 -04:00
Tessa Walsh
f6369ee01e
Add support for collectionIds to archived item PATCH endpoints (#1121)
* Add support for collectionIds to patch endpoints

* Make update available via all-crawls/ and add test

* Fix tests

* Always remove collectionIds from udpate

* Remove unnecessary fallback

* One more pass on expected values before update
2023-08-30 10:41:30 -04:00
Anish Lakhwara
8b16124675
feat: implement 'collections' array with {name, id} for archived item details (#1098)
- rename 'collections' -> 'collectionIds', adding migration 0014
- only populate 'collections' array with {name, id} pair for get_crawl() / single archived item
path, but not for aggregate/list methods
- remove Crawl.get_crawl(), redundant with BaseCrawl.get_crawl() version
- ensure _files_to_resources returns an empty [] instead of none if empty (matching BaseCrawl.get_crawl() behavior to Crawl.get_crawl())
- tests: update tests to use collectionIds for id list, add 'collections' for {name, id} test
- frontend: change Crawl object to have collectionIds instead of collections

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-08-25 00:26:46 -07:00
Ilya Kreymer
2e73148bea
fix redis connection leaks + exclusions error: (fixes #1065) (#1066)
* fix redis connection leaks + exclusions error: (fixes #1065)
- use contextmanager for accessing redis to ensure redis.close() is always called
- add get_redis_client() to k8sapi to ensure unified place to get redis client
- use connectionpool.from_url() until redis 5.0.0 is released to ensure auto close and single client settings are applied
- also: catch invalid regex passed to re.compile() in queue regex check, return 400 instead of 500 for invalid regex
- redis requirements: bump to 5.0.0rc2
2023-08-14 18:29:28 -07:00
sua yoo
37733483d5
Standardize archived item filtering, sorting and labels (#1054)
Frontend:
- Renames list view to "All Archived Items"
- Refactors fetches to use single all-crawls endpoints
- Removes search by config ID for more search parity with uploads
- Adds sort by size
- Refactors property and method names to replace crawl*
- Replaces remaining references to "crawl" in copy with "item"'
- Rename Upload Archive button to Upload WACZ
- Fix focusout in item menu so menus close

Backend:
- Filter search values by type as well
- Only get list of cids for crawls in search values
- Don't list crawl/workflow ids in search values

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-08-09 12:13:55 -07:00
Tessa Walsh
7ff57ce6b5
Backend: standardize search values, filters, and sorting for archived items (#1039)
- all-crawls list endpoint filters now conform to 'Standardize list controls for archived items #1025' and URL decode values before passing them in
- Uploads list endpoint now includes all all-crawls filters relevant to uploads
- An all-crawls/search-values endpoint is added to support searching across all archived item types
- Crawl configuration names are now copied to the crawl when the crawl is created, and crawl names and descriptions are now editable via the backend API (note: this will require frontend changes as well to make them editable via the UI)
- Migration added to copy existing config names for active configs into their associated crawls. This migration has been tested in a local deployment
- New statuses generate-wacz, uploading-wacz, and pending-wait are added when relevant to tests to ensure that they pass
- Tests coverage added for all new all-crawls endpoints, filters, and sort values
2023-08-04 09:56:52 -07:00
Ilya Kreymer
06cf9c7cc3
add crawl ending states: 'generate-wacz', 'uploading-wacz', 'pending-wait' that occur after a crawl is finished or is being stopped (#1022)
operator: ensure transitions from each of these states is supported, including to 'waiting_capacity'
add extra check on stopping to avoid transitioning back to a running state after crawl is finished
ui: add states to UI display, localization, add as active states
fixes #263
2023-08-01 00:15:59 -07:00
Ilya Kreymer
6506965d98
Streaming Download for Collections (#1012)
* support streaming download of collections (part of #927)
- WACZ zip created on the fly using stream-zip
- add 'Download Collection' option to collection detail and list
- after editing collection, return to collection view
- tests: add test for streaming download, ensure WACZ files + datapackage present, STORE compression used

---------

Co-authored-by: sua yoo <sua@suayoo.com>
2023-07-26 15:42:17 -07:00
Tessa Walsh
c21153255a
Rename notes to description in frontend and backend (#1011)
- Rename crawl notes to description
- Add migration renaming notes -> description
- Stop inheriting workflow description in crawl
- Update frontend to replace crawl/upload notes with description
- Remove setting of config description from crawl list
- Adjust tests for changes
2023-07-26 13:00:04 -07:00
Tessa Walsh
4014d98243
Move pydantic models to separate module + refactor crawl response endpoints to be consistent (#983)
* Move all pydantic models to models.py to avoid circular dependencies
* Include automated crawl details in all-crawls GET endpoints
- ensure /all-crawls endpoint resolves names / firstSeed data same as /crawls endpoint for crawls to ensure consistent frontend display. fields added in get and list all-crawl endpoints for automated
crawls only:
- cid
- name
- description
- firstSeed
- seedCount
- profileName

* Add automated crawl fields to list all-crawls test

* Uncomment mongo readinessProbe

* cleanup CrawlOutWithResources:
- remove 'files' from output model, only resources should be returned
- add _files_to_resources() to simplify computing presigned 'resources' from raw 'files'
- update upload tests to be more consistent, 'files' never present, 'errors' always none

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-07-20 13:05:33 +02:00
Ilya Kreymer
7d694754c6
uploads api ext: (#970)
- also support collectionId filter on /all-crawls
- update tests
2023-07-09 22:12:54 -07:00
Ilya Kreymer
00eb62214d
Uploads API: BaseCrawl refactor + Initial support for /uploads endpoint (#937)
* basecrawl refactor: make crawls db more generic, supporting different types of 'base crawls': crawls, uploads, manual archives
- move shared functionality to basecrawl.py
- create a base BaseCrawl object, which contains start / finish time, metadata and files array
- create BaseCrawlOps, base class for CrawlOps, which supports base crawl deletion, querying and collection add/remove

* uploads api: (part of #929)
- new UploadCrawl object which extends BaseCrawl, has name and description
- support multipart form data data upload to /uploads/formdata
- support streaming upload of a single file via /uploads/stream, using botocore multipart upload to upload to s3-endpoint in parts
- require 'filename' param to set upload filename for streaming uploads (otherwise use form data names)
- sanitize filename, place uploads in /uploads/<uuid>/<sanitized-filename>-<random>.wacz
- uploads have internal id 'upload-<uuid>'
- create UploadedCrawl object with CrawlFiles pointing to the newly uploaded files, set state to 'complete'
- handle upload failures, abort multipart upload
- ensure uploads added within org bucket path
- return id / added when adding new UploadedCrawl
- support listing, deleting, and patch /uploads
- support upload details via /replay.json to support for replay
- add support for 'replaceId=<id>', which would remove all previous files in upload after new upload succeeds. if replaceId doesn't exist, create new upload. (only for stream endpoint so far).
- support patching upload metadata: notes, tags and name on uploads (UpdateUpload extends UpdateCrawl and adds 'name')

* base crawls api: Add /all-crawls list and delete endpoints for all crawl types (without resources)
- support all-crawls/<id>/replay.json with resources
- Use ListCrawlOut model for /all-crawls list endpoint
- Extend BaseCrawlOut from ListCrawlOut, add type
- use 'type: crawl' for crawls and 'type: upload' for uploads
- migration: ensure all previous crawl objects / missing type are set to 'type: crawl'
- indexes: add db indices on 'type' field and with 'type' field and oid, cid, finished, state

* tests: add test for multipart and streaming upload, listing uploads, deleting upload
- add sample WACZ for upload testing: 'example.wacz' and 'example-2.wacz'

* collections: support adding and remove both crawls and uploads via base crawl
- include collection_ids in /all-crawls list
- collections replay.json can include both crawls and uploads

bump version to 1.6.0-beta.2
---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-07-07 09:13:26 -07:00