Commit Graph

38 Commits

Author SHA1 Message Date
Tessa Walsh
d41647e6c2
Document all API endpoints with response models (#1928)
Fixes #1920 

Adds response models to all API endpoints that were missing them,
documenting current behavior without making any changes at this stage to
standardize responses.

Follow-up work will involve adding generics to some of the response models
2024-07-16 12:48:38 -07:00
Tessa Walsh
aaf18e70a0
Add created date to Organization and fix datetimes across backend (#1921)
Fixes #1916

- Add `created` field to Organization and OrgOut, set on org creation
- Add migration to backfill `created` dates from first workflow
`created`
- Replace `datetime.now()` and `datetime.utcnow()` across app with
consistent timezone-aware `utils.dt_now` helper function, which now uses
`datetime.now(timezone.utc)`. This is in part to ensure consistency in
how we handle datetimes, and also to get ahead of timezone naive
datetime creation methods like `datetime.utcnow()` being deprecated in
Python 3.12. For more, see:
https://blog.miguelgrinberg.com/post/it-s-time-for-a-change-datetime-utcnow-is-now-deprecated
2024-07-15 19:46:32 -07:00
Ilya Kreymer
4f676e4e82
QA Runs Initial Backend Implementation (#1586)
Supports running QA Runs via the QA API!

Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes
#1498

Also requires the latest Browsertrix Crawler 1.1.0+ (from
webrecorder/browsertrix-crawler#469 branch)

Notable changes:
- QARun objects contain info about QA runs, which are crawls
performed on data loaded from existing crawls.

- Various crawl db operations can be performed on either the crawl or
`qa.` object, and core crawl fields have been moved to CoreCrawlable.

- While running,`QARun` data stored in a single `qa` object, while
finished qa runs are added to `qaFinished` dictionary on the Crawl. The
QA list API returns data from the finished list, sorted by most recent
first.

- Includes additional type fixes / type safety, especially around
BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific
get_upload(), get_basecrawl(), get_crawl() getters for internal use and
get_crawl_out() for API

- Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) 
along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-20 22:42:16 -07:00
Tessa Walsh
a898c2b456
Format backend with Black 24 (#1507)
Fixes #1506
2024-02-07 11:35:34 -08:00
Tessa Walsh
f3cbd9e179
Add crawl, upload, and collection delete webhook event notifications (#1363)
Fixes #1307
Fixes #1132
Related to #1306

Deleted webhook notifications include the org id and item/collection id.
This PR also includes API docs for the new webhooks and extends the
existing tests to account for the new webhooks.

This PR also does some additional cleanup for existing webhooks:
- Remove `downloadUrls` from item finished webhook bodies
- Rename collection webhook body `downloadUrls` to `downloadUrl`, since
we only ever have one per collection
- Fix API docs for existing webhooks, one of which had the wrong
response body
2023-11-09 18:19:08 -08:00
Ilya Kreymer
6384d8b5f1
Additional Type Hints / Type Fix Pass (#1320)
This PR adds more type safety to the backend codebase:
- All ops classes calls should be type checked
- Avoiding circular references with TYPE_CHECKING conditional
- Consistent UUID usage: uuid.UUID / UUID4 with just UUID
- Crawl states moved to models, made into lists
- Additional typing added as needed, fixed a few type related errors
- CrawlOps / UploadOps / BaseCrawlOps now all have same param init order
to simplify changes
2023-10-30 12:59:24 -04:00
Ilya Kreymer
16e7a1d0a2
Storage Ops Refactor (#1257)
* storage ops refactor:
- create StorageOps class similar to other ops classes
- init storages list in StorageOps, no longer require lookup up default storages via CrawlManager
- convert all storage functions to members, add storageops to operator
- remove unused params, ensure crawl exists for rollover restart
- add env var to determine if using local minio to use correct endpoint URL

* crawls /seeds endpoint: just return empty list if not a crawl (eg. upload)

* crawlmanager: remove unused code, rename check_storage -> has_storage
2023-10-10 15:04:23 -07:00
Anish Lakhwara
1bf531e1ec
Fix: Make Collections Public on Creation (#1213)
- Add isPublic to Add Collection endpoint, send isPublic from frontend
- Fixes #1212
2023-09-29 12:08:10 -07:00
Ilya Kreymer
7eac0fdf95
optimization: convert all uses of 'async for' to use iterator directly (#1229)
- optimization: convert all uses of 'async for' to use iterator directly instead of converting to list to avoid
unbounded size lists
- additional cursor.to_list() to async for conversions for stats computation, simply crawlconfigs stats computation

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-09-28 12:31:08 -07:00
Ilya Kreymer
feb7ab7652
Improved type checking for backend with mypy (#1174)
* add mypy type check
- run type check on backend fix ambiguous typing issues
- add mypy to lint gh action + precommit hook
- add mypy.ini
2023-09-13 19:40:26 -07:00
Ilya Kreymer
4b34da033a
Refactor / Cleanup: move ops functions back into classes (#1171)
* remove almost all standalone functions and move them back into ops member functions
* operator now has access to all the ops classes as well
* keep two standalone functions used only in migrations

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-09-13 11:56:09 -07:00
Tessa Walsh
7cf2b11eb7
Add event webhook tests (#1155)
* Add success filter to webhook list GET endpoint

* Add sorting to webhooks list API and add event filter

* Test webhooks via echo server

* Set address to echo server on host from CI env var for k3d and microk8s

* Add -s back to pytest command for k3d ci

* Change pytest test path to avoid hanging on collecting tests

* Revert microk8s to only run on push to main
2023-09-12 22:08:40 -07:00
Tessa Walsh
147bfd9d44
Add event webhook notifications system to backend (#1061)
Initial set of backend API for event webhook notifications for the following events:
* Crawl started (including boolean indicating if crawl was scheduled)
* Crawl finished
* Upload finished
* Archived item added to collection
* Archived item removed from collection

Configuration of URLs is done via /api/orgs/<oid>/event-webhook-urls. If a URL is configured for a given event, a webhook notification is added to the database and then attempted to be sent (up to a total of 5 tries per overall attempt, with an increasing backoff between, implemented via use of the backoff library, which supports async).

webhook status available via /api/orgs/<oid>/webhooks

(Additional testing + potential fastapi integration left in separate follow-ups
Fixes #1041
2023-08-31 19:52:37 -07:00
Anish Lakhwara
8b16124675
feat: implement 'collections' array with {name, id} for archived item details (#1098)
- rename 'collections' -> 'collectionIds', adding migration 0014
- only populate 'collections' array with {name, id} pair for get_crawl() / single archived item
path, but not for aggregate/list methods
- remove Crawl.get_crawl(), redundant with BaseCrawl.get_crawl() version
- ensure _files_to_resources returns an empty [] instead of none if empty (matching BaseCrawl.get_crawl() behavior to Crawl.get_crawl())
- tests: update tests to use collectionIds for id list, add 'collections' for {name, id} test
- frontend: change Crawl object to have collectionIds instead of collections

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-08-25 00:26:46 -07:00
Ilya Kreymer
8d0a4f2ca9
fix public collections endpoint returning 404 when not public (#1052)
tests: add tests for public collections endpoint when collection is public and when not
2023-08-04 13:29:13 -04:00
Ilya Kreymer
362afa47bd
Support for Public / Shareable Collections (#1038)
* collections: support toggling collections public/private, viewable via RWP
- backend: add 'public' to collection model, support patching to update
- backend: add .../collections/<id>/public/replay.json for public access
- backend: add CORS handling for public endpoint
- frontend: support 'make shareable / make private' dropdown actions on collection detail + collection list views
- frontend: show shareable / private icons by collection name on detail + list views
- frontend: link to replayweb.page for standalone browsing
- frontend: add embed code popup when a collection is shareable
- refer to public collections as 'shareable' for now

---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2023-08-03 19:11:01 -07:00
Ilya Kreymer
6506965d98
Streaming Download for Collections (#1012)
* support streaming download of collections (part of #927)
- WACZ zip created on the fly using stream-zip
- add 'Download Collection' option to collection detail and list
- after editing collection, return to collection view
- tests: add test for streaming download, ensure WACZ files + datapackage present, STORE compression used

---------

Co-authored-by: sua yoo <sua@suayoo.com>
2023-07-26 15:42:17 -07:00
Tessa Walsh
fcd48b1831
Add totalSize to collections and make it sortable in list endpoint (#1001)
* Precompute collection.totalSize and make sortable

* Add migration to recompute collection data with totalSize
2023-07-24 13:12:23 -04:00
Tessa Walsh
4014d98243
Move pydantic models to separate module + refactor crawl response endpoints to be consistent (#983)
* Move all pydantic models to models.py to avoid circular dependencies
* Include automated crawl details in all-crawls GET endpoints
- ensure /all-crawls endpoint resolves names / firstSeed data same as /crawls endpoint for crawls to ensure consistent frontend display. fields added in get and list all-crawl endpoints for automated
crawls only:
- cid
- name
- description
- firstSeed
- seedCount
- profileName

* Add automated crawl fields to list all-crawls test

* Uncomment mongo readinessProbe

* cleanup CrawlOutWithResources:
- remove 'files' from output model, only resources should be returned
- add _files_to_resources() to simplify computing presigned 'resources' from raw 'files'
- update upload tests to be more consistent, 'files' never present, 'errors' always none

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-07-20 13:05:33 +02:00
Ilya Kreymer
00eb62214d
Uploads API: BaseCrawl refactor + Initial support for /uploads endpoint (#937)
* basecrawl refactor: make crawls db more generic, supporting different types of 'base crawls': crawls, uploads, manual archives
- move shared functionality to basecrawl.py
- create a base BaseCrawl object, which contains start / finish time, metadata and files array
- create BaseCrawlOps, base class for CrawlOps, which supports base crawl deletion, querying and collection add/remove

* uploads api: (part of #929)
- new UploadCrawl object which extends BaseCrawl, has name and description
- support multipart form data data upload to /uploads/formdata
- support streaming upload of a single file via /uploads/stream, using botocore multipart upload to upload to s3-endpoint in parts
- require 'filename' param to set upload filename for streaming uploads (otherwise use form data names)
- sanitize filename, place uploads in /uploads/<uuid>/<sanitized-filename>-<random>.wacz
- uploads have internal id 'upload-<uuid>'
- create UploadedCrawl object with CrawlFiles pointing to the newly uploaded files, set state to 'complete'
- handle upload failures, abort multipart upload
- ensure uploads added within org bucket path
- return id / added when adding new UploadedCrawl
- support listing, deleting, and patch /uploads
- support upload details via /replay.json to support for replay
- add support for 'replaceId=<id>', which would remove all previous files in upload after new upload succeeds. if replaceId doesn't exist, create new upload. (only for stream endpoint so far).
- support patching upload metadata: notes, tags and name on uploads (UpdateUpload extends UpdateCrawl and adds 'name')

* base crawls api: Add /all-crawls list and delete endpoints for all crawl types (without resources)
- support all-crawls/<id>/replay.json with resources
- Use ListCrawlOut model for /all-crawls list endpoint
- Extend BaseCrawlOut from ListCrawlOut, add type
- use 'type: crawl' for crawls and 'type: upload' for uploads
- migration: ensure all previous crawl objects / missing type are set to 'type: crawl'
- indexes: add db indices on 'type' field and with 'type' field and oid, cid, finished, state

* tests: add test for multipart and streaming upload, listing uploads, deleting upload
- add sample WACZ for upload testing: 'example.wacz' and 'example-2.wacz'

* collections: support adding and remove both crawls and uploads via base crawl
- include collection_ids in /all-crawls list
- collections replay.json can include both crawls and uploads

bump version to 1.6.0-beta.2
---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-07-07 09:13:26 -07:00
Tessa Walsh
c7051d5fbf
Backend API consistency pass (#921)
* Make API add and update method returns consistent

- Updates return {"updated": True}
- Adds return {"added": True}
- Both can additionally have other fields as needed, e.g. id or name

- remove Profile response model, as returning added / id only
- reformat

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-06-16 18:52:46 -07:00
Tessa Walsh
bd6dc79449
Add frontend support for auto-adding collections to workflows (#916)
- Adds collections search and list to workflow editor
- Adds collections to workflow details component
- Adds namePrefix filter to backend GET /orgs/{oid}/collections endpoint to support case-insensitive searching of collections
- Adds documentation for new setting

---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2023-06-12 18:18:05 -07:00
Tessa Walsh
325355d991
Fix post-crawl collection stats update and add test (#918)
This fixes #917, where crawls added to a collection via the workflow
autoAddCollections were not successfully represented in the crawl
and page count stats in the collection after completing.
2023-06-10 19:06:25 -07:00
sua yoo
66b3befef9
Frontend collections beta UI (#886)
- Support for creating new collections and editing existing collections
- Can select crawling workflows which adds entire workflow, and then deselect individual crawls
- Can edit existing collections and add more crawls
- Can view, create and delete collections via new Collections top-level nav entry
2023-06-06 17:52:01 -07:00
sua yoo
6208ead040
Sort collection by last updated (modified) (#897)
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-05-30 14:09:10 -04:00
Ilya Kreymer
4d30a64bc9
collection delete: (#896)
set delete endpoint to use DELETE verb, fix for #869
2023-05-29 18:19:04 -07:00
Tessa Walsh
9c7a312a4c
Rework collections to track collections in Crawl (#878)
* Track collections in Crawl rather than crawls in Collection
* Add delete collection API endpoint and tests
* Precompute collection crawlCount, pageCount, and tags and add them to
GET collection responses
* Add modified field to Collection
* Update collection replay.json method
* Make add and remove crawls accept list of crawl ids
* Auto-add new workflow crawls to collections when they successfully
complete via CrawlConfig.autoAddCollections field
* Move long-running post-crawl operator tasks into asyncio task
* Make CrawlConfig.autoAddCollections updatable via /update API endpoint
2023-05-25 15:41:50 -04:00
Tessa Walsh
5c944d4626
Remove uniqueness constraint on collection descriptions
Fix for copy-paste error
2023-05-23 11:03:13 -04:00
Tessa Walsh
60fac2b677
Add collection sorting and filtering (#863)
* Sort by name and description (ascending by default)
* Filter by name
* Add endpoint to fetch collection names for search
* Add collation so that utf-8 chars sort as expected
2023-05-22 16:53:49 -04:00
Tessa Walsh
f482831d53
Use collection uuid as id (instead of name) (#855)
Also ensure name is not empty by adding minimum length of 1

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-05-19 09:03:48 -04:00
Tessa Walsh
59e49eacd5
Update collections backend API (#759)
* Re-implement collections, storing crawlIds in collection

* Return collections for crawl endpoints and filter on coll name

* Remove crawl from all collections when deleted

* Revert get_collection_crawls to flat array of resources

* Fix tests
2023-04-14 12:17:18 -04:00
Tessa Walsh
e9b61c632d
Add pageSize to pagination format (#736) 2023-04-03 15:57:47 -04:00
Tessa Walsh
4724754efc
Filter and sort crawl and workflow list API endpoints in backend (#724)
* Re-implement pagination and paginate crawlconfig revs

First step toward simplifying pagination to set us up for sorting
and filtering of list endpoints. This commit removes fastapi-pagination
as a dependency.

* Migrate all HttpUrl seeds to Seeds

This commit also updates the frontend to always use Seeds and to
fix display issues resulting from the change.

* Filter and sort crawls and workflows

Crawls:
- Filter by createdBy (via userid param)
- Filter by state (comma-separated string for multiple values)
- Filter by first_seed, name, description
- Sort by started, finished, fileSize, firstSeed
- Sort descending by default to match frontend

Workflows:
- Filter by createdBy (formerly userid) and modifiedBy
- Filter by first_seed, name, description
- Sort by created, modified, firstSeed, lastCrawlTime

* Add crawlconfigs search-values API endpoint and test
2023-03-28 17:55:40 -04:00
Tessa Walsh
e98c7172a9
Paginate API list endpoints (#659)
* Paginate API list endpoints

fastapi-pagination is pinned to 0.9.3, the latest release that plays
nicely with pinned versions of fastapi and fastapi-users.

* Increase page size via overriden Params and Page classes

* update api resource list keys

---------

Co-authored-by: sua yoo <sua@suayoo.com>
2023-03-06 14:41:25 -05:00
Sara Tavares
8167d7da8d
fix typos (#640) 2023-02-24 11:10:49 -08:00
Tessa Walsh
0fa60ebc45
Rename archives/teams -> orgs in codebase + add db migration (#486)
* Rename archives to orgs and aid to oid on backend

* Rename archive to org and aid to oid in frontend

* Remove translation artifact

* Rename team -> organization

* Add database migrations and run once on startup

* This commit also applies the new by_one_worker decorator to other
asyncio tasks to prevent heavy tasks from being run in each worker.

* Run black, pylint, and husky via pre-commit

* Set db version and use in migrations

* Update and prepare database in single task

* Migrate k8s configmaps
2023-01-18 14:51:04 -08:00
Ilya Kreymer
d340bceb39 style pass: normalize docstring spacing 2022-10-19 21:47:34 -07:00
Ilya Kreymer
bf79959a5a refactoring to use statefulsets + job (#245)
- use statefulsets instead of deployments for mongo, redis, signer
- use k8s job + statefulset for running crawls
- use separate statefulset for crawl (scaled) and single-replica redis stateful set
- move crawl job update login to crawl_updater
- remove shared redis chart

package refactor:
- move to shared code to 'btrixcloud'
- move k8s to 'btrixcloud.k8s'
- move docker to 'btrixcloud.docker'
2022-06-05 10:37:17 -07:00