- consolidate list_pages() and list_replay_query_pages() into
list_pages()
- to keep backwards compatibility, add <crawl>/pagesSearch that does not
include page totals, keep <crawl>/pages with page total (slower)
- qa frontend: add default 'Crawl Order' sort order, to better show
pages in QA view
- bgjob: account for parallelism in bgjobs, add logging if succeeded
mismatches parallelism
- QA sorting: default to 'crawl order' by default to get better results.
- Optimize pages job: also cover crawls that may not have any pages but have pages listed in done stats
- Bgjobs: give custom op jobs more memory
Fixes#2406
Converts migration 0042 to launch a background job (parallelized across
several pods) to migrate all crawls by optimizing their pages and
setting `version: 2` on the crawl when complete.
Also Optimizes MongoDB queries for better performance.
Migration Improvements:
- Add `isMigrating` and `version` fields to `BaseCrawl`
- Add new background job type to use in migration with accompanying
`migration_job.yaml` template that allows for parallelization
- Add new API endpoint to launch this crawl migration job, and ensure
that we have list and retry endpoints for superusers that work with
background jobs that aren't tied to a specific org
- Rework background job models and methods now that not all background
jobs are tied to a single org
- Ensure new crawls and uploads have `version` set to `2`
- Modify crawl and collection replay.json endpoints to only include
fields for replay optimization (`initialPages`, `pageQueryUrl`,
`preloadResources`) if all relevant crawls/uploads have `version` set to
`2`
- Remove `distinct` calls from migration pathways
- Consolidate collection recompute stats
Query Optimizations:
- Remove all uses of $group and $facet
- Optimize /replay.json endpoints to precompute preload_resources, avoid
fetching crawl list twice
- Optimize /collections endpoint by not fetching resources
- Rename /urls -> /pageUrlCounts and avoid $group, instead sort with
index, either by seed + ts or by url to get top matches.
- Use $gte instead of $regex to get prefix matches on URL
- Use $text instead of $regex to get text search on title
- Remove total from /pages and /pageUrlCounts queries by not using
$facet
- frontend: only call /pageUrlCounts when dialog is opened.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Fixes#2360
- Adds `initialPages` to /replay.json response for collections, returning
up-to 25 pages (seed pages first, then sorted by capture time).
- Adds `pagesQueryUrl` to /replay.json
- Adds a public pages search endpoint to support public collections.
- Adds `preloadResources`, including list of WACZ files that should
always be loaded, to /replay.json
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes https://github.com/webrecorder/browsertrix/issues/2383
- Fixes unpredictable sort order when typing in collection page URL
- Fixes page URL results flickering in and out while typing
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#2350
Collection earliest/latest dates and the collection modified date are
also now updated when crawls or uploads are added to a collection via
the collection auto-add feature.
- Refactors dashboard and org profile preview to use private API
endpoint, to fix public collections not showing when the org
visibility is hidden
- Adds additional sorting options for collections
- Adds unique page url counts for archived items, collections, and
organizations to backend and exposes this in collections
- Shows collection period (i.e. `dateEarliest` to `dateLatest`) in
collections list
- Shows same collection metadata in private and public views, updates
private view info bar
- Fixes "Update Org Profile" action item showing for crawler roles
---------
Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: sua yoo <sua@suayoo.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Resolves https://github.com/webrecorder/browsertrix/issues/2298
## Changes
- Slugs added to collections, can be specified separately when creating
or updating collections or else is based off of supplied collection name
- Migration added to backfill slugs for existing collections
- Redirect collection to newest slug if changed
- Adds option to copy public profile link to "Public Collections" action
menu
- Show "Back to <Org>" link instead of breadcrumbs
---------
Co-authored-by: sua yoo <sua@suayoo.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#2182
This rather large PR adds the rest of what should be needed for public
collections work in the frontend.
New API endpoints include:
- Public collections endpoints: GET, streaming download
- Paginated list of URLs in collection with snapshot (page) info for
each
- Collection endpoint to set home URL
- Collection endpoint to upload thumbnail as stream
- DELETE endpoint to remove collection thumbnail
Changes to existing API endpoints include:
- Paginating public collection list results
- Several `pages` endpoints that previously only supported `/crawls/` in
their path, e.g. `/orgs/{oid}/crawls/all/pages/reAdd`, now support
`/uploads/` and `/all-crawls/` namespaces as well. This is necessitated
by adding pages for uploads to the database (see below). For
`/orgs/{oid}/namespace/all/pages/reAdd`, `crawls` or `uploads` will
serve as a filter to only affect crawls of that given type. Other
endpoints are more liberal at this point, and will perform the same
action regardless of the namespace used in the route (we'll likely want
to change this in a follow-up to be more consistent).
- `/orgs/{oid}/namespace/all/pages/reAdd` now kicks off a background job
rather than doing all of the computation in an asyncio task in the
backend container. The background job additionally updates collection
date ranges, page/size counts, and tags for each collection in the org
after pages have been (re)added.
Other big changes:
- New uploads will now have their pages read into the database!
Collection page counts now also include uploads
- A migration was added to start a background job for each org that will
add the pages for previously-uploaded WACZ files to the database and
update collections accordingly
- Adds a new `ImageFile` subclass of `BaseFile` for thumbnails that we
can use for other user-uploaded image files moving forward, with
separate output models for authenticated and public endpoints
Fixes#1051
If org with provided slug doesn't exist or no public collections exist
for that org, return same 404 response with a detail of
"public_profile_not_found" to prevent people from using public endpoint
to determine whether an org exists.
Endpoint is `GET /api/public-collections/<org-slug>` (no auth needed) to
avoid collisions with existing org and collection endpoints.
Fixes#2158
- Adds `Organization.listPublicCollections` field and API endpoint to
update it
- Replaces `Collection.isPublic` boolean with `Collection.access`
(values: `private`, `unlisted`, `public`) and add database migration
- Update frontend to use `Collection.access` instead of `isPublic`,
otherwise not changing current behavior
---------
Co-authored-by: sua yoo <sua@suayoo.com>
- download via presigned URLs via requests instead of boto APIs, remove boto
- follow-up to #1933 for streaming download improvements
- fixes datapackage.json in multi-wacz to contain the same resources
objects with: `name`, `path`, `hash`, `bytes` to match single WACZ.
- Add additional metadata to multi-wacz datapackage.json, including `type`
(`crawl`, `upload`, `collection`, `qaRun`), `id` (unique id for the
object), `title` / `description` if available (for
crawl/upload/collection), and `crawlId` for `qaRun`
Fixes#1920
Adds response models to all API endpoints that were missing them,
documenting current behavior without making any changes at this stage to
standardize responses.
Follow-up work will involve adding generics to some of the response models
Fixes#1916
- Add `created` field to Organization and OrgOut, set on org creation
- Add migration to backfill `created` dates from first workflow
`created`
- Replace `datetime.now()` and `datetime.utcnow()` across app with
consistent timezone-aware `utils.dt_now` helper function, which now uses
`datetime.now(timezone.utc)`. This is in part to ensure consistency in
how we handle datetimes, and also to get ahead of timezone naive
datetime creation methods like `datetime.utcnow()` being deprecated in
Python 3.12. For more, see:
https://blog.miguelgrinberg.com/post/it-s-time-for-a-change-datetime-utcnow-is-now-deprecated
Supports running QA Runs via the QA API!
Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes
#1498
Also requires the latest Browsertrix Crawler 1.1.0+ (from
webrecorder/browsertrix-crawler#469 branch)
Notable changes:
- QARun objects contain info about QA runs, which are crawls
performed on data loaded from existing crawls.
- Various crawl db operations can be performed on either the crawl or
`qa.` object, and core crawl fields have been moved to CoreCrawlable.
- While running,`QARun` data stored in a single `qa` object, while
finished qa runs are added to `qaFinished` dictionary on the Crawl. The
QA list API returns data from the finished list, sorted by most recent
first.
- Includes additional type fixes / type safety, especially around
BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific
get_upload(), get_basecrawl(), get_crawl() getters for internal use and
get_crawl_out() for API
- Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch)
along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#1307Fixes#1132
Related to #1306
Deleted webhook notifications include the org id and item/collection id.
This PR also includes API docs for the new webhooks and extends the
existing tests to account for the new webhooks.
This PR also does some additional cleanup for existing webhooks:
- Remove `downloadUrls` from item finished webhook bodies
- Rename collection webhook body `downloadUrls` to `downloadUrl`, since
we only ever have one per collection
- Fix API docs for existing webhooks, one of which had the wrong
response body
This PR adds more type safety to the backend codebase:
- All ops classes calls should be type checked
- Avoiding circular references with TYPE_CHECKING conditional
- Consistent UUID usage: uuid.UUID / UUID4 with just UUID
- Crawl states moved to models, made into lists
- Additional typing added as needed, fixed a few type related errors
- CrawlOps / UploadOps / BaseCrawlOps now all have same param init order
to simplify changes
* storage ops refactor:
- create StorageOps class similar to other ops classes
- init storages list in StorageOps, no longer require lookup up default storages via CrawlManager
- convert all storage functions to members, add storageops to operator
- remove unused params, ensure crawl exists for rollover restart
- add env var to determine if using local minio to use correct endpoint URL
* crawls /seeds endpoint: just return empty list if not a crawl (eg. upload)
* crawlmanager: remove unused code, rename check_storage -> has_storage
- optimization: convert all uses of 'async for' to use iterator directly instead of converting to list to avoid
unbounded size lists
- additional cursor.to_list() to async for conversions for stats computation, simply crawlconfigs stats computation
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
* remove almost all standalone functions and move them back into ops member functions
* operator now has access to all the ops classes as well
* keep two standalone functions used only in migrations
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
* Add success filter to webhook list GET endpoint
* Add sorting to webhooks list API and add event filter
* Test webhooks via echo server
* Set address to echo server on host from CI env var for k3d and microk8s
* Add -s back to pytest command for k3d ci
* Change pytest test path to avoid hanging on collecting tests
* Revert microk8s to only run on push to main
Initial set of backend API for event webhook notifications for the following events:
* Crawl started (including boolean indicating if crawl was scheduled)
* Crawl finished
* Upload finished
* Archived item added to collection
* Archived item removed from collection
Configuration of URLs is done via /api/orgs/<oid>/event-webhook-urls. If a URL is configured for a given event, a webhook notification is added to the database and then attempted to be sent (up to a total of 5 tries per overall attempt, with an increasing backoff between, implemented via use of the backoff library, which supports async).
webhook status available via /api/orgs/<oid>/webhooks
(Additional testing + potential fastapi integration left in separate follow-ups
Fixes#1041
- rename 'collections' -> 'collectionIds', adding migration 0014
- only populate 'collections' array with {name, id} pair for get_crawl() / single archived item
path, but not for aggregate/list methods
- remove Crawl.get_crawl(), redundant with BaseCrawl.get_crawl() version
- ensure _files_to_resources returns an empty [] instead of none if empty (matching BaseCrawl.get_crawl() behavior to Crawl.get_crawl())
- tests: update tests to use collectionIds for id list, add 'collections' for {name, id} test
- frontend: change Crawl object to have collectionIds instead of collections
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
* collections: support toggling collections public/private, viewable via RWP
- backend: add 'public' to collection model, support patching to update
- backend: add .../collections/<id>/public/replay.json for public access
- backend: add CORS handling for public endpoint
- frontend: support 'make shareable / make private' dropdown actions on collection detail + collection list views
- frontend: show shareable / private icons by collection name on detail + list views
- frontend: link to replayweb.page for standalone browsing
- frontend: add embed code popup when a collection is shareable
- refer to public collections as 'shareable' for now
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
* support streaming download of collections (part of #927)
- WACZ zip created on the fly using stream-zip
- add 'Download Collection' option to collection detail and list
- after editing collection, return to collection view
- tests: add test for streaming download, ensure WACZ files + datapackage present, STORE compression used
---------
Co-authored-by: sua yoo <sua@suayoo.com>
* Move all pydantic models to models.py to avoid circular dependencies
* Include automated crawl details in all-crawls GET endpoints
- ensure /all-crawls endpoint resolves names / firstSeed data same as /crawls endpoint for crawls to ensure consistent frontend display. fields added in get and list all-crawl endpoints for automated
crawls only:
- cid
- name
- description
- firstSeed
- seedCount
- profileName
* Add automated crawl fields to list all-crawls test
* Uncomment mongo readinessProbe
* cleanup CrawlOutWithResources:
- remove 'files' from output model, only resources should be returned
- add _files_to_resources() to simplify computing presigned 'resources' from raw 'files'
- update upload tests to be more consistent, 'files' never present, 'errors' always none
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
* basecrawl refactor: make crawls db more generic, supporting different types of 'base crawls': crawls, uploads, manual archives
- move shared functionality to basecrawl.py
- create a base BaseCrawl object, which contains start / finish time, metadata and files array
- create BaseCrawlOps, base class for CrawlOps, which supports base crawl deletion, querying and collection add/remove
* uploads api: (part of #929)
- new UploadCrawl object which extends BaseCrawl, has name and description
- support multipart form data data upload to /uploads/formdata
- support streaming upload of a single file via /uploads/stream, using botocore multipart upload to upload to s3-endpoint in parts
- require 'filename' param to set upload filename for streaming uploads (otherwise use form data names)
- sanitize filename, place uploads in /uploads/<uuid>/<sanitized-filename>-<random>.wacz
- uploads have internal id 'upload-<uuid>'
- create UploadedCrawl object with CrawlFiles pointing to the newly uploaded files, set state to 'complete'
- handle upload failures, abort multipart upload
- ensure uploads added within org bucket path
- return id / added when adding new UploadedCrawl
- support listing, deleting, and patch /uploads
- support upload details via /replay.json to support for replay
- add support for 'replaceId=<id>', which would remove all previous files in upload after new upload succeeds. if replaceId doesn't exist, create new upload. (only for stream endpoint so far).
- support patching upload metadata: notes, tags and name on uploads (UpdateUpload extends UpdateCrawl and adds 'name')
* base crawls api: Add /all-crawls list and delete endpoints for all crawl types (without resources)
- support all-crawls/<id>/replay.json with resources
- Use ListCrawlOut model for /all-crawls list endpoint
- Extend BaseCrawlOut from ListCrawlOut, add type
- use 'type: crawl' for crawls and 'type: upload' for uploads
- migration: ensure all previous crawl objects / missing type are set to 'type: crawl'
- indexes: add db indices on 'type' field and with 'type' field and oid, cid, finished, state
* tests: add test for multipart and streaming upload, listing uploads, deleting upload
- add sample WACZ for upload testing: 'example.wacz' and 'example-2.wacz'
* collections: support adding and remove both crawls and uploads via base crawl
- include collection_ids in /all-crawls list
- collections replay.json can include both crawls and uploads
bump version to 1.6.0-beta.2
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
* Make API add and update method returns consistent
- Updates return {"updated": True}
- Adds return {"added": True}
- Both can additionally have other fields as needed, e.g. id or name
- remove Profile response model, as returning added / id only
- reformat
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- Adds collections search and list to workflow editor
- Adds collections to workflow details component
- Adds namePrefix filter to backend GET /orgs/{oid}/collections endpoint to support case-insensitive searching of collections
- Adds documentation for new setting
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
This fixes#917, where crawls added to a collection via the workflow
autoAddCollections were not successfully represented in the crawl
and page count stats in the collection after completing.
- Support for creating new collections and editing existing collections
- Can select crawling workflows which adds entire workflow, and then deselect individual crawls
- Can edit existing collections and add more crawls
- Can view, create and delete collections via new Collections top-level nav entry
* Track collections in Crawl rather than crawls in Collection
* Add delete collection API endpoint and tests
* Precompute collection crawlCount, pageCount, and tags and add them to
GET collection responses
* Add modified field to Collection
* Update collection replay.json method
* Make add and remove crawls accept list of crawl ids
* Auto-add new workflow crawls to collections when they successfully
complete via CrawlConfig.autoAddCollections field
* Move long-running post-crawl operator tasks into asyncio task
* Make CrawlConfig.autoAddCollections updatable via /update API endpoint
* Sort by name and description (ascending by default)
* Filter by name
* Add endpoint to fetch collection names for search
* Add collation so that utf-8 chars sort as expected
* Re-implement collections, storing crawlIds in collection
* Return collections for crawl endpoints and filter on coll name
* Remove crawl from all collections when deleted
* Revert get_collection_crawls to flat array of resources
* Fix tests
* Re-implement pagination and paginate crawlconfig revs
First step toward simplifying pagination to set us up for sorting
and filtering of list endpoints. This commit removes fastapi-pagination
as a dependency.
* Migrate all HttpUrl seeds to Seeds
This commit also updates the frontend to always use Seeds and to
fix display issues resulting from the change.
* Filter and sort crawls and workflows
Crawls:
- Filter by createdBy (via userid param)
- Filter by state (comma-separated string for multiple values)
- Filter by first_seed, name, description
- Sort by started, finished, fileSize, firstSeed
- Sort descending by default to match frontend
Workflows:
- Filter by createdBy (formerly userid) and modifiedBy
- Filter by first_seed, name, description
- Sort by created, modified, firstSeed, lastCrawlTime
* Add crawlconfigs search-values API endpoint and test
* Paginate API list endpoints
fastapi-pagination is pinned to 0.9.3, the latest release that plays
nicely with pinned versions of fastapi and fastapi-users.
* Increase page size via overriden Params and Page classes
* update api resource list keys
---------
Co-authored-by: sua yoo <sua@suayoo.com>