- Users can now upload .WACZ archives from the "Archived Data" page.
- Can specify name, description, tags and collection(s) to add upload to
- Show progress of upload
- Support canceling upload
* frontend:
- follow up to #969, fixes crawl workflows by using crawl-specific endpoint and merging results
* get crawls and uploads concurrently
---------
Co-authored-by: sua yoo <sua@suayoo.com>
* additional fixes for #935:
- don't use artifactType for detail pages, ensure correct artifact selected based on path
* naming tweaks:
- from uploads detail, return to 'All Uploads' with filter
- from crawls detail, return to 'All Crawls' with filter
- rename general to 'All Archived Data'
- Adds top-level "Archived Data" view, replacing "Finished Crawls" and moving it as "Crawls" into view
- Adds list for viewing all artifacts/data
- Adds list for viewing all uploaded crawls
- Updates crawl detail view to show upload details
- Edit upload metadata, including 'name'
- Delete uploads
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
* basecrawl refactor: make crawls db more generic, supporting different types of 'base crawls': crawls, uploads, manual archives
- move shared functionality to basecrawl.py
- create a base BaseCrawl object, which contains start / finish time, metadata and files array
- create BaseCrawlOps, base class for CrawlOps, which supports base crawl deletion, querying and collection add/remove
* uploads api: (part of #929)
- new UploadCrawl object which extends BaseCrawl, has name and description
- support multipart form data data upload to /uploads/formdata
- support streaming upload of a single file via /uploads/stream, using botocore multipart upload to upload to s3-endpoint in parts
- require 'filename' param to set upload filename for streaming uploads (otherwise use form data names)
- sanitize filename, place uploads in /uploads/<uuid>/<sanitized-filename>-<random>.wacz
- uploads have internal id 'upload-<uuid>'
- create UploadedCrawl object with CrawlFiles pointing to the newly uploaded files, set state to 'complete'
- handle upload failures, abort multipart upload
- ensure uploads added within org bucket path
- return id / added when adding new UploadedCrawl
- support listing, deleting, and patch /uploads
- support upload details via /replay.json to support for replay
- add support for 'replaceId=<id>', which would remove all previous files in upload after new upload succeeds. if replaceId doesn't exist, create new upload. (only for stream endpoint so far).
- support patching upload metadata: notes, tags and name on uploads (UpdateUpload extends UpdateCrawl and adds 'name')
* base crawls api: Add /all-crawls list and delete endpoints for all crawl types (without resources)
- support all-crawls/<id>/replay.json with resources
- Use ListCrawlOut model for /all-crawls list endpoint
- Extend BaseCrawlOut from ListCrawlOut, add type
- use 'type: crawl' for crawls and 'type: upload' for uploads
- migration: ensure all previous crawl objects / missing type are set to 'type: crawl'
- indexes: add db indices on 'type' field and with 'type' field and oid, cid, finished, state
* tests: add test for multipart and streaming upload, listing uploads, deleting upload
- add sample WACZ for upload testing: 'example.wacz' and 'example-2.wacz'
* collections: support adding and remove both crawls and uploads via base crawl
- include collection_ids in /all-crawls list
- collections replay.json can include both crawls and uploads
bump version to 1.6.0-beta.2
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
* Make API add and update method returns consistent
- Updates return {"updated": True}
- Adds return {"added": True}
- Both can additionally have other fields as needed, e.g. id or name
- remove Profile response model, as returning added / id only
- reformat
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- Adds collections search and list to workflow editor
- Adds collections to workflow details component
- Adds namePrefix filter to backend GET /orgs/{oid}/collections endpoint to support case-insensitive searching of collections
- Adds documentation for new setting
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
* Updates footer
- Adds documentation link
- Adds label to GitHub link, moves outside of the version code
- Adds copy button to version code for quick access when filing bug reports :)
* Comments out invisible div
* Improves responsiveness on mobile
* scope fix: when using 'Custom Page Prefix scope (fixes#873)
- don't include primary seed URL in include list
- don't always add trailing slash to extra in scope URLs
- set seed scope to 'prefix' (supported via webrecorder/browsertrix-crawler#318) instead of re-including seed URL
- add comments on using 'custom' to indicate 'Custom Prefix Scope' semantics on frontend, setting actual scope to 'prefix' on backend
- remove unneeded conditional for additional urls, main scopeType overridden per seed anyway
- Unifies trash icons on all pages to use trash3 (there were a few stragglers!)
- Brings styling of org quotas dialogue in-line with the rest of our dialogues
- Adds missing localization strings
- Swaps button with icon button to match table row action styling elsewhere
wabac.js will reload the replay.json on 403 with new token (will be in next version of wabac.js)
presign urls: make presign timeout configurable (in minutes), defaults to 60 mins
dockerfile: fix configuring RWP_BASE_URL
- Adds two new properties, name to pick the icon's name and content to pick a custom tooltip message. These are in-line with what Shoelace uses but are perhaps not the best descriptors...
- Swaps the existing anchor links on the Workflow Details' Settings tab for these and relocates them to after the heading. (Navigation to the links is broken right now... but the copying part works nicely!)
- Updates btrix-section-heading to better handle multiple elements with flexbox and an 8px gap between elements
- Support for creating new collections and editing existing collections
- Can select crawling workflows which adds entire workflow, and then deselect individual crawls
- Can edit existing collections and add more crawls
- Can view, create and delete collections via new Collections top-level nav entry
concurrent crawl limits: (addresses #866)
- support limits on concurrent crawls that can be run within a single org
- change 'waiting' state to 'waiting_org_limit' for concurrent crawl limit and 'waiting_capacity' for capacity-based
limits
orgs:
- add 'maxConcurrentCrawl' to new 'quotas' object on orgs
- add /quotas endpoint for updating quotas object
operator:
- add all crawljobs as related, appear to be returned in creation order
- operator: if concurrent crawl limit set, ensures current job is in the first N set of crawljobs (as provided via 'related' list of crawljob objects) before it can proceed to 'starting', otherwise set to 'waiting_org_limit'
- api: add org /quotas endpoint for configuring quotas
- remove 'new' state, always start with 'starting'
- crawljob: add 'oid' to crawljob spec and label for easier querying
- more stringent state transitions: add allowed_from to set_state()
- ensure state transitions only happened from allowed states, while failed/canceled can happen from any state
- ensure finished and state synched from db if transition not allowed
- add crawl indices by oid and cid
frontend:
- show different waiting states on frontend: 'Waiting (Crawl Limit) and 'Waiting (At Capacity)'
- add gear icon on orgs admin page
- and initial popup for setting org quotas, showing all properties from org 'quotas' object
tests:
- add concurrent crawl limit nightly tests
- fix state waiting -> waiting_capacity
- ci: add logging of operator output on test failure
* Precompute config crawl stats
Includes a database migration to move preciously dynamically computed
crawl stats for workflows into the CrawlConfig model.
* Add lastRun sorting option and enable it by default
* Add modified as final sort key to order non-run workflows
* Remove currCrawl* fields and update frontend accordingly
* Add isCrawlRunning field to backend and use in frontend
- Updates superadmin "Running Crawls" to show active crawls (starting, waiting, running, stopping) and sort by start by default
- Navigates to crawl workflow watch view on clicking crawl item
- Adds "Copy Crawl ID" to crawl actions for easy paste into "Jump to crawl"
- Navigates to crawl workflow watch when jumping to crawl
- Adds primary action button next to "Actions" dropdown
- Switches "Edit Workflow Settings" button to icon button
- Redirects user to "Watch Crawl" tab when starting crawl
- Now uses crawl ID from `data.started` in API `/run` response for more responsive UI
- Keeps "Watch Crawl" tab navigation button in list but disable when crawl is not running
- Also handles watch view when workflow is not running to cover navigational edge cases
- Adds banner in "Crawls" list to direct users to the Watch Crawl when workflow is running
- Shows notification when crawl is done to make redirect to Crawls tab smoother
- Uses workflow scale when updating crawl scale
- Removes "All" from "View: All Finished Crawls" on Finished Crawl page for wording consistency
* frontend crawl stopping improvements (#836)
- support new backend 'stopping' property
- for now, keep 'stopping' indicator state when crawl is running but stopping set to true
* operator: add waiting state
- add pods as related objects
- inspect pod status, set crawl status to 'waiting' if no pods are running
frontend:
- frontend support for 'waiting' state
- show waiting icon from mocks
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
- Adds an input to the Workflow creation and edit form for specifying crawl depth. This input is conditionally shown for seeded crawls when the scope is set to "Pages on this domain", "Pages on this domain & subdomains" or "Custom page prefix". The "any" scope is also supported for backwards compatibility but is not shown by default or in new configs.
- API implementation: The depth value is set in the primary seed config, i.e. the first seed in seeds: [], not in the outer .config.depth property.
Resolves#817
- Adds relevant action buttons to each Workflow detail tab header
- Adds "Delete" action menu item to crawls in Crawls tab
- Prevent automatically switching to "Watch" tab after running crawl from detail page
- Removes "Stop" confirmation prompt and only shows "Cancel" confirmation prompt if there are one or more pages crawled
- Replaces "Cancel" confirmation prompt with web component dialog (partially addresses Switch to in-page dialogue boxes #619)
- Fixes hash routing to fix going back with browser back button
- Changes `trash` for `trash3` which I believe wasn't originally available in the version of bootstrap-icons we were using but now it is and I like the tapered edges better :P
- Makes browser profiles action button small to fit with the rest of the dropdown components used elswhere
- Changes previous file-earmark delete icon to trash icons used everywhere else for delete actions
- Adds some labels to missing icon buttons
- Fixes metadata `aria-label` usage → `label` so it actually gets added to the rendered `button`
- Changes the "More" label to a (hopefully) more descriptive "Actions" label for dropdown actions menus
fixes from 1.4.1:
* Upgrade to mongo 6 and use for workflow crawls
* update readiness probe with timeouts doubled, and failure threshold increased for slower 'mongosh' readiness check
update versions to 1.5.0-beta.0 in backend and frontend
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
As per my note on #745, currently all our other check boxes turn features on when enabled. For consistency I have reversed the states of the autoscroll checkbox so the page autoscrolls when it is checked and does not run the behavior when it is unchecked. Checked is also now the default state.
- Updates help text accordingly
- Renames `disableAutoscrollBehavior` → autoscrollBehavior
* misc frontend build fixes:
- fix playwright version to be consistent to fix playwright test
- chunking: set max number of chunks generated
* lock playwright version
* remove intl polyfill
---------
Co-authored-by: sua yoo <sua@suayoo.com>
* Re-implement pagination and paginate crawlconfig revs
First step toward simplifying pagination to set us up for sorting
and filtering of list endpoints. This commit removes fastapi-pagination
as a dependency.
* Migrate all HttpUrl seeds to Seeds
This commit also updates the frontend to always use Seeds and to
fix display issues resulting from the change.
* Filter and sort crawls and workflows
Crawls:
- Filter by createdBy (via userid param)
- Filter by state (comma-separated string for multiple values)
- Filter by first_seed, name, description
- Sort by started, finished, fileSize, firstSeed
- Sort descending by default to match frontend
Workflows:
- Filter by createdBy (formerly userid) and modifiedBy
- Filter by first_seed, name, description
- Sort by created, modified, firstSeed, lastCrawlTime
* Add crawlconfigs search-values API endpoint and test
* backend: make crawlconfigs mutable! (#656)
- crawlconfig PATCH /{id} can now receive a new JSON config to replace the old one (in addition to scale, schedule, tags)
- exclusions: add / remove APIs mutate the current crawlconfig, do not result in a new crawlconfig created
- exclusions: ensure crawl job 'config' is updated when exclusions are added/removed, unify add/remove exclusions on crawl
- k8s: crawlconfig json is updated along with scale
- k8s: stateful set is restarted by updating annotation, instead of changing template
- crawl object: now has 'config', as well as 'profileid', 'schedule', 'crawlTimeout', 'jobType' properties to ensure anything that is changeable is stored on the crawl
- crawlconfigcore: store share properties between crawl and crawlconfig in new crawlconfigcore (includes 'schedule', 'jobType', 'config', 'profileid', 'schedule', 'crawlTimeout', 'tags', 'oid')
- crawlconfig object: remove 'oldId', 'newId', disallow deactivating/deleting while crawl is running
- rename 'userid' -> 'createdBy'
- remove unused 'completions' field
- add missing return to fix /run response
- crawlout: ensure 'profileName' is resolved on CrawlOut from profileid
- crawlout: return 'name' instead of 'configName' for consistent response
- update: 'modified', 'modifiedBy' fields to set modification date and user modifying config
- update: ensure PROFILE_FILENAME is updated in configmap is profileid provided, clear if profileid==""
- update: return 'settings_changed' and 'metadata_changed' if either crawl settings or metadata changed
- tests: update tests to check settings_changed/metadata_changed return values
add revision tracking to crawlconfig:
- store each revision separate mongo db collection
- revisions accessible via /crawlconfigs/{cid}/revs
- store 'rev' int in crawlconfig and in crawljob
- only add revision history if crawl config changed
migration:
- update to db v3
- copy fields from crawlconfig -> crawl
- rename userid -> createdBy
- copy userid -> modifiedBy, created -> modified
- skip invalid crawls (missing config), make createdBy optional (just in case)
frontend: Update crawl config keys with new API (#681), update frontend to use new PATCH endpoint, load config from crawl object in details view
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: sua yoo <sua@suayoo.com>
* Paginate API list endpoints
fastapi-pagination is pinned to 0.9.3, the latest release that plays
nicely with pinned versions of fastapi and fastapi-users.
* Increase page size via overriden Params and Page classes
* update api resource list keys
---------
Co-authored-by: sua yoo <sua@suayoo.com>
- Adds icons to details nav items
- Adds replay glyph icon
- Hides "Replay" & "Files" pages if the crawl is running
- Updates border radius 3px → 4px
- Updates colour values, aligns with mockups
- Replaces `margin` from menu items with `gap` values
- Removes animation
Prettier made some spacing adjustments, I also moved some lines around so they're all in the same spot now. 😬
Accessibility improvement, better for screen readers to have h1 be the content of the page as opposed to the application / brand name. Rest of the nav bar to be dealt with at a later date.
### Main Pages + General
- Adds H1 page titles for all main pages
- Moves the New Crawl Config action into the title row from the search controls box
- Gives the Crawl Config search controls box the same style as the Crawls search controls box
- Adds +8px of padding to the search controls box to match mockups
- Search box: medium → small
- Title row control buttons: medium → small
### Details Pages
- h2 → h1 for crawl config and crawl detail pages
- h3 → h2 for crawl config and crawl detail pages
- Removes crawl title margin bottom at medium breakpoint on crawl details page
- Aligns crawl config details title row controls with end of flexbox on mobile to match crawl details controls in the same spot
* Make invites expire after configurable window
The value can be set in EXPIRE_AFTER_SECONDS env var and via
helm chart values, and defaults to 7 days.
* Create nightly test CI and add invite expiration test to it
* Update 404 error message for missing or expired invite
---------
Co-authored-by: sua yoo <sua@suayoo.com>
* Fixes text overflow problem on crawl details page
- Crawl title length is now unlimited
- Flex items in that row are aligned to the bottom of the box (details bar) instead of the top
- Mirrors changes on config detail page
* Shrinks action button size on config detail page: Matches crawl detail page
* Margin fix: Added 0.5rem, aligned with mockups
Would probably ideally be break-word for all the non URL related things in the form but I don't think it will have any effect on anything that's not URLs in practice?
- Renames the first step as `Crawl Scope` from `Crawl Setup`. Technically this whole process is setting up crawls, `Crawl Scope` should be more descriptive.
- Changes the help text regarding crawler instances to mention rate limiting instead of the amount of resources it takes up on our end which isn't terribly relevant to users.
* Rename archives to orgs and aid to oid on backend
* Rename archive to org and aid to oid in frontend
* Remove translation artifact
* Rename team -> organization
* Add database migrations and run once on startup
* This commit also applies the new by_one_worker decorator to other
asyncio tasks to prevent heavy tasks from being run in each worker.
* Run black, pylint, and husky via pre-commit
* Set db version and use in migrations
* Update and prepare database in single task
* Migrate k8s configmaps
* filter crawls by user id
* filter crawl configs by user id
* text change: Filter My Crawls -> Show Only Mine
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
* profile browser vnc support + fixes:
- switch profile browser rendering to use VNC
- frontend: add @novnc/novnc as dependency, create separate bundle novnc.js to load into vnc browser (to avoid loading from each container)
- frontend: update proxy paths to proxy websocket, index page to crawler
- frontend: allow browser profiles in all browsers, remove browser compatibility check
- frontend: update webpack dev config, apply prettier
- frontend: node version fix
- backend: get vncpassword, build new URL for proxying to crawler iframe
- backend: fix profile / crawl job pull policy from 'Always' -> 'Never', should use existing image for job
- backend: fix kill signal to use bash -c to work with latest backend image
- backend/chart: add 'profile_browser_timeout_seconds' to chart values to control how long profile browser to remain when idle (default to 60)
- backend: remove utils.py, now using secret.token_hex() for random suffix
Co-authored-by: sua yoo <sua@suayoo.com>
* backend: crawl info apis:
- add /crawls/{crawl_id} api endpoint which just lists the crawl info, without resolving the individual files
- move /crawls/{crawl_id}.json -> /crawls/{crawl_id}/replay.json for clarity that it's used for replay
* frontend: update api for new replay.json endpoint
- fix typos in docs
- update prod deployment info
- update minikube info
- add info on how to run with local images
- bump version to 1.1.0-beta.3 for testing multiarch build
* Initial docs move
* Setup mkdocs
* Adds instructions for building docs
* add new deployment docs, local and prod
* set up three sections: deployment, dev and user guide
* remove old deployment docs
* ci: mkdocs gh-pages publish
Co-authored-by: sua yoo <sua@suayoo.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
* k8s local deployment work:
- make it easier to deploy w/o ingress by setting 'local_service_port' (suggested port 30870)
- if using local minio, ensure file endpoints set to /data/ and /data/ proxies correctly to local bucket
- if not using minio, ensure file endpoints point to correct access / endpoint url.
- setup should work with docker desktop, minikube, microk8s and k3s!
- nginx chart: bump nginx memory limit to 20Mi
- nginx image: 00-default-override-resolver-config -> 00-browsertrix-nginx-init for clarity
- nginx image: use default nginx.conf, pin to nginx 1.23.2
- mongo: readd readiness probe, bump connect wait timeout (needed for ci)
- config: set superadmin username to 'admin'
- config schema: set 'name' as required
- add sample chart values overrides:
- chart values: local-config.yaml for running locally with 'local_service_port'
- chart values: add microk8s-hosted.yaml for configuring a hosted microk8s setup
- chart values: add microk8s-ci.yaml for ci tests
- ci: remove docker swarm tests
- ci: add microk8s integration tests: launching cluster, logging in, running a crawl of example.com, downloading/checking WACZ
- bump to 1.1.0-beta.2
Fixes regression to #361 found after increasing the token timeout by preventing app load until the authentication service is initialized (and finishing check if another tab is logged in.)
- Adds version to version.txt in root
- adds update-version.sh which updates version in frontend/package.json and backend/btrixcloud/version.py
- frontend: loads version from $VERSION env var, ../version.txt or package.json
- ci: on new github release, pushes webrecorder/browsertrix-backend and webrecorder/browsertrix-frontend images to Dockerhub with current version, as well as latest.
- version set to 1.1.0-beta.0
- closes#357
- including pagination of queue results (30 results per page currently)
- show numbering on paginated results
- allow user navigation to each result page
Smoother elapsed crawl timer:
- Crawls list: show seconds increment up to 2 minutes, then show minutes only
- Crawls detail: show seconds increment up to one day
- only send signal if stopping, no need for canceling as pods/containers will be removed
- refactor stop/cancel handling to be unified in manager, separate in job
- when stopping / graceful shutdown, return false if sending signal fails
- return success=true in json response if and only if stop/cancel actually succeeds, return 'error' message in error, should fix#270
- allow canceling after stopping / if stopping fails
- ensure finished time is set in case of cancelation before crawl starts, should fix#273
* animate starting state
* consistent fixed-size slots for each browser (url + screencast)
* add tooltip for expected number of browsers (workers x scale)
* backend: refactor swarm support to also support podman (#260)
- implement podman support as subclass of swarm deployment
- podman is used when 'RUNTIME=podman' env var is set
- podman socket is mapped instead of docker socket
- podman-compose is used instead of docker-compose (though docker-compose works with podman, it does not support secrets, but podman-compose does)
- separate cli utils into SwarmRunner and PodmanRunner which extends it
- using config.yaml and config.env, both copied from sample versions
- work on simplifying config: add docker-compose.podman.yml and docker-compose.swarm.yml and signing and debug configs in ./configs
- add {build,run,stop}-{swarm,podman}.sh in scripts dir
- add init-configs, only copy if configs don't exist
- build local image use current version of podman, to support both podman 3.x and 4.x
- additional fixes for after testing podman on centos
- docs: update Deployment.md to cover swarm, podman, k8s deployment