- Changes `trash` for `trash3` which I believe wasn't originally available in the version of bootstrap-icons we were using but now it is and I like the tapered edges better :P
- Makes browser profiles action button small to fit with the rest of the dropdown components used elswhere
- Changes previous file-earmark delete icon to trash icons used everywhere else for delete actions
- Adds some labels to missing icon buttons
- Fixes metadata `aria-label` usage → `label` so it actually gets added to the rendered `button`
- Changes the "More" label to a (hopefully) more descriptive "Actions" label for dropdown actions menus
fixes from 1.4.1:
* Upgrade to mongo 6 and use for workflow crawls
* update readiness probe with timeouts doubled, and failure threshold increased for slower 'mongosh' readiness check
update versions to 1.5.0-beta.0 in backend and frontend
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
As per my note on #745, currently all our other check boxes turn features on when enabled. For consistency I have reversed the states of the autoscroll checkbox so the page autoscrolls when it is checked and does not run the behavior when it is unchecked. Checked is also now the default state.
- Updates help text accordingly
- Renames `disableAutoscrollBehavior` → autoscrollBehavior
* misc frontend build fixes:
- fix playwright version to be consistent to fix playwright test
- chunking: set max number of chunks generated
* lock playwright version
* remove intl polyfill
---------
Co-authored-by: sua yoo <sua@suayoo.com>
* Re-implement pagination and paginate crawlconfig revs
First step toward simplifying pagination to set us up for sorting
and filtering of list endpoints. This commit removes fastapi-pagination
as a dependency.
* Migrate all HttpUrl seeds to Seeds
This commit also updates the frontend to always use Seeds and to
fix display issues resulting from the change.
* Filter and sort crawls and workflows
Crawls:
- Filter by createdBy (via userid param)
- Filter by state (comma-separated string for multiple values)
- Filter by first_seed, name, description
- Sort by started, finished, fileSize, firstSeed
- Sort descending by default to match frontend
Workflows:
- Filter by createdBy (formerly userid) and modifiedBy
- Filter by first_seed, name, description
- Sort by created, modified, firstSeed, lastCrawlTime
* Add crawlconfigs search-values API endpoint and test
* backend: make crawlconfigs mutable! (#656)
- crawlconfig PATCH /{id} can now receive a new JSON config to replace the old one (in addition to scale, schedule, tags)
- exclusions: add / remove APIs mutate the current crawlconfig, do not result in a new crawlconfig created
- exclusions: ensure crawl job 'config' is updated when exclusions are added/removed, unify add/remove exclusions on crawl
- k8s: crawlconfig json is updated along with scale
- k8s: stateful set is restarted by updating annotation, instead of changing template
- crawl object: now has 'config', as well as 'profileid', 'schedule', 'crawlTimeout', 'jobType' properties to ensure anything that is changeable is stored on the crawl
- crawlconfigcore: store share properties between crawl and crawlconfig in new crawlconfigcore (includes 'schedule', 'jobType', 'config', 'profileid', 'schedule', 'crawlTimeout', 'tags', 'oid')
- crawlconfig object: remove 'oldId', 'newId', disallow deactivating/deleting while crawl is running
- rename 'userid' -> 'createdBy'
- remove unused 'completions' field
- add missing return to fix /run response
- crawlout: ensure 'profileName' is resolved on CrawlOut from profileid
- crawlout: return 'name' instead of 'configName' for consistent response
- update: 'modified', 'modifiedBy' fields to set modification date and user modifying config
- update: ensure PROFILE_FILENAME is updated in configmap is profileid provided, clear if profileid==""
- update: return 'settings_changed' and 'metadata_changed' if either crawl settings or metadata changed
- tests: update tests to check settings_changed/metadata_changed return values
add revision tracking to crawlconfig:
- store each revision separate mongo db collection
- revisions accessible via /crawlconfigs/{cid}/revs
- store 'rev' int in crawlconfig and in crawljob
- only add revision history if crawl config changed
migration:
- update to db v3
- copy fields from crawlconfig -> crawl
- rename userid -> createdBy
- copy userid -> modifiedBy, created -> modified
- skip invalid crawls (missing config), make createdBy optional (just in case)
frontend: Update crawl config keys with new API (#681), update frontend to use new PATCH endpoint, load config from crawl object in details view
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: sua yoo <sua@suayoo.com>
* Paginate API list endpoints
fastapi-pagination is pinned to 0.9.3, the latest release that plays
nicely with pinned versions of fastapi and fastapi-users.
* Increase page size via overriden Params and Page classes
* update api resource list keys
---------
Co-authored-by: sua yoo <sua@suayoo.com>
- Adds icons to details nav items
- Adds replay glyph icon
- Hides "Replay" & "Files" pages if the crawl is running
- Updates border radius 3px → 4px
- Updates colour values, aligns with mockups
- Replaces `margin` from menu items with `gap` values
- Removes animation
Prettier made some spacing adjustments, I also moved some lines around so they're all in the same spot now. 😬
Accessibility improvement, better for screen readers to have h1 be the content of the page as opposed to the application / brand name. Rest of the nav bar to be dealt with at a later date.
### Main Pages + General
- Adds H1 page titles for all main pages
- Moves the New Crawl Config action into the title row from the search controls box
- Gives the Crawl Config search controls box the same style as the Crawls search controls box
- Adds +8px of padding to the search controls box to match mockups
- Search box: medium → small
- Title row control buttons: medium → small
### Details Pages
- h2 → h1 for crawl config and crawl detail pages
- h3 → h2 for crawl config and crawl detail pages
- Removes crawl title margin bottom at medium breakpoint on crawl details page
- Aligns crawl config details title row controls with end of flexbox on mobile to match crawl details controls in the same spot
* Make invites expire after configurable window
The value can be set in EXPIRE_AFTER_SECONDS env var and via
helm chart values, and defaults to 7 days.
* Create nightly test CI and add invite expiration test to it
* Update 404 error message for missing or expired invite
---------
Co-authored-by: sua yoo <sua@suayoo.com>
* Fixes text overflow problem on crawl details page
- Crawl title length is now unlimited
- Flex items in that row are aligned to the bottom of the box (details bar) instead of the top
- Mirrors changes on config detail page
* Shrinks action button size on config detail page: Matches crawl detail page
* Margin fix: Added 0.5rem, aligned with mockups
Would probably ideally be break-word for all the non URL related things in the form but I don't think it will have any effect on anything that's not URLs in practice?
- Renames the first step as `Crawl Scope` from `Crawl Setup`. Technically this whole process is setting up crawls, `Crawl Scope` should be more descriptive.
- Changes the help text regarding crawler instances to mention rate limiting instead of the amount of resources it takes up on our end which isn't terribly relevant to users.
* Rename archives to orgs and aid to oid on backend
* Rename archive to org and aid to oid in frontend
* Remove translation artifact
* Rename team -> organization
* Add database migrations and run once on startup
* This commit also applies the new by_one_worker decorator to other
asyncio tasks to prevent heavy tasks from being run in each worker.
* Run black, pylint, and husky via pre-commit
* Set db version and use in migrations
* Update and prepare database in single task
* Migrate k8s configmaps
* filter crawls by user id
* filter crawl configs by user id
* text change: Filter My Crawls -> Show Only Mine
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
* profile browser vnc support + fixes:
- switch profile browser rendering to use VNC
- frontend: add @novnc/novnc as dependency, create separate bundle novnc.js to load into vnc browser (to avoid loading from each container)
- frontend: update proxy paths to proxy websocket, index page to crawler
- frontend: allow browser profiles in all browsers, remove browser compatibility check
- frontend: update webpack dev config, apply prettier
- frontend: node version fix
- backend: get vncpassword, build new URL for proxying to crawler iframe
- backend: fix profile / crawl job pull policy from 'Always' -> 'Never', should use existing image for job
- backend: fix kill signal to use bash -c to work with latest backend image
- backend/chart: add 'profile_browser_timeout_seconds' to chart values to control how long profile browser to remain when idle (default to 60)
- backend: remove utils.py, now using secret.token_hex() for random suffix
Co-authored-by: sua yoo <sua@suayoo.com>
* backend: crawl info apis:
- add /crawls/{crawl_id} api endpoint which just lists the crawl info, without resolving the individual files
- move /crawls/{crawl_id}.json -> /crawls/{crawl_id}/replay.json for clarity that it's used for replay
* frontend: update api for new replay.json endpoint
- fix typos in docs
- update prod deployment info
- update minikube info
- add info on how to run with local images
- bump version to 1.1.0-beta.3 for testing multiarch build
* Initial docs move
* Setup mkdocs
* Adds instructions for building docs
* add new deployment docs, local and prod
* set up three sections: deployment, dev and user guide
* remove old deployment docs
* ci: mkdocs gh-pages publish
Co-authored-by: sua yoo <sua@suayoo.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
* k8s local deployment work:
- make it easier to deploy w/o ingress by setting 'local_service_port' (suggested port 30870)
- if using local minio, ensure file endpoints set to /data/ and /data/ proxies correctly to local bucket
- if not using minio, ensure file endpoints point to correct access / endpoint url.
- setup should work with docker desktop, minikube, microk8s and k3s!
- nginx chart: bump nginx memory limit to 20Mi
- nginx image: 00-default-override-resolver-config -> 00-browsertrix-nginx-init for clarity
- nginx image: use default nginx.conf, pin to nginx 1.23.2
- mongo: readd readiness probe, bump connect wait timeout (needed for ci)
- config: set superadmin username to 'admin'
- config schema: set 'name' as required
- add sample chart values overrides:
- chart values: local-config.yaml for running locally with 'local_service_port'
- chart values: add microk8s-hosted.yaml for configuring a hosted microk8s setup
- chart values: add microk8s-ci.yaml for ci tests
- ci: remove docker swarm tests
- ci: add microk8s integration tests: launching cluster, logging in, running a crawl of example.com, downloading/checking WACZ
- bump to 1.1.0-beta.2
Fixes regression to #361 found after increasing the token timeout by preventing app load until the authentication service is initialized (and finishing check if another tab is logged in.)
- Adds version to version.txt in root
- adds update-version.sh which updates version in frontend/package.json and backend/btrixcloud/version.py
- frontend: loads version from $VERSION env var, ../version.txt or package.json
- ci: on new github release, pushes webrecorder/browsertrix-backend and webrecorder/browsertrix-frontend images to Dockerhub with current version, as well as latest.
- version set to 1.1.0-beta.0
- closes#357
- including pagination of queue results (30 results per page currently)
- show numbering on paginated results
- allow user navigation to each result page
Smoother elapsed crawl timer:
- Crawls list: show seconds increment up to 2 minutes, then show minutes only
- Crawls detail: show seconds increment up to one day
- only send signal if stopping, no need for canceling as pods/containers will be removed
- refactor stop/cancel handling to be unified in manager, separate in job
- when stopping / graceful shutdown, return false if sending signal fails
- return success=true in json response if and only if stop/cancel actually succeeds, return 'error' message in error, should fix#270
- allow canceling after stopping / if stopping fails
- ensure finished time is set in case of cancelation before crawl starts, should fix#273
* animate starting state
* consistent fixed-size slots for each browser (url + screencast)
* add tooltip for expected number of browsers (workers x scale)
* backend: refactor swarm support to also support podman (#260)
- implement podman support as subclass of swarm deployment
- podman is used when 'RUNTIME=podman' env var is set
- podman socket is mapped instead of docker socket
- podman-compose is used instead of docker-compose (though docker-compose works with podman, it does not support secrets, but podman-compose does)
- separate cli utils into SwarmRunner and PodmanRunner which extends it
- using config.yaml and config.env, both copied from sample versions
- work on simplifying config: add docker-compose.podman.yml and docker-compose.swarm.yml and signing and debug configs in ./configs
- add {build,run,stop}-{swarm,podman}.sh in scripts dir
- add init-configs, only copy if configs don't exist
- build local image use current version of podman, to support both podman 3.x and 4.x
- additional fixes for after testing podman on centos
- docs: update Deployment.md to cover swarm, podman, k8s deployment
- add custom init script for ./docker-entrypoint.d/ to setup resolver from local /etc/resolv.conf
- custom init script also removes default.conf, and removes minio route if NO_MINIO_ROUTE=1 is set
- assign template vars to nginx vars to avoid conflicts when interpolating
- k8s: remove initContainers and volumes, now handled via custom init script in image
- build backend and frontend with cacheing using GHA cache)
- streamline frontend image to reduce layers
- setup local swarm with test/setup.sh script, wait for containers to init
- copy sample config files as default (add storages.sample.yaml)
- add initial backend test for logging in with default superadmin credentials via 127.0.0.1:9871
- must use 127.0.0.1 instead of localhost for accessing frontend container within action
- use python-on-whale to use docker cli api directly, creating docker stack for each crawl or profile browser
- configure storages via storages.yaml secret
- add crawl_job, profile_job, splitting into base and k8s/swarm implementations
- split manager into base crawlmanager and k8s/swarm implementations
- swarm: load initial scale from db to avoid modifying fixed configs, in k8s, load from configmap
- swarm: support scheduled jobs via swarm-cronjob service
- remove docker dependencies (aiodocker, apscheduler, scheduling)
- swarm: when using local minio, expose via /data/ route in nginx via extra include (in k8s, include dir is empty and routing handled via ingress)
- k8s: cleanup minio chart: move init containers to minio.yaml
- swarm: stateful set implementation to be consistent with k8s scaling:
- don't use service replicas,
- create a unique service with '-N' appended and allocate unique volume for each replica
- allows crawl containers to be restarted w/o losing data
- add volume pruning background service, as volumes can be deleted only after service shuts down fully
- watch: fully simplify routing, route via replica index instead of ip for both k8s and swarm
- rename network btrix-cloud-net -> btrix-net to avoid conflict with compose network
- use statefulsets instead of deployments for mongo, redis, signer
- use k8s job + statefulset for running crawls
- use separate statefulset for crawl (scaled) and single-replica redis stateful set
- move crawl job update login to crawl_updater
- remove shared redis chart
package refactor:
- move to shared code to 'btrixcloud'
- move k8s to 'btrixcloud.k8s'
- move docker to 'btrixcloud.docker'
* ensure editing other config options does not lose profile
* support adding/editing/removing profile of existing config
* when duplicating config, ensure profile setting is also copied in the duplicate
- delete browser profile, if not in use
- if in use, show error message, listing crawl configs that use the profile
- backend: fix check for confirming profile deletion
* apply /api prefix consistently, both directly through backend and when accessing via frontend, fixes#236
* docs: update local deployment docs to use 9871 instead of 8000, don't expose 8000 by default
* schemas: don't include /openapi.json as /healthz in documentation, keep /healthz at root
* k8s: route backend to /api without additional rewriting
* add profile creation, list endpoints at /archives/<aid>/profiles
* add profile browser creation, get, ping, commit, delete endpoints at /archives/<aid>/profiles/browser
* support creation of profile browser using browsertrix-crawler 'create-login-profile' in docker and k8s
* ensure profile browser expires after set time, k8s job or docker container automatically deleted on exit
* profile browser creation returns temporary browser id, or `{"detail": "waiting_for_browser"}` while waiting for browser container init
* nginx frontend: proxy /loadbrowser/ to port 9223 in browsertrix-crawler, connecting directly to chrome devtools
* profile api auth: use redis for auth
- store browserid->archiveid and browserid->browser ip mapping in redis
- browser apis: ensure profile browser is associated with specified archive
- browser ws: pass arcchiveid and browserid to ws query args, browserid is part of archive, and browserid corresponds to specified ip
* store profiles in /profiles/ directory in default storage, include profileid in profile tar.gz filename
* support profile in crawlconfig:
- add profileid to CrawlConfig, and profileName to CrawlConfigOut
- support resolving profile path via profileid, setting '--profile @{path/to/profile.tar.gz}' for crawler (assuming same storage for profile as output for now) in both docker and k8s setups
- docker: support out_filename, custom wacz output filename missing functionality
* backend api
- superadmin has admin access to all archives
- new superadmin endpoints: /archives/all/crawls and /archives/all/crawls/<crawl_id>.json for list all running crawls
and loading crawl data by id
- frontend superadmin view (fixes#201)
* show all archives on superadmin home page
* show jump to crawl for super admin (#200)
* navbar links for: all archives, all running crawls and jump to crawl
Co-authored-by: sua yoo <sua@suayoo.com>
* frontend-tweaks:
- treat 'starting' state same as 'running'
- default to no schedule instead of weekly for default
- add 'Domain' scopeType
* backend: also allow 'domain' as a scopeType
* watch work: proxy directly to crawls instead of redis pubsub
- add 'watchIPs' to crawl detail output
- cache crawl ips for quick access for auth
- add '/ipaccess/{ip}' endpoint for watch ws connection to ensure ws has access to the specified container ip
- enable 'auth_request' in nginx frontend
- requirements: update to latest redis-py
remaining fixes for #134
* frontend docker build: pass GIT_COMMIT_HASH and GIT_BRANCH_NAME as env vars to remove dependency on git in webpack.config.js (for glitchtip)
fixes#150
* default to "unknown" if git and env vars not available
* add comment about error reporting for local use
Co-authored-by: sua yoo <sua@suayoo.com>
* backend support for new watch system (#134):
- support for watch via redis pubsub and websocket connection to backend
- can support watch from any number of crawler instances to support scaled crawls
- use /archives/{aid}/crawls/{crawl_id}/watch/ws websocket endpoint
- ws: ignore graceful connectionclosedok exception, log other exceptions
- set logging to info to instead of debug for now (debug logs all ws traffic)
- remove old watch apis in backend
- remove old websocket routing to crawler instance for old watch system
- oauth bearer check: support websockets, use websocket object if no request object
- crawler args: replace --screencastPort with --screencastRedis
* support for replay via replayweb.page embed, fixes#124
backend:
- pre-sign all files urls
- cache pre-signed urls in redis, presign again when expired (default duration 3600, settable via PRESIGN_DURATION_SECONDS env var)
- change files output -> resources to confirm to Data Package spec supported by replayweb.page
- add CrawlFileOut which contains 'name' (file id), 'path' (presigned url), 'hash', and 'size'
- add /replay/sw.js endpoint to import sw.js from latest replay-web-page release
- update to fastapi-users 9.2.2
- customize backend auth to allow authentication to check 'auth_bearer' query arg if 'Authorization' header not set
- remove sw.js endpoint, handling in frontend
frontend:
- add <replay-web-page> to frontend, include rwp ui.js from latest release in index.html for now
- update crawl api endpoint to end in json
- replay-web-page loads the api endpoint directly!
- update Crawl type to use new format, 'resources' -> instead of 'files', each file has 'name' and 'path'
- nginx: add endpoint to serve the replay sw.js endpoint
- add defer attr to ui.js
- move 'Download' to 'Download Files'
* frontend: support customizing replayweb.page loading url via RWP_BASE_URL env var in Dockerfile
- default prod value set in frontend Dockerfile (set to upcoming 1.5.8 release needed for multi-wacz-file support) (can be overridden during image build via --build-arg)
- rename index.html -> index.ejs to allow interpolation
- RWP_BASE_URL defaults to latest https://replayweb.page/ for testing
- for local testing, add sw.js loading via devServer, also using RWP_BASE_URL (#131)
Co-authored-by: sua yoo <sua@suayoo.com>
- Cancel and stop crawl
- Sorts crawls by start time, status and crawl template ID
- Filters crawls by crawl template ID
- Adds shortcut to copy template ID
frontend:
- add checkbox to basic crawl config component which sets 'extraHops' to 1, otherwise to 0
- text tweaks: rename Scope Type -> Crawl Scope, capitalization
backend: add 'extraHops' to CrawlConfig
fixes#102