Commit Graph

218 Commits

Author SHA1 Message Date
Ilya Kreymer
e6467c3374 backend work:
- support {configname}-{username}-@ts-@hostsuffix.wacz as output filename, sanitize username and config name
- support returning 'starting' for crawl status if no ips or 0/0 pages found.
- fix updating scale via POST crawlconfig update
- fix duplicate user error on superuser init
2022-03-15 18:20:25 -07:00
Ilya Kreymer
4b2f89db91 k8s: support for using a pre-made persistent volume/claim for crawling, configurable via CRAWLER_PV_CLAIM, otherwise using emptyDir
k8s: ability to set deployment scale for frontend as well
2022-03-15 11:18:23 -07:00
Ilya Kreymer
8ce7a9802b backend quick fix:
chart/config: use screencastPort, fixed collection name
k8s: set pod to never restart to see logs
2022-03-14 11:42:53 -07:00
Ilya Kreymer
9c99d67b1d quickfix: backend: docker: fix loading ips for watch 2022-03-04 17:12:19 -08:00
Ilya Kreymer
fb51f8e33e
Mongo auth fix (#190)
* backend: makes mongo auth configurable!
use mongo_auth secret in k8s and set env vars in docker
fixes #177 
* docker: update config.sample.env: use ws screencast by default, add NO_DELETE_ON_FAIL option, extend default login lifetime
2022-03-04 15:04:33 -08:00
Ilya Kreymer
cdd0ab34a3
Watch Stream Directly from Browsertrix Crawler (#189)
* watch work: proxy directly to crawls instead of redis pubsub
- add 'watchIPs' to crawl detail output
- cache crawl ips for quick access for auth
- add '/ipaccess/{ip}' endpoint for watch ws connection to ensure ws has access to the specified container ip
- enable 'auth_request' in nginx frontend
- requirements: update to latest redis-py
remaining fixes for #134
2022-03-04 14:55:11 -08:00
Ilya Kreymer
51a573ef1f backend prod settings:
- set WEB_CONCURRENCY env var to configure number of backend api workers for both docker and k8s
- set via 'backend_workers' in values.yaml
- also add 'rwp_base_url' to values.yaml
- update containers to use public webrecorder/browsertrix-backend and webrecorder/browsertrix-frontend containers
- make liveness, readiness and startup health checks more tolerant
2022-02-28 18:09:13 -08:00
Ilya Kreymer
84a9079b1f
support signing in docker deployment: (#166)
- add authsign to docker-compose.yml
- add signing.sample.yaml to be copied to signing.yaml for authsign
- add WACZ_SIGN_URL and WACZ_SIGN_TOKEN to config.sample.env
- signing enabled if WACZ_SIGN_URL is set
- add instructions on how to enable signing to Deployment
- update .gitignore, don't commit 'signing.yaml'
- update images to use public repo browsertrix images
2022-02-28 14:32:19 -08:00
Ilya Kreymer
9bd402fa17
New WS Endpoint for Watching Crawl (#152)
* backend support for new watch system (#134):
- support for watch via redis pubsub and websocket connection to backend
- can support watch from any number of crawler instances to support scaled crawls
- use /archives/{aid}/crawls/{crawl_id}/watch/ws websocket endpoint
- ws: ignore graceful connectionclosedok exception, log other exceptions
- set logging to info to instead of debug for now (debug logs all ws traffic)
- remove old watch apis in backend
- remove old websocket routing to crawler instance for old watch system
- oauth bearer check: support websockets, use websocket object if no request object
- crawler args: replace --screencastPort with --screencastRedis
2022-02-22 10:33:10 -08:00
Ilya Kreymer
aa5207915c
backend: fix crawl config revision links (#149)
backed: crawlconfig:
- ensure newId is saved on old config being replaced
- if old config replaced is being deleted, ensure newId link is set on its old config (if any),
and the oldId points to the oldId of config being replaced (if any)
2022-02-21 16:51:27 -08:00
Ilya Kreymer
ee68a2f64e
Support for setting scale in crawlconfig (#148)
* backend: scale support:
- add 'scale' field to crawlconfig
- support updating 'scale' field in crawlconfig patch
- add constraint for crawlconfig and crawl scale (currently 1-3)
2022-02-20 11:27:47 -08:00
Ilya Kreymer
d05f04be9f
Crawl Config Editing Support (#141)
* support inactive configs in same collection, configs with `inactive` set to true (#137)
- add `inactive`, `newId`, `oldId` to crawlconfigs
- filter out inactive configs by default for most operations
- add index for aid + inactive field for faster querying
- delete returns status: 'deactivated' or 'deleted'
- if no crawls ran, config can be deleted, otherwise it is deactivated

* update crawl endpoint: add general PATCH crawl config endpoint, support updating schedule and name
2022-02-17 16:04:07 -08:00
Ilya Kreymer
d28ebcc7b6 backend: crawlconfig: don't pass default settings to crawlconfig to avoid redundant settings, use browsertrix-crawler defaults
when config not set
2022-02-14 18:47:52 -08:00
Ilya Kreymer
ca85edc8b3 backend: resource limits:
- set resource mem and cpu requests/limits for all used services (not minio for now)
- add readiness proble to redis, mongo
- adjust crawler limits, set via configmap
2022-02-08 19:53:41 -08:00
Ilya Kreymer
71842be94a backend: k8s setup minor tweaks:
- add 'emptyDir' volume for crawl directory (to allow any pod restarts to have access to the data)
- rename minio and redis volumes to avoid any confusion
- add pod termination grace-period (default to 600 secs)
2022-02-08 15:52:57 -08:00
Ilya Kreymer
8acb43b171 backend: use redis to mark crawls as canceled immediately, avoid dupes in crawl list (even if paging is added for db results) 2022-02-01 15:58:56 -08:00
Ilya Kreymer
4b7522920a backend: k8s: fix finished check, resource limits increase 2022-02-01 15:07:20 -08:00
Ilya Kreymer
b3f21932fc backend: k8s: list running jobs tweak: if succeeded jobs == number of parallel jobs, filter out from list, assume finished and not stopping 2022-02-01 00:05:13 -08:00
Ilya Kreymer
2b2e6fedfa
Misc backend fixes (#133)
* misc backend fixes:
- fix uuid typing: roles list, user invites
- crawlconfig: fix created date setting, fix userName lookup
- docker: fix timezone for scheduler, fix running check
- remove prints
- fix get crawl stuck in 'stopping' - check finished list first, then run list (in case k8s job has not been deleted)
2022-01-31 19:41:04 -08:00
Ilya Kreymer
adb5c835f2
Presign and replay (#127)
* support for replay via replayweb.page embed, fixes #124

backend:
- pre-sign all files urls
- cache pre-signed urls in redis, presign again when expired (default duration 3600, settable via PRESIGN_DURATION_SECONDS env var)
- change files output -> resources to confirm to Data Package spec supported by replayweb.page
- add CrawlFileOut which contains 'name' (file id), 'path' (presigned url), 'hash', and 'size'
- add /replay/sw.js endpoint to import sw.js from latest replay-web-page release
- update to fastapi-users 9.2.2
- customize backend auth to allow authentication to check 'auth_bearer' query arg if 'Authorization' header not set
- remove sw.js endpoint, handling in frontend

frontend:
- add <replay-web-page> to frontend, include rwp ui.js from latest release in index.html for now
- update crawl api endpoint to end in json
- replay-web-page loads the api endpoint directly!
- update Crawl type to use new format, 'resources' -> instead of 'files', each file has 'name' and 'path'

- nginx: add endpoint to serve the replay sw.js endpoint
- add defer attr to ui.js
- move 'Download' to 'Download Files'

* frontend: support customizing replayweb.page loading url via RWP_BASE_URL env var in Dockerfile
- default prod value set in frontend Dockerfile (set to upcoming 1.5.8 release needed for multi-wacz-file support) (can be overridden during image build via --build-arg)
- rename index.html -> index.ejs to allow interpolation
- RWP_BASE_URL defaults to latest https://replayweb.page/ for testing
- for local testing, add sw.js loading via devServer, also using RWP_BASE_URL (#131)

Co-authored-by: sua yoo <sua@suayoo.com>
2022-01-31 17:02:15 -08:00
Ilya Kreymer
f569125a3d
storage: support loading default storage from crawl manangers (#126)
support s3-compatible presigning with default storage
backend support for #120
2022-01-31 11:22:03 -08:00
Ilya Kreymer
523b557eac replay route: (prepare for replay, #124)
- add support for /replay/sw.js
- ensure route works in both k8s and docker (routed via main nginx)
2022-01-31 11:18:10 -08:00
Ilya Kreymer
be86505347 backend: crawls api: better fix for graceful stop
- k8s: don't use redis, set to 'stopping' if status.active is not set, toggled immediately on delete_job
- docker: set custom redis key to indicate 'stopping' state (container still running)
- api: remove crawl is_running endpoint, redundant with general get crawl api
2022-01-30 22:01:00 -08:00
Ilya Kreymer
542680daf7
backend fixes: fix graceful stop + stats (#122)
* backend fixes: fix graceful stop + stats
- use redis to track stopping state, to be overwritten when finished
- also include stats in completed crawls
- docker: use short container id for crawl id
- graceful stop returns 'stopping_gracefully' instead of 'stopped_gracefully'
- don't set stopping state when complete!
- beginning files support: resolve absolute urls for crawl detail (not pre-signing yet)
2022-01-30 18:58:47 -08:00
Ilya Kreymer
bcbc40059e
Refactor backend data model to support UUID (fixes #118) (#119)
* uuid fix: (fixes #118)
- update all mongo models to use UUID type as main '_id' (users continue to use 'id' as defined by fastapi-users)
- update all foreign doc references to use UUID instead of string
- api handlers convert str->uuid as needed
api fix:
- fix single crawl api, add CrawlOut response model
- fix collections api
- fix standalone-docker apis
- for manual job, set user to current user, overriding the setting from crawlconfig

* additional fixes:
- rename username -> userName to indicate not the login 'username'
- rename user -> userid, archive -> aid for crawlconfig + crawls
- ensure invites correctly convert str -> uuid as needed
- filter out unset values from browsertrix-crawler config

* convert remaining user -> userid variables
ensure archive id is passed to crawl_manager as str (via archive.id_str)

* remove bulk crawlconfig delete
* add support for `stopping` state when gracefully stopping crawl
* for get crawl endpoint, check stopped crawls first, then running
2022-01-29 19:00:11 -08:00
Ilya Kreymer
9499ebfbba
Crawls API improvements (#117)
* crawls api improvements (fixes #110)
- add GET /crawls/{crawlid} api to return single crawl
- resolve crawlconfig name, add as `configName` to crawl model
- add 'created' date for crawlconfigs 
- flatten list to single 'crawls' list, instead of separate 'finished' and 'running' (running crawls added first)
- include 'fileCount' and 'fileSize', remove files
- remove `files` from crawl list response, also remove `aid`
- remove `schedule` from crawl data altogether, (available in crawl config)
- add ListCrawls response model
2022-01-29 12:08:02 -08:00
Ilya Kreymer
01ad7e656f quickfix: for /cancel immediate crawl cancelation, send SIGABRT instead of SIGUSR1 2022-01-27 20:45:03 -08:00
Ilya Kreymer
0bea0cfff2
crawl config new template: add support for 'extraHops' config option (available in browsertrix-crawler 0.5.0) (#104)
frontend:
- add checkbox to basic crawl config component which sets 'extraHops' to 1, otherwise to 0
- text tweaks: rename Scope Type -> Crawl Scope, capitalization

backend: add 'extraHops' to CrawlConfig
fixes #102
2022-01-26 21:18:22 -08:00
Ilya Kreymer
f55f84c60b backend:
- crawlconfigs cleanup: simplify get_crawl_configs api
- return CrawlConfigOut for single crawlconfig api endpoint, include currCrawlId
2022-01-22 17:41:37 -08:00
Ilya Kreymer
77aa5213f2 quickfix: typo fix, return config, not archive, fixes #96 2022-01-22 17:21:29 -08:00
Ilya Kreymer
b506442b21
backend api: add curr crawl to crawlconfig listing (#95)
* backend api: add current crawl id to crawlconfig listing
- model: add 'currCrawlId' to CrawlConfig model
- output: add response model to /crawlconfigs api response to show correct openapi model
- rename crawl_configs -> crawlConfigs for consistency
2022-01-22 13:52:46 -08:00
Ilya Kreymer
88f1689e0e crawlconfig: add 'name' property to crawl config
superuser init: don't check invite token for verified superuser (automatic init)
fix formatting
2022-01-15 19:06:48 -08:00
Ilya Kreymer
c561fe3af4
Support Invite Info APIs (#82)
* backend: support exposing info about a particular invite, fixes part of #35
new apis are:
- GET /users/invite/{token}?email={email} - no auth needed, get invite to new user
- GET /users/me/invite/{token} - with auth, to get invite to join an archive for an existing user

* get archive.name as well if invite is adding to an archive

* first camelCase typo
2022-01-14 22:53:02 -08:00
Ilya Kreymer
53beb84c01
Config superuser (#59)
* backend: automatically create super user, fixes #57
- if SUPERUSER_EMAIL is set, superuser is created with `is_superuser` and `is_verified` settings, if user doesn't already exist.
- if SUPERUSER_PASSWORD if set, the password for superuser is set, otherwise a random password is generated
update sample SUPERUSER_EMAIL and SUPERUSER_PASSWORD in config file and chart.
- ensure verification email is not sent if user already verified
2021-12-05 14:12:42 -08:00
Ilya Kreymer
eaf8055063
Support unified docker + k8s deployment (#58)
- adapt nginx config to work both in docker and k8s, using env vars to set urls

backend: additional fixes:
- use env vars with nginx config
- fix settings api route
- when sending e-mail, use the Host header for verification urls when available
- prepare Dockerfile with full build from scratch in image, (disabled 'yarn install' for faster builds for now)
- fix accept invite api for existing user to /archives/accept-invite/{token}
2021-12-05 13:02:26 -08:00
Ilya Kreymer
87c5505c43
Backend Invite System Refactor (#53)
* backend:
- refactor invite system, move to separate InviteOps object, used by archives and user
- supporting three invite use cases:
1) superuser invites any user not registered, not added to any archive
2) archive admin invites any user not registered, add to one of their archives
3) archive admin invites existing registered user, add to one of their archives

- support superadmin invite via /users/invite (fixes #37)
- superadmin invite has no archive set and does not add user to archive

- don't send verification email when accepting from invite, fixes #50
- use different email template / accept url for existing user invite, eg, `/invite/accept/`

- fix default token value in chart
2021-12-04 12:14:28 -08:00
Ilya Kreymer
11b797d535
Add global settings endpoint (#52)
* backend:
- add /api/settings endpoint for misc system-wide settings
- setting 'registrationEnabled' if open registration should be enabled, set via REGISTRATION_ENABLED=1 env var
- setting 'jwtTokenLifetimeMinutes' returns the jwt token expiry in seconds, configured in minutes via JWT_TOKEN_LIFETIME_MINUTES env var (default: 60)
2021-12-03 10:56:57 -08:00
Ilya Kreymer
05c1129fb8
Frontend + Backend Integrated Deployment (K8s only) (#45)
* support running backend + frontend together on k8s
* split nginx container into separate frontend service, which uses nignx-base image and the static frontend files
* add nginx-based frontend image to docker-compose build (for building only, docker-based combined deployment not yet supported)

* backend:
- fix paths for email templates
- chart: support '--set backend_only=1' and '--set frontend_only=1' to only force deploy one or the other
- run backend from root /api in uvicorn
2021-12-03 10:17:22 -08:00
Ilya Kreymer
081d6f8519
User Display Name Support + Token Refresh Support (#44)
* backend api/data model improvements:
- add 'name' property to user, can be set on registration, fixes #43
- in archive user list, include 'name' and 'role' for each user
- don't include is_* property in user create/register and update
- add /auth/jwt/refresh endpoint for refreshing token, fixes #34, support for #22

* allow jwt token lifetime to be settable via JWT_LIFETIME env var (default 3600)
2021-12-01 18:55:10 -08:00
Ilya Kreymer
d0b54dd752 Enable sending emails in K8S, trigger verification e-mail on registration. (#38)
* k8s: support email configuration
support sending reset password email
fix for #32

* fastapi users: update to latest (8.1.2)
send verification email upon registration

* update to latest fastapi-users(8.1.2), refactor to use UserManager class
ensure verification e-mail sent upon registration, w/o requiring separate apicall
fixes #32

* add email options to default chart/values.yaml

* separate usermanager init from fastapi users init, fix for sending invite emails
2021-11-30 23:50:38 -08:00
Ilya Kreymer
3d4d7049a2
Misc backend fixes for cloud deployment (#26)
* misc backend fixes:
- fix running w/o local minio
- ensure crawler image pull policy is configurable, loaded via chart value
- use digitalocean repo for main backend image (for now)
- add bucket_name to config only if using default bucket

* enable all behaviors, support 'access_endpoint_url' for default storages

* debugging: add 'no_delete_jobs' setting for k8s and docker to disable deletion of completed jobs
2021-11-25 11:58:26 -08:00
Ilya Kreymer
57a4b6b46f add collections api:
- collections defined by name per archive
- can update collections with additional metadata (currently just description)
- crawl config api accepts a list of collections by name, resolved to collection uids and stored in config
- finished crawls also associated with collection list
- /archives/{aid}/collections/{name} can list all crawl artifacts (wacz files) from a named collection (in frictionless data package-ish format)
- /archives/{aid}/collections/$all lists all crawled artifacts for the archive

readiness check: add /healthz endpoints for app and nginx
ingress: add /data/ route to local bucket

storage improvements:
- for default storages, store path only, and prepend default storage access endpoint
- collections api returns the paths using the storage access endpoint
- define default storages as secrets in k8s (can support multiple), hard-coded in docker (only one for now)
2021-10-27 09:39:14 -07:00
Ilya Kreymer
c38e0b7bf7 use redis based queue instead of url for crawl done webhook
update docker setup to support redis webhook, add consistent CRAWL_ARGS, additional fixes
2021-10-10 12:18:28 -07:00
Ilya Kreymer
4ae4005d74 add ingress + nginx container for better routing
support screencasting to dynamically created service via nginx (k8s only thus far)
add crawl /watch endpoint to enable watching, creates service if doesn't exist
add crawl /running endpoint to check if crawl is running
nginx auth check in place, but not yet enabled
add k8s nginx.conf
add missing chart files
file reorg: move docker config to configs/
k8s: add readiness check for nginx and api containers for smoother reloading
ensure service deleted along with job
todo: update dockerman with screencast support
2021-10-09 23:47:29 -07:00
Ilya Kreymer
19879fe349 Storage + Data Model Refactor (fixes #3):
- Add default vs custom (s3) storage
 - K8S: All storages correspond to secrets
 - K8S: Default storages inited via helm
 - K8S: Custom storage results in custom secret (per archive)
 - K8S: Don't add secret per crawl config
 - API for changing storage per archive
 - Docker: default storage just hard-coded from env vars (only one for now)
 - Validate custom storage via aiobotocore before confirming
 - Data Model: remove usage from users
 - Data Model: support adding multiple files per crawl for parallel crawls
 - Data Model: track completions for parallel crawls
 - Data Model: initial support for tags per crawl, add collection as 'coll' tag

README fixes
2021-10-09 18:58:40 -07:00
Ilya Kreymer
b6d1e492d7 add redis for storing crawl state data!
- supported in both docker and k8s
- additional pods with same job id automatically use same crawl state in redis
- support dynamic scaling (#2) via /scale endpoint - k8s job parallelism adjusted dynamically for running job (only supported in k8s so far)
2021-09-17 15:02:11 -07:00
Ilya Kreymer
223658cfa2 misc tweaks:
- better error handling for not found resources, ensure 404
- typo in k8smanager
- add pylintrc
- ensure manual job ares deleted when complete
- fix typos, reformat
2021-08-25 18:34:49 -07:00
Ilya Kreymer
9a3356ad0d add missing scheduler! 2021-08-25 16:18:53 -07:00
Ilya Kreymer
36fb01cbdf docker-compose: use fixed network name 2021-08-25 16:04:34 -07:00
Ilya Kreymer
60b48ee8a6 dockermanager + scheduler:
- run as child process using aioprocessing
- cleanup: support cleanup of orphaned containers
- timeout: support crawlTimeout via check in cleanup loop
- support crawl listing + crawl stopping
2021-08-25 15:28:57 -07:00
Ilya Kreymer
b417d7c185 docker manager: support scheduling with apscheduler and separate 'scheduler' process 2021-08-25 12:21:03 -07:00
Ilya Kreymer
91e9fc8699 dockerman: initial pass
- support for creating, deleting crawlconfigs, running crawls on-demand
- config stored in volume
- list to docker events and clean up containers when they exit
2021-08-24 22:49:06 -07:00
Ilya Kreymer
20b19f932f make crawlTimeout a per-crawconfig property
allow crawl complete/partial complete to update existing crawl state, eg. timeout
enable handling backofflimitexceeded / deadlineexceeded failure, with possible success able to override the failure state
filter out only active jobs in running crawls listing
2021-08-24 11:29:15 -07:00
Ilya Kreymer
ed27f3e3ee job handling:
- job watch: add watch loop for job failure (backofflimitexceeded)
- set job retries + job timeout via chart values
- sigterm starts graceful shutdown by default, including for timeout
- use sigusr1 to switch to instant shutdown
- update stop_crawl() to use new semantics
2021-08-23 21:22:01 -07:00
Ilya Kreymer
7146e054a4 crawls work (#1):
- support listing existing crawls
- add 'schedule' and 'manual' annotations to jobs, store in Crawl obj
- ensure manual jobs are deleted when completed
- support deleting crawls by id (but not data)
- rename running crawl delete to '/cancel'

change paths for local minio/mongo to /tmp
2021-08-23 18:01:29 -07:00
Ilya Kreymer
66c4e618eb crawls work (#1), support for:
- canceling a crawl (via sigterm)
- stopping a crawl gracefully (via custom exec sigint)
2021-08-23 12:25:04 -07:00
Ilya Kreymer
a8255a76b2 crawljob:
- support run once on existing crawl job
- support updating/patching existing crawl job with new crawl config, new schedule and run once
2021-08-21 22:10:31 -07:00
Ilya Kreymer
ea9010bf9a add completed crawls to crawls table 2021-08-20 23:53:06 -07:00
Ilya Kreymer
4b08163ead support usage counters per archive, per user -- handle crawl completion 2021-08-20 23:05:42 -07:00
Ilya Kreymer
170958be37 rename crawls -> crawlconfigs.py
add crawls for crawl api management
2021-08-20 15:15:51 -07:00
Ilya Kreymer
f2d9d7ba6a new features:
- sending emai for validation + invites, configured via env vars
- inviting new users to join an existing archive
- /crawldone webhook to track verify crawl id (next: store crawl complete entry)
2021-08-20 11:02:29 -07:00
Ilya Kreymer
627e9a6f14 cleanup crawl config, add separate 'runNow' field
crawler: add cpu/memory limits
minio: auto-create bucket for local minio
2021-08-19 14:15:21 -07:00
Ilya Kreymer
eaa87c8b43 support for user roles (owner, crawler, viewer), owner users can issue invites to other existing users by email to join existing archives 2021-08-18 20:35:51 -07:00
Ilya Kreymer
61a608bfbe update models:
- replace storages with archives, which have a single storage (for now)
- crawls associated with archives
- users below to archive, with one admin user (if archive created by default)
- update crawlconfig for latest browsertrix-crawler (0.4.4)
- k8s: fix permissions for crawler role
- k8s: fix minio service (now requiring two ports)
2021-08-18 16:53:49 -07:00
Ilya Kreymer
f77eaccf41 support committing to s3 storage
move mongo into separate optional deployment along with minio
support for configuring storages
support for deleting crawls, associated config and secrets
2021-07-02 15:56:24 -07:00
Ilya Kreymer
a111bacfb5 add k8s support
- working apis for adding crawls, removing crawls in mongo, mapped to k8s cronjobs
- more complete crawl spec
- option to start on-demand job from cronjobs
- optional minio in separate deployment/service
2021-06-30 21:48:44 -07:00
Ilya Kreymer
c3143df0a2 rename archives -> storages
add crawlconfig apis
run lint pass, prep for k8s / docker crawl manager support
2021-06-29 20:30:33 -07:00
Ilya Kreymer
b08a188fea initial commit! 2021-06-28 15:48:59 -07:00