Commit Graph

16 Commits

Author SHA1 Message Date
Ilya Kreymer
0c8a5a49b4 refactor to use docker swarm for local alternative to k8s instead of docker compose (#247):
- use python-on-whale to use docker cli api directly, creating docker stack for each crawl or profile browser
- configure storages via storages.yaml secret
- add crawl_job, profile_job, splitting into base and k8s/swarm implementations
- split manager into base crawlmanager and k8s/swarm implementations
- swarm: load initial scale from db to avoid modifying fixed configs, in k8s, load from configmap
- swarm: support scheduled jobs via swarm-cronjob service
- remove docker dependencies (aiodocker, apscheduler, scheduling)
- swarm: when using local minio, expose via /data/ route in nginx via extra include (in k8s, include dir is empty and routing handled via ingress)
- k8s: cleanup minio chart: move init containers to minio.yaml
- swarm: stateful set implementation to be consistent with k8s scaling:
  - don't use service replicas,
  - create a unique service with '-N' appended and allocate unique volume for each replica
  - allows crawl containers to be restarted w/o losing data
- add volume pruning background service, as volumes can be deleted only after service shuts down fully
- watch: fully simplify routing, route via replica index instead of ip for both k8s and swarm
- rename network btrix-cloud-net -> btrix-net to avoid conflict with compose network
2022-06-05 10:37:17 -07:00
Ilya Kreymer
bf79959a5a refactoring to use statefulsets + job (#245)
- use statefulsets instead of deployments for mongo, redis, signer
- use k8s job + statefulset for running crawls
- use separate statefulset for crawl (scaled) and single-replica redis stateful set
- move crawl job update login to crawl_updater
- remove shared redis chart

package refactor:
- move to shared code to 'btrixcloud'
- move k8s to 'btrixcloud.k8s'
- move docker to 'btrixcloud.docker'
2022-06-05 10:37:17 -07:00
Ilya Kreymer
cdd0ab34a3
Watch Stream Directly from Browsertrix Crawler (#189)
* watch work: proxy directly to crawls instead of redis pubsub
- add 'watchIPs' to crawl detail output
- cache crawl ips for quick access for auth
- add '/ipaccess/{ip}' endpoint for watch ws connection to ensure ws has access to the specified container ip
- enable 'auth_request' in nginx frontend
- requirements: update to latest redis-py
remaining fixes for #134
2022-03-04 14:55:11 -08:00
Ilya Kreymer
9bd402fa17
New WS Endpoint for Watching Crawl (#152)
* backend support for new watch system (#134):
- support for watch via redis pubsub and websocket connection to backend
- can support watch from any number of crawler instances to support scaled crawls
- use /archives/{aid}/crawls/{crawl_id}/watch/ws websocket endpoint
- ws: ignore graceful connectionclosedok exception, log other exceptions
- set logging to info to instead of debug for now (debug logs all ws traffic)
- remove old watch apis in backend
- remove old websocket routing to crawler instance for old watch system
- oauth bearer check: support websockets, use websocket object if no request object
- crawler args: replace --screencastPort with --screencastRedis
2022-02-22 10:33:10 -08:00
Ilya Kreymer
adb5c835f2
Presign and replay (#127)
* support for replay via replayweb.page embed, fixes #124

backend:
- pre-sign all files urls
- cache pre-signed urls in redis, presign again when expired (default duration 3600, settable via PRESIGN_DURATION_SECONDS env var)
- change files output -> resources to confirm to Data Package spec supported by replayweb.page
- add CrawlFileOut which contains 'name' (file id), 'path' (presigned url), 'hash', and 'size'
- add /replay/sw.js endpoint to import sw.js from latest replay-web-page release
- update to fastapi-users 9.2.2
- customize backend auth to allow authentication to check 'auth_bearer' query arg if 'Authorization' header not set
- remove sw.js endpoint, handling in frontend

frontend:
- add <replay-web-page> to frontend, include rwp ui.js from latest release in index.html for now
- update crawl api endpoint to end in json
- replay-web-page loads the api endpoint directly!
- update Crawl type to use new format, 'resources' -> instead of 'files', each file has 'name' and 'path'

- nginx: add endpoint to serve the replay sw.js endpoint
- add defer attr to ui.js
- move 'Download' to 'Download Files'

* frontend: support customizing replayweb.page loading url via RWP_BASE_URL env var in Dockerfile
- default prod value set in frontend Dockerfile (set to upcoming 1.5.8 release needed for multi-wacz-file support) (can be overridden during image build via --build-arg)
- rename index.html -> index.ejs to allow interpolation
- RWP_BASE_URL defaults to latest https://replayweb.page/ for testing
- for local testing, add sw.js loading via devServer, also using RWP_BASE_URL (#131)

Co-authored-by: sua yoo <sua@suayoo.com>
2022-01-31 17:02:15 -08:00
Ilya Kreymer
eaf8055063
Support unified docker + k8s deployment (#58)
- adapt nginx config to work both in docker and k8s, using env vars to set urls

backend: additional fixes:
- use env vars with nginx config
- fix settings api route
- when sending e-mail, use the Host header for verification urls when available
- prepare Dockerfile with full build from scratch in image, (disabled 'yarn install' for faster builds for now)
- fix accept invite api for existing user to /archives/accept-invite/{token}
2021-12-05 13:02:26 -08:00
Ilya Kreymer
d0b54dd752 Enable sending emails in K8S, trigger verification e-mail on registration. (#38)
* k8s: support email configuration
support sending reset password email
fix for #32

* fastapi users: update to latest (8.1.2)
send verification email upon registration

* update to latest fastapi-users(8.1.2), refactor to use UserManager class
ensure verification e-mail sent upon registration, w/o requiring separate apicall
fixes #32

* add email options to default chart/values.yaml

* separate usermanager init from fastapi users init, fix for sending invite emails
2021-11-30 23:50:38 -08:00
Ilya Kreymer
19879fe349 Storage + Data Model Refactor (fixes #3):
- Add default vs custom (s3) storage
 - K8S: All storages correspond to secrets
 - K8S: Default storages inited via helm
 - K8S: Custom storage results in custom secret (per archive)
 - K8S: Don't add secret per crawl config
 - API for changing storage per archive
 - Docker: default storage just hard-coded from env vars (only one for now)
 - Validate custom storage via aiobotocore before confirming
 - Data Model: remove usage from users
 - Data Model: support adding multiple files per crawl for parallel crawls
 - Data Model: track completions for parallel crawls
 - Data Model: initial support for tags per crawl, add collection as 'coll' tag

README fixes
2021-10-09 18:58:40 -07:00
Ilya Kreymer
b6d1e492d7 add redis for storing crawl state data!
- supported in both docker and k8s
- additional pods with same job id automatically use same crawl state in redis
- support dynamic scaling (#2) via /scale endpoint - k8s job parallelism adjusted dynamically for running job (only supported in k8s so far)
2021-09-17 15:02:11 -07:00
Ilya Kreymer
60b48ee8a6 dockermanager + scheduler:
- run as child process using aioprocessing
- cleanup: support cleanup of orphaned containers
- timeout: support crawlTimeout via check in cleanup loop
- support crawl listing + crawl stopping
2021-08-25 15:28:57 -07:00
Ilya Kreymer
b417d7c185 docker manager: support scheduling with apscheduler and separate 'scheduler' process 2021-08-25 12:21:03 -07:00
Ilya Kreymer
91e9fc8699 dockerman: initial pass
- support for creating, deleting crawlconfigs, running crawls on-demand
- config stored in volume
- list to docker events and clean up containers when they exit
2021-08-24 22:49:06 -07:00
Ilya Kreymer
627e9a6f14 cleanup crawl config, add separate 'runNow' field
crawler: add cpu/memory limits
minio: auto-create bucket for local minio
2021-08-19 14:15:21 -07:00
Ilya Kreymer
a111bacfb5 add k8s support
- working apis for adding crawls, removing crawls in mongo, mapped to k8s cronjobs
- more complete crawl spec
- option to start on-demand job from cronjobs
- optional minio in separate deployment/service
2021-06-30 21:48:44 -07:00
Ilya Kreymer
c3143df0a2 rename archives -> storages
add crawlconfig apis
run lint pass, prep for k8s / docker crawl manager support
2021-06-29 20:30:33 -07:00
Ilya Kreymer
b08a188fea initial commit! 2021-06-28 15:48:59 -07:00