browsertrix

Author	SHA1	Message	Date
Ilya Kreymer	50c525853f	validation: ensure seed urls, and other url properties, are validated on POST by using pydantic HttpUrl type, fixes #277 (#278 )	2022-06-29 16:09:32 -07:00
Ilya Kreymer	b9d7907ab3	Single config and env vars (#267 ) * simplify back to single config.env! - back to good ole env vars! - remove shared secret, which made it difficult to have scheduled crawls, since secrets are immutable, so could not update config if a scheduled crawl existed :/ - all env vars unified in configs/config.env - run-swarm.sh and run-pod.sh 'source' this config - remove config.sample.yaml - customize minio volume dir via config.env - customize redis port via config.env - include authsign ports in debug-ports config	2022-06-16 21:50:03 -07:00
Ilya Kreymer	418c07bf0d	Local swarm + podman support (#261 ) * backend: refactor swarm support to also support podman (#260) - implement podman support as subclass of swarm deployment - podman is used when 'RUNTIME=podman' env var is set - podman socket is mapped instead of docker socket - podman-compose is used instead of docker-compose (though docker-compose works with podman, it does not support secrets, but podman-compose does) - separate cli utils into SwarmRunner and PodmanRunner which extends it - using config.yaml and config.env, both copied from sample versions - work on simplifying config: add docker-compose.podman.yml and docker-compose.swarm.yml and signing and debug configs in ./configs - add {build,run,stop}-{swarm,podman}.sh in scripts dir - add init-configs, only copy if configs don't exist - build local image use current version of podman, to support both podman 3.x and 4.x - additional fixes for after testing podman on centos - docs: update Deployment.md to cover swarm, podman, k8s deployment	2022-06-14 00:13:49 -07:00
Ilya Kreymer	5b6aa3bc95	Affinity + Tolerations + Cleanup Crawl Job (#256 ) * k8s: add tolerations for 'nodeType=crawling:NoSchedule' to allow scheduling crawling on designated nodes for crawler and profiles jobs and statefulsets * add affinity for 'nodeType=crawling' on crawling and profile browser statefulsets * refactor crawljob: combine crawl_updater logic into base crawl_job * increment new 'crawlAttemptCount' counter crawlconfig when crawl is started, not necessarily finished, to avoid deleting configs that had attempted but not finished crawls. * better external mongodb support: use MONGO_DB_URL to set custom url directly, otherwise build from username, password and mongo host	2022-06-10 19:21:37 -07:00
Ilya Kreymer	dee354f252	affinity: add affinity for k8s crawl deployments: - prefer deploy crawler, redis and job to same zone - prefer deploying crawler and job together via crawler node type, redis via redis node type (all optional)	2022-06-07 21:52:04 -07:00
Ilya Kreymer	21b1a87534	crawljob: detect crawl failure when all crawlers set their status to 'failed'	2022-06-07 21:48:58 -07:00
Ilya Kreymer	e3f268a2e8	CI setup for new swarm mode (#248 ) - build backend and frontend with cacheing using GHA cache) - streamline frontend image to reduce layers - setup local swarm with test/setup.sh script, wait for containers to init - copy sample config files as default (add storages.sample.yaml) - add initial backend test for logging in with default superadmin credentials via 127.0.0.1:9871 - must use 127.0.0.1 instead of localhost for accessing frontend container within action	2022-06-06 09:34:02 -07:00
Ilya Kreymer	0c8a5a49b4	refactor to use docker swarm for local alternative to k8s instead of docker compose (#247 ): - use python-on-whale to use docker cli api directly, creating docker stack for each crawl or profile browser - configure storages via storages.yaml secret - add crawl_job, profile_job, splitting into base and k8s/swarm implementations - split manager into base crawlmanager and k8s/swarm implementations - swarm: load initial scale from db to avoid modifying fixed configs, in k8s, load from configmap - swarm: support scheduled jobs via swarm-cronjob service - remove docker dependencies (aiodocker, apscheduler, scheduling) - swarm: when using local minio, expose via /data/ route in nginx via extra include (in k8s, include dir is empty and routing handled via ingress) - k8s: cleanup minio chart: move init containers to minio.yaml - swarm: stateful set implementation to be consistent with k8s scaling: - don't use service replicas, - create a unique service with '-N' appended and allocate unique volume for each replica - allows crawl containers to be restarted w/o losing data - add volume pruning background service, as volumes can be deleted only after service shuts down fully - watch: fully simplify routing, route via replica index instead of ip for both k8s and swarm - rename network btrix-cloud-net -> btrix-net to avoid conflict with compose network	2022-06-05 10:37:17 -07:00
Ilya Kreymer	bf79959a5a	refactoring to use statefulsets + job (#245 ) - use statefulsets instead of deployments for mongo, redis, signer - use k8s job + statefulset for running crawls - use separate statefulset for crawl (scaled) and single-replica redis stateful set - move crawl job update login to crawl_updater - remove shared redis chart package refactor: - move to shared code to 'btrixcloud' - move k8s to 'btrixcloud.k8s' - move docker to 'btrixcloud.docker'	2022-06-05 10:37:17 -07:00
Ilya Kreymer	ae51114a45	backend: fix accessing signed urls when using local minio service - signing url with endpoint_url instead of access_endpoint_url, but replace endpoint_url prefix with access_endpoint_url for access. - keep existing behavior of signing access_endpoint_url only if SIGN_ACCESS_ENDPOINT env var is set	2022-06-04 08:29:57 -07:00
sua yoo	6a78bcd4aa	Delete browser profile (#243 ) - delete browser profile, if not in use - if in use, show error message, listing crawl configs that use the profile - backend: fix check for confirming profile deletion	2022-06-01 19:18:41 -07:00
Ilya Kreymer	c023fe7c9a	Backend API prefix (#240 ) * apply /api prefix consistently, both directly through backend and when accessing via frontend, fixes #236 * docs: update local deployment docs to use 9871 instead of 8000, don't expose 8000 by default * schemas: don't include /openapi.json as /healthz in documentation, keep /healthz at root * k8s: route backend to /api without additional rewriting	2022-05-31 19:29:20 -07:00
Ilya Kreymer	3df310ee4f	Backend: Crawls with Multiple WACZ files + Profile + Misc Fixes (#232 ) * backend: k8s: - support crawls with multiple wacz files, don't assume crawl complete after first wacz uploaded - if crawl is running and has wacz file, still show as running - k8s: allow configuring node selector for main pods (eg. nodeType=main) and for crawlers (eg. nodeType=crawling) - profiles: support uploading to alternate storage specified via 'shared_profile_storage' value is set - misc fixes for profiles * backend: ensure docker run_profile api matches k8s k8s chart: don't delete pvc and pv in helm chart * dependency: bump authsign to 0.4.0 docker: disable public redis port * profiles: fix path, profile browser return value * fix typo in presigned url cacheing	2022-05-19 18:40:41 -07:00
Ilya Kreymer	ff42785410	Profiles Backend (part 2) (#224 ) * profiles: api update: - support profile deletion - support listing crawlconfigs using a profile - support using a browser to update existing profile or create new one - cleanup: move profile creation to POST, profile updates to PATCH endpoints - support updating just profile name or description - add new /navigate api to navigate browser	2022-04-24 10:23:52 -07:00
Ilya Kreymer	2f63c7dcf8	Profiles: Backend API + Nginx Devtools Proxy Support (#212 ) * add profile creation, list endpoints at /archives/<aid>/profiles * add profile browser creation, get, ping, commit, delete endpoints at /archives/<aid>/profiles/browser * support creation of profile browser using browsertrix-crawler 'create-login-profile' in docker and k8s * ensure profile browser expires after set time, k8s job or docker container automatically deleted on exit * profile browser creation returns temporary browser id, or `{"detail": "waiting_for_browser"}` while waiting for browser container init * nginx frontend: proxy /loadbrowser/ to port 9223 in browsertrix-crawler, connecting directly to chrome devtools * profile api auth: use redis for auth - store browserid->archiveid and browserid->browser ip mapping in redis - browser apis: ensure profile browser is associated with specified archive - browser ws: pass arcchiveid and browserid to ws query args, browserid is part of archive, and browserid corresponds to specified ip * store profiles in /profiles/ directory in default storage, include profileid in profile tar.gz filename * support profile in crawlconfig: - add profileid to CrawlConfig, and profileName to CrawlConfigOut - support resolving profile path via profileid, setting '--profile @{path/to/profile.tar.gz}' for crawler (assuming same storage for profile as output for now) in both docker and k8s setups - docker: support out_filename, custom wacz output filename missing functionality	2022-04-13 19:36:06 -07:00
Ilya Kreymer	9a6483630e	Support for Admin interface for viewing web archives (#198 ) * backend api - superadmin has admin access to all archives - new superadmin endpoints: /archives/all/crawls and /archives/all/crawls/<crawl_id>.json for list all running crawls and loading crawl data by id - frontend superadmin view (fixes #201) * show all archives on superadmin home page * show jump to crawl for super admin (#200) * navbar links for: all archives, all running crawls and jump to crawl Co-authored-by: sua yoo <sua@suayoo.com>	2022-04-06 12:42:04 -07:00
Ilya Kreymer	aa83d51f7a	k8s backend improvements: (#205 ) - add liveness probe for crawls, configurable via 'crawler_liveness_port' - add User system:anonymous permissions - treat jobs that have exceeded total as 'partial_complete' (experimental)	2022-03-30 14:39:06 -07:00
Ilya Kreymer	9e45dc35d2	minor frontend-tweaks: (#196 ) * frontend-tweaks: - treat 'starting' state same as 'running' - default to no schedule instead of weekly for default - add 'Domain' scopeType * backend: also allow 'domain' as a scopeType	2022-03-15 21:19:23 -07:00
Ilya Kreymer	e6467c3374	backend work: - support {configname}-{username}-@ts-@hostsuffix.wacz as output filename, sanitize username and config name - support returning 'starting' for crawl status if no ips or 0/0 pages found. - fix updating scale via POST crawlconfig update - fix duplicate user error on superuser init	2022-03-15 18:20:25 -07:00
Ilya Kreymer	4b2f89db91	k8s: support for using a pre-made persistent volume/claim for crawling, configurable via CRAWLER_PV_CLAIM, otherwise using emptyDir k8s: ability to set deployment scale for frontend as well	2022-03-15 11:18:23 -07:00
Ilya Kreymer	8ce7a9802b	backend quick fix: chart/config: use screencastPort, fixed collection name k8s: set pod to never restart to see logs	2022-03-14 11:42:53 -07:00
Ilya Kreymer	9c99d67b1d	quickfix: backend: docker: fix loading ips for watch	2022-03-04 17:12:19 -08:00
Ilya Kreymer	fb51f8e33e	Mongo auth fix (#190 ) * backend: makes mongo auth configurable! use mongo_auth secret in k8s and set env vars in docker fixes #177 * docker: update config.sample.env: use ws screencast by default, add NO_DELETE_ON_FAIL option, extend default login lifetime	2022-03-04 15:04:33 -08:00
Ilya Kreymer	cdd0ab34a3	Watch Stream Directly from Browsertrix Crawler (#189 ) * watch work: proxy directly to crawls instead of redis pubsub - add 'watchIPs' to crawl detail output - cache crawl ips for quick access for auth - add '/ipaccess/{ip}' endpoint for watch ws connection to ensure ws has access to the specified container ip - enable 'auth_request' in nginx frontend - requirements: update to latest redis-py remaining fixes for #134	2022-03-04 14:55:11 -08:00
Ilya Kreymer	51a573ef1f	backend prod settings: - set WEB_CONCURRENCY env var to configure number of backend api workers for both docker and k8s - set via 'backend_workers' in values.yaml - also add 'rwp_base_url' to values.yaml - update containers to use public webrecorder/browsertrix-backend and webrecorder/browsertrix-frontend containers - make liveness, readiness and startup health checks more tolerant	2022-02-28 18:09:13 -08:00
Ilya Kreymer	84a9079b1f	support signing in docker deployment: (#166 ) - add authsign to docker-compose.yml - add signing.sample.yaml to be copied to signing.yaml for authsign - add WACZ_SIGN_URL and WACZ_SIGN_TOKEN to config.sample.env - signing enabled if WACZ_SIGN_URL is set - add instructions on how to enable signing to Deployment - update .gitignore, don't commit 'signing.yaml' - update images to use public repo browsertrix images	2022-02-28 14:32:19 -08:00
Ilya Kreymer	9bd402fa17	New WS Endpoint for Watching Crawl (#152 ) * backend support for new watch system (#134): - support for watch via redis pubsub and websocket connection to backend - can support watch from any number of crawler instances to support scaled crawls - use /archives/{aid}/crawls/{crawl_id}/watch/ws websocket endpoint - ws: ignore graceful connectionclosedok exception, log other exceptions - set logging to info to instead of debug for now (debug logs all ws traffic) - remove old watch apis in backend - remove old websocket routing to crawler instance for old watch system - oauth bearer check: support websockets, use websocket object if no request object - crawler args: replace --screencastPort with --screencastRedis	2022-02-22 10:33:10 -08:00
Ilya Kreymer	aa5207915c	backend: fix crawl config revision links (#149 ) backed: crawlconfig: - ensure newId is saved on old config being replaced - if old config replaced is being deleted, ensure newId link is set on its old config (if any), and the oldId points to the oldId of config being replaced (if any)	2022-02-21 16:51:27 -08:00
Ilya Kreymer	ee68a2f64e	Support for setting scale in crawlconfig (#148 ) * backend: scale support: - add 'scale' field to crawlconfig - support updating 'scale' field in crawlconfig patch - add constraint for crawlconfig and crawl scale (currently 1-3)	2022-02-20 11:27:47 -08:00
Ilya Kreymer	d05f04be9f	Crawl Config Editing Support (#141 ) * support inactive configs in same collection, configs with `inactive` set to true (#137) - add `inactive`, `newId`, `oldId` to crawlconfigs - filter out inactive configs by default for most operations - add index for aid + inactive field for faster querying - delete returns status: 'deactivated' or 'deleted' - if no crawls ran, config can be deleted, otherwise it is deactivated * update crawl endpoint: add general PATCH crawl config endpoint, support updating schedule and name	2022-02-17 16:04:07 -08:00
Ilya Kreymer	d28ebcc7b6	backend: crawlconfig: don't pass default settings to crawlconfig to avoid redundant settings, use browsertrix-crawler defaults when config not set	2022-02-14 18:47:52 -08:00
Ilya Kreymer	ca85edc8b3	backend: resource limits: - set resource mem and cpu requests/limits for all used services (not minio for now) - add readiness proble to redis, mongo - adjust crawler limits, set via configmap	2022-02-08 19:53:41 -08:00
Ilya Kreymer	71842be94a	backend: k8s setup minor tweaks: - add 'emptyDir' volume for crawl directory (to allow any pod restarts to have access to the data) - rename minio and redis volumes to avoid any confusion - add pod termination grace-period (default to 600 secs)	2022-02-08 15:52:57 -08:00
Ilya Kreymer	8acb43b171	backend: use redis to mark crawls as canceled immediately, avoid dupes in crawl list (even if paging is added for db results)	2022-02-01 15:58:56 -08:00
Ilya Kreymer	4b7522920a	backend: k8s: fix finished check, resource limits increase	2022-02-01 15:07:20 -08:00
Ilya Kreymer	b3f21932fc	backend: k8s: list running jobs tweak: if succeeded jobs == number of parallel jobs, filter out from list, assume finished and not stopping	2022-02-01 00:05:13 -08:00
Ilya Kreymer	2b2e6fedfa	Misc backend fixes (#133 ) * misc backend fixes: - fix uuid typing: roles list, user invites - crawlconfig: fix created date setting, fix userName lookup - docker: fix timezone for scheduler, fix running check - remove prints - fix get crawl stuck in 'stopping' - check finished list first, then run list (in case k8s job has not been deleted)	2022-01-31 19:41:04 -08:00
Ilya Kreymer	adb5c835f2	Presign and replay (#127 ) * support for replay via replayweb.page embed, fixes #124 backend: - pre-sign all files urls - cache pre-signed urls in redis, presign again when expired (default duration 3600, settable via PRESIGN_DURATION_SECONDS env var) - change files output -> resources to confirm to Data Package spec supported by replayweb.page - add CrawlFileOut which contains 'name' (file id), 'path' (presigned url), 'hash', and 'size' - add /replay/sw.js endpoint to import sw.js from latest replay-web-page release - update to fastapi-users 9.2.2 - customize backend auth to allow authentication to check 'auth_bearer' query arg if 'Authorization' header not set - remove sw.js endpoint, handling in frontend frontend: - add <replay-web-page> to frontend, include rwp ui.js from latest release in index.html for now - update crawl api endpoint to end in json - replay-web-page loads the api endpoint directly! - update Crawl type to use new format, 'resources' -> instead of 'files', each file has 'name' and 'path' - nginx: add endpoint to serve the replay sw.js endpoint - add defer attr to ui.js - move 'Download' to 'Download Files' * frontend: support customizing replayweb.page loading url via RWP_BASE_URL env var in Dockerfile - default prod value set in frontend Dockerfile (set to upcoming 1.5.8 release needed for multi-wacz-file support) (can be overridden during image build via --build-arg) - rename index.html -> index.ejs to allow interpolation - RWP_BASE_URL defaults to latest https://replayweb.page/ for testing - for local testing, add sw.js loading via devServer, also using RWP_BASE_URL (#131) Co-authored-by: sua yoo <sua@suayoo.com>	2022-01-31 17:02:15 -08:00
Ilya Kreymer	f569125a3d	storage: support loading default storage from crawl manangers (#126 ) support s3-compatible presigning with default storage backend support for #120	2022-01-31 11:22:03 -08:00
Ilya Kreymer	523b557eac	replay route: (prepare for replay, #124 ) - add support for /replay/sw.js - ensure route works in both k8s and docker (routed via main nginx)	2022-01-31 11:18:10 -08:00
Ilya Kreymer	be86505347	backend: crawls api: better fix for graceful stop - k8s: don't use redis, set to 'stopping' if status.active is not set, toggled immediately on delete_job - docker: set custom redis key to indicate 'stopping' state (container still running) - api: remove crawl is_running endpoint, redundant with general get crawl api	2022-01-30 22:01:00 -08:00
Ilya Kreymer	542680daf7	backend fixes: fix graceful stop + stats (#122 ) * backend fixes: fix graceful stop + stats - use redis to track stopping state, to be overwritten when finished - also include stats in completed crawls - docker: use short container id for crawl id - graceful stop returns 'stopping_gracefully' instead of 'stopped_gracefully' - don't set stopping state when complete! - beginning files support: resolve absolute urls for crawl detail (not pre-signing yet)	2022-01-30 18:58:47 -08:00
Ilya Kreymer	bcbc40059e	Refactor backend data model to support UUID (fixes #118 ) (#119 ) * uuid fix: (fixes #118) - update all mongo models to use UUID type as main '_id' (users continue to use 'id' as defined by fastapi-users) - update all foreign doc references to use UUID instead of string - api handlers convert str->uuid as needed api fix: - fix single crawl api, add CrawlOut response model - fix collections api - fix standalone-docker apis - for manual job, set user to current user, overriding the setting from crawlconfig * additional fixes: - rename username -> userName to indicate not the login 'username' - rename user -> userid, archive -> aid for crawlconfig + crawls - ensure invites correctly convert str -> uuid as needed - filter out unset values from browsertrix-crawler config * convert remaining user -> userid variables ensure archive id is passed to crawl_manager as str (via archive.id_str) * remove bulk crawlconfig delete * add support for `stopping` state when gracefully stopping crawl * for get crawl endpoint, check stopped crawls first, then running	2022-01-29 19:00:11 -08:00
Ilya Kreymer	9499ebfbba	Crawls API improvements (#117 ) * crawls api improvements (fixes #110) - add GET /crawls/{crawlid} api to return single crawl - resolve crawlconfig name, add as `configName` to crawl model - add 'created' date for crawlconfigs - flatten list to single 'crawls' list, instead of separate 'finished' and 'running' (running crawls added first) - include 'fileCount' and 'fileSize', remove files - remove `files` from crawl list response, also remove `aid` - remove `schedule` from crawl data altogether, (available in crawl config) - add ListCrawls response model	2022-01-29 12:08:02 -08:00
Ilya Kreymer	01ad7e656f	quickfix: for /cancel immediate crawl cancelation, send SIGABRT instead of SIGUSR1	2022-01-27 20:45:03 -08:00
Ilya Kreymer	0bea0cfff2	crawl config new template: add support for 'extraHops' config option (available in browsertrix-crawler 0.5.0) (#104 ) frontend: - add checkbox to basic crawl config component which sets 'extraHops' to 1, otherwise to 0 - text tweaks: rename Scope Type -> Crawl Scope, capitalization backend: add 'extraHops' to CrawlConfig fixes #102	2022-01-26 21:18:22 -08:00
Ilya Kreymer	f55f84c60b	backend: - crawlconfigs cleanup: simplify get_crawl_configs api - return CrawlConfigOut for single crawlconfig api endpoint, include currCrawlId	2022-01-22 17:41:37 -08:00
Ilya Kreymer	77aa5213f2	quickfix: typo fix, return config, not archive, fixes #96	2022-01-22 17:21:29 -08:00
Ilya Kreymer	b506442b21	backend api: add curr crawl to crawlconfig listing (#95 ) * backend api: add current crawl id to crawlconfig listing - model: add 'currCrawlId' to CrawlConfig model - output: add response model to /crawlconfigs api response to show correct openapi model - rename crawl_configs -> crawlConfigs for consistency	2022-01-22 13:52:46 -08:00
Ilya Kreymer	88f1689e0e	crawlconfig: add 'name' property to crawl config superuser init: don't check invite token for verified superuser (automatic init) fix formatting	2022-01-15 19:06:48 -08:00

1 2

86 Commits