Commit Graph

14 Commits

Author SHA1 Message Date
Ilya Kreymer
ccd87e0dff
Rename api / nginx settings -> backend / frontend, set pull policy job images (#504)
* rename config values
- api -> backend
- nginx -> frontend

* job pods:
- set job_pull_policy from api_pull_policy (same as backend image)
- default to Always, but can be overridden for local deployment (same as backend image)

typo fix: CRAWL_NAMESPACE -> CRAWLER_NAMESPACE (part of #491)
ansible: set default label to :latest instead of :dev for
2023-01-18 20:21:36 -08:00
Tessa Walsh
0fa60ebc45
Rename archives/teams -> orgs in codebase + add db migration (#486)
* Rename archives to orgs and aid to oid on backend

* Rename archive to org and aid to oid in frontend

* Remove translation artifact

* Rename team -> organization

* Add database migrations and run once on startup

* This commit also applies the new by_one_worker decorator to other
asyncio tasks to prevent heavy tasks from being run in each worker.

* Run black, pylint, and husky via pre-commit

* Set db version and use in migrations

* Update and prepare database in single task

* Migrate k8s configmaps
2023-01-18 14:51:04 -08:00
Ilya Kreymer
2daa742585
Copy tags from crawlconfig to crawl (#467), fixes #466
- add tags to crawl object
- ensure tags are copied from crawlconfig to crawl when crawl is created (both manually and scheduled)
- tests: add test to ensure tags added to crawl, remove redundant wait replaced with fixtures
2023-01-12 17:46:19 -08:00
Ilya Kreymer
30bda8c75d
VNC-Based Profile Browser (#433)
* profile browser vnc support + fixes:
- switch profile browser rendering to use VNC
- frontend: add @novnc/novnc as dependency, create separate bundle novnc.js to load into vnc browser (to avoid loading from each container)
- frontend: update proxy paths to proxy websocket, index page to crawler
- frontend: allow browser profiles in all browsers, remove browser compatibility check
- frontend: update webpack dev config, apply prettier
- frontend: node version fix
- backend: get vncpassword, build new URL for proxying to crawler iframe
- backend: fix profile / crawl job pull policy from 'Always' -> 'Never', should use existing image for job
- backend: fix kill signal to use bash -c to work with latest backend image
- backend/chart: add 'profile_browser_timeout_seconds' to chart values to control how long profile browser to remain when idle (default to 60)
- backend: remove utils.py, now using secret.token_hex() for random suffix
Co-authored-by: sua yoo <sua@suayoo.com>
2023-01-10 14:42:42 -08:00
Ilya Kreymer
793611e5bb
add exclusion api, fixes #311 (#349)
* add exclusion api, fixes #311
add new apis: `POST crawls/{crawl_id}/exclusion?regex=...` and `DELETE crawls/{crawl_id}/exclusion?regex=...` which will:
- create new config with add 'regex' as exclusion (deleting or making inactive previous config) OR remove as exclusion.
- update crawl to point to new config
- update statefulset to point to new config, causing crawler pods to restart
- filter out urls matching 'regex' from both queue and seen list (currently a bit slow) (when adding only)
- return 400 if exclusion already existing when adding, or doesn't exist when removing
- api reads redis list in reverse to match how exclusion queue is used
2022-11-12 17:24:30 -08:00
Ilya Kreymer
d340bceb39 style pass: normalize docstring spacing 2022-10-19 21:47:34 -07:00
lasztoth
5ccacd8a16
Fixed issue with missing variables affecting deployments on Podman (#338) 2022-10-14 18:06:51 -07:00
Ilya Kreymer
2842ca6a06 quick fix: add back USER_ID to crawlconfig config map, needed by crawler pod 2022-08-11 17:33:12 -07:00
Ilya Kreymer
3859be009a k8s: don't add entire crawl config as env var from configmap, add only specified env vars from configmap
fix issue with crawls with large number of seeds failing due to unusually large env var
2022-08-11 15:33:22 -07:00
Ilya Kreymer
2717a60763
improvements / bug fixes for stop/cancel handling: (#279)
- only send signal if stopping, no need for canceling as pods/containers will be removed
- refactor stop/cancel handling to be unified in manager, separate in job
- when stopping / graceful shutdown, return false if sending signal fails
- return success=true in json response if and only if stop/cancel actually succeeds, return 'error' message in error, should fix #270
- allow canceling after stopping / if stopping fails
- ensure finished time is set in case of cancelation before crawl starts, should fix #273
2022-06-29 17:47:25 -07:00
Ilya Kreymer
b9d7907ab3
Single config and env vars (#267)
* simplify back to single config.env!
- back to good ole env vars!
- remove shared secret, which made it difficult to have scheduled crawls, since secrets are immutable, so could not update config if a scheduled crawl existed :/
- all env vars unified in configs/config.env - run-swarm.sh and run-pod.sh 'source' this config
- remove config.sample.yaml
- customize minio volume dir via config.env
- customize redis port via config.env
- include authsign ports in debug-ports config
2022-06-16 21:50:03 -07:00
Ilya Kreymer
dee354f252 affinity: add affinity for k8s crawl deployments:
- prefer deploy crawler, redis and job to same zone
- prefer deploying crawler and job together via crawler node type, redis via redis node type (all optional)
2022-06-07 21:52:04 -07:00
Ilya Kreymer
e3f268a2e8
CI setup for new swarm mode (#248)
- build backend and frontend with cacheing using GHA cache)
- streamline frontend image to reduce layers
- setup local swarm with test/setup.sh script, wait for containers to init
- copy sample config files as default (add storages.sample.yaml)
- add initial backend test for logging in with default superadmin credentials via 127.0.0.1:9871
- must use 127.0.0.1 instead of localhost for accessing frontend container within action
2022-06-06 09:34:02 -07:00
Ilya Kreymer
0c8a5a49b4 refactor to use docker swarm for local alternative to k8s instead of docker compose (#247):
- use python-on-whale to use docker cli api directly, creating docker stack for each crawl or profile browser
- configure storages via storages.yaml secret
- add crawl_job, profile_job, splitting into base and k8s/swarm implementations
- split manager into base crawlmanager and k8s/swarm implementations
- swarm: load initial scale from db to avoid modifying fixed configs, in k8s, load from configmap
- swarm: support scheduled jobs via swarm-cronjob service
- remove docker dependencies (aiodocker, apscheduler, scheduling)
- swarm: when using local minio, expose via /data/ route in nginx via extra include (in k8s, include dir is empty and routing handled via ingress)
- k8s: cleanup minio chart: move init containers to minio.yaml
- swarm: stateful set implementation to be consistent with k8s scaling:
  - don't use service replicas,
  - create a unique service with '-N' appended and allocate unique volume for each replica
  - allows crawl containers to be restarted w/o losing data
- add volume pruning background service, as volumes can be deleted only after service shuts down fully
- watch: fully simplify routing, route via replica index instead of ip for both k8s and swarm
- rename network btrix-cloud-net -> btrix-net to avoid conflict with compose network
2022-06-05 10:37:17 -07:00