Commit Graph

26 Commits

Author SHA1 Message Date
Ilya Kreymer
aa5207915c
backend: fix crawl config revision links (#149)
backed: crawlconfig:
- ensure newId is saved on old config being replaced
- if old config replaced is being deleted, ensure newId link is set on its old config (if any),
and the oldId points to the oldId of config being replaced (if any)
2022-02-21 16:51:27 -08:00
Ilya Kreymer
ee68a2f64e
Support for setting scale in crawlconfig (#148)
* backend: scale support:
- add 'scale' field to crawlconfig
- support updating 'scale' field in crawlconfig patch
- add constraint for crawlconfig and crawl scale (currently 1-3)
2022-02-20 11:27:47 -08:00
Ilya Kreymer
d05f04be9f
Crawl Config Editing Support (#141)
* support inactive configs in same collection, configs with `inactive` set to true (#137)
- add `inactive`, `newId`, `oldId` to crawlconfigs
- filter out inactive configs by default for most operations
- add index for aid + inactive field for faster querying
- delete returns status: 'deactivated' or 'deleted'
- if no crawls ran, config can be deleted, otherwise it is deactivated

* update crawl endpoint: add general PATCH crawl config endpoint, support updating schedule and name
2022-02-17 16:04:07 -08:00
Ilya Kreymer
d28ebcc7b6 backend: crawlconfig: don't pass default settings to crawlconfig to avoid redundant settings, use browsertrix-crawler defaults
when config not set
2022-02-14 18:47:52 -08:00
Ilya Kreymer
2b2e6fedfa
Misc backend fixes (#133)
* misc backend fixes:
- fix uuid typing: roles list, user invites
- crawlconfig: fix created date setting, fix userName lookup
- docker: fix timezone for scheduler, fix running check
- remove prints
- fix get crawl stuck in 'stopping' - check finished list first, then run list (in case k8s job has not been deleted)
2022-01-31 19:41:04 -08:00
Ilya Kreymer
bcbc40059e
Refactor backend data model to support UUID (fixes #118) (#119)
* uuid fix: (fixes #118)
- update all mongo models to use UUID type as main '_id' (users continue to use 'id' as defined by fastapi-users)
- update all foreign doc references to use UUID instead of string
- api handlers convert str->uuid as needed
api fix:
- fix single crawl api, add CrawlOut response model
- fix collections api
- fix standalone-docker apis
- for manual job, set user to current user, overriding the setting from crawlconfig

* additional fixes:
- rename username -> userName to indicate not the login 'username'
- rename user -> userid, archive -> aid for crawlconfig + crawls
- ensure invites correctly convert str -> uuid as needed
- filter out unset values from browsertrix-crawler config

* convert remaining user -> userid variables
ensure archive id is passed to crawl_manager as str (via archive.id_str)

* remove bulk crawlconfig delete
* add support for `stopping` state when gracefully stopping crawl
* for get crawl endpoint, check stopped crawls first, then running
2022-01-29 19:00:11 -08:00
Ilya Kreymer
9499ebfbba
Crawls API improvements (#117)
* crawls api improvements (fixes #110)
- add GET /crawls/{crawlid} api to return single crawl
- resolve crawlconfig name, add as `configName` to crawl model
- add 'created' date for crawlconfigs 
- flatten list to single 'crawls' list, instead of separate 'finished' and 'running' (running crawls added first)
- include 'fileCount' and 'fileSize', remove files
- remove `files` from crawl list response, also remove `aid`
- remove `schedule` from crawl data altogether, (available in crawl config)
- add ListCrawls response model
2022-01-29 12:08:02 -08:00
Ilya Kreymer
0bea0cfff2
crawl config new template: add support for 'extraHops' config option (available in browsertrix-crawler 0.5.0) (#104)
frontend:
- add checkbox to basic crawl config component which sets 'extraHops' to 1, otherwise to 0
- text tweaks: rename Scope Type -> Crawl Scope, capitalization

backend: add 'extraHops' to CrawlConfig
fixes #102
2022-01-26 21:18:22 -08:00
Ilya Kreymer
f55f84c60b backend:
- crawlconfigs cleanup: simplify get_crawl_configs api
- return CrawlConfigOut for single crawlconfig api endpoint, include currCrawlId
2022-01-22 17:41:37 -08:00
Ilya Kreymer
77aa5213f2 quickfix: typo fix, return config, not archive, fixes #96 2022-01-22 17:21:29 -08:00
Ilya Kreymer
b506442b21
backend api: add curr crawl to crawlconfig listing (#95)
* backend api: add current crawl id to crawlconfig listing
- model: add 'currCrawlId' to CrawlConfig model
- output: add response model to /crawlconfigs api response to show correct openapi model
- rename crawl_configs -> crawlConfigs for consistency
2022-01-22 13:52:46 -08:00
Ilya Kreymer
88f1689e0e crawlconfig: add 'name' property to crawl config
superuser init: don't check invite token for verified superuser (automatic init)
fix formatting
2022-01-15 19:06:48 -08:00
Ilya Kreymer
3d4d7049a2
Misc backend fixes for cloud deployment (#26)
* misc backend fixes:
- fix running w/o local minio
- ensure crawler image pull policy is configurable, loaded via chart value
- use digitalocean repo for main backend image (for now)
- add bucket_name to config only if using default bucket

* enable all behaviors, support 'access_endpoint_url' for default storages

* debugging: add 'no_delete_jobs' setting for k8s and docker to disable deletion of completed jobs
2021-11-25 11:58:26 -08:00
Ilya Kreymer
57a4b6b46f add collections api:
- collections defined by name per archive
- can update collections with additional metadata (currently just description)
- crawl config api accepts a list of collections by name, resolved to collection uids and stored in config
- finished crawls also associated with collection list
- /archives/{aid}/collections/{name} can list all crawl artifacts (wacz files) from a named collection (in frictionless data package-ish format)
- /archives/{aid}/collections/$all lists all crawled artifacts for the archive

readiness check: add /healthz endpoints for app and nginx
ingress: add /data/ route to local bucket

storage improvements:
- for default storages, store path only, and prepend default storage access endpoint
- collections api returns the paths using the storage access endpoint
- define default storages as secrets in k8s (can support multiple), hard-coded in docker (only one for now)
2021-10-27 09:39:14 -07:00
Ilya Kreymer
c38e0b7bf7 use redis based queue instead of url for crawl done webhook
update docker setup to support redis webhook, add consistent CRAWL_ARGS, additional fixes
2021-10-10 12:18:28 -07:00
Ilya Kreymer
4ae4005d74 add ingress + nginx container for better routing
support screencasting to dynamically created service via nginx (k8s only thus far)
add crawl /watch endpoint to enable watching, creates service if doesn't exist
add crawl /running endpoint to check if crawl is running
nginx auth check in place, but not yet enabled
add k8s nginx.conf
add missing chart files
file reorg: move docker config to configs/
k8s: add readiness check for nginx and api containers for smoother reloading
ensure service deleted along with job
todo: update dockerman with screencast support
2021-10-09 23:47:29 -07:00
Ilya Kreymer
19879fe349 Storage + Data Model Refactor (fixes #3):
- Add default vs custom (s3) storage
 - K8S: All storages correspond to secrets
 - K8S: Default storages inited via helm
 - K8S: Custom storage results in custom secret (per archive)
 - K8S: Don't add secret per crawl config
 - API for changing storage per archive
 - Docker: default storage just hard-coded from env vars (only one for now)
 - Validate custom storage via aiobotocore before confirming
 - Data Model: remove usage from users
 - Data Model: support adding multiple files per crawl for parallel crawls
 - Data Model: track completions for parallel crawls
 - Data Model: initial support for tags per crawl, add collection as 'coll' tag

README fixes
2021-10-09 18:58:40 -07:00
Ilya Kreymer
b6d1e492d7 add redis for storing crawl state data!
- supported in both docker and k8s
- additional pods with same job id automatically use same crawl state in redis
- support dynamic scaling (#2) via /scale endpoint - k8s job parallelism adjusted dynamically for running job (only supported in k8s so far)
2021-09-17 15:02:11 -07:00
Ilya Kreymer
223658cfa2 misc tweaks:
- better error handling for not found resources, ensure 404
- typo in k8smanager
- add pylintrc
- ensure manual job ares deleted when complete
- fix typos, reformat
2021-08-25 18:34:49 -07:00
Ilya Kreymer
60b48ee8a6 dockermanager + scheduler:
- run as child process using aioprocessing
- cleanup: support cleanup of orphaned containers
- timeout: support crawlTimeout via check in cleanup loop
- support crawl listing + crawl stopping
2021-08-25 15:28:57 -07:00
Ilya Kreymer
b417d7c185 docker manager: support scheduling with apscheduler and separate 'scheduler' process 2021-08-25 12:21:03 -07:00
Ilya Kreymer
91e9fc8699 dockerman: initial pass
- support for creating, deleting crawlconfigs, running crawls on-demand
- config stored in volume
- list to docker events and clean up containers when they exit
2021-08-24 22:49:06 -07:00
Ilya Kreymer
20b19f932f make crawlTimeout a per-crawconfig property
allow crawl complete/partial complete to update existing crawl state, eg. timeout
enable handling backofflimitexceeded / deadlineexceeded failure, with possible success able to override the failure state
filter out only active jobs in running crawls listing
2021-08-24 11:29:15 -07:00
Ilya Kreymer
66c4e618eb crawls work (#1), support for:
- canceling a crawl (via sigterm)
- stopping a crawl gracefully (via custom exec sigint)
2021-08-23 12:25:04 -07:00
Ilya Kreymer
a8255a76b2 crawljob:
- support run once on existing crawl job
- support updating/patching existing crawl job with new crawl config, new schedule and run once
2021-08-21 22:10:31 -07:00
Ilya Kreymer
170958be37 rename crawls -> crawlconfigs.py
add crawls for crawl api management
2021-08-20 15:15:51 -07:00