* Btrixjobs Operator - Phase 1 (#679) - add metacontroller and custom crds - add main_op entrypoint for operator * Btrix Operator Crawl Management (#767) * operator backend: - run operator api in separate container but in same pod, with WEB_CONCURRENCY=1 - operator creates statefulsets and services for CrawlJob and ProfileJob - operator: use service hook endpoint, set port in values.yaml * crawls working with CrawlJob - jobs start with 'crawljob-' prefix - update status to reflect current crawl state - set sync time to 10 seconds by default, overridable with 'operator_resync_seconds' - mark crawl as running, failed, complete when finished - store finished status when crawl is complete - support updating scale, forcing rollover, stop via patching CrawlJob - support cancel via deletion - requires hack to content-length for patching custom resources - auto-delete of CrawlJob via 'ttlSecondsAfterFinished' - also delete pvcs until autodelete supported via statefulset (k8s >1.27) - ensure filesAdded always set correctly, keep counter in redis, add to status display - optimization: attempt to reduce automerging, by reusing volumeClaimTemplates from existing children, as these may have additional props added - add add_crawl_errors_to_db() for storing crawl errors from redis '<crawl>:e' key to mongodb when crawl is finished/failed/canceled - add .status.size to display human-readable crawl size, if available (from webrecorder/browsertrix-crawler#291) - support new page size, >0.9.0 and old page size key (changed in webrecorder/browsertrix-crawler#284) * support for scheduled jobs! - add main_scheduled_job entrypoint to run scheduled jobs - add crawl_cron_job.yaml template for declaring CronJob - CronJobs moved to default namespace * operator manages ProfileJobs: - jobs start with 'profilejob-' - update expiry time by updating ProfileJob object 'expireTime' while profile is active * refactor/cleanup: - remove k8s package - merge k8sman and basecrawlmanager into crawlmanager - move templates, k8sapi, utils into root package - delete all *_job.py files - remove dt_now, ts_now from crawls, now in utils - all db operations happen in crawl/crawlconfig/org files - move shared crawl/crawlconfig/org functions that use the db to be importable directly, including get_crawl_config, add_new_crawl, inc_crawl_stats * role binding: more secure setup, don't allow crawler namespace any k8s permissions - move cronjobs to be created in default namespace - grant default namespace access to create cronjobs in default namespace - remove role binding from crawler namespace * additional tweaks to templates: - templates: split crawler and redis statefulset into separate yaml file (in case need to load one or other separately) * stats / redis optimization: - don't update stats in mongodb on every operator sync, only when crawl is finished - for api access, read stats directly from redis to get up-to-date stats - move get_page_stats() to utils, add get_redis_url() to k8sapi to unify access * Add migration for operator changes - Update configmap for crawl configs with scale > 1 or crawlTimeout > 0 and schedule exists to recreate CronJobs - add option to rerun last migration, enabled via env var and by running helm with --set=rerun_last_migration=1 * subcharts: move crawljob and profilejob crds to separate subchart, as this seems best way to guarantee proper install order with + update on upgrade with helm, add built btrix-crds-0.1.0.tgz subchart - metacontroller: use release from ghcr, add metacontroller-helm-v4.10.1.tgz subchart * backend api fixes - ensure changing scale of crawl also updates it in the db - crawlconfigs: add 'currCrawlSize' and 'lastCrawlSize' to crawlconfig api --------- Co-authored-by: D. Lee <leepro@gmail.com> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
119 lines
4.2 KiB
Python
119 lines
4.2 KiB
Python
"""
|
|
Migration 0004 - Ensuring all config.seeds are Seeds not HttpUrls
|
|
"""
|
|
from pydantic import HttpUrl
|
|
|
|
from btrixcloud.crawlconfigs import CrawlConfig, ScopeType, Seed
|
|
from btrixcloud.crawls import Crawl
|
|
from btrixcloud.migrations import BaseMigration
|
|
|
|
|
|
MIGRATION_VERSION = "0004"
|
|
|
|
|
|
class Migration(BaseMigration):
|
|
"""Migration class."""
|
|
|
|
def __init__(self, mdb, migration_version=MIGRATION_VERSION):
|
|
super().__init__(mdb, migration_version)
|
|
|
|
async def migrate_up(self):
|
|
"""Perform migration up.
|
|
|
|
Convert any crawlconfig.config.seed HttpUrl values to Seeds with url value.
|
|
"""
|
|
# pylint: disable=too-many-branches
|
|
|
|
# Migrate workflows
|
|
crawl_configs = self.mdb["crawl_configs"]
|
|
crawl_config_results = [res async for res in crawl_configs.find({})]
|
|
if not crawl_config_results:
|
|
return
|
|
|
|
for config_dict in crawl_config_results:
|
|
seeds_to_migrate = []
|
|
seed_dicts = []
|
|
|
|
seed_list = config_dict["config"]["seeds"]
|
|
for seed in seed_list:
|
|
if isinstance(seed, HttpUrl):
|
|
new_seed = Seed(url=str(seed.url), scopeType=ScopeType.PAGE)
|
|
seeds_to_migrate.append(new_seed)
|
|
elif isinstance(seed, str):
|
|
new_seed = Seed(url=str(seed), scopeType=ScopeType.PAGE)
|
|
seeds_to_migrate.append(new_seed)
|
|
elif isinstance(seed, Seed):
|
|
seeds_to_migrate.append(seed)
|
|
|
|
for seed in seeds_to_migrate:
|
|
seed_dict = {
|
|
"url": str(seed.url),
|
|
"scopeType": seed.scopeType,
|
|
"include": seed.include,
|
|
"exclude": seed.exclude,
|
|
"sitemap": seed.sitemap,
|
|
"allowHash": seed.allowHash,
|
|
"depth": seed.depth,
|
|
"extraHops": seed.extraHops,
|
|
}
|
|
seed_dicts.append(seed_dict)
|
|
|
|
if seed_dicts:
|
|
await crawl_configs.find_one_and_update(
|
|
{"_id": config_dict["_id"]},
|
|
{"$set": {"config.seeds": seed_dicts}},
|
|
)
|
|
|
|
# Migrate seeds copied into crawls
|
|
crawls = self.mdb["crawls"]
|
|
crawl_results = [res async for res in crawls.find({})]
|
|
|
|
for crawl_dict in crawl_results:
|
|
seeds_to_migrate = []
|
|
seed_dicts = []
|
|
|
|
seed_list = crawl_dict["config"]["seeds"]
|
|
for seed in seed_list:
|
|
if isinstance(seed, HttpUrl):
|
|
new_seed = Seed(url=str(seed.url), scopeType=ScopeType.PAGE)
|
|
seeds_to_migrate.append(new_seed)
|
|
elif isinstance(seed, str):
|
|
new_seed = Seed(url=str(seed), scopeType=ScopeType.PAGE)
|
|
seeds_to_migrate.append(new_seed)
|
|
elif isinstance(seed, Seed):
|
|
seeds_to_migrate.append(seed)
|
|
|
|
for seed in seeds_to_migrate:
|
|
seed_dict = {
|
|
"url": str(seed.url),
|
|
"scopeType": seed.scopeType,
|
|
"include": seed.include,
|
|
"exclude": seed.exclude,
|
|
"sitemap": seed.sitemap,
|
|
"allowHash": seed.allowHash,
|
|
"depth": seed.depth,
|
|
"extraHops": seed.extraHops,
|
|
}
|
|
seed_dicts.append(seed_dict)
|
|
|
|
if seed_dicts:
|
|
await crawls.find_one_and_update(
|
|
{"_id": crawl_dict["_id"]},
|
|
{"$set": {"config.seeds": seed_dicts}},
|
|
)
|
|
|
|
# Test migration
|
|
crawl_config_results = [res async for res in crawl_configs.find({})]
|
|
for config_dict in crawl_config_results:
|
|
config = CrawlConfig.from_dict(config_dict)
|
|
for seed in config.config.seeds:
|
|
assert isinstance(seed, Seed)
|
|
assert seed.url
|
|
|
|
crawl_results = [res async for res in crawls.find({})]
|
|
for crawl_dict in crawl_results:
|
|
crawl = Crawl.from_dict(crawl_dict)
|
|
for seed in crawl.config.seeds:
|
|
assert isinstance(seed, Seed)
|
|
assert seed.url
|