Fixes#890
This PR introduces new streaming superuser-only API endpoints to export
and import database information for an organization. New Adminstrator
deployment documentation on how to manage the process and copy files
between S3 buckets as needed is also included.
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Updates the /api/orgs/create endpoint to:
- not have name / slug be required, will be renamed on first user via
#1870
- support optional quotas
- support optional first admin user email, who will receive an invite to
join the org.
Also supports a new shared secret mechanism, to allow an external
automation to access the /api/orgs/create endpoint (and only that
endpoint thus far) via a shared secret instead of normal login.
Fixes#1893
- Removes crawl workflow-scoped configmaps, and replaces with operator-controlled
per-crawl configmaps that only contain the json config passed to Browsertrix
Crawler (as a volume).
- Other configmap settings replaced are replaced the custom CrawlJob options
(mostly already were, just added profile_filename and storage_filename)
- Cron jobs also updated to create CrawlJob without relying on configmaps,
querying the db for additional settings.
- The `userid` associated with cron jobs is set to the user that last modified
the schedule of the crawl, rather than whomever last modified the workflow
- Various functions that deal with updating configmaps have been removed,
including in migrations.
- New migration 0029 added to remove all crawl workflow configmaps
- needed to support js-wacz signing requests in upcoming crawler versions
- Also has slightly increased memory requirements due to new versions of
some libraries.
- 0.5.2 adds a fix to dropping the fractional part of the second, to make
it work with ISO date strings that have microseconds, such as those from
js-wacz.
Resolves https://github.com/webrecorder/browsertrix/issues/1874
Support for new two-part sign up flow if first admin user is added to org
- If new user, user registers first, then is able to change the org name / slug on following screen
- If existing user, user accepts invite, then is able to change the org name / slug on following screen
- After confirming org slug name, user is taken to dashboard, or error is shown if org name or slug already taken.
- If org name == org id, org name and slug is automatically set to `{Your Name}'s Archive` when first user is registered / accepts invite
- Email templates updated to better reflect new / existing users and not show org name if it is 'unset' (org name == org id internally)
- tests: frontend unit testing for accept + invite screens.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: sua yoo <sua@suayoo.com>
Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Fixes#1888
Refactors scale handling:
- Ensures number of scaled instances does not exceed number of pages,
but is also at minimum 1
- Checks for finish condition to be numFailed + numDone >= desired scale
- If at least one instance succeeds, crawl considers successful / done.
- If all instances fail, crawl considered failed
- Ensures that pod done count >= redis done count
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- Adds Browsertrix GitHub repo and Webrecorder forum to the bottom of
the support email.
- Adds note about having an applicable plan to contact support
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Backend work for first two tasks of
https://github.com/webrecorder/browsertrix/issues/1875
New /billing API endpoint to be added separately once we have a better
idea of what data we can get from the payment processor.
- add a 'expire_at_duration_seconds' which is 75% of actual presign
duration time, or <25% remaining until presigned URL actually expires to
ensure presigned URLs are updated early than when they actually expire
- set cached expireAt time to the renew at time for more frequent
updates
- update QA configmap in place with updated presigned URLs when expireAt
time is reached
- mount qa config volume under /tmp/qa/ without subPath to get automatic
updates, which crawler will handle
- tests: fix qa test typo (from main)
- fixes#1864
- allow configuring QA run scale via 'qa_scale' setting in helm values
(overriding any setting on the qa crawljob)
- adds additional comments to browser instances helm values settings for clarity
- fixes#1842
Currently, the workflow crawl settings were not being included at all in
QA runs.
This mounts the crawl workflow config, as well as QA configmap, into QA
run crawls, allowing for page limits from crawl workflow to be applied
to QA runs.
It also allows a different number of browser instances to be used for QA
runs, as QA runs might work better with less browsers, (eg. 2 instead of
4). This can be set with `qa_browser_instances` in helm chart.
Default qa browser workers to 1 if unset (for now, for best results)
Fixes#1828
- ensure max_crawler_memory_size is inited before it is set!
- pass profile_browser_memory / profile_browser_cpu from chart values
- map volume to /tmp/home to avoid persisting /tmp for profiles
Adds a `max_crawler_memory` chart setting, which, if set, will
defines the upper crawler memory limit that crawler pods can be resized up to.
If not set, auto resizing is disabled and pods are always set to 'crawler_memory' memory
if 'registration_enabled' is set, check 'registration_org_id' for org id
of an existing org that new users should be added to when they register.
if omitted, default to the default org
Fixes#1729
Repository Index: Generate an index.yaml in ./docx/helm-repo/index.yaml
to allow for browsertrix to be a helm repository.
docs: rename docs.browsertrix.cloud -> docs.browsertrix.com
docs: update deployment doc to mention helm repo as preferred way to
install
docs build action: generate repository index in GH action
publish action: update auto-generated message to mention installing from
the repo.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- fixes#1684
- can be used to optionally restrict QA to only some crawls (eg. with
browsertrix-crawler>=1.0.0)
- enforce error on backend (return 400) and handle special error on the
frontend
- Remove globals from profile, uploads, and qa test modules in favor of fixtures
- Add retries to fix intermittent test failures due to timing
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- priority classes <-10 are ignored by cluster-autoscaler so QA jobs
with too low priorities never run
- start crawl priorities at 0 going down (same as before)
- start qa run priorities at -2 going down (instead of -100)
- this means a crawl of with scale of 3 can be preempted by 1st qa pod,
but otherwise crawls have higher priority
- rename priority classes as they are otherwise immutable and error on
helm upgrade
This allows for more room in lower pri classes for other type of
objects, while keeping in mind the -10 and below threshold: (see:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md)
- modify invite email template to answer common questions
- email templates: make each email template overridable with --set-file
- docs: update customization doc to document how to customize email
templates
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Supports horizontal pod autoscaling (hpa) for backend and frontend pods:
- use cpu and memory averages
- adjust base memory + cpu for backend
- threshold set to 80% cpu and 95% memory utilization by default
(configurable in values.yaml)
- instead of backend and frontend replicas, set max replicas in
values.yaml
- only enable hpa if backend_max_replicas or frontend_max_replicas is
>1, default to 1 for now
- set memory limit to 1.2x memory request to provide extra padding and
avoid OOM
- attempt to resize crawler pods by 1.2x when exceeding 90% of available
memory
- do a 'soft OOM' (send extra SIGTERM) to pod when reaching 100% of
requested memory, resulting in faster graceful restart, but avoiding a
system-instant OOM Kill
- Fixes#1632
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Supports running QA Runs via the QA API!
Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes
#1498
Also requires the latest Browsertrix Crawler 1.1.0+ (from
webrecorder/browsertrix-crawler#469 branch)
Notable changes:
- QARun objects contain info about QA runs, which are crawls
performed on data loaded from existing crawls.
- Various crawl db operations can be performed on either the crawl or
`qa.` object, and core crawl fields have been moved to CoreCrawlable.
- While running,`QARun` data stored in a single `qa` object, while
finished qa runs are added to `qaFinished` dictionary on the Crawl. The
QA list API returns data from the finished list, sorted by most recent
first.
- Includes additional type fixes / type safety, especially around
BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific
get_upload(), get_basecrawl(), get_crawl() getters for internal use and
get_crawl_out() for API
- Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch)
along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#1597
New endpoints (replacing old migration) to re-add crawl pages to db from
WACZs.
After a few implementation attempts, we settled on using
[remotezip](https://github.com/gtsystem/python-remotezip) to handle
parsing of the zip files and streaming their contents line-by-line for
pages. I've also modified the sync log streaming to use remotezip as
well, which allows us to remove our own zip module and let remotezip
handle the complexity of parsing zip files.
Database inserts for pages from WACZs are batched 100 at a time to help
speed up the endpoint, and the task is kicked off using
asyncio.create_task so as not to block before giving a response.
StorageOps now contains a method for streaming the bytes of any file in
a remote WACZ, requiring only the presigned URL for the WACZ and the
name of the file to stream.
- Backend: Use separate resource constraints for profiles: default
profile browser resources to either 'profile_browser_cpu' /
'profile_browser_memory' or single browser 'crawler_memory_base' /
'crawler_cpu_base', instead of scaled to the number of browser workers
- Frontend: check that profile html page is loading, keep retrying if
still getting nginx error instead of loading an iframe with the error.
Fixes#1598 (Copy of #1599 from 1.9.4)
Part of #1241
### Changes
- Renames all instances of "Browsertrix Cloud" to "Browsertrix" on the
front end, emails, and documentation
---------
Co-authored-by: emma <hi@emma.cafe>
- increases the failureThreshold for startupProbe for the api backend
container to account for long running migrations, upto 300 seconds
- add `/healthzStartup` which checks if db is ready
- bump
- keeps `/healthz` to always return 200 when running
- increases livenessProbe failureThreshold to be higher than readiness
probe, following recommended best practice of liveness probe > readiness
probe
- fixes#1559
Fixes#1502
- Adds pages to database as they get added to Redis during crawl
- Adds migration to add pages to database for older crawls from
pages.jsonl and extraPages.jsonl files in WACZ
- Adds GET, list GET, and PATCH update endpoints for pages
- Adds POST (add), PATCH, and POST (delete) endpoints for page notes,
each with their own id, timestamp, and user info in addition to text
- Adds page_ops methods for 1. adding resources/urls to page, and 2.
adding automated heuristics and supplemental info (mime, type, etc.) to
page (for use in crawl QA job)
- Modifies `Migration` class to accept kwargs so that we can pass in ops
classes as needed for migrations
- Deletes WACZ files and pages from database for failed crawls during
crawl_finished process
- Deletes crawl pages when a crawl is deleted
Note: Requires a crawler version 1.0.0 beta3 or later, with support for
`--writePagesToRedis` to populate pages at crawl completion. Beta 4 is
configured in the test chart, which should be upgraded to stable 1.0.0
when it's released.
Connected to https://github.com/webrecorder/browsertrix-crawler/pull/464
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Supports setting WARC prefix for WARCs inside WACZ to `<org slug>-<slug
[crawl name | first seed host]>`.
- Prefix set via WARC_PREFIX env var, supported in browsertrix-crawler
1.0.0-beta.4 or higher
If crawl name is provided, uses crawl name, other hostname of first
seed. The name is 'sluggified', using lowercase alphanum characters
separated by dashes.
Ex: in an organization called `Default Org`, a crawl of
`https://specs.webrecorder.net/` and no name will have WARCs named:
`default-org-specs-webrecorder-net-....warc.gz`
If the crawl is given the name `SPECS`, the WARCs will be named
`default-org-specs-manual-....warc.gz`
Fixes#412 in a default way.
configmap: add --screenshot thumbnail,view as default screenshots
version: update update-version.sh to add newline in version.py to match
new black formatting (from changes in #1507)
Fixes#1519
Fixes#1341
Adds "User Agent" field to workflow editor under the Browser Settings
tab. If not set, the crawler will use the browser's default user agent.
Also added to docs and to the workflow details page (if set).
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Refactors backend deployment to:
- Use MOTOR_MAX_WORKERS (defaulting to 1) to reduce threads used by
mongodb connections
- Also sets backend workers to 1 by default to reduce default memory
usage
- Switches to gunicorn with uvloop worker for production use instead of
uvicorn (as recommended by uvicorn)
Lower thread count should address memory leak/increased usage, which
resulted in 5x thread x cpus x workers, eg. potentially 20 or 40 threads
just for mongodb connections. Lower default number of workers should
make it easier to scale backend with HPA if additional capacity.
Fixes#1467
Fixes#1385
## Changes
Supports multiple crawler 'channels' which can be configured to
different browsertrix-crawler versions
- Replaces `crawler_image` in helm chart with `crawler_channels` array
similar to how storages are handled
- The `default` crawler channel must always be provided and specifies
the default crawler image
- Adds backend `/orgs/{oid}/crawlconfigs/crawler-channels` API endpoint
to fetch information about available crawler versions (name, image, and
label) and test
- Adds crawler channel select to workflow creation/edit screens and
profile creation dialog, and updates related API endpoints and
configmaps accordingly. The select dropdown is shown only if more than
one channel is configured.
- Adds `crawlerChannel` to workflow and crawl details.
- Add `image` to crawler image, used to display actual image used as
part of the crawl.
- Modifies `crawler_crawl_id` backend test fixture to use `test` crawler
version to ensure crawler versions other than latest work
- Adds migration to add `crawlerChannel` set to `default` to existing
workflow and profile objects and workflow configmaps
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
- Emails are now processed from Jinja2 templates found in
`charts/email-templates`, to support easier updates via helm chart in
the future.
- The available templates are: `invite`, `password_reset`, `validate` and
`failed_bg_job`.
- Each template can be text only or also include HTML. The format of the
template is:
```
subject
~~~
<html content>
~~~
text
```
- A new `support_email` field is also added to the email block in
values.yaml
Invite Template:
- Currently, only the invite template includes an HTML version, other
templates are text only.
- The same template is used for new and existing users, with slightly
different text if adding user to an existing org.
- If user is invited by the superadmin, the invited by field is not
included, otherwise it also includes 'You have been invited by X to join Y'
- Adds two new crawl finished state, stopped_by_user and
stopped_quota_reached
- Tracking other possible 'stop reasons' in operator, though not making
them distinct states for now.
- Updated frontend with 'Stopped by User' and 'Stopped: Time Quota
Reached', shown with same icon as current partial_complete
- Added migration of partial_complete to either stopped_by_user or
complete (no historical quota data available)
- Addresses edge case in scaling: if crawl never scaled (no redis entry,
no pod), automatically scale down
- Edge case in status: if crawl is somehow 'canceled' but not deleted,
immediately delete crawl object and begin finalizing.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Previously, the crawler pods use preferred node affinity, instead of
required node affinity. This results in crawler nodes running on the
main node pool. Instead, we want to ensure crawler nodes are running on
dedicated node pool (if configured).
- Converts 'preferred node affinity' to 'required node affinity' for
the node pool, while keeping preferred pod affinity for keeping all
crawler / redis pods together.
- For profiles, updates to same node affinity, and also adds
resource constraint to match a single crawler for profile browser,
which did not have resource constraints.
Fixes#1364
Regression fix for issue introduced in storage refactoring (see issue
for more details).
Changes:
1. Add `profiles/` prefix to profile filename passed in to crawler for
profile creation and written into db
2. Remove hardcoded `profiles/` prefix from crawler YAML
3. Add migration to add `profiles/` prefix to profile filenames that
don't already have it, including updating PROFILE_FILENAME in ConfigMaps
This way between the related storage document and the profile filename,
we have the full path to the object in the database rather than relying
on additional prefixes hardcoded into k8s job YAML files.
Note that this as a follow-up it'll be necessary to manually move any
profiles that had been written into the `<oid>` "directory" in object
storage rather than `<oid>/profiles` to the latter. This should only
affect profiles created very recently in a 1.8.0-beta release.
- move authsign secret to signer and make port configurable
- rename storages to more general ops-configs
- put 'storages.json' path into env var
- rename backend secret to backend-auth
- cronjobs: don't keep succeeded jobs around, triggers operator update
Fixes#1337
Crawl timeout is tracked via `elapsedCrawlTime` field on the crawl
status, which is similar to regular crawl execution time, but only
counts one pod if scale > 1. If scale == 1, this time is equivalent.
Crawl is gracefully stopped when the elapsed execution time exceeds the
timeout. For more responsiveness, also adding current crawl time since
last update interval.
Details:
- handle crawl timeout via elapsed crawl time - longest running time of a
single pod, instead of expire time.
- include current running from last update for best precision
- more accurately count elapsed time crawl is actually running
- store elapsedCrawlTime in addition to crawlExecTime, storing the
longest duration of each pod since last test interval
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Instead of adding the app templates launched from the backend via
`backend/btrixcloud/templates`, add them to a configmap and mount the
configmap in the same location.
This allows these templates to be updated, like other values in
charts/... without having to rebuild any of the images, speeding up dev
and maintenance time.
Changes include:
- move backend/btrixcloud/templates -> chart/app-templates/
- add app-templates/*.yaml to app-templates configmap
- mount app-templates configmap to /app/btrixcloud/templates/ in api and op containers
Implemented variable and defaults for cluster-issuer to allow users to
specify, if needed, their own cluster issuer. (eg. installations with
only outbound traffic that cannot solve ACME https challenge)
Fixes#1252
Supports a generic background job system, with two background jobs,
CreateReplicaJob and DeleteReplicaJob.
- CreateReplicaJob runs on new crawls, uploads, profiles and updates the
`replicas` array with the info about the replica after the job succeeds.
- DeleteReplicaJob deletes the replica.
- Both jobs are created from the new `replica_job.yaml` template. The
CreateReplicaJob sets secrets for primary storage + replica storage,
while DeleteReplicaJob only needs the replica storage.
- The job is processed in the operator when the job is finalized
(deleted), which should happen immediately when the job is done, either
because it succeeds or because the backoffLimit is reached (currently
set to 3).
- /jobs/ api lists all jobs using a paginated response, including filtering and sorting
- /jobs/<job id> returns details for a particular job
- tests: nightly tests updated to check create + delete replica jobs for crawls as well as uploads, job api endpoints
- tests: also fixes to timeouts in nightly tests to avoid crawls finishing too quickly.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- Refactors storage to support replicas + custom storages on the Org.
- There is a default primary + replica storage, while an Org can also have
primary and replica storages.
- StorageRef object is used to store references to default and custom
storage.
- CrawlFile has been updated to contain a StorageRef instead of a
def_storage_name, which references
either a default storage (in StorageOps) or custom storage (in
Organization)
- There is also a 'replicas' Optional[List[StorageRef]] which contains
replicas, if any.
- CrawlFileOut contain a numReplicas for how many replicas exist for
a given file.
- Migration: migration 0020 added to migrate existing Orgs, CrawlFile and ProfileFile objects to new storage system (CrawlFile and ProfileFile now extend BaseFile)
Part of #1262
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#1306
- Include full `resources` with expireAt (as string) in crawlFinished
and uploadFinished webhook notifications rather than using the
`downloadUrls` field (this is retained for collections).
- Set default presigned duration to one minute short of 1 week and enforce
maximum supported by S3
- Add 'storage_presign_duration_minutes' commented out to helm values.yaml
- Update tests
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#1050
Major refactor of the user/auth system to remove fastapi_users
dependency. Refactors users.py to be standalone
and adds new auth.py module for handling auth. UserManager now works
similar to other ops classes.
The auth should be fully backwards compatible with fastapi_users auth,
including accepting previous JWT tokens w/o having to re-login. The User
data model in mongodb is also unchanged.
Additional fixes:
- allows updating fastapi to latest
- add webhook docs to openapi (follow up to #1041)
API changes:
- Removing the`GET, PATCH, DELETE /users/<id>` endpoints, which were not
in used before, as users are scoped to orgs. For deletion, probably
auto-delete when user is removed from last org (to be implemented).
- Rename `/users/me-with-orgs` is renamed to just `/users/me/`
- New `PUT /users/me/change-password` endpoint with password required to update password, fixes #1269, supersedes #1272
Frontend changes:
- Fixes from #1272 to support new change password endpoint.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: sua yoo <sua@suayoo.com>
* Refactor microk8s playbook to follow structure with shared roles
- Integrates with btrix/deploy role for deploying
- Seperated RedHat and Debian into seperate roles
- Created Common role
- allow running remotely by default
- use 'browsertrix_cloud_home' for charts path
- add additional customizable options to btrix_values.j2 (todo: unify all the templates)
- docs: update to new playbook path
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
* storage ops refactor:
- create StorageOps class similar to other ops classes
- init storages list in StorageOps, no longer require lookup up default storages via CrawlManager
- convert all storage functions to members, add storageops to operator
- remove unused params, ensure crawl exists for rollover restart
- add env var to determine if using local minio to use correct endpoint URL
* crawls /seeds endpoint: just return empty list if not a crawl (eg. upload)
* crawlmanager: remove unused code, rename check_storage -> has_storage
* keep track of per pod status on crawljob:
- crashes time, and reason
- 'used' vs 'allocated' resources
- 'percent' used / allocated
* crawl log errors: log error when crawler crashes via OOM, either via redis error log
or to console
* add initial autoscaling support!
- detect if metrics server is available via K8SApi.is_pod_metrics_available()
- if available, use metrics for 'used' fields
- if no metrics, set memory used for redis only (using redis apis)
- allow overriding memory and cpu via newMemory and newCpu settings on pod status
- scale memory / cpu based on newMemory and newCpu setting
- templates: update jinja templates to allow restarting crawler and redis with new resources
- ci: enable metrics-server on k3d, microk8s and nightly k3d ci runs
* roles: cleanup unused roles, add permissions for listing metrics
* stats for running crawls:
- update in db via operator
- avoids losing stats if redis pod happens to be done
- tradeoff is more db access in operator, but less extra connections to redis + already
loading from db in backend
- size stat: ensure size of previous files is added to the stats
* crawler deployment tweaks:
- adjust cpu/mem per browser
- add --headless flag to configmap to use new headless mode by default!
- add liveness check/fix readiness check - ensure 'redis-cli ping' actually returns 'PONG', as exit code is 0 even if errors
will detect situations where redis is not available, such as due to to max clients being reached
- bump redis memory/cpu for now (until autoscaling/automatic adjustment is available)
* migration improvements + rerunning migrations: (fixes#1227)
- avoid starting some workers while migration is still running
- ensure workers that aren't performing migration await for migration to complete
- backend will not be valid until migration is run
* allow rerunning migration from specified version via --set rerun_from_migration=<VERSION> (replaces rerun_last_migration)
* use metacontroller's decoratorcontroller to create CrawlJob from Job
* scheduled job work:
- use existing job name for scheduled crawljob
- use suspended job, set startTime, completionTime and succeeded status on job when crawljob is done
- simplify cronjob template: remove job_image, cron_namespace, using same namespace as crawls,
placeholder job image for cronjobs
* move storage quota check to crawljob handler:
- add 'skipped_quota_reached' as new failed status type
- check for storage quota before checking if crawljob can be started, fail if not (check before any pods/pvcs created)
* frontend:
- show all crawls in crawl workflow, no need to filter by status
- add 'skipped_quota_reached' status, show as 'Skipped (Quota Reached)', render same as failed
* migration: make release namespace available as DEFAULT_NAMESPACE, delete old cronjobs in DEFAULT_NAMESPACE and recreate in crawlers namespace with new template
- Ability for pod to be Completed, unlike in Statefulset - eg. if 3 pods are running and first one finishes, all 3 must be running until all 3 are done. With this setup, the first finished pod can remain in Completed state.
- Fixed shutdown order - crawler pods now correctly shutdown first before redis pods, by switching to background deletion.
- Pod priority decreases with scale: 1st instance of a new crawl can preempt 3rd or 2nd instance of another crawl
- Create priority classes upto 'max_crawl_scale, configured in values.yaml
- Improved scale change reconciliation: if increasing scale, immediately scale up. If decreasing scale,
graceful stop scaled-down instance to complete via redis 'stopone' key, wait until they exit with Completed state
before adjust status.scale / removing scaled down pods. Ensures unaccepted interrupts don't cause scaled down data to be deleted.
- Redis pod remains inactive until crawler is first active, or after no crawl pods are active for 60 seconds
- Configurable Redis storage with 'redis_storage' value, set to 3Gi by default
- CrawlJob deletion starts as soon as post-finish crawl operations are run
- Post-crawl operations get their own redis instance, since one during response is being cleaned up in finalizer
- Finalizer ignores request with incorrect state (returns 400 if reported as not finished while crawl is finished)
- Current resource usage added to status
- Profile browser: also manage single pod directly without statefulset for consistency.
- Restart pods via restartTime value: if spec.restartTime != status.restartTime, clear out pods and update status.restartTime (using OnDelete policy to avoid recreate loops in edge cases).
- Update to latest metacontroller (v4.11.0)
- Add --restartOnError flag for crawler (for browsertrix-crawler 0.11.0)
- Failed crawl logging: dd 'fail_crawl()' to be used for failing a crawl, which prints logs for default container (if enabled) as well as pod status
- tests: check other finished states to avoid stuck in infinite loop if crawl fails
- tests: disable disk utilization check, which adds unpredictability to crawl testing!
fixes#1147
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
* ingress: simplify ingress config: (fixes#1135)
- use standard Prefix pathTypes
- remove nginx-specific rewriting
- remove 'scheme', use https/http based on 'tls' setting (in ingress and configmap)
- fix signing ingress to use ingressClassName
* log only if 'log_failed_crawl_lines' value is set to number of last lines to log
from failed container
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>