Commit Graph

1463 Commits

Author SHA1 Message Date
Anish Lakhwara
32428f4d93
fix: usr/bin/env bash interpreter for btrix (#1028) 2023-08-01 09:28:56 -07:00
sua yoo
54e2b2c703
List web captures in Collection (#1024)
- Adds tab for "Web Captures" in Collection detail view
- Move Collection description under Replay section
- Fixes app reloading when clicking into a Collection
- Standardizes Web Capture list headers from "Finished -> "Created Date"
2023-08-01 09:14:27 -07:00
Ilya Kreymer
06cf9c7cc3
add crawl ending states: 'generate-wacz', 'uploading-wacz', 'pending-wait' that occur after a crawl is finished or is being stopped (#1022)
operator: ensure transitions from each of these states is supported, including to 'waiting_capacity'
add extra check on stopping to avoid transitioning back to a running state after crawl is finished
ui: add states to UI display, localization, add as active states
fixes #263
2023-08-01 00:15:59 -07:00
Anish Lakhwara
d848502f84
fix(build): add DOCKER_BUILDKIT=1 to frontend Dockerfile to better support older versions of Docker (#1021) 2023-08-01 00:15:03 -07:00
Ilya Kreymer
7ea6d76f10
Resource Constraints Cleanup: (fixes #895) (#1019)
* resource constraints: (fixes #895)
- for cpu, only set cpu requests
- for memory, set mem requests == mem limits
- add missing resource constraints for minio and scheduled job
- for crawler, set mem and cpu constraints per browser, scale based on browser instances per crawler
- add comments in values.yaml for crawler values being multiplied
- default values: bump crawler to 650 millicpu per browser instance just in case

cleanup: remove unused entries from main backend configmap
2023-08-01 00:11:16 -07:00
Anish Lakhwara
d8502da885
fix(build): use /usr/bin/env bash instead of /bin/bash (#1020)
* fix: add to various other shell scripts
2023-07-28 21:50:04 -07:00
Ilya Kreymer
c76dd10928
chart: always pull latest crawler image - since default image is pointing to webrecorder/browsertrix-crawler:latest, makes sense to always pull latest (#1018) 2023-07-27 12:41:41 -07:00
Anish Lakhwara
a347f61973
ci: password check: fix: don't break on ScannerError (#1017) 2023-07-27 07:19:27 -07:00
Vinzenz Sinapius
5807507f29
Add proxy settings for crawler and profilebrowser (#997) 2023-07-26 16:11:10 -07:00
sua yoo
7069b33646
Show only running crawls in superadmin view (#1015)
- Show separate crawls list for admin view, fixes #1010
2023-07-26 15:48:20 -07:00
Ilya Kreymer
6506965d98
Streaming Download for Collections (#1012)
* support streaming download of collections (part of #927)
- WACZ zip created on the fly using stream-zip
- add 'Download Collection' option to collection detail and list
- after editing collection, return to collection view
- tests: add test for streaming download, ensure WACZ files + datapackage present, STORE compression used

---------

Co-authored-by: sua yoo <sua@suayoo.com>
2023-07-26 15:42:17 -07:00
Anish Lakhwara
6062042fae
feat: create DO registry if it doesn't exist (#947)
- if use_do_registry is enabled and registry doesn't exist, create it
2023-07-26 15:41:03 -07:00
Anish Lakhwara
4c1465d94b
feat: ansible DO teardown (#950)
* feat: ansible DO teardown

* fix(DO): idempotency issues in ansible teardown

* chore(DO): remove unused code

* docs(ansible): mention teardown in the docs

* fix: pass ansible-lint

* fix: point database backup upload to the correct location in DO space
2023-07-26 15:38:59 -07:00
Anish Lakhwara
b5a9c42df1
feat: add pre-commit to check we don't have real passwords in yml files (#990)
* feat: use existing pre-commit framework

* feat(ci): add github action for password_check

* feat: add some simple tests to password_check.py

* fix: set `backend_password_secret` in default values.yaml to an allowed password
2023-07-26 13:29:37 -07:00
Tessa Walsh
c21153255a
Rename notes to description in frontend and backend (#1011)
- Rename crawl notes to description
- Add migration renaming notes -> description
- Stop inheriting workflow description in crawl
- Update frontend to replace crawl/upload notes with description
- Remove setting of config description from crawl list
- Adjust tests for changes
2023-07-26 13:00:04 -07:00
Ilya Kreymer
4bea7565bc
load handling: scale up redis only when crawler pods running (#1009)
Operator: Modified init behavior to only load redis when at least one crawler pod available:
- waits for at least one crawler pod to be available before starting redis pod, to avoid situation where many crawler pods are in pending mode, but redis pods are still running.
- redis statefulset starts at scale of 0
- once crawler pod becomes available, redis sts is scaled to 1 (via `initRedis==true` status)
- crawl remains in 'starting' or 'waiting_capacity' state until pod becomes available without redis pod running
- set to 'running' state only after redis and at least one crawler pod is available
- if no crawler pods available after running, or, if stuck in starting for >60 seconds, switch to 'waiting_capacity' state
- when switching to 'waiting_capacity', also scale down redis to 0, wait for crawler pod to become available, only then scale up redis to 1, and get back to 'running'

other tweaks:
- add new status field 'initRedis', default to false, not displayed
- crawler pod: consider 'ContainerCreating' state as available, as container will not be blocked by resource limits
- add a resync after 3 seconds when waiting for crawler pod or redis pod to become available, configurable via 'operator_fast_resync_secs'
- set_state: if not updating state, ensure state reflects actual value in db
2023-07-26 08:40:05 -07:00
Tessa Walsh
608a744aaf
Add migration to replace None with 0 for configmap CRAWL_TIMEOUT (#1008) 2023-07-24 15:49:26 -04:00
Tessa Walsh
fcd48b1831
Add totalSize to collections and make it sortable in list endpoint (#1001)
* Precompute collection.totalSize and make sortable

* Add migration to recompute collection data with totalSize
2023-07-24 13:12:23 -04:00
sua yoo
75b011f951
Upload WACZ via UI (#992)
- Users can now upload .WACZ archives from the "Archived Data" page.
- Can specify name, description, tags and collection(s) to add upload to
- Show progress of upload
- Support canceling upload
2023-07-21 16:45:52 +02:00
Tessa Walsh
9f32aa697b
Add collections and tags to upload API endpoints (#993)
* Add collections and tags to uploads

* Fix order of deletion check test

* Re-add tags to UploadedCrawl model after rebase

* Fix Users model heading
2023-07-21 16:44:56 +02:00
Tessa Walsh
4014d98243
Move pydantic models to separate module + refactor crawl response endpoints to be consistent (#983)
* Move all pydantic models to models.py to avoid circular dependencies
* Include automated crawl details in all-crawls GET endpoints
- ensure /all-crawls endpoint resolves names / firstSeed data same as /crawls endpoint for crawls to ensure consistent frontend display. fields added in get and list all-crawl endpoints for automated
crawls only:
- cid
- name
- description
- firstSeed
- seedCount
- profileName

* Add automated crawl fields to list all-crawls test

* Uncomment mongo readinessProbe

* cleanup CrawlOutWithResources:
- remove 'files' from output model, only resources should be returned
- add _files_to_resources() to simplify computing presigned 'resources' from raw 'files'
- update upload tests to be more consistent, 'files' never present, 'errors' always none

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-07-20 13:05:33 +02:00
Tessa Walsh
577416024b
Fix pull_request syntax in ansible lint GH Action (#995)
* Fix pull_request syntax in ansible lint GH Action

* Only lint Digital Ocean playbook for now

* fix: pass ansible lint

---------

Co-authored-by: Anish Lakhwara <anish+git@lakhwara.com>
2023-07-20 12:13:52 +02:00
sua yoo
85913112a2
Upgrade lit + shoelace to reduce build size (#938)
* upgrade lit

* upgrade shoelace

* upgrade testing libraries

* add webpack bundle analyzer

* revert shoelace changes

* remove bundle analyzer

* remove console log
2023-07-20 11:50:05 +02:00
Tessa Walsh
d5c3a8519f
Add crawler Use Sitemap option to Browsertrix Cloud (#978)
* Add user-guide docs for Use Sitemap option
---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2023-07-19 13:57:52 -04:00
Anish Lakhwara
db851b8360
Merge pull request #976 from webrecorder/ansible-lint-action
feat: ansible lint github action
2023-07-19 03:22:22 +10:00
Ilya Kreymer
a5312709bb
fix issues that caused cronjob container to crash: (#987)
- don't set CRAWL_TIMEOUT to "None" in configmap, and if encountered, just set to 0
- run register_exit_handler() after run loop has been inited
2023-07-18 18:08:53 +02:00
sua yoo
c5b3be0680
Fix frontend formatting pre-commit (#991)
* update lint staged config

* remove prettier defaults
2023-07-18 17:51:13 +02:00
Anish Lakhwara
4fed3ed1b0
fix: resolve ansible pipenv dependencies successfully (#977) 2023-07-18 17:39:38 +02:00
Ilya Kreymer
5dede47874 remove accidentally added values file! 2023-07-16 15:05:08 +02:00
Anish Lakhwara
bc82f562dc feat: ansible lint github action 2023-07-10 17:58:47 -07:00
Ilya Kreymer
2372f43c2c
frontend: fix to collection editor with crawls and uploads (#971)
* frontend:
- follow up to #969, fixes crawl workflows by using crawl-specific endpoint and merging results

* get crawls and uploads concurrently

---------

Co-authored-by: sua yoo <sua@suayoo.com>
2023-07-10 19:29:19 +02:00
sua yoo
f3660839bf
Allow users to add uploads to collections (#968)
* show uploads in 'Select Uploads' section
2023-07-09 22:21:50 -07:00
Ilya Kreymer
7d694754c6
uploads api ext: (#970)
- also support collectionId filter on /all-crawls
- update tests
2023-07-09 22:12:54 -07:00
Ilya Kreymer
f1bce310d0
uploads api: support filtering uploads by collectionId (#969)
tests: add collection filter test
2023-07-09 10:54:30 -07:00
Ilya Kreymer
a640f58657
Tests: fix test get crawl loop (#967)
* tests: add sleep() between all looping get_crawl() calls to avoid tight request loop, also remove unneeded loop
will likely fix occasional '504 timeout' test failures where frontend is overwhelmed with /replay.json requests
2023-07-08 17:16:11 -07:00
Henry Wilkinson
d9e73fcbc3
Reorder Limits section (#966)
* Reorder Limits section

- Minor text change to section names
  - "Limit Per Page" → "Per-Page Limits"
  - "Limit Per Crawl" → "Per-Crawl Limits"

* Reorder limits section in documentation
2023-07-08 08:54:30 -07:00
Anish Lakhwara
fd310f620a
fix: mongodb uri password not accessible on second API call (#964) 2023-07-08 08:48:50 -07:00
Anish Lakhwara
9489c1e00d
fix: configure_kubectl is the variable name (#963) 2023-07-08 08:13:54 -07:00
Anish Lakhwara
df82a4755f
fix: pass ansible-lint in DO playbook (#962)
* fix: pass ansible-lint in DO playbook

* fix: don't break s3 module
2023-07-08 08:13:23 -07:00
Ilya Kreymer
8eeb66e11f
Frontend more upload path fixes (#961)
* additional fixes for #935:
- don't use artifactType for detail pages, ensure correct artifact selected based on path

* naming tweaks:
- from uploads detail, return to 'All Uploads' with filter
- from crawls detail, return to 'All Crawls' with filter
- rename general to 'All Archived Data'
2023-07-07 15:41:03 -07:00
Anish Lakhwara
478719d59a
fix: only use db_create when the db is created (#959) 2023-07-07 14:38:03 -07:00
Ilya Kreymer
d3a757e20b
partial fix for: #935: (#960)
- add route for /artifacts/upload/<id> to be used for uploads
- link uploads to /artifacts/upload/<id> instead of /artifacts/crawl/<id>
2023-07-07 14:23:26 -07:00
sua yoo
de4b18aa67
List crawls, uploads, and all objects in UI (#941)
- Adds top-level "Archived Data" view, replacing "Finished Crawls" and moving it as "Crawls" into view
- Adds list for viewing all artifacts/data
- Adds list for viewing all uploaded crawls
- Updates crawl detail view to show upload details
- Edit upload metadata, including 'name'
- Delete uploads
---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2023-07-07 13:20:28 -07:00
Ilya Kreymer
d7cb47390e
readd support for passing in 'crawler_extra_args' for additional/custom (#957)
options not covered by standard crawler opts (removed setting all args this way in #889)
2023-07-07 12:08:40 -07:00
Ilya Kreymer
2038e3d668
remove default: similar to #952, remove default extraHops setting as it disables 'url list' extraHops by forcing the value to 0 (#954) 2023-07-07 12:08:30 -07:00
Ilya Kreymer
7139b9a7a9
operator: ensure finished is always set (#953) 2023-07-07 12:08:15 -07:00
Anish Lakhwara
99117a532b
feat: configure mongodb firewall (#949)
Co-authored-by: Anish Lakhwara <anish+git@lakhwara.com>
2023-07-07 09:15:36 -07:00
Anish Lakhwara
c5803dcda0
feat: configure kubectl through ansible (#948)
Co-authored-by: Anish Lakhwara <anish+git@lakhwara.com>
2023-07-07 09:15:18 -07:00
Anish Lakhwara
dd3d9001fb
fix: idempotent mongodb creation, with saved facts (#945)
Co-authored-by: Anish Lakhwara <anish+git@lakhwara.com>
2023-07-07 09:14:12 -07:00
Ilya Kreymer
00eb62214d
Uploads API: BaseCrawl refactor + Initial support for /uploads endpoint (#937)
* basecrawl refactor: make crawls db more generic, supporting different types of 'base crawls': crawls, uploads, manual archives
- move shared functionality to basecrawl.py
- create a base BaseCrawl object, which contains start / finish time, metadata and files array
- create BaseCrawlOps, base class for CrawlOps, which supports base crawl deletion, querying and collection add/remove

* uploads api: (part of #929)
- new UploadCrawl object which extends BaseCrawl, has name and description
- support multipart form data data upload to /uploads/formdata
- support streaming upload of a single file via /uploads/stream, using botocore multipart upload to upload to s3-endpoint in parts
- require 'filename' param to set upload filename for streaming uploads (otherwise use form data names)
- sanitize filename, place uploads in /uploads/<uuid>/<sanitized-filename>-<random>.wacz
- uploads have internal id 'upload-<uuid>'
- create UploadedCrawl object with CrawlFiles pointing to the newly uploaded files, set state to 'complete'
- handle upload failures, abort multipart upload
- ensure uploads added within org bucket path
- return id / added when adding new UploadedCrawl
- support listing, deleting, and patch /uploads
- support upload details via /replay.json to support for replay
- add support for 'replaceId=<id>', which would remove all previous files in upload after new upload succeeds. if replaceId doesn't exist, create new upload. (only for stream endpoint so far).
- support patching upload metadata: notes, tags and name on uploads (UpdateUpload extends UpdateCrawl and adds 'name')

* base crawls api: Add /all-crawls list and delete endpoints for all crawl types (without resources)
- support all-crawls/<id>/replay.json with resources
- Use ListCrawlOut model for /all-crawls list endpoint
- Extend BaseCrawlOut from ListCrawlOut, add type
- use 'type: crawl' for crawls and 'type: upload' for uploads
- migration: ensure all previous crawl objects / missing type are set to 'type: crawl'
- indexes: add db indices on 'type' field and with 'type' field and oid, cid, finished, state

* tests: add test for multipart and streaming upload, listing uploads, deleting upload
- add sample WACZ for upload testing: 'example.wacz' and 'example-2.wacz'

* collections: support adding and remove both crawls and uploads via base crawl
- include collection_ids in /all-crawls list
- collections replay.json can include both crawls and uploads

bump version to 1.6.0-beta.2
---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-07-07 09:13:26 -07:00