browsertrix

Author	SHA1	Message	Date
Ilya Kreymer	6384d8b5f1	Additional Type Hints / Type Fix Pass (#1320 ) This PR adds more type safety to the backend codebase: - All ops classes calls should be type checked - Avoiding circular references with TYPE_CHECKING conditional - Consistent UUID usage: uuid.UUID / UUID4 with just UUID - Crawl states moved to models, made into lists - Additional typing added as needed, fixed a few type related errors - CrawlOps / UploadOps / BaseCrawlOps now all have same param init order to simplify changes	2023-10-30 12:59:24 -04:00
Ilya Kreymer	6dc452ebad	Storage Refactor: Replication + Custom Storage Support (#1296 ) - Refactors storage to support replicas + custom storages on the Org. - There is a default primary + replica storage, while an Org can also have primary and replica storages. - StorageRef object is used to store references to default and custom storage. - CrawlFile has been updated to contain a StorageRef instead of a def_storage_name, which references either a default storage (in StorageOps) or custom storage (in Organization) - There is also a 'replicas' Optional[List[StorageRef]] which contains replicas, if any. - CrawlFileOut contain a numReplicas for how many replicas exist for a given file. - Migration: migration 0020 added to migrate existing Orgs, CrawlFile and ProfileFile objects to new storage system (CrawlFile and ProfileFile now extend BaseFile) Part of #1262 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-10-26 21:44:09 -07:00
Ilya Kreymer	41c054d209	Storage ops followup type checking (#1274 ) * storage ops: follow up to #1257: - fix refactor typo - add type hints for all storageops apis (add mypy_boto3_s3 and types_aiobotocore_s3 for type hints)	2023-10-11 14:03:00 -07:00
Ilya Kreymer	16e7a1d0a2	Storage Ops Refactor (#1257 ) * storage ops refactor: - create StorageOps class similar to other ops classes - init storages list in StorageOps, no longer require lookup up default storages via CrawlManager - convert all storage functions to members, add storageops to operator - remove unused params, ensure crawl exists for rollover restart - add env var to determine if using local minio to use correct endpoint URL * crawls /seeds endpoint: just return empty list if not a crawl (eg. upload) * crawlmanager: remove unused code, rename check_storage -> has_storage	2023-10-10 15:04:23 -07:00
Anish Lakhwara	037396f3d9	Fix: Stream log downloading from WACZ (#1225 ) * Fix(backend): Stream logs without causing OOM Also be smarter about when to use `heapq.merge` and when to use `itertools.chain`: If all the logs are coming from the same instance we `chain` them, otherwise we'll `merge` them iterator fixes: - group wacz files by instance by suffix, eg. -0.wacz, -1.wacz, -2.wacz - sort wacz files, and all logs within each wacz file - chain log iterators for all log files within wacz group - merge log iterators across wacz files in different groups - add type hints to help keep track of iterator helper functions - add iter_lines() from botocore, use that for line parsing for simplicity --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-09-28 18:54:52 -07:00
Tessa Walsh	2efc461b9b	Implement sync streaming for finished crawl logs (#1168 ) - Crawl logs streamed from WACZs using the sync boto client	2023-09-14 17:05:19 -07:00
Ilya Kreymer	6506965d98	Streaming Download for Collections (#1012 ) * support streaming download of collections (part of #927) - WACZ zip created on the fly using stream-zip - add 'Download Collection' option to collection detail and list - after editing collection, return to collection view - tests: add test for streaming download, ensure WACZ files + datapackage present, STORE compression used --------- Co-authored-by: sua yoo <sua@suayoo.com>	2023-07-26 15:42:17 -07:00
Tessa Walsh	4014d98243	Move pydantic models to separate module + refactor crawl response endpoints to be consistent (#983 ) * Move all pydantic models to models.py to avoid circular dependencies * Include automated crawl details in all-crawls GET endpoints - ensure /all-crawls endpoint resolves names / firstSeed data same as /crawls endpoint for crawls to ensure consistent frontend display. fields added in get and list all-crawl endpoints for automated crawls only: - cid - name - description - firstSeed - seedCount - profileName * Add automated crawl fields to list all-crawls test * Uncomment mongo readinessProbe * cleanup CrawlOutWithResources: - remove 'files' from output model, only resources should be returned - add _files_to_resources() to simplify computing presigned 'resources' from raw 'files' - update upload tests to be more consistent, 'files' never present, 'errors' always none --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-07-20 13:05:33 +02:00
Ilya Kreymer	00eb62214d	Uploads API: BaseCrawl refactor + Initial support for /uploads endpoint (#937 ) * basecrawl refactor: make crawls db more generic, supporting different types of 'base crawls': crawls, uploads, manual archives - move shared functionality to basecrawl.py - create a base BaseCrawl object, which contains start / finish time, metadata and files array - create BaseCrawlOps, base class for CrawlOps, which supports base crawl deletion, querying and collection add/remove * uploads api: (part of #929) - new UploadCrawl object which extends BaseCrawl, has name and description - support multipart form data data upload to /uploads/formdata - support streaming upload of a single file via /uploads/stream, using botocore multipart upload to upload to s3-endpoint in parts - require 'filename' param to set upload filename for streaming uploads (otherwise use form data names) - sanitize filename, place uploads in /uploads/<uuid>/<sanitized-filename>-<random>.wacz - uploads have internal id 'upload-<uuid>' - create UploadedCrawl object with CrawlFiles pointing to the newly uploaded files, set state to 'complete' - handle upload failures, abort multipart upload - ensure uploads added within org bucket path - return id / added when adding new UploadedCrawl - support listing, deleting, and patch /uploads - support upload details via /replay.json to support for replay - add support for 'replaceId=<id>', which would remove all previous files in upload after new upload succeeds. if replaceId doesn't exist, create new upload. (only for stream endpoint so far). - support patching upload metadata: notes, tags and name on uploads (UpdateUpload extends UpdateCrawl and adds 'name') * base crawls api: Add /all-crawls list and delete endpoints for all crawl types (without resources) - support all-crawls/<id>/replay.json with resources - Use ListCrawlOut model for /all-crawls list endpoint - Extend BaseCrawlOut from ListCrawlOut, add type - use 'type: crawl' for crawls and 'type: upload' for uploads - migration: ensure all previous crawl objects / missing type are set to 'type: crawl' - indexes: add db indices on 'type' field and with 'type' field and oid, cid, finished, state * tests: add test for multipart and streaming upload, listing uploads, deleting upload - add sample WACZ for upload testing: 'example.wacz' and 'example-2.wacz' * collections: support adding and remove both crawls and uploads via base crawl - include collection_ids in /all-crawls list - collections replay.json can include both crawls and uploads bump version to 1.6.0-beta.2 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-07-07 09:13:26 -07:00
Tessa Walsh	fb80a04f18	Add crawl /log API endpoint If a crawl is completed, the endpoint streams the logs from the log files in all of the created WACZ files, sorted by timestamp. The API endpoint supports filtering by log_level and context whether the crawl is still running or not. This is not yet proper streaming because the entire log file is read into memory before being streamed to the client. We will want to switch to proper streaming eventually, but are currently blocked by an aiobotocore bug - see: https://github.com/aio-libs/aiobotocore/issues/991?#issuecomment-1490737762	2023-04-11 11:51:17 -04:00
Tessa Walsh	bd4fba7af7	Fix POST /orgs/{oid}/crawls/delete (#591 ) * Fix POST /orgs/{oid}/crawls/delete - Add permissions check to ensure crawler users can only delete their own crawls - Fix broken delete_crawls endpoint - Delete files from storage as well as deleting crawl from db - Add tests, including nightly test that ensures crawl files are no longer accessible after the crawl is deleted	2023-02-15 21:06:12 -05:00
Tessa Walsh	23022193fb	Reformat backend for black 23.1.0 (#548 )	2023-02-01 20:01:09 -05:00
Tessa Walsh	0fa60ebc45	Rename archives/teams -> orgs in codebase + add db migration (#486 ) * Rename archives to orgs and aid to oid on backend * Rename archive to org and aid to oid in frontend * Remove translation artifact * Rename team -> organization * Add database migrations and run once on startup * This commit also applies the new by_one_worker decorator to other asyncio tasks to prevent heavy tasks from being run in each worker. * Run black, pylint, and husky via pre-commit * Set db version and use in migrations * Update and prepare database in single task * Migrate k8s configmaps	2023-01-18 14:51:04 -08:00
Ilya Kreymer	aabb0b2a92	chart / deployment fixes to run on microk8s: (fixes #385 ) (#387 ) - ingress: fix proxying /data to minio, use another ingress which proxies correct host to ensure presigned urls work - presigning: determine if signing endpoint url (minio) or access endpoint (cloud bucket) based on if access endpoint is provided, set bool on storage object - chart: fix indent on incorrect storageClassName configs - ingress: make 'ingress_class' configurable (set to 'public' for microk8s, default to 'nginx') - minio: use older minio image which supports legacy fs based setup (for now) - nginx service: add 'nginx_service_use_node_port' config setting: if true, will use NodePort for frontend, other will use default (ClusterIP) and only for the frontend / nginx - chart: remove changing service type for other services	2022-11-30 09:21:58 -08:00
Ilya Kreymer	d340bceb39	style pass: normalize docstring spacing	2022-10-19 21:47:34 -07:00
Ilya Kreymer	bf79959a5a	refactoring to use statefulsets + job (#245 ) - use statefulsets instead of deployments for mongo, redis, signer - use k8s job + statefulset for running crawls - use separate statefulset for crawl (scaled) and single-replica redis stateful set - move crawl job update login to crawl_updater - remove shared redis chart package refactor: - move to shared code to 'btrixcloud' - move k8s to 'btrixcloud.k8s' - move docker to 'btrixcloud.docker'	2022-06-05 10:37:17 -07:00

16 Commits