Go to file

Ilya Kreymer 00eb62214d Uploads API: BaseCrawl refactor + Initial support for /uploads endpoint (#937 ) * basecrawl refactor: make crawls db more generic, supporting different types of 'base crawls': crawls, uploads, manual archives - move shared functionality to basecrawl.py - create a base BaseCrawl object, which contains start / finish time, metadata and files array - create BaseCrawlOps, base class for CrawlOps, which supports base crawl deletion, querying and collection add/remove * uploads api: (part of #929) - new UploadCrawl object which extends BaseCrawl, has name and description - support multipart form data data upload to /uploads/formdata - support streaming upload of a single file via /uploads/stream, using botocore multipart upload to upload to s3-endpoint in parts - require 'filename' param to set upload filename for streaming uploads (otherwise use form data names) - sanitize filename, place uploads in /uploads/<uuid>/<sanitized-filename>-<random>.wacz - uploads have internal id 'upload-<uuid>' - create UploadedCrawl object with CrawlFiles pointing to the newly uploaded files, set state to 'complete' - handle upload failures, abort multipart upload - ensure uploads added within org bucket path - return id / added when adding new UploadedCrawl - support listing, deleting, and patch /uploads - support upload details via /replay.json to support for replay - add support for 'replaceId=<id>', which would remove all previous files in upload after new upload succeeds. if replaceId doesn't exist, create new upload. (only for stream endpoint so far). - support patching upload metadata: notes, tags and name on uploads (UpdateUpload extends UpdateCrawl and adds 'name') * base crawls api: Add /all-crawls list and delete endpoints for all crawl types (without resources) - support all-crawls/<id>/replay.json with resources - Use ListCrawlOut model for /all-crawls list endpoint - Extend BaseCrawlOut from ListCrawlOut, add type - use 'type: crawl' for crawls and 'type: upload' for uploads - migration: ensure all previous crawl objects / missing type are set to 'type: crawl' - indexes: add db indices on 'type' field and with 'type' field and oid, cid, finished, state * tests: add test for multipart and streaming upload, listing uploads, deleting upload - add sample WACZ for upload testing: 'example.wacz' and 'example-2.wacz' * collections: support adding and remove both crawls and uploads via base crawl - include collection_ids in /all-crawls list - collections replay.json can include both crawls and uploads bump version to 1.6.0-beta.2 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>		2023-07-07 09:13:26 -07:00
.github	Concurrent Crawl Limit (#874 )	2023-05-30 15:38:03 -07:00
ansible	add authsign block to microk8s playbook (#776 )	2023-05-05 11:32:32 -07:00
backend	Uploads API: BaseCrawl refactor + Initial support for /uploads endpoint (#937 )	2023-07-07 09:13:26 -07:00
chart	Uploads API: BaseCrawl refactor + Initial support for /uploads endpoint (#937 )	2023-07-07 09:13:26 -07:00
configs	Remove Code and Configs for Swarm/podman support (#407 )	2022-12-08 18:19:58 -08:00
docs	docs: ansible deploy docs reflect expected env var names (#946 )	2023-07-06 21:57:19 -07:00
frontend	Uploads API: BaseCrawl refactor + Initial support for /uploads endpoint (#937 )	2023-07-07 09:13:26 -07:00
scripts	Fix doc to build a local image for microk8s (#594 )	2023-02-14 16:10:04 -08:00
test	Single config and env vars (#267 )	2022-06-16 21:50:03 -07:00
.gitignore	Make btrix helper work with microk8s (#768 )	2023-04-18 08:50:46 -04:00
.pre-commit-config.yaml	Rename archives/teams -> orgs in codebase + add db migration (#486 )	2023-01-18 14:51:04 -08:00
btrix	btrix helper: Add -microk8s flag to explicitly use microk8s (#888 )	2023-05-30 15:41:26 -07:00
CHANGES.md	version: bump to 1.3.0	2023-02-24 18:07:56 -08:00
dev-values.yaml	Frontend collections beta UI (#886 )	2023-06-06 17:52:01 -07:00
LICENSE
mkdocs.yml	Org Settings documetation & Getting Started docs page updates	2023-06-11 17:39:16 -04:00
NOTICE
pylintrc
README.md	docs: fix link to dev docs	2023-05-24 10:59:41 -07:00
update-version.sh	Release Build + Versioning (#373 )	2022-11-18 17:15:25 -08:00
version.txt	Uploads API: BaseCrawl refactor + Initial support for /uploads endpoint (#937 )	2023-07-07 09:13:26 -07:00
yarn.lock	Frontend collections beta UI (#886 )	2023-06-06 17:52:01 -07:00

README.md

Browsertrix Cloud

Browsertrix Cloud is an open-source cloud-native high-fidelity browser-based crawling service designed to make web archiving easier and more accessible for everyone.

The service provides an API and UI for scheduling crawls and viewing results, and managing all aspects of crawling process. This system provides the orchestration and management around crawling, while the actual crawling is performed using Browsertrix Crawler containers, which are launched for each crawl.

See Features for a high-level list of planned features.

Documentation

The full docs for using, deploying and developing Browsertrix Cloud are available at: https://docs.browsertrix.cloud

Deployment

The latest deployment documentation is available at: https://docs.browsertrix.cloud/deploy

The docs cover deploying Browsertrix Cloud in different environments using Kubernetes, from a single-node setup to scalable clusters in the cloud.

Previously, Browsertrix Cloud also supported Docker Compose and podman-based deployment. This is now deprecated due to the complexity of maintaining feature parity across different setups, and with various Kubernetes deployment options being available and easy to deploy, even on a single machine.

Making deployment of Browsertrix Cloud as easy as possible remains a key goal, and we welcome suggestions for how we can further improve our Kubernetes deployment options.

If you are looking to just try running a single crawl, you may want to try Browsertrix Crawler first to test out the crawling capabilities.

Development Status

Browsertrix Cloud is currently in a beta, though the system and backend API is fairly stable, we are working on many additional features.

Additional developer documentation is available at https://docs.browsertrix.cloud/develop

Please see the GitHub issues and this GitHub Project for our current project plan and tasks.

License

Browsertrix Cloud is made available under the AGPLv3 License.

Documentation is made available under the Creative Commons Attribution 4.0 International License