* basecrawl refactor: make crawls db more generic, supporting different types of 'base crawls': crawls, uploads, manual archives - move shared functionality to basecrawl.py - create a base BaseCrawl object, which contains start / finish time, metadata and files array - create BaseCrawlOps, base class for CrawlOps, which supports base crawl deletion, querying and collection add/remove * uploads api: (part of #929) - new UploadCrawl object which extends BaseCrawl, has name and description - support multipart form data data upload to /uploads/formdata - support streaming upload of a single file via /uploads/stream, using botocore multipart upload to upload to s3-endpoint in parts - require 'filename' param to set upload filename for streaming uploads (otherwise use form data names) - sanitize filename, place uploads in /uploads/<uuid>/<sanitized-filename>-<random>.wacz - uploads have internal id 'upload-<uuid>' - create UploadedCrawl object with CrawlFiles pointing to the newly uploaded files, set state to 'complete' - handle upload failures, abort multipart upload - ensure uploads added within org bucket path - return id / added when adding new UploadedCrawl - support listing, deleting, and patch /uploads - support upload details via /replay.json to support for replay - add support for 'replaceId=<id>', which would remove all previous files in upload after new upload succeeds. if replaceId doesn't exist, create new upload. (only for stream endpoint so far). - support patching upload metadata: notes, tags and name on uploads (UpdateUpload extends UpdateCrawl and adds 'name') * base crawls api: Add /all-crawls list and delete endpoints for all crawl types (without resources) - support all-crawls/<id>/replay.json with resources - Use ListCrawlOut model for /all-crawls list endpoint - Extend BaseCrawlOut from ListCrawlOut, add type - use 'type: crawl' for crawls and 'type: upload' for uploads - migration: ensure all previous crawl objects / missing type are set to 'type: crawl' - indexes: add db indices on 'type' field and with 'type' field and oid, cid, finished, state * tests: add test for multipart and streaming upload, listing uploads, deleting upload - add sample WACZ for upload testing: 'example.wacz' and 'example-2.wacz' * collections: support adding and remove both crawls and uploads via base crawl - include collection_ids in /all-crawls list - collections replay.json can include both crawls and uploads bump version to 1.6.0-beta.2 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> |
||
---|---|---|
.github | ||
ansible | ||
backend | ||
chart | ||
configs | ||
docs | ||
frontend | ||
scripts | ||
test | ||
.gitignore | ||
.pre-commit-config.yaml | ||
btrix | ||
CHANGES.md | ||
dev-values.yaml | ||
LICENSE | ||
mkdocs.yml | ||
NOTICE | ||
pylintrc | ||
README.md | ||
update-version.sh | ||
version.txt | ||
yarn.lock |
Browsertrix Cloud
Browsertrix Cloud is an open-source cloud-native high-fidelity browser-based crawling service designed to make web archiving easier and more accessible for everyone.
The service provides an API and UI for scheduling crawls and viewing results, and managing all aspects of crawling process. This system provides the orchestration and management around crawling, while the actual crawling is performed using Browsertrix Crawler containers, which are launched for each crawl.
See Features for a high-level list of planned features.
Documentation
The full docs for using, deploying and developing Browsertrix Cloud are available at: https://docs.browsertrix.cloud
Deployment
The latest deployment documentation is available at: https://docs.browsertrix.cloud/deploy
The docs cover deploying Browsertrix Cloud in different environments using Kubernetes, from a single-node setup to scalable clusters in the cloud.
Previously, Browsertrix Cloud also supported Docker Compose and podman-based deployment. This is now deprecated due to the complexity of maintaining feature parity across different setups, and with various Kubernetes deployment options being available and easy to deploy, even on a single machine.
Making deployment of Browsertrix Cloud as easy as possible remains a key goal, and we welcome suggestions for how we can further improve our Kubernetes deployment options.
If you are looking to just try running a single crawl, you may want to try Browsertrix Crawler first to test out the crawling capabilities.
Development Status
Browsertrix Cloud is currently in a beta, though the system and backend API is fairly stable, we are working on many additional features.
Additional developer documentation is available at https://docs.browsertrix.cloud/develop
Please see the GitHub issues and this GitHub Project for our current project plan and tasks.
License
Browsertrix Cloud is made available under the AGPLv3 License.
Documentation is made available under the Creative Commons Attribution 4.0 International License