browsertrix

Author	SHA1	Message	Date
Ilya Kreymer	a640f58657	Tests: fix test get crawl loop (#967 ) * tests: add sleep() between all looping get_crawl() calls to avoid tight request loop, also remove unneeded loop will likely fix occasional '504 timeout' test failures where frontend is overwhelmed with /replay.json requests	2023-07-08 17:16:11 -07:00
Henry Wilkinson	d9e73fcbc3	Reorder Limits section (#966 ) * Reorder Limits section - Minor text change to section names - "Limit Per Page" → "Per-Page Limits" - "Limit Per Crawl" → "Per-Crawl Limits" * Reorder limits section in documentation	2023-07-08 08:54:30 -07:00
Anish Lakhwara	fd310f620a	fix: mongodb uri password not accessible on second API call (#964 )	2023-07-08 08:48:50 -07:00
Anish Lakhwara	9489c1e00d	fix: configure_kubectl is the variable name (#963 )	2023-07-08 08:13:54 -07:00
Anish Lakhwara	df82a4755f	fix: pass ansible-lint in DO playbook (#962 ) * fix: pass ansible-lint in DO playbook * fix: don't break s3 module	2023-07-08 08:13:23 -07:00
Ilya Kreymer	8eeb66e11f	Frontend more upload path fixes (#961 ) * additional fixes for #935: - don't use artifactType for detail pages, ensure correct artifact selected based on path * naming tweaks: - from uploads detail, return to 'All Uploads' with filter - from crawls detail, return to 'All Crawls' with filter - rename general to 'All Archived Data'	2023-07-07 15:41:03 -07:00
Anish Lakhwara	478719d59a	fix: only use db_create when the db is created (#959 )	2023-07-07 14:38:03 -07:00
Ilya Kreymer	d3a757e20b	partial fix for: #935 : (#960 ) - add route for /artifacts/upload/<id> to be used for uploads - link uploads to /artifacts/upload/<id> instead of /artifacts/crawl/<id>	2023-07-07 14:23:26 -07:00
sua yoo	de4b18aa67	List crawls, uploads, and all objects in UI (#941 ) - Adds top-level "Archived Data" view, replacing "Finished Crawls" and moving it as "Crawls" into view - Adds list for viewing all artifacts/data - Adds list for viewing all uploaded crawls - Updates crawl detail view to show upload details - Edit upload metadata, including 'name' - Delete uploads --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2023-07-07 13:20:28 -07:00
Ilya Kreymer	d7cb47390e	readd support for passing in 'crawler_extra_args' for additional/custom (#957 ) options not covered by standard crawler opts (removed setting all args this way in #889)	2023-07-07 12:08:40 -07:00
Ilya Kreymer	2038e3d668	remove default: similar to #952 , remove default extraHops setting as it disables 'url list' extraHops by forcing the value to 0 (#954 )	2023-07-07 12:08:30 -07:00
Ilya Kreymer	7139b9a7a9	operator: ensure finished is always set (#953 )	2023-07-07 12:08:15 -07:00
Anish Lakhwara	99117a532b	feat: configure mongodb firewall (#949 ) Co-authored-by: Anish Lakhwara <anish+git@lakhwara.com>	2023-07-07 09:15:36 -07:00
Anish Lakhwara	c5803dcda0	feat: configure kubectl through ansible (#948 ) Co-authored-by: Anish Lakhwara <anish+git@lakhwara.com>	2023-07-07 09:15:18 -07:00
Anish Lakhwara	dd3d9001fb	fix: idempotent mongodb creation, with saved facts (#945 ) Co-authored-by: Anish Lakhwara <anish+git@lakhwara.com>	2023-07-07 09:14:12 -07:00
Ilya Kreymer	00eb62214d	Uploads API: BaseCrawl refactor + Initial support for /uploads endpoint (#937 ) * basecrawl refactor: make crawls db more generic, supporting different types of 'base crawls': crawls, uploads, manual archives - move shared functionality to basecrawl.py - create a base BaseCrawl object, which contains start / finish time, metadata and files array - create BaseCrawlOps, base class for CrawlOps, which supports base crawl deletion, querying and collection add/remove * uploads api: (part of #929) - new UploadCrawl object which extends BaseCrawl, has name and description - support multipart form data data upload to /uploads/formdata - support streaming upload of a single file via /uploads/stream, using botocore multipart upload to upload to s3-endpoint in parts - require 'filename' param to set upload filename for streaming uploads (otherwise use form data names) - sanitize filename, place uploads in /uploads/<uuid>/<sanitized-filename>-<random>.wacz - uploads have internal id 'upload-<uuid>' - create UploadedCrawl object with CrawlFiles pointing to the newly uploaded files, set state to 'complete' - handle upload failures, abort multipart upload - ensure uploads added within org bucket path - return id / added when adding new UploadedCrawl - support listing, deleting, and patch /uploads - support upload details via /replay.json to support for replay - add support for 'replaceId=<id>', which would remove all previous files in upload after new upload succeeds. if replaceId doesn't exist, create new upload. (only for stream endpoint so far). - support patching upload metadata: notes, tags and name on uploads (UpdateUpload extends UpdateCrawl and adds 'name') * base crawls api: Add /all-crawls list and delete endpoints for all crawl types (without resources) - support all-crawls/<id>/replay.json with resources - Use ListCrawlOut model for /all-crawls list endpoint - Extend BaseCrawlOut from ListCrawlOut, add type - use 'type: crawl' for crawls and 'type: upload' for uploads - migration: ensure all previous crawl objects / missing type are set to 'type: crawl' - indexes: add db indices on 'type' field and with 'type' field and oid, cid, finished, state * tests: add test for multipart and streaming upload, listing uploads, deleting upload - add sample WACZ for upload testing: 'example.wacz' and 'example-2.wacz' * collections: support adding and remove both crawls and uploads via base crawl - include collection_ids in /all-crawls list - collections replay.json can include both crawls and uploads bump version to 1.6.0-beta.2 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-07-07 09:13:26 -07:00
Anish Lakhwara	e1d6de21a0	docs: ansible deploy docs reflect expected env var names (#946 ) Co-authored-by: Anish Lakhwara <anish+git@lakhwara.com>	2023-07-06 21:57:19 -07:00
Tessa Walsh	29a6f0f6bc	Fix links in watch crawl after workflow crawl completes (#943 )	2023-07-06 15:04:26 -07:00
Tessa Walsh	bf1e817da3	Unset default scopeType for seeds so they inherit parent scopeType by default (#952 )	2023-07-06 15:03:05 -07:00
Henry Wilkinson	8a240ad044	Fixes z-index (#939 )	2023-07-04 23:05:09 -04:00
Henry Wilkinson	ac4716614e	Minor gramatical changes to documentation (#919 )	2023-07-04 17:14:49 -04:00
Ilya Kreymer	4c8de3160b	typo fix: fix extra trailing quote on CRAWL_ARGS in configmap.yaml	2023-06-16 18:55:21 -07:00
Ilya Kreymer	e37f220d6c	version: bump to 1.6.0-beta.1	2023-06-16 18:53:32 -07:00
Tessa Walsh	c7051d5fbf	Backend API consistency pass (#921 ) * Make API add and update method returns consistent - Updates return {"updated": True} - Adds return {"added": True} - Both can additionally have other fields as needed, e.g. id or name - remove Profile response model, as returning added / id only - reformat --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-06-16 18:52:46 -07:00
Ilya Kreymer	d9ad8c11d2	frontend: fix RWP_BASE_URL not being set correctly for nginx image	2023-06-13 00:04:46 -07:00
Tessa Walsh	bd6dc79449	Add frontend support for auto-adding collections to workflows (#916 ) - Adds collections search and list to workflow editor - Adds collections to workflow details component - Adds namePrefix filter to backend GET /orgs/{oid}/collections endpoint to support case-insensitive searching of collections - Adds documentation for new setting --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2023-06-12 18:18:05 -07:00
Henry Wilkinson	71e9984e65	Adds documentation link and version copy button to footer (#920 ) * Updates footer - Adds documentation link - Adds label to GitHub link, moves outside of the version code - Adds copy button to version code for quick access when filing bug reports :) * Comments out invisible div * Improves responsiveness on mobile	2023-06-12 17:51:21 -07:00
Ilya Kreymer	ec3404c798	Fix Extra URLs in Scope (#913 ) * scope fix: when using 'Custom Page Prefix scope (fixes #873) - don't include primary seed URL in include list - don't always add trailing slash to extra in scope URLs - set seed scope to 'prefix' (supported via webrecorder/browsertrix-crawler#318) instead of re-including seed URL - add comments on using 'custom' to indicate 'Custom Prefix Scope' semantics on frontend, setting actual scope to 'prefix' on backend - remove unneeded conditional for additional urls, main scopeType overridden per seed anyway	2023-06-12 17:29:41 -07:00
Henry Wilkinson	79703baa69	Org Settings documetation & Getting Started docs page updates	2023-06-11 17:39:16 -04:00
Henry Wilkinson	2364433932	Admin Panel Minor Frontend Style Updates (#915 ) - Unifies trash icons on all pages to use trash3 (there were a few stragglers!) - Brings styling of org quotas dialogue in-line with the rest of our dialogues - Adds missing localization strings - Swaps button with icon button to match table row action styling elsewhere	2023-06-10 19:21:34 -07:00
Tessa Walsh	325355d991	Fix post-crawl collection stats update and add test (#918 ) This fixes #917, where crawls added to a collection via the workflow autoAddCollections were not successfully represented in the crawl and page count stats in the collection after completing.	2023-06-10 19:06:25 -07:00
Henry Wilkinson	8477919989	Adds all workflow settings to the user docs with descriptions (#894 ) Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-06-08 14:28:58 -04:00
Tessa Walsh	e10b7093c7	Fix bug preventing deleting collections with no crawls (#912 )	2023-06-08 11:28:30 -07:00
Ilya Kreymer	9707fb55e4	fix finished workflows incorrectly being displayed as running (#909 )	2023-06-08 11:26:42 -07:00
Ilya Kreymer	4428184aea	frontend: configure running with a fixed 'replay.json', auth headers passed via separate config (#899 ) wabac.js will reload the replay.json on 403 with new token (will be in next version of wabac.js) presign urls: make presign timeout configurable (in minutes), defaults to 60 mins dockerfile: fix configuring RWP_BASE_URL	2023-06-08 11:26:26 -07:00
Henry Wilkinson	d286555396	Adds initial version of the documentation style guide (#891 ) * Adds initial version of the documentation style guide * Adds a note about adding new pages * Instructs users about where to edit the `nav:` for the section * Adds acronym rule clarification	2023-06-07 16:54:49 -07:00
Tessa Walsh	120f7ca158	Precompute crawl file stats (#906 )	2023-06-07 16:39:49 -07:00
Ilya Kreymer	dd757961fc	config: add overridable 'user_agent_suffix' and 'user_agent' to values.yaml, (#910 ) passed to crawler --userAgentSuffix and --userAgent params, respectively, using 'quote' to support spaces in user-agent. config: re-order settings to put 'Crawler Settings' section first, followed by 'Cluster Settings' fixes #787	2023-06-07 12:01:12 -07:00
Henry Wilkinson	a718043fa8	Adds icon `name` and tooltip `content` fields to `btrix-copy-button` (#879 ) - Adds two new properties, name to pick the icon's name and content to pick a custom tooltip message. These are in-line with what Shoelace uses but are perhaps not the best descriptors... - Swaps the existing anchor links on the Workflow Details' Settings tab for these and relocates them to after the heading. (Navigation to the links is broken right now... but the copying part works nicely!) - Updates btrix-section-heading to better handle multiple elements with flexbox and an 8px gap between elements	2023-06-06 17:54:17 -07:00
sua yoo	66b3befef9	Frontend collections beta UI (#886 ) - Support for creating new collections and editing existing collections - Can select crawling workflows which adds entire workflow, and then deselect individual crawls - Can edit existing collections and add more crawls - Can view, create and delete collections via new Collections top-level nav entry	2023-06-06 17:52:01 -07:00
Ilya Kreymer	f2b7b6bcd5	Nightly Tests Fix (#905 ) * tests: fix nightly test to account for 'waiting_capacity' state * readd missing --logErrorsToRedis flag	2023-06-02 21:47:41 -07:00
Ilya Kreymer	3f42515914	crawls list: unset errors in crawls list response to avoid very large… (#904 ) * crawls list: unset errors in crawls list response to avoid very large responses #872 * Remove errors from crawl replay.json * Add tests to ensure errors are excluded from crawl GET endpoints * Update tests to accept None for errors --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-06-02 18:52:59 -07:00
Tessa Walsh	0284903b34	Cleanup carwler args (#889 ) * crawler args cleanup: - move crawler args command line entirely to configmap - add required settings like --generateWACZ and --waitOnDone to configmap to not be overridable - values files can configure individual settings, assembled in configmap - move disk_utilization_threshold to configmap - add 'crawler_logging_opts' and 'crawler_extract_full_text' options to values.yaml to more easily set these options --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-05-30 19:29:07 -04:00
Tessa Walsh	08b3d706a7	btrix helper: Add -microk8s flag to explicitly use microk8s (#888 )	2023-05-30 15:41:26 -07:00
Ilya Kreymer	00fb8ac048	Concurrent Crawl Limit (#874 ) concurrent crawl limits: (addresses #866) - support limits on concurrent crawls that can be run within a single org - change 'waiting' state to 'waiting_org_limit' for concurrent crawl limit and 'waiting_capacity' for capacity-based limits orgs: - add 'maxConcurrentCrawl' to new 'quotas' object on orgs - add /quotas endpoint for updating quotas object operator: - add all crawljobs as related, appear to be returned in creation order - operator: if concurrent crawl limit set, ensures current job is in the first N set of crawljobs (as provided via 'related' list of crawljob objects) before it can proceed to 'starting', otherwise set to 'waiting_org_limit' - api: add org /quotas endpoint for configuring quotas - remove 'new' state, always start with 'starting' - crawljob: add 'oid' to crawljob spec and label for easier querying - more stringent state transitions: add allowed_from to set_state() - ensure state transitions only happened from allowed states, while failed/canceled can happen from any state - ensure finished and state synched from db if transition not allowed - add crawl indices by oid and cid frontend: - show different waiting states on frontend: 'Waiting (Crawl Limit) and 'Waiting (At Capacity)' - add gear icon on orgs admin page - and initial popup for setting org quotas, showing all properties from org 'quotas' object tests: - add concurrent crawl limit nightly tests - fix state waiting -> waiting_capacity - ci: add logging of operator output on test failure	2023-05-30 15:38:03 -07:00
sua yoo	ab518f51fb	Fix ResizeObserver loop error (#902 )	2023-05-30 14:59:34 -07:00
sua yoo	6208ead040	Sort collection by last updated (modified) (#897 ) Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-05-30 14:09:10 -04:00
Ilya Kreymer	4d30a64bc9	collection delete: (#896 ) set delete endpoint to use DELETE verb, fix for #869	2023-05-29 18:19:04 -07:00
Tessa Walsh	df4c4e6c5a	Optimize workflow statistics updates (#892 ) * optimizations: - rename update_crawl_config_stats to stats_recompute_all, only used in migration to fetch all crawls and do a full recompute of all file sizes - add stats_recompute_last to only get last crawl by size, increment total size by specified amount, and incr/decr number of crawls - Update migration 0007 to use stats_recompute_all - Add isCrawlRunning, lastCrawlStopping, and lastRun to stats_recompute_last - Increment crawlSuccessfulCount in stats_recompute_last * operator/crawls: - operator: keep track of filesAddedSize in redis as well - rename update_crawl to update_crawl_state_if_changed() and only update if state is different, otherwise return false - ensure mark_finished() operations only occur if crawl is state has changed - don't clear 'stopping' flag, can track if crawl was stopped - state always starts with "starting", don't reset to starting tests: - Add test for incremental workflow stats updating - don't clear stopping==true, indicates crawl was manually stopped --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-05-26 22:57:08 -07:00
Tessa Walsh	9c7a312a4c	Rework collections to track collections in Crawl (#878 ) * Track collections in Crawl rather than crawls in Collection * Add delete collection API endpoint and tests * Precompute collection crawlCount, pageCount, and tags and add them to GET collection responses * Add modified field to Collection * Update collection replay.json method * Make add and remove crawls accept list of crawl ids * Auto-add new workflow crawls to collections when they successfully complete via CrawlConfig.autoAddCollections field * Move long-running post-crawl operator tasks into asyncio task * Make CrawlConfig.autoAddCollections updatable via /update API endpoint	2023-05-25 15:41:50 -04:00

... 3 4 5 6 7 ...

879 Commits