Commit Graph

1120 Commits

Author SHA1 Message Date
Tessa Walsh
11ca3e678a
Configure crawler disk utilization threshold via helm chart (#748) 2023-04-05 21:51:53 -07:00
Tessa Walsh
f6f3b7abba
Add btrix CLI dev helper (#732)
* Add btrix CLI dev helper

* Fix identation

* Use bash syntax for ifs
2023-04-05 21:51:22 -07:00
sua yoo
80bc4a3eb9
Fix additional URLs (#752) 2023-04-05 20:11:09 -07:00
sua yoo
91c2c1ad62
Allow users to set additional page time limits (#744) 2023-04-05 20:06:46 -07:00
sua yoo
72967a0381
Frontend Docker build improvements (#749) 2023-04-05 20:05:45 -07:00
sua yoo
c60dc5d086
Crawls list backend pagination (#735) 2023-04-05 10:55:42 -07:00
Ilya Kreymer
63be81d835 ci: make playwright integration tests run only on PRs involving frontend 2023-04-05 09:57:34 -07:00
Ilya Kreymer
7f757d396a
config: add 'pageLoadTimeout' and 'pageExtraDelay' options to backend… (#742)
* config: add 'pageLoadTimeout' and 'pageExtraDelay' options to backend config
- add 'default_page_load_timeout_seconds' to values.yaml, defaulting to 120, for pageLoadTimeout
- add 'defaultPageLoadTimeSeconds ' to /api/settings, update tests for /api/settings
addresses issue in #636
2023-04-04 19:52:23 -07:00
Ilya Kreymer
67172ca1e2
fix: only include finished crawls in crawlCount value for /api/crawlconfigs (#746) 2023-04-04 19:50:14 -07:00
Ilya Kreymer
88497d2a64
text: rename workflowuration -> workflow (#741) 2023-04-04 08:48:06 -07:00
sua yoo
370b8cbd4d
Set max pages to API default (#739) 2023-04-04 08:47:37 -07:00
Ilya Kreymer
2b0d5ff8b3
misc frontend build fixes: playwright version + chunking (#740)
* misc frontend build fixes:
- fix playwright version to be consistent to fix playwright test
- chunking: set max number of chunks generated

* lock playwright version

* remove intl polyfill

---------

Co-authored-by: sua yoo <sua@suayoo.com>
2023-04-03 21:27:44 -07:00
Ilya Kreymer
1c47a648a9
Max page limit override (#737)
* more page limit: update to #717, instead of setting --limit in each crawlconfig,
apply override --maxPageLimit setting, implemented in crawler, to override individually configured page limit

* update tests, no longer returning 'crawl_page_limit_exceeds_allowed'
2023-04-03 14:01:32 -07:00
Tessa Walsh
3b99bdf26a
Update nightly test fixtures to use Seed objects (#734) 2023-04-03 16:21:25 -04:00
Tessa Walsh
e9b61c632d
Add pageSize to pagination format (#736) 2023-04-03 15:57:47 -04:00
Henry Wilkinson
68ec47cb7f Moves deployment docs back to the root docs directory
- Replaces hyphens on the homepage with em dashes
2023-03-31 00:06:45 -04:00
Ilya Kreymer
887cb16146
Allow configurable max pages per crawl in deployment settings (#717)
* backend: max pages per crawl limit, part of fix for #716:
- set 'max_pages_crawl_limit' in values.yaml, default to 100,000
- if set/non-0, automatically set limit if none provided
- if set/non-0, return 400 if adding config with limit exceeding max limit
- return limit as 'maxPagesPerCrawl' in /api/settings
- api: /all/crawls - add runningOnly=0 to show all crawls, default to 1/true (for more reliable testing)

tests: add test for 'max_pages_per_crawl' setting
- ensure 'limit' can not be set higher than max_pages_per_crawl
- ensure pages crawled is at the limit
- set test limit to max 2 pages
- add settings test
- check for pages.jsonl and extraPages.jsonl when crawling 2 pages
2023-03-28 16:26:29 -07:00
Sara Tavares
948cce3d30
Add README.md related to run playwright tests locally (#722) 2023-03-28 16:08:28 -07:00
Tessa Walsh
4724754efc
Filter and sort crawl and workflow list API endpoints in backend (#724)
* Re-implement pagination and paginate crawlconfig revs

First step toward simplifying pagination to set us up for sorting
and filtering of list endpoints. This commit removes fastapi-pagination
as a dependency.

* Migrate all HttpUrl seeds to Seeds

This commit also updates the frontend to always use Seeds and to
fix display issues resulting from the change.

* Filter and sort crawls and workflows

Crawls:
- Filter by createdBy (via userid param)
- Filter by state (comma-separated string for multiple values)
- Filter by first_seed, name, description
- Sort by started, finished, fileSize, firstSeed
- Sort descending by default to match frontend

Workflows:
- Filter by createdBy (formerly userid) and modifiedBy
- Filter by first_seed, name, description
- Sort by created, modified, firstSeed, lastCrawlTime

* Add crawlconfigs search-values API endpoint and test
2023-03-28 17:55:40 -04:00
Sara Tavares
36cfb2591f
ci: fix version related to @playwright/test (#729)
* fix version, add resolutions to have fixed playwright version
2023-03-28 14:30:36 -07:00
sua yoo
25e4da2522
fix: enable semibold variable 2023-03-28 12:17:34 -07:00
sua yoo
8033061540
Leave trailing slash in seed URLs (#731) 2023-03-27 14:46:59 -07:00
Tessa Walsh
e293e98ac3
Fix migration to avoid jobType KeyError (#727)
* Fix migration to avoid KeyError

* Use .get() for other optional fields
2023-03-27 13:52:05 -07:00
sua yoo
bca67c74e2
chore: format frontend files with prettier 2023-03-27 11:05:19 -07:00
Henry Wilkinson
96afa408d9 Update docs.md
- Adds links to tools
- Adds future docs style guide section
- Updates name and makes it an H1
2023-03-27 02:46:50 -04:00
Henry Wilkinson
2afc13e35a Rename "Developer Docs" index page
- Better title for sidebar
2023-03-27 02:19:41 -04:00
Henry Wilkinson
f6bab4f26c Reorganize content
- Renames "Dev" to "Develop" for improved navigation labels
- Deployment docs are now located under a larger "Development" section (fewer nav bar choices & realistically I think anyone who wants to do one is going to be referring to the other)
- Adds links to tools the first time they're mentioned
- Rewords part of the homepage
- Hides section navigation on the homepage (now we don't have a blank section nav bar!
- Adds some syntax highlighting
- Removes some manual word wrapping — this was done very rarely / inconsistently
2023-03-27 02:11:41 -04:00
Henry Wilkinson
7576ac8423 Add stylesheet & mkdocs features
- Adds a custom stylesheet & brand colours
- Adds Recursive as the code font
- Adds repo info to the nav bar
- Adds auto tracking ID links for deep linking to sections as users scroll the page
- Index pages are now a part of their section as determined by their H1
- Removes mkdocs info from future footer
2023-03-27 02:06:34 -04:00
Sara Tavares
48163db5d3
ci: fix version playwright version for tests (#725) 2023-03-26 21:57:06 -07:00
Sara Tavares
b61592b5ed
CI: Add Playwright UI e2e tests + CI (#614)
Adds Playwright for UI tests.
Basic Playwright test to login.
Playwright Github Action.

---------

Co-authored-by: sua yoo <sua@suayoo.com>
2023-03-22 16:23:22 -07:00
sua yoo
e8f88a797b
Remove new issue project automation config (#718) 2023-03-21 13:49:34 -07:00
sua yoo
5f5bb5ea6e
Allow users to set workflow description (#708) 2023-03-21 13:40:23 -07:00
Tessa Walsh
4136bdad2e
Add optional description to crawl configs and return in crawl endpoints (#707) 2023-03-21 15:39:09 -04:00
sua yoo
0b0bae00c8
chore: add PR template for UI changes 2023-03-21 11:32:36 -07:00
Sara Tavares
3fa93b01b8
ci: Create proofread-action.yaml (#714) 2023-03-20 21:08:56 -07:00
Ilya Kreymer
ba70d3227e version: update to 1.4.0-beta.1 2023-03-17 21:14:42 -07:00
Ilya Kreymer
07e9f51292
backend: update queue apis to work with new sorted queue apis (also b… (#712)
* backend: update queue apis to work with new sorted queue apis (also backwards compatible to existing apis)
designed for browsertrix-crawler 0.9.0-beta.1 but also backwards compatible with older list-based queue as well
2023-03-17 21:11:17 -07:00
sua yoo
b9a24fa5e2
Combine watch crawl with crawl queue (#710)
- crawl queue and watch page are now part of single view
- exclusions can be edited via 'Edit Exclusions' popup
2023-03-17 21:04:08 -07:00
sua yoo
03e9b2aba5
Disable copy tags menu item if no tags (#709) 2023-03-16 19:45:04 -07:00
sua yoo
0009ce8bf6
fix limit fields (#704) 2023-03-14 18:28:13 -07:00
Ilya Kreymer
de9212eec7
exclusions editor fix: (#692)
- backend: fix updating model after exclusions change
- frontend: don't check for new_cid, just success
- fixes #691
2023-03-10 22:36:10 -08:00
D. Lee
7528f2ec6d
Add lightweight logging mode (#668)
Enabled with `logging.fileMode`: true
- disables elasticsearch, kibana and ingress
- only enables fluentd to write logs in the node's volume
- lightweight logging into files (in JSON format and compressed in gzip)
- log file rotation (default: rotating files every 4 hours, retention 3 days)
2023-03-10 14:34:37 -08:00
Ilya Kreymer
86ca9c4bac
backend: Fix for total crawl time limit. (#665)
* backend: fix for total crawl timelimit:
- time limit is computed for total job run time
- when limit is exceeded, job starts to stop crawls gracefully, equivalent to 'stop crawl' operation
- fix for #664

* rename crawl-timeout -> crawl_expire_time

* fix lint
2023-03-10 11:43:16 -08:00
sua yoo
8ca4276c57
Migrate crawl config frontend -> workflow (#686) 2023-03-10 11:39:42 -08:00
sua yoo
fecdc6229d
Improve crawl queue pagination UX (#680)
* switches to infinite scroll for crawl queue
2023-03-09 12:18:26 -08:00
sua yoo
934ee18044
chore: switch actions for issue assign automation
addresses #658
2023-03-08 10:01:00 -08:00
Ilya Kreymer
c2fa78859b
permissions: allow user with 'viewer' permissions to access read-only crawlconfig apis (#687)
addresses issue in #653, fixes #685
2023-03-08 09:29:25 -08:00
sua yoo
666c28f420
Limit organization name length (#671) 2023-03-08 09:21:48 -08:00
Ilya Kreymer
544346d1d4
backend: make crawlconfigs mutable! (#656) (#662)
* backend: make crawlconfigs mutable! (#656)
- crawlconfig PATCH /{id} can now receive a new JSON config to replace the old one (in addition to scale, schedule, tags)
- exclusions: add / remove APIs mutate the current crawlconfig, do not result in a new crawlconfig created
- exclusions: ensure crawl job 'config' is updated when exclusions are added/removed, unify add/remove exclusions on crawl
- k8s: crawlconfig json is updated along with scale
- k8s: stateful set is restarted by updating annotation, instead of changing template
- crawl object: now has 'config', as well as 'profileid', 'schedule', 'crawlTimeout', 'jobType' properties to ensure anything that is changeable is stored on the crawl
- crawlconfigcore: store share properties between crawl and crawlconfig in new crawlconfigcore (includes 'schedule', 'jobType', 'config', 'profileid', 'schedule', 'crawlTimeout', 'tags', 'oid')
- crawlconfig object: remove 'oldId', 'newId', disallow deactivating/deleting while crawl is running
- rename 'userid' -> 'createdBy'
- remove unused 'completions' field
- add missing return to fix /run response
- crawlout: ensure 'profileName' is resolved on CrawlOut from profileid
- crawlout: return 'name' instead of 'configName' for consistent response
- update: 'modified', 'modifiedBy' fields to set modification date and user modifying config
- update: ensure PROFILE_FILENAME is updated in configmap is profileid provided, clear if profileid==""
- update: return 'settings_changed' and 'metadata_changed' if either crawl settings or metadata changed
- tests: update tests to check settings_changed/metadata_changed return values

add revision tracking to crawlconfig:
- store each revision separate mongo db collection
- revisions accessible via /crawlconfigs/{cid}/revs
- store 'rev' int in crawlconfig and in crawljob
- only add revision history if crawl config changed

migration:
- update to db v3
- copy fields from crawlconfig -> crawl
- rename userid -> createdBy
- copy userid -> modifiedBy, created -> modified
- skip invalid crawls (missing config), make createdBy optional (just in case)

frontend: Update crawl config keys with new API (#681), update frontend to use new PATCH endpoint, load config from crawl object in details view

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: sua yoo <sua@suayoo.com>
2023-03-07 20:36:50 -08:00
sua yoo
d3bb524971
Fix missing crawl config name (#683) 2023-03-07 19:13:56 -08:00