browsertrix/backend
Anish Lakhwara 037396f3d9
Fix: Stream log downloading from WACZ (#1225)
* Fix(backend): Stream logs without causing OOM

Also be smarter about when to use `heapq.merge` and when to use
`itertools.chain`: If all the logs are coming from the same instance we
`chain` them, otherwise we'll `merge` them

iterator fixes:
- group wacz files by instance by suffix, eg. -0.wacz, -1.wacz, -2.wacz
- sort wacz files, and all logs within each wacz file
- chain log iterators for all log files within wacz group
- merge log iterators across wacz files in different groups
- add type hints to help keep track of iterator helper functions
- add iter_lines() from botocore, use that for line parsing for simplicity

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-09-28 18:54:52 -07:00
..
btrixcloud Fix: Stream log downloading from WACZ (#1225) 2023-09-28 18:54:52 -07:00
test Remove username lookups for crawls and workflows by storing usernames in db (#1199) 2023-09-28 09:37:23 -07:00
test_nightly Expect that crawl deleted response is bool, not int (#1170) 2023-09-12 15:03:17 -07:00
.pylintrc
Dockerfile
mypy.ini Improved type checking for backend with mypy (#1174) 2023-09-13 19:40:26 -07:00
requirements.txt better resources scaling by number of browsers per crawler container (#1103) 2023-09-06 01:42:44 -04:00