Docs: Elaborates on using user agents (#1841)

- Provides a link to Mozilla's page explaining what they are (good for
folks new to the concept)
- Provides a link to useragents.me, the same site we link to in the app
- Provides two examples of situations where they may be helpful to get
around content restrictions
This commit is contained in:
Henry Wilkinson 2024-05-30 14:50:10 -04:00 committed by GitHub
parent b432d226bd
commit 251aef3ac1
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -168,7 +168,20 @@ Will prevent any content from the domains listed in [Steven Black's Unified Host
### User Agent
Sets the browser's user agent in outgoing requests to the specified value. If left blank, the crawler will use the browser's default user agent.
Sets the browser's [user agent](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent) in outgoing requests to the specified value. If left blank, the crawler will use the Brave browser's default user agent. For a list of common user agents see [useragents.me](https://www.useragents.me/).
??? example "Using custom user agents to get around restrictions"
Despite being against best practices, some websites will block specific browsers based on their user agent: a string of text that browsers send web servers to identify what type of browser or operating system is requesting content. If Brave is blocked, using a user agent string of a different browser (such as Chrome or Firefox) may be sufficient to convince the website that a different browser is being used.
User agents can also be used to voluntarily identify your crawling activity, which can be useful when working with a website's owners to ensure crawls can be completed successfully. We recommend using a user agent string similar to the following, replacing the `orgname` and URL comment with your own:
```
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.3 orgname.browsertrix (+https://example.com/crawling-explination-page)
```
If you have no webpage to identify your organization or statement about your crawling activities available as a link, omit the bracketed comment section at the end entirely.
This string must be provided to the website's owner so they can allowlist Browsertrix to prevent it from being blocked.
### Language