Skip to content

Commit

Permalink
feat: Browserbase Web Reader (#12877)
Browse files Browse the repository at this point in the history
  • Loading branch information
mishushakov committed May 2, 2024
1 parent 0f8a6ef commit 8f3b518
Show file tree
Hide file tree
Showing 7 changed files with 155 additions and 0 deletions.
52 changes: 52 additions & 0 deletions docs/docs/examples/data_connectors/WebPageDemo.ipynb
Expand Up @@ -130,6 +130,58 @@
"display(Markdown(f\"<b>{response}</b>\"))"
]
},
{
"cell_type": "markdown",
"id": "005d14cd",
"metadata": {},
"source": [
"# Using Browserbase Reader 🅱️\n",
"\n",
"[Browserbase](https://browserbase.com) is a serverless platform for running headless browsers, it offers advanced debugging, session recordings, stealth mode, integrated proxies and captcha solving.\n",
"\n",
"## Installation and Setup\n",
"\n",
"- Get an API key from [browserbase.com](https://browserbase.com) and set it in environment variables (`BROWSERBASE_API_KEY`).\n",
"- Install the [Browserbase SDK](http://github.com/browserbase/python-sdk):"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c74e6425",
"metadata": {},
"outputs": [],
"source": [
"% pip install browserbase"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c23d02bc",
"metadata": {},
"outputs": [],
"source": [
"from llama_index.readers.web import BrowserbaseWebReader"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7e71d347",
"metadata": {},
"outputs": [],
"source": [
"reader = BrowserbaseWebReader()\n",
"docs = reader.load_data(\n",
" urls=[\n",
" \"https://example.com\",\n",
" ],\n",
" # Text mode\n",
" text_content=False,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "15f46387",
Expand Down
Expand Up @@ -5,6 +5,7 @@
from llama_index.readers.web.beautiful_soup_web.base import (
BeautifulSoupWebReader,
)
from llama_index.readers.web.browserbase.base import BrowserbaseWebReader
from llama_index.readers.web.firecrawl_web.base import FireCrawlWebReader
from llama_index.readers.web.knowledge_base.base import (
KnowledgeBaseWebReader,
Expand Down Expand Up @@ -42,6 +43,7 @@
__all__ = [
"AsyncWebPageReader",
"BeautifulSoupWebReader",
"BrowserbaseWebReader",
"FireCrawlWebReader",
"KnowledgeBaseWebReader",
"MainContentExtractorReader",
Expand Down
@@ -0,0 +1,5 @@
python_sources()

python_requirements(
name="reqs",
)
@@ -0,0 +1,47 @@
# Browserbase Web Reader

[Browserbase](https://browserbase.com) is a serverless platform for running headless browsers, it offers advanced debugging, session recordings, stealth mode, integrated proxies and captcha solving.

## Installation and Setup

- Get an API key from [browserbase.com](https://browserbase.com) and set it in environment variables (`BROWSERBASE_API_KEY`).
- Install the [Browserbase SDK](http://github.com/browserbase/python-sdk):

```
pip install browserbase
```

## Usage

### Loading documents

You can load webpages into LlamaIndex using `BrowserbaseWebReader`. Optionally, you can set `text_content` parameter to convert the pages to text-only representation.

```python
from llama_index.readers.web import BrowserbaseWebReader


reader = BrowserbaseWebReader()
docs = reader.load_data(
urls=[
"https://example.com",
],
# Text mode
text_content=False,
)
```

### Loading images

You can also load screenshots of webpages (as bytes) for multi-modal models.

```python
from browserbase import Browserbase
from base64 import b64encode

browser = Browserbase()
screenshot = browser.screenshot("https://browserbase.com")

# Optional. Convert to base64
img_encoded = b64encode(screenshot).decode()
```
@@ -0,0 +1,48 @@
import logging
from typing import Optional, Iterator, Sequence
from llama_index.core.readers.base import BaseReader
from llama_index.core.schema import Document


logger = logging.getLogger(__name__)


class BrowserbaseWebReader(BaseReader):
"""BrowserbaseWebReader.
Load pre-rendered web pages using a headless browser hosted on Browserbase.
Depends on `browserbase` package.
Get your API key from https://browserbase.com
"""

def __init__(
self,
api_key: Optional[str] = None,
) -> None:
try:
from browserbase import Browserbase
except ImportError:
raise ImportError(
"`browserbase` package not found, please run `pip install browserbase`"
)

self.browserbase = Browserbase(api_key=api_key)

def lazy_load_data(
self, urls: Sequence[str], text_content: bool = False
) -> Iterator[Document]:
"""Load pages from URLs."""
pages = self.browserbase.load_urls(urls, text_content)

for i, page in enumerate(pages):
yield Document(
text=page,
metadata={
"url": urls[i],
},
)


if __name__ == "__main__":
reader = BrowserbaseWebReader()
logger.info(reader.load_data(urls=["https://example.com"]))
@@ -0,0 +1 @@
browserbase

0 comments on commit 8f3b518

Please sign in to comment.