Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add local file path & Raw HTML string support. #16

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -128,3 +128,5 @@ dist
.yarn/build-state.yml
.yarn/install-state.gz
.pnp.*

examples/*.html
55 changes: 55 additions & 0 deletions examples/local-html.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
// Example will grab todays HN page and works a lot like examples/hn.ts but
// does the parsing from a local HTML file instead of using chromium/playwright
import { z } from 'zod'
import OpenAI from 'openai'
import LLMScraper from '../src'
import path from 'path'
import { writeFileSync } from 'fs'

// Initialize LLM provider
const llm = new OpenAI()

// Create a new LLMScraper
const scraper = new LLMScraper(null, llm)

// Define schema to extract contents into
const schema = z.object({
top: z
.array(
z.object({
title: z.string(),
points: z.number(),
by: z.string(),
commentsURL: z.string(),
})
)
.length(5) // How many results to parse for this specific instance.
.describe('Top 5 stories on Hacker News'),
})

// Where we store the local HTML file for this example.
const HNRawHtmlPath = path.resolve('./example-hn.html');

// Grab today's HN front page to run the example
await fetch('https://news.ycombinator.com/')
.then((res) => res.text())
.then((html) => writeFileSync(HNRawHtmlPath, html, { encoding: 'utf-8', flag: 'w' }))
.catch((e) => {
console.error("Failed to fetch content from Hackernews", e)
})

// Local file paths to scrape - will be loaded from local filepaths.
const filePaths = [HNRawHtmlPath]

// Run the scraper
const pages = await scraper.runFiles(filePaths, {
model: 'gpt-4-turbo',
schema,
mode: 'html',
closeOnFinish: true,
})

// Stream the result from LLM
for await (const page of pages) {
console.log(page.data)
}
47 changes: 47 additions & 0 deletions examples/raw-html.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
// Example will grab todays HN page and works a lot like examples/hn.ts but
// does the parsing from a the RAW html response instead of using chromium/playwright
import { z } from 'zod'
import OpenAI from 'openai'
import LLMScraper from '../src'

// Initialize LLM provider
const llm = new OpenAI()

// Create a new LLMScraper
const scraper = new LLMScraper(null, llm)

// Define schema to extract contents into
const schema = z.object({
top: z
.array(
z.object({
title: z.string(),
points: z.number(),
by: z.string(),
commentsURL: z.string(),
})
)
.length(5) // How many results to parse for this specific instance.
.describe('Top 5 stories on Hacker News'),
})

// Grab today's HN front page to run the example
const htmlString = await fetch('https://news.ycombinator.com/')
.then((res) => res.text())
.catch((e) => {
console.error("Failed to fetch content from Hackernews", e)
return null;
})

// Run the scraper
const pages = await scraper.rawHTML([htmlString], {
model: 'gpt-4-turbo',
schema,
mode: 'html',
closeOnFinish: true,
})

// Stream the result from LLM
for await (const page of pages) {
console.log(page.data)
}
122 changes: 122 additions & 0 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
},
"homepage": "https://github.com/mishushakov/llm-scraper#readme",
"dependencies": {
"node-html-parser": "^6.1.13",
"node-llama-cpp": "^2.8.9",
"openai": "^4.38.2",
"turndown": "^7.1.3",
Expand All @@ -37,4 +38,4 @@
"typescript": "^5.4.5",
"zod": "^3.22.5"
}
}
}