Skip to content

Simple node server to extract relevant content from website source code using Mozilla's Readability.js

License

Notifications You must be signed in to change notification settings

SbstnErhrdt/node-readability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Readability Service

This is a small node server for processing html content with the Readability library of Firefox developed by Mozilla.

See: https://github.com/mozilla/readability/

The goal of this project is to provide an endpoint to use the Readability library to extract the most relevant content of a rendered website.

Docker

Simply run the docker container

docker run -p8080:8080 ese7en/node-readability

Request

The request object must contain the following:

  • data: the html source code as escaped string
HTTP PUT /
HTTP HEADER: Content-Type: application/json

{
    "data": "...HTML SROUCE CODE AS STRING ..."
}

Response

This response object will contain the following properties:

  • title: article title
  • content: HTML string of processed article content
  • textContent: text content of the article (all HTML removed)
  • length: length of an article, in characters
  • excerpt: article description, or short excerpt from the content
  • byline: author metadata
  • dir: content direction

Environment Variables

  • PORT: sets the port on which the server is running

End2End example

Website

<html>
    <head>
        <title>Hello World</title>
    </head>
    <body>
        <h1>This is a website</h1>
        <p>With some text</p>
    </body>
</html>

HTTP PUT Request to http://localhost:8080

{
    "data": "<html>\r\n    <head>\r\n        <title>Hello World<\/title>\r\n    <\/head>\r\n    <body>\r\n        <h1>This is a website<\/h1>\r\n        <p>With some text<\/p>\r\n    <\/body>\r\n<\/html>"
}

with curl

curl --request POST \
  --url http://localhost:8080/ \
  --header 'Content-Type: application/json' \
  --data '{
    "data": "<html>\r\n    <head>\r\n        <title>Hello World<\/title>\r\n    <\/head>\r\n    <body>\r\n        <h1>This is a website<\/h1>\r\n        <p>With some text<\/p>\r\n    <\/body>\r\n<\/html>"
}'

HTTP Response

{
  "title": "Hello World",
  "byline": null,
  "dir": null,
  "content": "<div id=\"readability-page-1\" class=\"page\">\n        <h2>This is a website</h2>\n        <p>With some text</p>\n    \n</div>",
  "textContent": "\n        This is a website\n        With some text\n    \n",
  "length": 55,
  "excerpt": "With some text",
  "siteName": null
}

About

Simple node server to extract relevant content from website source code using Mozilla's Readability.js

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published