GitHub - html2rss/html2rss: 📰 Build RSS 2.0 feeds from websites (and JSON APIs) with a few CSS selectors.

This Ruby gem builds RSS 2.0 feeds from a feed config.

With the feed config containing the URL to scrape and CSS selectors for information extraction (like title, URL, ...) your RSS builds. Extractors and chain-able post processors make information extraction, processing and sanitizing a breeze. Scraping JSON responses and setting HTTP request headers is supported, too.

Searching for a ready to use app which serves generated feeds via HTTP? Head over to html2rss-web!

To support the development, feel free to sponsor this project on Github. Thank you! 💓

Installation

Install	`gem install html2rss`
Usage	`html2rss help`

You can also install it as a dependency in your Ruby project:

🤩 Like it?	Star it! ⭐️
Add this line to your `Gemfile`:	`gem 'html2rss'`
Then execute:	`bundle`
In your code:	`require 'html2rss'`

Generating a feed on the CLI

Create a file called my_config_file.yml with this example content:

channel:
  url: https://stackoverflow.com/questions
selectors:
  items:
    selector: "#hot-network-questions > ul > li"
  title:
    selector: a
  link:
    selector: a
    extractor: href

Build the RSS with: html2rss feed ./my_config_file.yml.

Generating a feed with Ruby

Here's a minimal working example within Ruby:

require 'html2rss'

rss =
  Html2rss.feed(
    channel: { url: 'https://stackoverflow.com/questions' },
    selectors: {
      items: { selector: '#hot-network-questions > ul > li' },
      title: { selector: 'a' },
      link: { selector: 'a', extractor: 'href' }
    }
  )

puts rss

The feed config and its options

A feed config consists of a channel and a selectors Hash. The contents of both hashes are explained in the chapters below.

Good to know:

You'll find extensive example feed configs at spec/*.test.yml.
See html2rss-configs for ready-made feed configs!
If you've already created feed configs, you're invited to send a PR to html2rss-configs to make your config available to the general public.

Alright, let's move on.

The `channel`

attribute		type	default	remark
`url`	required	String
`title`	optional	String	auto-generated
`description`	optional	String	auto-generated
`ttl`	optional	Integer	`360`	TTL in minutes
`time_zone`	optional	String	`'UTC'`	TimeZone name
`language`	optional	String	`'en'`	Language code
`author`	optional	String		Format: `email (Name)`
`headers`	optional	Hash	`{}`	Set HTTP request headers. See notes below.
`json`	optional	Boolean	`false`	Handle JSON response. See notes below.

Dynamic parameters in `channel` attributes

Sometimes there are structurally equal pages with different URLs. In such a case you can add dynamic parameters to the channel's attributes.

Example of a dynamic id parameter in the channel URLs:

channel:
  url: "http://domainname.tld/whatever/%<id>s.html"

Command line usage example:

bundle exec html2rss feed the_feed_config.yml id=42

See a Ruby example

config = Html2rss::Config.new({ channel: { url: 'http://domainname.tld/whatever/%<id>s.html' } }, {}, { id: 42 })
Html2rss.feed(config)

See the more complex formatting of the sprintf method for formatting options.

The `selectors`

First, you must give an items selector hash which contains a CSS selector. The selector selects a collection of HTML tags from which the RSS feed items are build. Except the items selector, all other keys are scoped to each item of the collection.

Then, to build a valid RSS 2.0 item, you need to have at least a title or a description. You can have both.

Having an items and a title selector is already enough to build a simple feed.

Your selectors Hash can contain arbitrary named selectors, but only a few will make it into the RSS feed (This due to the RSS 2.0 specification):

RSS 2.0 tag	name in `html2rss`	remark
`title`	`title`
`description`	`description`	Supports HTML.
`link`	`link`	A URL.
`author`	`author`
`category`	`categories`	See notes below.
`guid`	`guid`	Default title/description. See notes below.
`enclosure`	`enclosure`	See notes below.
`pubDate`	`updated`	An instance of `Time`.
`comments`	`comments`	A URL.
`source`	~~source~~	Not yet supported.

The `selector` hash

Every named selector in your selectors hash can have these attributes:

name	value
`selector`	The CSS selector to select the tag with the information.
`extractor`	Name of the extractor. See notes below.
`post_process`	A hash or array of hashes. See notes below.

Using extractors

Extractors help with extracting the information from the selected HTML tag.

The default extractor is text, which returns the tag's inner text.
The html extractor returns the tag's outer HTML.
The href extractor returns a URL from the tag's href attribute and corrects relative ones to absolute ones.
The attribute extractor returns the value of that tag's attribute.
The static extractor returns the configured static value (it doesn't extract anything).
See file list of extractors.

Extractors might need extra attributes on the selector hash. 👉 Read their docs for usage examples.

See a Ruby example

Html2rss.feed(
  channel: {}, selectors: { link: { selector: 'a', extractor: 'href' } }
)

See a YAML feed config example

channel:
  # ... omitted
selectors:
  # ... omitted
  link:
    selector: 'a'
    extractor: 'href'

Using post processors

Extracted information can be further manipulated with post processors.

name
`gsub`	Allows global substitution operations on Strings (Regexp or simple pattern).
`html_to_markdown`	HTML to Markdown, using reverse_markdown.
`markdown_to_html`	converts Markdown to HTML, using kramdown.
`parse_time`	Parses a String containing a time in a time zone.
`parse_uri`	Parses a String as URL.
`sanitize_html`	Strips unsafe and uneeded HTML and adds security related attributes.
`substring`	Cuts a part off of a String, starting at a position.
`template`	Based on a template, it creates a new String filled with other selectors values.

⚠️ Always make use of the sanitize_html post processor for HTML content. Never trust the internet! ⚠️

See file list of post processors.

👉 Read their docs for usage examples.

See a Ruby example

Html2rss.feed(
  channel: {},
  selectors: {
    description: {
      selector: '.content', post_process: { name: 'sanitize_html' }
    }
  }
)

See a YAML feed config example

channel:
  # ... omitted
selectors:
  # ... omitted
  description:
    selector: '.content'
    post_process:
      - name: sanitize_html

Chaining post processors

Pass an array to post_process to chain the post processors.

YAML example: build the description from a template String (in Markdown) and convert that Markdown to HTML

channel:
  # ... omitted
selectors:
  # ... omitted
  price:
    selector: '.price'
  description:
    selector: '.section'
    post_process:
      - name: template
        string: |
          # %{self}

          Price: %{price}
      - name: markdown_to_html

Note the use of | for a multi-line String in YAML.

Adding `<category>` tags to an item

The categories selector takes an array of selector names. Each value of those selectors will become a <category> on the RSS item.

See a Ruby example

Html2rss.feed(
  channel: {},
  selectors: {
    genre: {
      # ... omitted
      selector: '.genre'
    },
    branch: { selector: '.branch' },
    categories: %i[genre branch]
  }
)

See a YAML feed config example

channel:
  # ... omitted
selectors:
  # ... omitted
  genre:
    selector: ".genre"
  branch:
    selector: ".branch"
  categories:
    - genre
    - branch

Custom item GUID

By default, html2rss generates a GUID from the title or description.

If this does not work well, you can choose other attributes from which the GUID is build. The principle is the same as for the categories: pass an array of selectors names.

In all cases, the GUID is a SHA1-encoded string.

See a Ruby example

Html2rss.feed(
  channel: {},
  selectors: {
    title: {
      # ... omitted
      selector: 'h1'
    },
    link: { selector: 'a', extractor: 'href' },
    guid: %i[link]
  }
)

See a YAML feed config example

channel:
  # ... omitted
selectors:
  # ... omitted
  title:
    selector: "h1"
  link:
    selector: "a"
    extractor: "href"
  guid:
    - link

Adding an `<enclosure>` tag to an item

An enclosure can be any file, e.g. a image, audio or video.

The enclosure selector needs to return a URL of the content to enclose. If the extracted URL is relative, it will be converted to an absolute one using the channel's URL as base.

Since html2rss does no further inspection of the enclosure, its support comes with trade-offs:

The content-type is guessed from the file extension of the URL.
If the content-type guessing fails, it will default to application/octet-stream.
The content-length will always be undetermined and therefore stated as 0 bytes.

Read the RSS 2.0 spec for further information on enclosing content.

See a Ruby example

Html2rss.feed(
  channel: {},
  selectors: {
    enclosure: { selector: 'img', extractor: 'attribute', attribute: 'src' }
  }
)

See a YAML feed config example

channel:
  # ... omitted
selectors:
  # ... omitted
  enclosure:
    selector: "img"
    extractor: "attribute"
    attribute: "src"

Scraping and handling JSON responses

Although this gem's name is html2rss, it's possible to scrape and process JSON.

Adding json: true to the channel config will convert the JSON response to XML.

See a Ruby example

Html2rss.feed(
  channel: {
    url: 'https://example.com', json: true
  },
  selectors: {} # ... omitted
)

See a YAML feed config example

channel:
  url: https://example.com
  json: true
selectors:
  # ... omitted

See example of a converted JSON object

This JSON object:

{
  "data": [{ "title": "Headline", "url": "https://example.com" }]
}

converts to:

<object>
  <data>
    <array>
      <object>
        <title>Headline</title>
        <url>https://example.com</url>
      </object>
    </array>
  </data>
</object>

Your items selector would be array > object, the item's link selector would be url.

See example of a converted JSON array

This JSON array:

[{ "title": "Headline", "url": "https://example.com" }]

converts to:

<array>
  <object>
    <title>Headline</title>
    <url>https://example.com</url>
  </object>
</array>

Your items selector would be array > object, the item's link selector would be url.

Set any HTTP header in the request

You can add any HTTP headers to the request to the channel URL. Use this to e.g. have Cookie or Authorization information sent or to spoof the User-Agent.

See a Ruby example

Html2rss.feed(
channel: {
  url: 'https://example.com',
  headers: {
    'User-Agent': 'html2rss-request',
    'X-Something': 'Foobar',
    Authorization: 'Token deadbea7',
    Cookie: 'monster=MeWantCookie'
  }
},
selectors: {}
)

See a YAML feed config example

channel:
  url: https://example.com
  headers:
    "User-Agent": "html2rss-request"
    "X-Something": "Foobar"
    "Authorization": "Token deadbea7"
    "Cookie": "monster=MeWantCookie"
selectors:
  # ...

The headers provided by the channel are merged into the global headers.

Reverse the order of items

By default, html2rss keeps the order of the collection returned from the items selector. The items selector hash can optionally contain an order attribute. If its value is reverse, the order of items in the RSS will reverse.

See a YAML feed config example

channel:
  # ... omitted
selectors:
  items:
    selector: 'ul > li'
    order: 'reverse'
  # ... omitted

Note that the order of items, according to the RSS 2.0 spec, should not matter to the feed-consuming client.

Usage with a YAML config file

This step is not required to work with this gem. If you're using html2rss-web and want to create your private feed configs, keep on reading!

First, create a YAML file, e.g. feeds.yml. This file will contain your global config and multiple feed configs under the key feeds.

Example:

headers:
  "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"
feeds:
  myfeed:
    channel:
    selectors:
  myotherfeed:
    channel:
    selectors:

Your feed configs go below feeds. Everything else is part of the global config.

Find a full example of a feeds.yml at spec/feeds.test.yml.

Now you can build your feeds like this:

Build feeds in Ruby

require 'html2rss'

myfeed = Html2rss.feed_from_yaml_config('feeds.yml', 'myfeed')
myotherfeed = Html2rss.feed_from_yaml_config('feeds.yml', 'myotherfeed')

Build feeds on the command line

$ html2rss feed feeds.yml myfeed
$ html2rss feed feeds.yml myotherfeed

Display the RSS feed nicely in a web browser

To display RSS feeds nicely in a web browser, you can:

add a plain old CSS stylesheet, or
use XSLT (eXtensible Stylesheet Language Transformations).

A web browser will apply these stylesheets and show the contents as described.

In a CSS stylesheet, you'd use element selectors to apply styles.

If you want to do more, then you need to create a XSLT. XSLT allows you to use a HTML template and to freely design the information of the RSS, including using JavaScript and external resources.

You can add as many stylesheets and types as you like. Just add them to your global configuration.

Ruby: a stylesheet config example

config = Html2rss::Config.new(
  { channel: {}, selectors: {} }, # omitted
  {
    stylesheets: [
      {
        href: '/relative/base/path/to/style.xls',
        media: :all,
        type: 'text/xsl'
      },
      {
        href: 'http://example.com/rss.css',
        media: :all,
        type: 'text/css'
      }
    ]
  }
)

Html2rss.feed(config)

YAML: a stylesheet config example

stylesheets:
  - href: "/relative/base/path/to/style.xls"
    media: "all"
    type: "text/xsl"
  - href: "http://example.com/rss.css"
    media: "all"
    type: "text/css"
feeds:
  # ... omitted

Gotchas and tips & tricks

Check that the channel URL does not redirect to a mobile page with a different markup structure.
Do not rely on your web browser's developer console. html2rss does not execute JavaScript.
Fiddling with curl and pup to find the selectors seems efficient (curl URL | pup).
CSS selectors are versatile. Here's an overview.

Development

Check out the repository: git clone ... && cd html2rss
Install Ruby >=3.3, if you haven't already.
Run bin/setup to install dependencies.
Run the test suite bundle exec rspec.
- To generate Test Coverage, run COVERAGE=true bundle exec rspec and open coverage/index.html.
For an interactive prompt You can also run bin/console.

Releasing a new version

git pull
increase version in lib/html2rss/version.rb
bundle
git add Gemfile.lock lib/html2rss/version.rb
VERSION=$(ruby -e 'require "./lib/html2rss/version.rb"; puts Html2rss::VERSION')
git commit -m "chore: release $VERSION"
git tag v$VERSION
standard-changelog -f
git add CHANGELOG.md && git commit --amend
git tag v$VERSION -f
git push && git push --tags

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/html2rss/html2rss.

Name		Name	Last commit message	Last commit date
Latest commit History 230 Commits
.github		.github
bin		bin
exe		exe
lib		lib
spec		spec
support		support
.gitignore		.gitignore
.mergify.yml		.mergify.yml
.rspec		.rspec
.rubocop.yml		.rubocop.yml
.yardopts		.yardopts
CHANGELOG.md		CHANGELOG.md
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
README.md		README.md
html2rss.gemspec		html2rss.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Generating a feed on the CLI

Generating a feed with Ruby

The feed config and its options

The `channel`

Dynamic parameters in `channel` attributes

The `selectors`

The `selector` hash

Using extractors

Using post processors

Chaining post processors

Adding `<category>` tags to an item

Custom item GUID

Adding an `<enclosure>` tag to an item

Scraping and handling JSON responses

Set any HTTP header in the request

Reverse the order of items

Usage with a YAML config file

Display the RSS feed nicely in a web browser

Gotchas and tips & tricks

Development

Contributing

About

Releases

Sponsor this project

Packages

Contributors 4

Languages

License

html2rss/html2rss

Folders and files

Latest commit

History

Repository files navigation

Installation

Generating a feed on the CLI

Generating a feed with Ruby

The feed config and its options

The channel

Dynamic parameters in channel attributes

The selectors

The selector hash

Using extractors

Using post processors

Chaining post processors

Adding <category> tags to an item

Custom item GUID

Adding an <enclosure> tag to an item

Scraping and handling JSON responses

Set any HTTP header in the request

Reverse the order of items

Usage with a YAML config file

Display the RSS feed nicely in a web browser

Gotchas and tips & tricks

Development

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Contributors 4

Languages

The `channel`

Dynamic parameters in `channel` attributes

The `selectors`

The `selector` hash

Adding `<category>` tags to an item

Adding an `<enclosure>` tag to an item

Packages