Jan-Lukas Else

Thoughts of an IT expert

Miniflux scraper rules

Published on in 👨‍💻 Dev
Short link: https://b.jlel.se/s/448
⚠️ This entry is already over one year old. It may no longer be up to date. Opinions may have changed.

Speaking of web comics that I follow via RSS, ATOM or JSON feed: ideally I would like to see the comics directly in my feed reader (Miniflux). Some feeds already show the images directly in the content, others do not.

For the feeds that don’t show the content directly, Miniflux provides an option to extract the content from the corresponding website. Combined with the ability to configure extraction (scraper) rules.

An example I had to create an extraction rule for the other day is The Joy of Tech.

Miniflux relies internally on the Go library goquery, which in turn relies on the cascadia library for the CSS selectors.

So in the extraction rule, all you really need to do is set a CSS selector, which then selects the HTML elements to be extracted.

CSS selectors are a powerful tool, although like regular expressions, it sometimes takes a bit of research or training to find the appropriate selector. For The Joy of Tech, the following rule works:

p.Maintext a[href="support.html"] img[src$=".png"]

This selector selects all image tags whose src attribute ends in .png and is in an anchor element referencing support.html, which in turn is in a paragraph with the class Maintext. Complicated, but it works and when extracted Miniflux now shows me only the actual comic.

Tags: , ,

Jan-Lukas Else
Interactions & Comments