diff options
-rw-r--r-- | src/blog/downloading-articles-for-my-ebook-reader.html | 99 |
1 files changed, 99 insertions, 0 deletions
diff --git a/src/blog/downloading-articles-for-my-ebook-reader.html b/src/blog/downloading-articles-for-my-ebook-reader.html new file mode 100644 index 0000000..8e5d463 --- /dev/null +++ b/src/blog/downloading-articles-for-my-ebook-reader.html @@ -0,0 +1,99 @@ +title: Downloading Articles for my Ebook Reader +date: April 08, 2021 14:40 +--- +<p> +I've recently taken to reading blog posts and other internet articles on my +ereader. And I don't mean using my tablet's browser and wifi connection to load +up websites. Instead, I convert the articles I want to read to PDF and read them +like I would any other ebook (I have a large screen tablet on which reading PDFs +is very comfortable; I would probably be playing around with EPUB conversion if +I had a smaller screen). +</p> + +<p> +The obvious way to get a PDF of a website would be to use my browser's built in +print-to-PDF feature. But this has some minor problems for me: + +<ul> + <li> + Articles from different websites will look very differently. I can't + anticipate how the website's CSS will affect readability (things like + font, text size, etc.). + </li> + <li> + It's not super easy to automate. Maybe this is possible with headless + browsers? But I haven't played around with those much and it feels silly + to spin up a whole browser just to render some HTML as a PDF. + </li> +</ul> +</p> + +<p> +That second point — about automation and scripting — was +particularly important to me. So the obvious tool for the job was the Swiss-army +knife of document conversions, <code>pandoc</code>. +</p> + +<p> +For a while I was wondering if I would have to write some clever script that +downloads all of the article's HTML and other resources (like images) and then +inputs them to <code>pandoc</code>. Fortunately, it turns out that <code>pandoc +<article url> -o <output file></code> does exactly what you think it +does. The article ends up converted to PDF, with LaTeX used as an intermediate +step, so everything is in the beautiful LaTeX font. <code>pandoc</code> also +takes care of downloading and including images. +</p> + +<h3>Hotkeys</h3> +<p> +I wrote a short script that calls <code>pandoc</code> and saves the PDF in a +specific directory. With that script available and working, I added hotkeys to +my browser and RSS reader that invoke it. These are the two programs in which I +might find articles to read, and now I can easily generate PDFs from both. +</p> + +<p> +Here's what the <code>newsboat</code> config looks like: + +<pre> +macro p set browser "article2pdf %u" ; open-in-browser ; set browser "elinks %u" +</pre> + +And here's the <code>qutebrowser</code> binding: + +<pre> +config.bind( + '"P', + 'spawn article2pdf {url}' +) +</pre> + +(<code>article2pdf</code> being the name of my script) +</p> + +<h3>Caveats</h3> +<p> +This doesn't work perfectly. + +<ul> + <li> + There's some issues with certain Unicode characters (including emojis) + that LaTeX apparently can't handle. Adding the + <code>--pdf-engine=xelatex</code> flag when calling <code>pandoc</code> + doesn't fully mitigate the issue, but it will produce reasonable output + without completely failing. + </li> + <li> + Sometimes images are not handled great. For example they might not fit + width-wise. LaTeX completely fails on images in the WebP format. + </li> + <li> + Similarly, sometimes code blocks might get cut off and not fit + width-wise. This is admittedly a pretty big problem. + </li> + <li> + Headers and footers from many sites will not be rendered great. This + doesn't bother me, all I care about is the main article contents. + </li> +</ul> +</p> |