Add PDF articles article

author: Marcin Chrzanowski <m@m-chrzan.xyz> 2021-04-08 15:48:34 +0200
committer: Marcin Chrzanowski <m@m-chrzan.xyz> 2021-04-08 15:48:34 +0200
commit: 912df09249e699a06d5f19e01d0e8bf712d8f08f (patch)
tree: db23c7e01daf9cc2658be9ddb22597e1f67f1337
parent: 3b380f473c9f030bf2a86ec5e7f8b5dc3a074fda (diff)
1 files changed, 99 insertions, 0 deletions
diff --git a/src/blog/downloading-articles-for-my-ebook-reader.html b/src/blog/downloading-articles-for-my-ebook-reader.html
new file mode 100644
index 0000000..8e5d463
--- /dev/null
+++ b/src/blog/downloading-articles-for-my-ebook-reader.html
@@ -0,0 +1,99 @@
+title: Downloading Articles for my Ebook Reader
+date: April 08, 2021 14:40
+---
+<p>
+I've recently taken to reading blog posts and other internet articles on my
+ereader. And I don't mean using my tablet's browser and wifi connection to load
+up websites. Instead, I convert the articles I want to read to PDF and read them
+like I would any other ebook (I have a large screen tablet on which reading PDFs
+is very comfortable; I would probably be playing around with EPUB conversion if
+I had a smaller screen).
+</p>
+
+<p>
+The obvious way to get a PDF of a website would be to use my browser's built in
+print-to-PDF feature. But this has some minor problems for me:
+
+<ul>
+    <li>
+        Articles from different websites will look very differently. I can't
+        anticipate how the website's CSS will affect readability (things like
+        font, text size, etc.).
+    </li>
+    <li>
+        It's not super easy to automate. Maybe this is possible with headless
+        browsers? But I haven't played around with those much and it feels silly
+        to spin up a whole browser just to render some HTML as a PDF.
+    </li>
+</ul>
+</p>
+
+<p>
+That second point &mdash; about automation and scripting &mdash; was
+particularly important to me. So the obvious tool for the job was the Swiss-army
+knife of document conversions, <code>pandoc</code>.
+</p>
+
+<p>
+For a while I was wondering if I would have to write some clever script that
+downloads all of the article's HTML and other resources (like images) and then
+inputs them to <code>pandoc</code>. Fortunately, it turns out that <code>pandoc
+&lt;article url&gt; -o &lt;output file&gt;</code> does exactly what you think it
+does. The article ends up converted to PDF, with LaTeX used as an intermediate
+step, so everything is in the beautiful LaTeX font. <code>pandoc</code> also
+takes care of downloading and including images.
+</p>
+
+<h3>Hotkeys</h3>
+<p>
+I wrote a short script that calls <code>pandoc</code> and saves the PDF in a
+specific directory. With that script available and working, I added hotkeys to
+my browser and RSS reader that invoke it. These are the two programs in which I
+might find articles to read, and now I can easily generate PDFs from both.
+</p>
+
+<p>
+Here's what the <code>newsboat</code> config looks like:
+
+<pre>
+macro p set browser "article2pdf %u" ; open-in-browser ; set browser "elinks %u"
+</pre>
+
+And here's the <code>qutebrowser</code> binding:
+
+<pre>
+config.bind(
+        '"P',
+        'spawn article2pdf {url}'
+)
+</pre>
+
+(<code>article2pdf</code> being the name of my script)
+</p>
+
+<h3>Caveats</h3>
+<p>
+This doesn't work perfectly.
+
+<ul>
+    <li>
+        There's some issues with certain Unicode characters (including emojis)
+        that LaTeX apparently can't handle. Adding the
+        <code>--pdf-engine=xelatex</code> flag when calling <code>pandoc</code>
+        doesn't fully mitigate the issue, but it will produce reasonable output
+        without completely failing.
+    </li>
+    <li>
+        Sometimes images are not handled great. For example they might not fit
+        width-wise. LaTeX completely fails on images in the WebP format.
+    </li>
+    <li>
+        Similarly, sometimes code blocks might get cut off and not fit
+        width-wise. This is admittedly a pretty big problem.
+    </li>
+    <li>
+        Headers and footers from many sites will not be rendered great. This
+        doesn't bother me, all I care about is the main article contents.
+    </li>
+</ul>
+</p>
author	Marcin Chrzanowski <m@m-chrzan.xyz>	2021-04-08 15:48:34 +0200
committer	Marcin Chrzanowski <m@m-chrzan.xyz>	2021-04-08 15:48:34 +0200
commit	912df09249e699a06d5f19e01d0e8bf712d8f08f (patch)
tree	db23c7e01daf9cc2658be9ddb22597e1f67f1337
parent	3b380f473c9f030bf2a86ec5e7f8b5dc3a074fda (diff)