m-chrzan.xyz
aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMarcin Chrzanowski <m@m-chrzan.xyz>2021-04-08 15:48:34 +0200
committerMarcin Chrzanowski <m@m-chrzan.xyz>2021-04-08 15:48:34 +0200
commit912df09249e699a06d5f19e01d0e8bf712d8f08f (patch)
treedb23c7e01daf9cc2658be9ddb22597e1f67f1337
parent3b380f473c9f030bf2a86ec5e7f8b5dc3a074fda (diff)
Add PDF articles article
-rw-r--r--src/blog/downloading-articles-for-my-ebook-reader.html99
1 files changed, 99 insertions, 0 deletions
diff --git a/src/blog/downloading-articles-for-my-ebook-reader.html b/src/blog/downloading-articles-for-my-ebook-reader.html
new file mode 100644
index 0000000..8e5d463
--- /dev/null
+++ b/src/blog/downloading-articles-for-my-ebook-reader.html
@@ -0,0 +1,99 @@
+title: Downloading Articles for my Ebook Reader
+date: April 08, 2021 14:40
+---
+<p>
+I've recently taken to reading blog posts and other internet articles on my
+ereader. And I don't mean using my tablet's browser and wifi connection to load
+up websites. Instead, I convert the articles I want to read to PDF and read them
+like I would any other ebook (I have a large screen tablet on which reading PDFs
+is very comfortable; I would probably be playing around with EPUB conversion if
+I had a smaller screen).
+</p>
+
+<p>
+The obvious way to get a PDF of a website would be to use my browser's built in
+print-to-PDF feature. But this has some minor problems for me:
+
+<ul>
+ <li>
+ Articles from different websites will look very differently. I can't
+ anticipate how the website's CSS will affect readability (things like
+ font, text size, etc.).
+ </li>
+ <li>
+ It's not super easy to automate. Maybe this is possible with headless
+ browsers? But I haven't played around with those much and it feels silly
+ to spin up a whole browser just to render some HTML as a PDF.
+ </li>
+</ul>
+</p>
+
+<p>
+That second point &mdash; about automation and scripting &mdash; was
+particularly important to me. So the obvious tool for the job was the Swiss-army
+knife of document conversions, <code>pandoc</code>.
+</p>
+
+<p>
+For a while I was wondering if I would have to write some clever script that
+downloads all of the article's HTML and other resources (like images) and then
+inputs them to <code>pandoc</code>. Fortunately, it turns out that <code>pandoc
+&lt;article url&gt; -o &lt;output file&gt;</code> does exactly what you think it
+does. The article ends up converted to PDF, with LaTeX used as an intermediate
+step, so everything is in the beautiful LaTeX font. <code>pandoc</code> also
+takes care of downloading and including images.
+</p>
+
+<h3>Hotkeys</h3>
+<p>
+I wrote a short script that calls <code>pandoc</code> and saves the PDF in a
+specific directory. With that script available and working, I added hotkeys to
+my browser and RSS reader that invoke it. These are the two programs in which I
+might find articles to read, and now I can easily generate PDFs from both.
+</p>
+
+<p>
+Here's what the <code>newsboat</code> config looks like:
+
+<pre>
+macro p set browser "article2pdf %u" ; open-in-browser ; set browser "elinks %u"
+</pre>
+
+And here's the <code>qutebrowser</code> binding:
+
+<pre>
+config.bind(
+ '"P',
+ 'spawn article2pdf {url}'
+)
+</pre>
+
+(<code>article2pdf</code> being the name of my script)
+</p>
+
+<h3>Caveats</h3>
+<p>
+This doesn't work perfectly.
+
+<ul>
+ <li>
+ There's some issues with certain Unicode characters (including emojis)
+ that LaTeX apparently can't handle. Adding the
+ <code>--pdf-engine=xelatex</code> flag when calling <code>pandoc</code>
+ doesn't fully mitigate the issue, but it will produce reasonable output
+ without completely failing.
+ </li>
+ <li>
+ Sometimes images are not handled great. For example they might not fit
+ width-wise. LaTeX completely fails on images in the WebP format.
+ </li>
+ <li>
+ Similarly, sometimes code blocks might get cut off and not fit
+ width-wise. This is admittedly a pretty big problem.
+ </li>
+ <li>
+ Headers and footers from many sites will not be rendered great. This
+ doesn't bother me, all I care about is the main article contents.
+ </li>
+</ul>
+</p>