src/blog/downloading-articles-for-my-ebook-reader.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99

title: Downloading Articles for my Ebook Reader
date: April 08, 2021 14:40
---
<p>
I've recently taken to reading blog posts and other internet articles on my
ereader. And I don't mean using my tablet's browser and wifi connection to load
up websites. Instead, I convert the articles I want to read to PDF and read them
like I would any other ebook (I have a large screen tablet on which reading PDFs
is very comfortable; I would probably be playing around with EPUB conversion if
I had a smaller screen).
</p>

<p>
The obvious way to get a PDF of a website would be to use my browser's built in
print-to-PDF feature. But this has some minor problems for me:

<ul>
    <li>
        Articles from different websites will look very differently. I can't
        anticipate how the website's CSS will affect readability (things like
        font, text size, etc.).
    </li>
    <li>
        It's not super easy to automate. Maybe this is possible with headless
        browsers? But I haven't played around with those much and it feels silly
        to spin up a whole browser just to render some HTML as a PDF.
    </li>
</ul>
</p>

<p>
That second point &mdash; about automation and scripting &mdash; was
particularly important to me. So the obvious tool for the job was the Swiss-army
knife of document conversions, <code>pandoc</code>.
</p>

<p>
For a while I was wondering if I would have to write some clever script that
downloads all of the article's HTML and other resources (like images) and then
inputs them to <code>pandoc</code>. Fortunately, it turns out that <code>pandoc
&lt;article url&gt; -o &lt;output file&gt;</code> does exactly what you think it
does. The article ends up converted to PDF, with LaTeX used as an intermediate
step, so everything is in the beautiful LaTeX font. <code>pandoc</code> also
takes care of downloading and including images.
</p>

<h3>Hotkeys</h3>
<p>
I wrote a short script that calls <code>pandoc</code> and saves the PDF in a
specific directory. With that script available and working, I added hotkeys to
my browser and RSS reader that invoke it. These are the two programs in which I
might find articles to read, and now I can easily generate PDFs from both.
</p>

<p>
Here's what the <code>newsboat</code> config looks like:

<pre>
macro p set browser "article2pdf %u" ; open-in-browser ; set browser "elinks %u"
</pre>

And here's the <code>qutebrowser</code> binding:

<pre>
config.bind(
        '"P',
        'spawn article2pdf {url}'
)
</pre>

(<code>article2pdf</code> being the name of my script)
</p>

<h3>Caveats</h3>
<p>
This doesn't work perfectly.

<ul>
    <li>
        There's some issues with certain Unicode characters (including emojis)
        that LaTeX apparently can't handle. Adding the
        <code>--pdf-engine=xelatex</code> flag when calling <code>pandoc</code>
        doesn't fully mitigate the issue, but it will produce reasonable output
        without completely failing.
    </li>
    <li>
        Sometimes images are not handled great. For example they might not fit
        width-wise. LaTeX completely fails on images in the WebP format.
    </li>
    <li>
        Similarly, sometimes code blocks might get cut off and not fit
        width-wise. This is admittedly a pretty big problem.
    </li>
    <li>
        Headers and footers from many sites will not be rendered great. This
        doesn't bother me, all I care about is the main article contents.
    </li>
</ul>
</p>