1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
|
title: Downloading Articles for my Ebook Reader
date: April 08, 2021 14:40
---
<p>
I've recently taken to reading blog posts and other internet articles on my
ereader. And I don't mean using my tablet's browser and wifi connection to load
up websites. Instead, I convert the articles I want to read to PDF and read them
like I would any other ebook (I have a large screen tablet on which reading PDFs
is very comfortable; I would probably be playing around with EPUB conversion if
I had a smaller screen).
</p>
<p>
The obvious way to get a PDF of a website would be to use my browser's built in
print-to-PDF feature. But this has some minor problems for me:
<ul>
<li>
Articles from different websites will look very differently. I can't
anticipate how the website's CSS will affect readability (things like
font, text size, etc.).
</li>
<li>
It's not super easy to automate. Maybe this is possible with headless
browsers? But I haven't played around with those much and it feels silly
to spin up a whole browser just to render some HTML as a PDF.
</li>
</ul>
</p>
<p>
That second point — about automation and scripting — was
particularly important to me. So the obvious tool for the job was the Swiss-army
knife of document conversions, <code>pandoc</code>.
</p>
<p>
For a while I was wondering if I would have to write some clever script that
downloads all of the article's HTML and other resources (like images) and then
inputs them to <code>pandoc</code>. Fortunately, it turns out that <code>pandoc
<article url> -o <output file></code> does exactly what you think it
does. The article ends up converted to PDF, with LaTeX used as an intermediate
step, so everything is in the beautiful LaTeX font. <code>pandoc</code> also
takes care of downloading and including images.
</p>
<h3>Hotkeys</h3>
<p>
I wrote a short script that calls <code>pandoc</code> and saves the PDF in a
specific directory. With that script available and working, I added hotkeys to
my browser and RSS reader that invoke it. These are the two programs in which I
might find articles to read, and now I can easily generate PDFs from both.
</p>
<p>
Here's what the <code>newsboat</code> config looks like:
<pre>
macro p set browser "article2pdf %u" ; open-in-browser ; set browser "elinks %u"
</pre>
And here's the <code>qutebrowser</code> binding:
<pre>
config.bind(
'"P',
'spawn article2pdf {url}'
)
</pre>
(<code>article2pdf</code> being the name of my script)
</p>
<h3>Caveats</h3>
<p>
This doesn't work perfectly.
<ul>
<li>
There's some issues with certain Unicode characters (including emojis)
that LaTeX apparently can't handle. Adding the
<code>--pdf-engine=xelatex</code> flag when calling <code>pandoc</code>
doesn't fully mitigate the issue, but it will produce reasonable output
without completely failing.
</li>
<li>
Sometimes images are not handled great. For example they might not fit
width-wise. LaTeX completely fails on images in the WebP format.
</li>
<li>
Similarly, sometimes code blocks might get cut off and not fit
width-wise. This is admittedly a pretty big problem.
</li>
<li>
Headers and footers from many sites will not be rendered great. This
doesn't bother me, all I care about is the main article contents.
</li>
</ul>
</p>
|