title: Downloading Articles for my Ebook Reader date: April 08, 2021 14:40 --- <p> I've recently taken to reading blog posts and other internet articles on my ereader. And I don't mean using my tablet's browser and wifi connection to load up websites. Instead, I convert the articles I want to read to PDF and read them like I would any other ebook (I have a large screen tablet on which reading PDFs is very comfortable; I would probably be playing around with EPUB conversion if I had a smaller screen). </p> <p> The obvious way to get a PDF of a website would be to use my browser's built in print-to-PDF feature. But this has some minor problems for me: <ul> <li> Articles from different websites will look very differently. I can't anticipate how the website's CSS will affect readability (things like font, text size, etc.). </li> <li> It's not super easy to automate. Maybe this is possible with headless browsers? But I haven't played around with those much and it feels silly to spin up a whole browser just to render some HTML as a PDF. </li> </ul> </p> <p> That second point — about automation and scripting — was particularly important to me. So the obvious tool for the job was the Swiss-army knife of document conversions, <code>pandoc</code>. </p> <p> For a while I was wondering if I would have to write some clever script that downloads all of the article's HTML and other resources (like images) and then inputs them to <code>pandoc</code>. Fortunately, it turns out that <code>pandoc <article url> -o <output file></code> does exactly what you think it does. The article ends up converted to PDF, with LaTeX used as an intermediate step, so everything is in the beautiful LaTeX font. <code>pandoc</code> also takes care of downloading and including images. </p> <h3>Hotkeys</h3> <p> I wrote a short script that calls <code>pandoc</code> and saves the PDF in a specific directory. With that script available and working, I added hotkeys to my browser and RSS reader that invoke it. These are the two programs in which I might find articles to read, and now I can easily generate PDFs from both. </p> <p> Here's what the <code>newsboat</code> config looks like: <pre> macro p set browser "article2pdf %u" ; open-in-browser ; set browser "elinks %u" </pre> And here's the <code>qutebrowser</code> binding: <pre> config.bind( '"P', 'spawn article2pdf {url}' ) </pre> (<code>article2pdf</code> being the name of my script) </p> <h3>Caveats</h3> <p> This doesn't work perfectly. <ul> <li> There's some issues with certain Unicode characters (including emojis) that LaTeX apparently can't handle. Adding the <code>--pdf-engine=xelatex</code> flag when calling <code>pandoc</code> doesn't fully mitigate the issue, but it will produce reasonable output without completely failing. </li> <li> Sometimes images are not handled great. For example they might not fit width-wise. LaTeX completely fails on images in the WebP format. </li> <li> Similarly, sometimes code blocks might get cut off and not fit width-wise. This is admittedly a pretty big problem. </li> <li> Headers and footers from many sites will not be rendered great. This doesn't bother me, all I care about is the main article contents. </li> </ul> </p>