Archiving The Internet
Once upon a time, say in the 90s, we’d create an <a ref> to a URL to cite our sources and our readers could follow.
Then came the influx of link rot where content would move, so we started archiving content. With the recent turmoil in the American Federal government1, attacks on shared collective knowledge and last year’s hack of the Internet Archives, many people are concerned about the process of archiving our digital culture.
Personal Archives
Here’s an idea. If you like the contents of something cool you read this morning, make a personal copy while you can.
Why yes, you can save the source as an Ugly HTML File, which may render horrifically without the online style sheets and images from soon-to-be transient CDN services.
I’ve a better idea. Store it as an Org file.
How?
The idea is simple, pipe the output from curl into pandoc:
curl $URL | pandoc -f html -t org --wrap=none -o $FILE.org emacsclient $FILE.org # To tidy up the results
Actually, you don’t even need curl, as pandoc can fetch a URL:
pandoc -f html -t org --wrap=none -o $FILE.org $URL
You might want to read through the documentation to Pandoc, and place standard options in a YAML file, and then use the --defaults command line option.
Emacs Wrapper
While Pandoc does a decent job, the results aren’t perfect for someone with mild-to-severe obsessive-compulsive disorder such as myself.
So I’m going to let Emacs do some of the heavy lifting of calling Pandoc, doing some of the initial cleanup, and then leaving the results in an editable buffer.
To begin, I want a function that can download and store a web page as an HTML file in a temporary directory. Later, I will point Pandoc at this file, but before that, I can analyze the file to retrieve information like the author and page title.
(defun archival-web-as-html (url) "Store URL and return temporary file containing HTML content." (let ((file (make-temp-file "archival-" nil ".html"))) (shell-command (format "curl --location --silent --output %s %s" file url)) file))
In HTML, the <head> section contains information about the webpage, like the page’s title, the author.
Given the HTML file, let’s use the libxml-parse-html-region to turn the HTML tags into a walkable internal structure of s-expressions:
(defun archival-html-head (file) "Return title from FILE with HTML contents." (with-temp-buffer (insert-file file) (thread-last (libxml-parse-html-region) (nth 2) ; The HTML contents (cddr)))) ; The HEAD section as an alist
First, let’s have a function that scans through the <head> entries looking for <meta…> tags and elevate either the name or property attribute as the key, and the content as the value:
(defun archival-head-to-metadata (headers) "Filter the <meta...> entries from HEADERS." (thread-last headers ;; filter headers of type <meta>: (seq-filter (lambda (e) (eq (car e) 'meta))) ;; get the attributes for the <meta> tag: (seq-map (lambda (e) (cadr e))) ;; filter entries that have either a name or ;; property attribute: (seq-filter (lambda (e) (or (alist-get 'property e) (alist-get 'name e)))) ;; Pair the name/property and content: (seq-map (lambda (e) (cons (or (alist-get 'property e) (alist-get 'name e)) (alist-get 'content e))))))
At this point, an assortment of header entries, like:
<html lang="en"> <head> <title>How to Get a Common Lisp Job In 2025 | Sebastian Carlos | Satire, Humor, Programming, Computer Science, Lisp | Medium </title> <meta charset="utf-8"/> <meta name="theme-color" content="#000000"/> <meta property="og:site_name" content="Medium"/> <meta property="og:type" content="article"/> <meta property="article:published_time" content="2025-03-07T20:52:15.901Z"/> <meta name="title" content="How to Get a Common Lisp Job In 2025 | Sebastian Carlos | Satire, Humor, Programming, Computer Science, Lisp | Medium"/> <meta property="og:title" content="Common Lisp In 2055"/> <meta name="description" content="It’s 2055. The job market is tough. AI automation has made 99.7% of all jobs obsolete — a human barista is now a novelty. Despite AI calling the shots, it can’t generate truly original ideas —…"/> <meta property="og:description" content="The year is 2055. AI automation has made 99.7% of all jobs obsolete — a human barista is now a rare novelty."/> <meta property="og:url" content="https://medium.com/@sebastiancarlos/common-lisp-in-2055-f3debf4df01c"/> <meta property="og:image" content="https://miro.medium.com/v2/resize:fit:1200/0*l_JDD2Os1gdRilMg.jpg"/> <meta property="article:author" content="https://medium.com/@sebastiancarlos"/> <meta name="author" content="Sebastian Carlos"/> <meta name="robots" content="index,noarchive,follow,max-image-preview:large"/> <meta name="referrer" content="unsafe-url"/> <link rel="icon" href="https://miro.medium.com/v2/5d8de952517e8160e40ef9841c781cdc14a5db313057fa3c3de41c6f5b494b19"/> <link rel="search" type="application/opensearchdescription+xml" title="Medium" href="/osd.xml"/> ...
Becomes an associated list (a hashtable), like:
(("theme-color" . "#000000") ("og:site_name" . "Medium") ("og:type" . "article") ("article:published_time" . "2025-03-07T20:52:15.901Z") ("title" . "How to Get a Common Lisp Job In 2025 | Sebastian Carlos | Satire, Humor, Programming, Computer Science, Lisp | Medium") ("og:title" . "Common Lisp In 2055") ("description" . "It’s 2055. The job market is tough. AI automation has made 99.7% of all jobs obsolete — a human barista is now a novelty. Despite AI calling the shots, it can’t generate truly original ideas —…") ("og:description" . "The year is 2055. AI automation has made 99.7% of all jobs obsolete — a human barista is now a rare novelty.") ("og:url" . "https://medium.com/@sebastiancarlos/common-lisp-in-2055-f3debf4df01c") ("al:web:url" . "https://medium.com/@sebastiancarlos/common-lisp-in-2055-f3debf4df01c") ("og:image" . "https://miro.medium.com/v2/resize:fit:1200/0*l_JDD2Os1gdRilMg.jpg") ("article:author" . "https://medium.com/@sebastiancarlos") ("author" . "Sebastian Carlos") ("robots" . "index,noarchive,follow,max-image-preview:large") ("referrer" . "unsafe-url")
Since we can’t agree on the metadata, I needed to write a fairly complicated and probably evolving function.
For instance, the <title> is seldom the actual title since a browser displays that text at the top of a window (or its tab), web page authors often overload it with more than just the title. Other features like author or the date written is often in the prose and not in the header format.
Yes, we’ve attempted to standardize this before, so:
<head> <meta name="author" content="Howard Abrams"> ... </head>
May or may not work, so I will just use Lisp’s or function to pick the most preferred approach.
(defun archival-html-metadata (headers) "Extract interesting metadata from s-exp HEADERS. Return a property list containing :title and others. Since webpages have different ways of including information in the header, we attempt to be pretty liberal in what we accept, and choose elements that are more promising over others. For instance, the <title>...</title> should be the articles title, but since it shows as the tab or window in a browser, some articles include too much information, like the name of the site, etc." (let ((title (let-alist headers (nth 1 .title))) (metas (archival-head-to-metadata headers))) (cl-flet ((meta-get (tag) (alist-get tag metas nil nil 'string-equal))) (list :title (or (meta-get "og:title") (meta-get "title") title) :author (or (meta-get "author")) :date (or (meta-get "article:published_time")) :desc (or (meta-get "og:description") (meta-get "description"))))))
Ambiguities still abound. For instance, what should $FILE be? Perhaps it could be extracted from the URL? Or the <title> in the HTML?
(defun archival-html-file (url &optional title) "Extract filename from TITLE or URL." (let ((filename (thread-last url url-generic-parse-url url-path-and-query car file-name-base))) (thread-last (or title filename) (downcase) (replace-regexp-in-string (rx (any ":'!?\.*#@\"")) "") (replace-regexp-in-string (rx (or space "_")) "-") (format "%s.org"))))
Some simple tests to validate that it behaves as expected.
(ert-deftest archival-html-file-test () (should (equal "common-lisp-in-2055-f3debf4df01c.org" (archival-html-file "https://medium.com/@sebastiancarlos/common-lisp-in-2055-f3debf4df01c"))) (should (equal "something-thats-completely-different.org" (archival-html-file "https://stevelosh.com/blog/2018/08/a-road-to-common-lisp" "Something that's completely different?"))))
Even with all my lovely extraction code, I may want to review it before I use it.
(defun archival-url-data (url) "Return a list of information from gleaning URL. Including: - Input :: The original HTML file to read - Output :: a spaceless filename with .org ending - Title :: title from tag or metadata - Author :: gleaned from og:author or author metadata - Date" (let* ((file (archival-web-as-html url)) (head (archival-html-head file)) (meta (archival-html-metadata head)) (name (archival-html-file url (plist-get :title meta))) (full (read-file-name "Org File: " (expand-file-name (file-name-as-directory org-directory)) nil nil name)) (title (read-string "Title: " (plist-get meta :title))) (author (read-string "Author: " (plist-get meta :author))) ;; Parse the timestamp in the file (or the current time): (date (read-string "Date: " (thread-last (current-time-string) (or (plist-get meta :date)) (parse-time-string) (encode-time) (format-time-string "%Y-%m-%d"))))) (list file full title author date)))
No matter what programming language you use, manipulating dates and times is awful. Here, I start with the current time, overwrite it with the value in the HTML file (if given, using the or), parse it, encode and format it. Sheesh.
Now I’m ready to rock and roll here. This function servers as my UI, and after asking for a URL, downloads it, analyzes it, and then calls functions like read-string (with defaults from the parsed file) to get … alllllofthedata.
(defun archival-url-to-org (url) "Store URL as an org file." (interactive "sURL: ") (seq-let (input output title author date) (archival-url-data url) ;; Convert the already downloaded HTML into Org format: (shell-command (format "pandoc --from html --to org --wrap=none --output %s %s" output input)) ;; Pull in the file so that we can clean-it-up (find-file output) ;; Insert some proper header information previously collected: (goto-char (point-min)) (insert (format "#+TITLE: %s #+AUTHOR: %s #+DATE: %s #+LASTMOD: [%s] #+REF: %s #+FILETAGS: website archive #+STARTUP: inlineimages " title author date (format-time-string "%Y-%m-%d %a") url))))
If I didn’t want to deal with the metadata, then I could have made this code quite a bit smaller, so your own script may be just a shell script. Feel free to steal this code, and I hope it was helpful to see a cookbook-approach to solving a problem in Lisp.
Footnotes:
Sigh. No link is necessary, as I assume all readers of this essay will understand what I’m referring to.