Processing a List of Links using Python and BeautifulSoup
Whilst uploading the weekly podcast I am required to produce a list of links discussed on the show. This can become a little tedious, as I must visit each link to find a suitable title. Additionally, when using Markdown, you must provide lists in a specific format. I had been doing this manually for a couple of weeks, and last night I thought, “I am a developer; I should automate this.”.
Below is a simple script I wrote in Python 3 that grabs the latest entry from your clipboard (a list of links) and then processes them into the specified format. By default, it creates a Markdown formatted list, but this can be changed at the command line by supplying another Python-format compliant string. The script requires access to an environment with the “xerox” and “beautifulsoup4” packages installed.
#!/usr/bin/env python3
import sys, requests, xerox
from bs4 import BeautifulSoup
from requests.exceptions import InvalidSchema, MissingSchema
template = sys.argv[1] if len(sys.argv) > 1 else '- [{title}]({url})'
links = []
for link in xerox.paste().split('\n'):
try:
url = link.strip()
print(url, '... ' , end='')
req = requests.get(url)
res = BeautifulSoup(req.text)
title = res.title.string.strip()
links.append(template.format(title=title, url=req.url))
print(title)
except (InvalidSchema, MissingSchema) as exp:
print('x')
xerox.copy('\n'.join(links))
For convenience of invocation, I store this script in my ‘~/bin’ directory with execute privileges, allowing me to avoid having to specify the Python interpreter.