Processing a List of Links using Python and BeautifulSoup
Whilst uploading the weekly podcast I am required to produce a list of links we discussed about on the show. This can get a little tiresome, visiting each link and finding a suitable title. Along with this, using Markdown you are required to provide lists in a specific format. I had been doing this manually for a couple of weeks and last night I thought, I am a developer, I should not be doing unnecessary work.
Below is a simple script I wrote in Python (3) that grabs the latest entry from your clipboard (a list of links) and then processes them into the specified format. By default it creates a Markdown formatted list, but this can be changed at the command line, by supplying another Python-format compliant string. It is required that the script has access to an environment with ‘xerox’ and ‘beautifulsoup4’ packages installed.
#!/usr/bin/env python3
import sys, requests, xerox
from bs4 import BeautifulSoup
from requests.exceptions import InvalidSchema, MissingSchema
template = sys.argv[1] if len(sys.argv) > 1 else '- [{title}]({url})'
links = []
for link in xerox.paste().split('\n'):
try:
url = link.strip()
print(url, '... ' , end='')
req = requests.get(url)
res = BeautifulSoup(req.text)
title = res.title.string.strip()
links.append(template.format(title=title, url=req.url))
print(title)
except (InvalidSchema, MissingSchema) as exp:
print('x')
xerox.copy('\n'.join(links))
For convenience of invocation I store this script in my ‘~/bin’ directory with execute privileges, allowing me to not have to specify the Python interpreter.