Using Telegra.ph as external editor for articles
In this article I will show an interesting use case about using telegraph as external wysiwyg editor. Actually you can use this approach as a grabber from any resource or for parsing/converting html to different structure.
So, let’s begin. We will be using Python 3 and just standard library without any third-party modules and extensions. The initial purpose is to grab desired html structure for the article and save images to serve them from our server.
First we need to download the page, urllib is on a duty
1 | from urllib import request |
After that we need to omit tags that will not be used, update their attributes or replace with another one. The tool for parsing the stucture is html.parser
1 | class ArticleParser(HTMLParser): |
There are three main function each behaves as a callback when openning tag is encountered, closed tag is currently in the feed or data within a tag that is processing now. We will use stack to restore correct structure of the document and transorm it on the fly as needed. Flag appending shows whether we add data to the resulting document of skip it.
On image tag we need to download it first from remote url and function below will help us with that
1 | import os |
And the full code of handle_starttag
method
1 | def handle_starttag(self, tag, attrs): |
Another helper method that we use is wrap_in_tag
. It ensures that data will be properly enclosed within a tag
1 |
|
Function for handling closing tag should be symmetrical to one the handles oppening like this
1 | def handle_endtag(self, tag): |
Also this code does simple validation of balancing tags and shows errors if any.
Finally we are handling the data enclosed and append it to an intermediate buffer
1 | def handle_data(self, data): |
The same result can be achieved with a help of regular expressions but that would be much complex and error prone. For example we can look up for a title to the article using such a helper method
1 | def find_tag(tag_name, html_data): |
Summary
We have built a grabber + parser for articles to fetch and format them in a way we want evaluating only tools from standard library. You might extend this example adding different providers and that can be a tool for populating your own blog with aggregated articles from different resources. If you want to rely on more user-friendly libraries see links below.