The Mute Man

This is the story of a man who cannot speak.

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




How to scrape articles from data science publications using Python

What data science articles attract more attention (Part 1)

Have you ever wondered, what makes an article great? Are there specific areas in data science world readers more interested in? I certainly did! I am aiming to find answers to these questions by analysing articles from data science publications on Medium. The series of articles will cover areas such as web scraping, cleansing text data and topic modelling.

In the part 1 of the series we will obtain historical articles from various data science publications by means of web scraping using Python.

Navigating to archives

By the means of simple google search we can get the link for the archive of publication we are interested in. Almost all publications have the url of the following form ‘https://medium.com/publication-name/archive’ with the exception of TDS. However the layout of the page remains unchanged.

Let’s inspect archive pages of publications, on top of the page we can find years when articles were published, by clicking on one of the years we will see months and similarly to get days we need to click on specific month. Now, one thing to note is that not all years have months and not all months have days.

We start by inspecting the archive page, we can see that the links are stored in ‘div’ container with class ‘timebucket…’. The only difference in naming convention of class for three containers is the width.

Inspecting Analytics Vidhya archive page.

Now that we have all information needed to obtain the links we can write the code. Let’s start with the imports necessary for this task, we will use Python libraries requests and BeautifulSoup to make HTTP requests and extract the data from HTML.

The final output is a list of links to all the pages in the publication archive.

Inspecting the page is a good starting point for this task as well. All article boxes have standard format, thus obtaining information on one box will allow us to collect everything we need on all other articles.

Before we can write the code we need to establish elements and their classes where information we need is stored. At this point we are only interested to get title, subtitle, number of claps and responses for each article.

Inspecting individual articles

We now have all the information to write the code, we start with smaller snippets of the main script. To get all the articles from the single link, we search for ‘div’ containers with class ‘streamItem …’.

Next we obtain title and subtitle for a single article, in the code below we have two elements for the title ‘h3’ and ‘h2’, this is due to some authors choosing a different approach to writing the title. For the case where the title or subtitle is missing we set it to be empty, however the value can be changed to anything.

To obtain claps and responses we find corresponding elements with their class, in this case we are not just interested in text we need to get integer values. At times claps for individual articles reach thousands, in this circumstance they are expressed in following form ‘1k’, ‘2.2k’ and etc, to deal with this we replace ‘K’ with empty entry, transform string to integer and multiply it by 1000. For all the other values we either transform it to integer or set it to zero.

When an article has some responses from the readers, they come in the following format ‘1 response’, ‘2 responses’ and etc. So here, we just want to obtain the integer, we do so by searching ‘\d+’ regular expression in the string.

Running the script might take some time depending on the number of HTTP requests, each page contains 10 articles and some publications have quite large archives. The script includes tqdm package to keep track of the progress.

After running the script we get following data set

Wrap up

We started our journey on uncovering what areas in data science are more interesting to the reader by obtaining the data set from the archives of some of the publications. The information on the article is not limited to the features we covered, there are plenty more metrics to obtain, such as whether an author had an image, how long is the article and etc. Which can be done in the similar fashion to obtaining title, subtitle, claps and responses.

Keep an eye for further articles in this series were we continue with data cleansing and topic modelling.

Add a comment

Related posts:

4 tips you must know before starting a dropshipping business

With its origins dating to the 1960s (even before the Internet was invented!), dropshipping has evolved from simple mail-order catalogs to a booming e-commerce business model. Dropshipping is an…

Budgeting template

Struggling to find a budgeting template that actually is easy to use and understand? Look no further because I am providing you with my favorite and FREE budgeting template that helps me to track my…

Being a Sex Worker Can Feel Like Being a Therapist

Men seek out sex workers for a lot of reasons, the most common, of course, for sexual satisfaction. And yet, as a professional dominatrix, I’ve also found that some men have other reasons for…