Weekend Hack, Another Scraper

1/20/2019

So I recently started to write programming articles for another website. I thought it'd be a good idea to link the articles in my website. I've set it up here: https://msanatan.com/categories/other/.

The first thing I needed to do was get all my articles in one go. The scraper consists 3 parts:

  • Parse the HTML to get the articles from my user page - https://stackabuse.com/author/marcus
  • Save each article as Markdown files with a link property
  • Allow users to specify which author to scrape via command line arguments

Parsing

For parsing I used Beautiful Soup, definitely one of the most popular scarping libraries for Python. I skipped on lxml as the page is very simple, there was no need for fancy XPath. This function returns all the posts in a list of dictionaries. I'm not copying the content, so all I need are the names, links and dates.

def parse_posts(author_url):
    '''Recursively retrieves all the posts of an blog write in stack abuse'''
    logging.info('Scraping {}'.format(author_url))
    posts = []
    # It's always good to set user-agent, makes the request looks more like a
    # regualr browsing request, without it some sites would outright block you
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}
    response = requests.get(author_url, headers=headers)
    # Don't be too hasty, check that your response was actually succesful
    if response is not None and response.status_code == 200:
        html = BeautifulSoup(response.content, 'html.parser')
        # Loop through all the articles
        for article in html.find_all('article'):
            title_tag = article.find('h2', {'class': 'post-title'}).find('a')
            title = title_tag.text
            link = BASE_URL + title_tag['href']
            meta = article.find('div', {'class': 'post-meta'})
            # The date comes like December 10, 2018
            #  We really want it like 2018-12-10
            date_text = meta.find('span', {'class': 'date'}).text
            date = datetime.datetime.strptime(date_text, '%B %d, %Y')
            post = {
                'title': title,
                'link': link,
                'date': datetime.date.strftime(date, "%Y-%m-%d"),
            }
            posts.append(post)
        logging.info('{} posts found on page'.format(len(posts)))
        # Stack Abuse paginates every 5 posts, this collects the older ones
        pagination = html.find('nav', {'class': 'pagination'}).find('a', {'class': 'older-posts'})
        if pagination is not None:
            logging.info('Retrieving older posts')
            # Who said you don't use recursion?
            return posts + parse_posts(BASE_URL + pagination['href'])
        return posts
    else:
        logging.error('Could not get a response for the link')
        return []

Saving Data

I needed to convert the data to markdown so that Hexo can use it. I figured that allowing JSON and CSV formats as well, this scraper could be of more use to others. Doesn't matter too much, Python makes saving in each file type dead simple:

def get_posts_json(filename, author_url):
    '''Dumps JSON for stack abuse articles'''
    posts = parse_posts(author_url)
    logging.info('Retrieved {} posts'.format(len(posts)))
    with open(filename, 'w') as json_file:
        json.dump(posts, json_file, indent=4)


def get_posts_csv(filename, author_url):
    '''Saves CSV file for stack abuse articles'''
    posts = parse_posts(author_url)
    logging.info('Retrieved {} posts'.format(len(posts)))
    headers = ['Title', 'Link', 'Date']
    with open(filename, 'w') as csv_file:
        csv_writer = csv.writer(csv_file, delimiter=',', quoting=csv.QUOTE_ALL)
        csv_writer.writerow(headers)
        for post in posts:
            csv_writer.writerow([post['title'], post['link'], post['date']])


def get_posts_markdown(author_url):
    '''Saves posts as markdown files to work in Hexo'''
    posts = parse_posts(author_url)
    logging.info('Retrieved {} posts'.format(len(posts)))
    # As markdown produces many files, it's neater to have them in one folder
    pathlib.Path('articles').mkdir(exist_ok=True)
    for post in posts:
        # Use the slug as it's a more appropriate file name
        post_slug = slugify(post['title'])
        with open('articles/{}.md'.format(post_slug), 'w') as f:
            f.writelines([
                '---\n',
                'title: {}\n'.format(post['title']),
                'date: {}\n'.format(post['date']),
                'categories: [other]\n',
                'link: {}\n'.format(post['link']),
                '---\n',
            ])

Command Line Arguments

Software is used by humans, always make your programs friendly. Python comes with a flexible argument parsing library which brings some order and useful information. Even for small programs like this, it feels better than processing sys.argv values myself. We put the argparse logic in the main function:

def main():
    '''Argument parser for scraper'''
    # I imported the library like: from argparse import ArgumentParser
    # For most cases you just need the ArgumentParser class
    parser = ArgumentParser(description='Web scraper for Stack Abuse writers')
    # Adding an argument is pretty simple, I give the short and long forms,
    # specify the property the value will be saves as in dest and write a
    # useful help message. In this case the author should be required
    parser.add_argument('-a', '--author', dest='author',
                        help='Writer whose articles you want', required=True)

    # The user selects what format they would like the data in. A file can't be
    # JSON and CSV or CSV and Markdown at the same time so we make these options
    # mutually exclusive.
    group = parser.add_mutually_exclusive_group()
    group.add_argument('--csv', action='store_true',
                       help='Save data in CSV format')
    group.add_argument('--json', action='store_true',
                       help='Save data in JSON format')
    group.add_argument('--markdown', action='store_true',
                       help='Save data as Markdown articles for Hexo')

    # A simple way to manage log levels in your app!
    parser.add_argument('-l', '--loglevel', dest='loglevel',
                        help='Select log level', default='info')
    args = parser.parse_args() # Where all the magic happens

    # Set logging preferences
    if args.loglevel == 'error':
        log_level = logging.ERROR
    elif args.loglevel == 'debug':
        log_level = logging.DEBUG
    else:
        log_level = logging.INFO

    logging.basicConfig(filename='stackabuse_scraper.log',level=log_level)

    # BASE_URL was defined earlier in the script, fyi
    author_url = '{}/author/{}/'.format(BASE_URL, args.author)
    # Determine output format
    if args.csv:
        get_posts_csv('stackabuse_articles.csv', author_url)
    elif args.json:
        get_posts_json('stackabuse_articles.json', author_url)
    elif args.markdown:
        get_posts_markdown(author_url)
    # Put a default case for that one user who'll try to break it :-/
    else:
        print(json.dumps(parse_posts(author_url)))

Conclusion

When I run python3 stackabuse_scraper.py -a marcus --markdown I get the following Markdown file as output:


title: Building a GraphQL API with Django date: 2018-12-10 categories: [other] link: https://stackabuse.com/building-a-graphql-api-with-django/

Exactly what I wanted and usable with Hexo! This was just some of the annotated code, you can find the full script at https://github.com/msanatan/stackabuse_scraper. That's all I got through to this weekend, till next time

Happy Hacking!