My Spam Blog: BeautifulSoup for web scraping

Pages

BeautifulSoup for web scraping

Viswanadh Y BeautifulSoup, facebook, python, scraping, script, web Saturday, August 21, 2010

Ever wondered how you get so many spam e-mails to your inbox with in minutes of posting your e-mail address on a popular website/blog? Most of those spam e-mails are sent to you by bots, whose job is to search(scrape) the web pages for text that match the signature of e-mail addresses. This is accomplished by using powerful scraping tools. One such tool is BeautifulSoup

Forget e-mail spamming for a moment, there are a lot of other things that you can do with these tools. e.g Scraping ebay.in's facebook community 'wall' for all the posts. The following python code shows how this can be done with the help of BeautifulSoup.

If you observe the source code for web page, you'll see that each and every post on the wall is under the tag span class="UIStory_Message. So we have to parse the page for finding out all the 'span' tags, which have the 'class' attribute set to 'UIStory_Message'. The method 'bs.findAll', shown below, does exactly that. Also we may want to print the names of the post's author before every post. From the HTML source, you can see that these names are available as text under the tag span class="UIIntentional_StoryNames". This tag is just before our post tag span class="UIStory_Message. As we already have references to all the post tags, we can find out the name tags by calling 'findPreviousSibling' and the name is available one level deep, under the 'a' tag. Finally, we can call 'getText' method to get the name of the post's author.

If this is confusing, please see the official BeautifulSoup documentation

#!/usr/bin/env python
__author__ = "Kasi Viswanadh Yakkala"

import os
import re
import sys
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup

def ebayin_fb_parse():
    frontpage = urlopen("http://www.facebook.com/ebaydotin?v=wall").read()
    bs = BeautifulSoup(frontpage)
    fbstories = bs.findAll(name='span', attrs={'class':'UIStory_Message'})
    for s in fbstories:
        fbprofile_name = s.findPreviousSibling(name='span',attrs={'class':'UIIntentionalStory_Names'}).a.getText()
        print fbprofile_name,':'
        try:
            print s.getText()
        except:
            print s.find(text=True)

# HOW TO USE
""" Main Function """
if __name__ == "__main__":
    ebayin_fb_parse()

3 comments:

Darius said... August 22, 2010 at 8:42 PM

I imagine people will want to know what website scraping software is?
Check out this series of posts is dedicated to executives taking charge of projects that entail scraping information from one or more websites.
http://www.fornova.net/blog/?p=4
csharpp said... September 23, 2010 at 2:45 AM

You should try ScrapePro Web Scraper Designer.
Unknown said... July 30, 2012 at 9:46 AM

Hello All,

Web Content Extractor is the most powerful and easy-to-use data extraction software for web scraping and data extraction from the websites. Web scraping is a method of pulling information from the seemingly infinite number of locations on the web where it is stored. I really like what you have going here. Lots of information on a lot of subjects that I find interesting. Thank you...........
Web Scraping Tool

My Spam Blog

Pages

BeautifulSoup for web scraping

3 comments:

Post a Comment

Archive

Tags

Followers