Sunday, 30 November 2014

Why scraping and why TheWebMiner?

If you read this blog you are one of two things: you are either interested in web scraping and you have studied this domain for quite a while, or you are just curious about this relatively new field of interest and want to know what it is, how it’s done and especially why. Either way it’s fine!

In case you haven’t googled already this I can tell you that data extraction (or scraping) is a technique in which a computer program extracts data from human-readable output coming from another program (wikipedia). Basically it can collect all the information on a certain subject from certain places. It’s sort of the equivalent of ctrl+f, at the scale of the whole internet. It’s nothing like the search engines that we currently use because it can extract the data in a certain file, as excel, csv (coma separated values) or any other that the buyer wants, and also extracts only the relevant data, only the values that you are interested in.

I hope now that you understand the concept and you are wondering just why would you need such data. Well let’s take the example of an online store, pretty common nowadays, and of course the manager just like any manager wants his business to thrive, so, for that he has to keep up with the other online stores. Now the web scraping takes place: it is very useful for him to have, saved as excels all the competitor’s prices of certain products if not all of them. By this he can maintain a fair pricing policy and always be ahead of his competitors by knowing all of their prices and fluctuations.  Of course the data collecting can also be done manually but this is not effective because we are talking of thousand of products each one having its own page and so on. This is only one example of situation in which scrapping is useful but there are hundreds and each one of them it’s profitable for the company.

By now I’ve talked about what it is and why you should be interested in it, from now on I’m going to explain why you should use thewebminer.com. First of all, it’s easy: you only have to specify what type of data you want and from where and we’ll manage the rest. Throughout the project you will receive first of all an approximation of price, followed by a time approximation. All the time you will be in contact with us so you can find out at any point what is the state of your project. The pricing policy is reasonable and depends on factors like the project size or complexity. For very big projects a discount may be applicable so the total cost be within reason.

Now I believe that thewebminer.com is able to manage with any kind of situation or requirement from users all over the world and to convince you, free samples are available at any project you may have or any uncertainty or doubt.

Source:http://thewebminer.com/blog/2013/07/

Wednesday, 26 November 2014

Web Scraping Tools for Non-developers

I recently spoke with a resource-limited organization that is investigating government corruption and wants to access various public datasets to monitor politicians and law firms. They don’t have developers in-house, but feel pretty comfortable analyzing datasets in CSV form. While many public datasources are available in structured form, some sources are hidden in what us data folks call the deep web. Amazon is a nice example of a deep website, where you have to enter text into a search box, click on a few buttons to narrow down your results, and finally access relatively structured data (prices, model numbers, etc.) embedded in HTML. Amazon has a structured database of their products somewhere, but all you get to see is a bunch of webpages trapped behind some forms.

A developer usually isn’t hindered by the deep web. If we want the data on a webpage, we can automate form submissions and key presses, and we can parse some ugly HTML before emitting reasonably structured CSVs or JSON. But what can one accomplish without writing code?

This turns out to be a hard problem. Lots of companies have tried, to varying degrees of success, to build a programmer-free interface for structured web data extraction. I had the pleasure of working on one such project, called Needlebase at ITA before Google acquired it and closed things down. David Huynh, my wonderful colleague from grad school, prototyped a tool called Sifter that did most of what one would need, but like all good research from 2006, the lasting impact is his paper rather than his software artifact.

Below, I’ve compiled a list of some available tools. The list comes from memory, the advice of some friends that have done this before, and, most productively, a question on Twitter that Hilary Mason was nice enough to retweet.

The bad news is that none of the tools I tested would work out of the box for the specific use case I was testing. To understand why, I’ll break down the steps required for a working web scraper, and then use those steps to explain where various solutions broke down.

The anatomy of a web scraper

There are three steps to a structured extraction pipeline:

    Authenticate yourself. This might require logging in to a website or filling out a CAPTCHA to prove you’re not…a web scraper. Because the source I wanted to scrape required filling out a CAPTCHA, all of the automated tools I’ll review below failed step 1. It suggests that as a low bar, good scrapers should facilitate a human in the loop: automate the things machines are good at automating, and fall back to a human to perform authentication tasks the machines can’t do on their own.

    Navigate to the pages with the data. This might require entering some text into a search box (e.g., searching for a product on Amazon), or it might require clicking “next” through all of the pages that results are split over (often called pagination). Some of the tools I looked at allowed entering text into search boxes, but none of them correctly handled pagination across multiple pages of results.

    Extract the data. On any page you’d like to extract content from, the scraper has to help you identify the data you’d like to extract. The cleanest example of this that I’ve seen is captured in a video for one of the tools below: the interface lets you click on some text you want to pluck out of a website, asks you to label it, and then allows you to correct mistakes it learns how to extract the other examples on the page.

As you’ll see in a moment, the steps at the top of this list are hardest to automate.

What are the tools?

Here are some of the tools that came highly recommended, and my experience with them. None of those passed the CAPTCHA test, so I’ll focus on their handling of navigation and extraction.

    Web Scraper is a Chrome plugin that allows you to build navigable site maps and extract elements from those site maps. It would have done everything necessary in this scenario, except the source I was trying to scrape captured click events on links (I KNOW!), which tripped things up. You should give it a shot if you’d like to scrape a simpler site, and the youtube video that comes with it helps get around the slightly confusing user interface.

    import.io looks like a clean webpage-to-api story. The service views any webpage as a potential data source to generate an API from. If the page you’re looking at has been scraped before, you can access an API or download some of its data. If the page hasn’t been processed before, import.io walks you through the process of building connectors (for navigation) or extractors (to pull out the data) for the site. Once at the page with the data you want, you can annotate a screenshot of the page with the fields you’d like to extract. After you submit your request, it appears to get queued for extraction. I’m still waiting for the data 24 hours after submitting a request, so I can’t vouch for the quality, but the delay suggests that import.io uses crowd workers to turn your instructions into some sort of semi-automated extraction process, which likely helps improve extraction quality. The site I tried to scrape requires an arcane combination of javascript/POST requests that threw import.io’s connectors for a lo
op, and ultimately made it impossible to tell import.io how to navigate the site. Despite the complications, import.io seems like one of the more polished website-to-data efforts on this list.

    Kimono was one of the most popular suggestions I got, and is quite polished. After installing the Kimono bookmarklet in your browser, you can select elements of the page you wish to extract, and provide some positive/negative examples to train the extractor. This means that unlike import.io, you don’t have to wait to get access to the extracted data. After labeling the data, you can quickly export it as CSV/JSON/a web endpoint. The tool worked seamlessly to extract a feed from the Hackernews front page, but I’d imagine that failures in the automated approach would make me wish I had access to import.io’s crowd workers. The tool would be high on my list except that navigation/pagination is coming soon, and will ultimately cost money.

    Dapper, which is now owned by Yahoo!, provides about the same level of scraping capabilities as Kimono. You can extract content, but like Kimono it’s unclear how to navigate/paginate.

    Google Docs was an unexpected contender. If the data you’re extracting is in an HTML table/RSS Feed/CSV file/XML document on a single webpage with no navigation/authentication, you can use one of the Import* functions in Google Docs. The IMPORTHTML macro worked as advertised in a quick test.

    iMacros is a tool that I could imagine solves all of the tasks I wanted, but costs more than I was willing to pay to write this blog post. Interestingly, the free version handles the steps that the other tools on this list don’t do as well: navigation. Through your browser, iMacros lets you automate filling out forms, clicking on “next” links, etc. To perform extraction, you have to pay at least $495.

    A friend has used Screen-scraper in the past with good outcomes. It handles navigation as well as extraction, but costs money and requires a small amount of programming/tokenization skills.

    Winautomation seems cool, but it’s only available for Windows, which was a dead end for me.

So that’s it? Nothing works?

Not quite. None of these tools solved the problem I had on a very challenging website: the site clearly didn’t want to be crawled given the CAPTCHA, and the javascript-submitted POST requests threw most of the tools that expected navigation through links for a loop. Still, most of the tools I reviewed have snazzy demos, and I was able to use some of them for extracting content from sites that were less challenging than the one I initially intended to scrape.

All hope is not lost, however. Where pure automation fails, a human can step in. Several proposals suggested paying people on oDesk, Mechanical Turk, or CrowdFlower to extract the content with a human touch. This would certainly get us past the CAPTCHA and hard-to-automate navigation. It might get pretty expensive to have humans copy/paste the data for extraction, however. Given that the tools above are good at extracting content from any single page, I suspect there’s room for a human-in-the-loop scraping tool to steal the show: humans can navigate and train the extraction step, and the machine can perform the extraction. I suspect that’s what import.io is up to, and I’m hopeful they keep the tool available to folks like the ones I initially tried to help.

While we’re on the topic of human-powered solutions, it might make sense to hire a developer on oDesk to just implement the scraper for the site this organization was looking at. While a lot of the developer-free tools I mentioned above look promising, there are clearly cases where paying someone for a few hours of script-building just makes sense.

Source: http://blog.marcua.net/post/74655674340

Wednesday, 19 November 2014

NHL ending dry scraping of ice before overtime

TORONTO (AP) — The NHL will no longer dry scrape the ice before overtime.

Instituted this season in an effort to reduce the number of shootouts, the dry scraping will stop after Friday's games.

The general managers decided at their meeting Tuesday to make the change after the league talked to the players' union the past few days.

Beginning Saturday, ice crews around the league will again shovel the ice after regulation as they did in previous years. The GMs said the dry scrape was causing too much of a delay. Director of hockey operations Colin Campbell said the delays were lasting from more than four minutes to almost seven.

The dry scrape initially had been approved in hopes of reducing shootouts by improving scoring chances without unduly slowing play by recoating the ice.

The GMs also discussed expanded video review, including goaltender interference, and the possibility of three-on-three overtime. The American Hockey League is experimenting with the three-on-three format this season.

This annual meeting the day after the Hockey Hall of Fame induction usually doesn't produce actual changes, with the dry scrape providing an exception.

The main purpose is to set up the March meeting in Boca Raton, Florida, where these items will be further addressed.

Source:http://missoulian.com/sports/hockey/nhl-ending-dry-scraping-of-ice-before-overtime/article_3dd5473c-6102-5800-99f7-2c98be0f99ad.html

Saturday, 15 November 2014

Is Web Scraping Legal?

Web scraping might be one of the best ways to aggregate content from across the internet, but it comes with a caveat: It’s also one of the hardest tools to parse from a legal standpoint.

For the uninitiated, web scraping is a process whereby an automated piece of software extracts data from a website by “scraping” through the site’s many pages. While search engines like Google and Bing do a similar task when they index web pages, scraping engines take the process a step further and convert the information into a format which can be easily transferred over to a database or spreadsheet.

It’s also important to note that a web scraper is not the same as an API. While a company might provide an API to allow other systems to interact with its data, the quality and quantity of data available through APIs is typically lower than what is made available through web scraping. In addition, web scrapers provide more up-to-date information than APIs and are much easier to customize from a structural standpoint.

The applications of this “scraped” information are widespread. A journalist like Nate Silver might use scrapers to monitor baseball statistics and create numerical evidence for a new sports story he’s working on. Similarly, an eCommerce business might bulk scrape product titles, prices, and SKUs from other sites in order to further analyze them.

Legality of Web ScrapingWhile web scraping is an undoubtedly powerful tool, it’s still undergoing growing pains when it comes to legal matters. Because the scraping process appropriates pre-existing content from across the web, there are all kinds of ethical and legal quandaries that confront businesses who hope to do leverage scrapers for their own processes.

In this “wild west” environment, where the legal implications of web scraping are in a constant state of flux, it helps to get a foothold on where the legal needle currently falls. The following timeline outlines some of the biggest cases involving web scrapers in the United States, and allows us to achieve a greater understanding on the precedents that surround the court rulings.

Terms of Use Tug-of-War—2000-2009

For years after they first came into use, web scrapers went largely unchallenged from a legal standpoint. In 2000, however, the use of scrapers came under heavy and consistent fire when eBay fired the first shot against an auction data aggregator called Bidder’s Edge. In this very early case, eBay argued that Bidder’s Edge was using scrapers in a way that violated Trespass to Chattels doctrine. While the lawsuit was settled out of court, the judge upheld eBay’s original injunction, stating that heavy bot traffic could very well disrupt eBay’s service.

Then in 2003’s Intel Corp. v. Hamidi, the California Supreme court overturned the basis of eBay v. Bidder’s Edge, ruling that Trespass to Chattels could not extend to the context of computers if no actual damage to personal property occurred.

So in terms of legal action against web scraping, Tresspass to Chattels no longer applied, and things were back to square one. This began a period in which the courts consistently rejected Terms of Service as a valid means of prohibiting scrapers, including cases like Perfect 10 v. Google, and Cvent v. Eventbrite.

The Takeaway: The earliest cases against scrapers hinged on Trespass to Chattels law, and were successful. However, that doctrine is no longer a valid approach.

Facebook Web Scraping2009—Facebook Steps In

In 2009, Facebook turned the tides of the web scraping war when Power.com, a site which aggregated multiple social networks into one centralized site, included Facebook in their service. Because Power.com was scraping Facebook’s content instead of adhering to their established standards, Facebook sued Power on grounds of copyright infringement.

In denying Power.com’s motion to dismiss the case, the Judge ruled that scraping can constitute copying, however momentary that copying may be. And because Facebook’s Terms of Service don’t allow for scraping, that act of copying constituted an infringement on Facebook’s copyright. With this decision, the waters regarding the legality of web scrapers began to shift in favor of the content creators.

The Takeaway: Even if a web scraper ignores infringing content on its way to freely-usable content, it might qualify as copyright infringement by virtue of having technically “copied” the infringing content first.

2011-2014— U.S. v Auernheimer

In 2010, hacker Andrew “Weev” Auernheimer found a security flaw in AT&T’s website, which would display the email addresses of users who visited the site via their iPads. By exploiting the flaw using some simple scripts and a scraper, Auernheimer was able to gather thousands of emails from the AT&T site.

Although these email addresses were publicly available, Auernheimer’s exploit led to his 2012 conviction, where he was charged with identity fraud and conspiracy to access a computer without authorization.

Data ScrapingEarlier this year, the court vacated Auernheimer’s conviction, ruling that the trial’s New Jersey venue was improper. But even though the case turned out to be mostly inconclusive, the court noted the fact that there was no evidence to show that “any password gate or code-based barrier was breached.” This seems to leave room for the web scraping of publicly-available personal information, although it’s still very much open to interpretation and not set in stone.

The Takeaway: Using a web scraper to aggregate sensitive personal information can lead to a conviction, even if that information was technically available to the public. While there is hope in the court’s observation that no passwords or barriers were broken to retrieve this information, the waters here are still very volatile.

2013—Associated Press vs. Meltwater

Meltwater is a software company whose “Global Media Monitoring” product uses scrapers to aggregate news stories for paying clients. The Associated Press took issue with Meltwater’s scraping of their original stories, some of which had been copyrighted. In 2012, AP filed suit against Meltwater for copy infringement and hot news misappropriation.

While it’s already been established that facts cannot be copyrighted, the court decided that the AP’s copyrighted articles—and more specifically, the way in which the facts within those articles were arranged—were not fair game for copying. On top of this, Meltwater’s use of the articles failed to meet the established fair use standards, and could not be defended on that front either.

The Takeaway: Fair use is limited when it comes to web scrapers, and copyrighted content is not always open to be scraped.

~~

By closely observing the outcomes of previous rulings, you’ll find that there are a few guidelines that a scraper should attempt to adhere to:

    Content being scraped is not copyright protected
    The act of scraping does not burden the services of the site being scraped
    The scraper does not violate the Terms of Use of the site being scraped
    The scraper does not gather sensitive user information
    The scraped content adheres to fair use standards

While all of these guidelines are important to understand before using scrapers, there are other ways to acclimate to the legal nuances. In many cases, you’ll find that a simple conversation with a business software developer or consultant will lead to some satisfying conclusions: Odds are, they’ve used scrapers in the past and can shed light on any snags they’ve hit in the process. And of course, talking with a lawyer is always an ideal course of action when treading into questionable legal territory.

Source:http://blog.icreon.us/2014/09/12/web-scraping-and-you-a-legal-primer-for-one-of-its-most-useful-tools/

Monday, 10 November 2014

Example of Scraping with Selenium WebDriver in C#

In this article I will show you how it is easy to scrape a web site using Selenium WebDriver. I will guide you through a sample project which is written in C# and uses WebDriver in conjunction with the Chrome browser to login on the testing page and scrape the text from the private area of the website.

Downloading the WebDriver

First of all we need to get the latest version of Selenium Client & WebDriver Language Bindings and the Chrome Driver. Of course, you can download WebDriver bindings for any language (Java, C#, Python, Ruby), but within the scope of this sample project I will use the C# binding only. In the same manner, you can use any browser driver, but here I will use Chrome.

After downloading the libraries and the browser driver we need to include them in our Visual

Studio solution:

VS Solution

Creating the scraping program

In order to use the WebDriver in our program we need to add its namespaces:

using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support.UI;

Then, in the main function, we need to initialize the Chrome Driver:

using (var driver = new ChromeDriver())

{

 This piece of code searches for the chromedriver.exe file. If this file is located in a directory different from the directory where our program is executed, then we need to specify explicitly its path in the ChromeDriver constructor.

When an instance of ChromeDriver is created, a new Chrome browser will be started. Now we can control this browser via the driver variable. Let’s navigate to the target URL first:

driver.Navigate().GoToUrl("http://testing-ground.scraping.pro/login");

Then we can find the web page elements needed for us to login in the private area of the website:

var userNameField = driver.FindElementById("usr");
var userPasswordField = driver.FindElementById("pwd");
var loginButton = driver.FindElementByXPath("//input[@value='Login']");

Here we search for user name and password fields and the login button and put them into the corresponding variables. After we have found them, we can type in the user name and the password  and press the login button:

userNameField.SendKeys("admin");
userPasswordField.SendKeys("12345");
loginButton.Click();

At this point the new page will be loaded into the browser, and after it’s done we can scrape the text we need and save it into the file:

var result = driver.FindElementByXPath("//div[@id='case_login']/h3").Text;

File.WriteAllText("result.txt", result);

That’s it! At the end, I’d like to give you a bonus – saving a screenshot of the current page into a

file:

driver.GetScreenshot().SaveAsFile(@"screen.png", ImageFormat.Png);

The complete program listing

using System.IO;
using System.Text;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support.UI;

namespace WebDriverTest
{
    class Program
    {
        static void Main(string[] args)
        {
            // Initialize the Chrome Driver
            using (var driver = new ChromeDriver())
            {
                // Go to the home page
                driver.Navigate().GoToUrl("http://testing-ground.scraping.pro/login");

                // Get the page elements
                var userNameField = driver.FindElementById("usr");
                var userPasswordField = driver.FindElementById("pwd");
                var loginButton = driver.FindElementByXPath("//input[@value='Login']");

                // Type user name and password
                userNameField.SendKeys("admin");
                userPasswordField.SendKeys("12345");

                // and click the login button
                loginButton.Click();

                // Extract the text and save it into result.txt
                var result = driver.FindElementByXPath("//div[@id='case_login']/h3").Text;
                File.WriteAllText("result.txt", result);

                // Take a screenshot and save it into screen.png
                driver.GetScreenshot().SaveAsFile(@"screen.png", ImageFormat.Png);
            }
        }
    }
}

Also you can download a ready project here.

Conclusion

I hope you are impressed with how easy it is to scrape web pages using the WebDriver. You can naturally press keys and click buttons as you would in working with the browser. You don’t even need to understand what kind of HTTP requests are sent and what cookies are stored; the browser does all this for you. This makes the WebDriver a wonderful tool in the hands of a web scraping specialist.

Source:http://scraping.pro/example-of-scraping-with-selenium-webdriver-in-csharp/

Wednesday, 5 November 2014

Application of Web Data Mining in CRM

The process of improvising the customer relations and interactions and making them more amicable may be termed as Customer relationship management (CRM). Since web data mining is used in the utilization of the various modeling and data analysis methods in detecting given patterns and relationships in the data, it can be used as an effective tool in CRM. By the effectively using web data mining you are able to understand what your customers what.

It is important to note that web data mining can be used effectively in searching for the right and potential customers to be offered the right products at the right time. The result of this in any business is the increase in the revenue generated. This is made possible as you are able to respond to each customer in an effective and efficient way. The method further utilizes very few resources and can be therefore termed as an economical method.

In the next paragraphs we discuss the basic process of customer relationship management and its integration with web data mining service. The following are the basic process that should be used in understanding what your customers need, sending them the right offers and products, and reducing the resources used in managing your customers.

Defining the business objective. Web data mining can be used to define and inform your customers your business objective. By doing research you can be able to determine whether your business objective is communicated well to your customers and clients. Does your business objective take interest in the customers? Your business goal must be clearly outlined in your business CRM. By having a more precise and defined goal is the possible way of ensuring success in the customer relationship management.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/application-web-data-mining-crm/