Scraping Website Data that needs you to log in

By | July 6, 2016

There are many ways to Scrape data out of a Website but have you thought of how to scrape Website data when you need to log in first? Well same thing, there are also different ways but we will be using the easiest way by using Selenium WebDriver API .NET in combination with ChromeDriver.

Logging in programmatically into websites is difficult and tightly coupled with how the site implements its login procedure this means you either write an incredibly complicated, convolutive and perhaps a clunky code just to adhere with probably an ever-changing site procedures or you can also manually do the login then intercept and grab the values needed and plug them into your request objects or better yet use the Selenium WebDriver to do everything for you, all you need to do is script something based on the actions you will be performing like clicks, submits, reading elements via divs/id’s/classes and even writing on to them. Basically creating a bot that imitates your physical actions performed on a site.


With Selenium it can handle this things, just use them as a reference for your codes.

So the million dollar question how do I use it? well its quite easy and I will demonstrate step by step procedures below.

First download the Selenium WebDriver API .NET using NuGet

01 Scraper

Then you also need to download ChromeDriver as we will be using Chrome in this exercise.


Lets start!

You can use any project you want but for this example we will be using Console in C#.

Lets code a simple login process to begin with

using System;
using System.Configuration;
using OpenQA.Selenium.Chrome;
 
namespace Site_Scraper
{
    class Program
    {
        private static void Main(string[] args)
        {
            var chromeDriver = new ChromeDriver(AppDomain.CurrentDomain.BaseDirectory)
            {
                Url = @"https://www.yoursite.com/login"
            };
 
            chromeDriver.Navigate();
 
            var e = chromeDriver.FindElementById("user_session_email");
            e.SendKeys("yourusername@yoursite.com");
            e = chromeDriver.FindElementById("user_session_password");
            e.SendKeys("Fr34k!n6H4rdP455w0rd");
 
            e = chromeDriver.FindElementByXPath(@"/html/body/div[2]/div/div/div/div/div/div/div/div/form/p/input");
            e.Click();
        }
    }
}

You can see from the code you instantiate the ChromeDriver this is the folder location where you saved the “chromedriver.exe“, you can hardcore this if you wish by using the proper path but this is more manageable by putting them into the Application Directory itself. From there you just tell which url is the login page then navigate.

Once its in the page you can find the textboxes where the username and password is located, either search by Id, Class, Css, Text, Name, etc.

02 Scraper

Send the values you want to type by using SendKeys then do this with password as well. From there you just need to find the button to click, in this example we used FindElementByXPath but if the site is expecting a form submit you can also use the Submit function without even finding the Submit button

e.Submit();

Simple isn’t it?
Now lets add more options to it and this is important specially when you want to scrape files as well. Usually when a browser downloads a file it will prompt you a Windows Dialog that you cannot control from your code luckily with Chrome Driver you can set Chrome preferences that you can override like preventing a dialog box from popping out and confirming the download, meaning you can download it automatically.

Lets update our codes

using System;
using System.Configuration;
using System.Linq;
using OpenQA.Selenium.Chrome;
using Site_Scraper.Base;
 
namespace Site_Scraper
{
    class Program
    {
        private static void Main(string[] args)
        {
            //This is for downloading items such as documents, pdf's & etc on the website without any prompts 
            var chromeOptions = new ChromeOptions();
            chromeOptions.AddUserProfilePreference("download.default_directory"ConfigurationManager.AppSettings["DefaultDownloadLocation"]);
            chromeOptions.AddUserProfilePreference("intl.accept_languages""nl");
            chromeOptions.AddUserProfilePreference("disable-popup-blocking""true");
            chromeOptions.AddUserProfilePreference("download.prompt_for_download"false);
 
            var chromeDriver = new ChromeDriver(AppDomain.CurrentDomain.BaseDirectorychromeOptions)
            {
                Url = @"https://www.yoursite.com/login"
            };
 
            chromeDriver.Navigate();
 
            var e = chromeDriver.FindElementById("user_session_email");
            e.SendKeys("yourusername@yoursite.com");
            e = chromeDriver.FindElementById("user_session_password");
            e.SendKeys("Fr34k!n6H4rdP455w0rd");
 
            e.Submit();
        }
    }
}

As you can see we are already setting the download.default_directory, disable-popup-blocking and download.prompt_for_download preferences so our scrape will continue without any hassles. Once that is set up just add a parameter when you instantiate your Chrome driver.

From here it will be a breeze, now you are logged in you can Navigate to any page and scrape off data. Let me give you an example of that one.

Lets say you are in a Url where a table name “tbl_files“exist and in that table there are rows where the file “href” is in the first cell. All you have to do is iterate on each row and navigate to that cell, you can even filter what you want to download if you wish like the example below.

IWebElement filesTable = chromeDriver.FindElementById("tbl_files");
ReadOnlyCollection<IWebElement> attachmentRows = filesTable.FindElements(By.TagName("tr"));
 
foreach (IWebElement row in attachmentRows)
{
    ReadOnlyCollection<IWebElement> cells = row.FindElements(By.TagName("td"));
                    
    var fileCell = cells[1];
    var fileName = fileCell.Text;
                        
    var fileUrl = fileCell.FindElement(By.TagName("a")).GetAttribute("href");
 
    if (fileName.EndsWith(".doc"|| fileName.EndsWith(".docx"|| fileName.EndsWith(".pdf"|| fileName.EndsWith(".jpg"|| fileName.EndsWith(".rtf"|| fileName.EndsWith(".txt"))
    {
        chromeDriver.Url = fileUrl;
        chromeDriver.Navigate();
    }
                    
}

See it’s quite easy and fun! Once you get the hang of it you will be scraping data off various websites if you cant do it via API.


Leave a Reply