Challenge powered by Oxylabs

Parse The Page: prepared by Karolina Šarauskaitė, Python Developer & Squad Lead at Oxylabs.

Try out an IT role of software developer: Don’t wait, go ahead and give it a try right now! Karolina from Oxylabs has created a very simple task, designed for beginners. The steps are described in great detail, and you can even see a possible solution. So don’t wait any longer! There will be two winners who will receive a career consultation with a recruiter from Oxylabs!*

Here’s what you need to do:

  1. Read the instructions provided below.
  2. Complete the task following the instructions.
  3. Once you’ve completed the task, take a screenshot of your solution.
  4. Return to this page.
  5. Submit your task solution and register by May 31st. You can find the registration form at the bottom of the page.
  6. On June 4th, two winners will be chosen randomly to receive consultations and announced here on the page.
  7. And remember, the most important thing is to try. You can do it!

Good luck!

*Only those who have registered are included in the competition and have the chance to win a prize.

The objective of the task is to parse a product page. This is a beginner level task.

  1. Total time needed for the task: 40 min. (may take some time to set up, but there is an easier option that doesn’t require the program to be set up).
  2. Deadline: May 31st.
  3. The prize for 2 winners: a career consultation with a recruiter.

The objective of the task is to parse a product page.

See a sample product page: https://books.toscrape.com/catalogue/dune-dune-1_151/index.html

I have already prepared all the required code to scrape the page, what you need to do is locate the elements in the web page, define what their paths are and add them to your code in the places provided.

I am using a Chrome browser myself, but you can use any browser you like.

To see the source of the page, press F12. It should open a tab on the right of the page. This is what we will be using to locate the elements.

I’ve pressed CTRL + F to open a search bar containing text Find by string, selector, or XPath.

We will be looking for elements by their XPaths.

The path to locate the name of the book is already provided.

I’ve entered .//h1/text() in the search bar and now I am able to detect the element.

Your objective is to locate the elements of price, product description and, if you are up for a bigger challenge, stock availability.

Here are the elements:

Printed out as a result:

Here is an example of the output:

Book information:

Name: Dune (Dune #1)

Price: £54.86

Is in stock: True

Description: Set in the far future amidst a sprawling feudal interstellar empire where planetary dynasties are controlled by noble houses that owe an allegiance to the imperial House Corrino, Dune tells the story of young Paul Atreides (the heir apparent to Duke Leto Atreides and heir of House Atreides) as he and his family accept control of the desert planet Arrakis, the only source o Set in the far future amidst a sprawling feudal interstellar empire where planetary dynasties are controlled by noble houses that owe an allegiance to the imperial House Corrino, Dune tells the story of young Paul Atreides (the heir apparent to Duke Leto Atreides and heir of House Atreides) as he and his family accept control of the desert planet Arrakis, the only source of the 'spice' melange, the most important and valuable substance in the cosmos. The story explores the complex, multi-layered interactions of politics, religion, ecology, technology, and human emotion as the forces of the empire confront each other for control of Arrakis.Published in 1965, it won the Hugo Award in 1966 and the inaugural Nebula Award for Best Novel. Dune is frequently cited as the world's best-selling sf novel. ...more

Right now, if you run the code provided, you will receive a result as such: 

Book information:

Name: Dune (Dune #1)

Price: 

Is in stock: 

Description:


Look for TODO parts in the code – they should guide you what needs to be done where. Follow an example of how the name element was parsed when you do.

Resources

For project setup, I suggest to install PyCharm Community and follow this guide – https://www.jetbrains.com/help/pycharm/creating-and-running-your-first-python-project.html

After you set up your project environment, you will need to install two additional packages to run the project (I will provide the code below) – lxml and requests. You can do this via PyCharm Community by going to File -> Settings -> Project: <your project> -> Python interpreter and clicking an +. There should be a pop-up with a list of packages. Search for lxml & requests and install them one by one. Then you should be ready for the next part – running the sample code provided below.

Also, to learn about XPath that is going to be used in this task, I suggest looking at this resource – https://www.w3schools.com/xml/xpath_syntax.asp.

In case of issues setting up the project

It happens, no worries! You can still play around with extracting the data you need for this task without using Python and learn XPath as you go. It can be done simply in your browser instead.

Following an example provided above on how we extracted the name element, look for a Console tab. It should be located near the Elements tab, where the page source is. Open this tab. You can enter the XPaths there, like this:

Having .//h1/text() as the XPath we are trying to test, you can enter a command following this format: $x(“<your XPath>”)[0].data and it should return the result.

In the example above, the command entered was $x(“.//h1/text()”)[0].data. It returned ‘Dune (Dune #1)’ as a result.

Here is the code you may start with when using Python to do this task. Just copy and paste it.

from io import StringIO

import requests

from lxml import etree

def get_lxml_tree(html: str) -> etree.Element:

    parser = etree.HTMLParser()

    tree = etree.parse(

        StringIO(html),

        parser=parser,

    )

    return tree

def scrape(url: str) -> str:

    response = requests.get(url)

    return response.content.decode(“utf-8”)

def parse_element(tree: etree.ElementTree, xpath: str) -> etree.Element:

    element_list = tree.xpath(xpath)

    return element_list[0] if element_list else None

def parse(tree: etree.ElementTree) -> dict:

    product_page = parse_element(tree, “.//article[@class=’product_page’]”)

    name = parse_element(product_page, “.//h1/text()”)

    price = “”  # TODO: Add price XPath here.

    description = “”  # TODO: Add description XPath here.

    is_in_stock = “”  # TODO: Add in stock XPath here & determine if it’s True or False.

    return {

        “name”: name,

        “price”: price,

        “description”: description,

        “is_in_stock”: is_in_stock,

    }

def print_result(parsed_result: dict) -> None:

    print(“Book information:”)

    print(f”Name: {parsed_result[‘name’]}”)

    print(f”Price: {parsed_result[‘price’]}”)

    print(f”Is in stock: {parsed_result[‘is_in_stock’]}”)

    print(f”Description: {parsed_result[‘description’]}”)

if __name__ == “__main__”:

    url_to_scrape = “https://books.toscrape.com/catalogue/dune-dune-1_151/index.html”

    content = scrape(url_to_scrape)

    html_tree = get_lxml_tree(content)

    parsed_content = parse(html_tree)

    print_result(parsed_content)

          

    Here is a possible solution to the task. Of course, the XPaths can be written in multiple ways, so there can be many solutions. 

    from io import StringIO

    import requests

    from lxml import etree

    def get_lxml_tree(html: str) -> etree.Element:

        parser = etree.HTMLParser()

        tree = etree.parse(

            StringIO(html),

            parser=parser,

        )

        return tree

    def scrape(url: str) -> str:

        response = requests.get(url)

        return response.content.decode(“utf-8”)

    def parse_element(tree: etree.ElementTree, xpath: str) -> etree.Element:

        element_list = tree.xpath(xpath)

        return element_list[0] if element_list else None

    def parse(tree: etree.ElementTree) -> dict:

        product_page = parse_element(tree, “.//article[@class=’product_page’]”)

        name = parse_element(product_page, “.//h1/text()”)

        price = parse_element(

            product_page,

            “.//div[contains(@class, ‘product_main’)]/p[@class=’price_color’]/text()”,

        )

        description = parse_element(

            product_page, “.//div[@id=’product_description’]/following-sibling::p/text()”

        )

        is_in_stock = (

            parse_element(

                product_page, “.//p[contains(@class,’instock’)]/i[@class=’icon-ok’]”

            )

            is not None

        )

        return {

            “name”: name,

            “price”: price,

            “description”: description,

            “is_in_stock”: is_in_stock,

        }

    def print_result(parsed_result: dict) -> None:

        print(“Book information:”)

        print(f”Name: {parsed_result[‘name’]}”)

        print(f”Price: {parsed_result[‘price’]}”)

        print(f”Is in stock: {parsed_result[‘is_in_stock’]}”)

        print(f”Description: {parsed_result[‘description’]}”)

    if __name__ == “__main__”:

        url_to_scrape = “https://books.toscrape.com/catalogue/dune-dune-1_151/index.html”

        content = scrape(url_to_scrape)

        html_tree = get_lxml_tree(content)

        parsed_content = parse(html_tree)

        print_result(parsed_content)

      Tried the challenge from Oxylabs ? Great! Congratulations on completing it! Please take a screenshot of your solution, upload it here, and sign up.