Skip to content

3. Start scrapping

Javinator9889 edited this page Aug 12, 2018 · 3 revisions

Every time we want to initialize pyGle, we have to always do the following:

from pyGle import PyGle  # Importing the class

Now, we have some ways to initialize the class object:

  1. Enabling history: each search we will do in a future will be saved and stored in a private data-structure, so you will be able to recover it and work with the results.
pSearch = PyGle(enable_history=True)
  1. Using session cookies: as a faster way to navigate through Google, you can enable cookies so some pages will not need to be loaded.
pSearch = PyGle(use_session_cookies=True)
  1. Defining a query for searching: with this option enabled, you can directly do a search without defining anything else.
# Just defining the query
pSearch = PyGle(query="your_query")

# Doing directly a search
search_results = PyGle(query="your_query").doSearch()  # This will return a Future object

Of course, you can combine the three options mentioned above

  1. Defining nothing: using the default constructor, so there will be no history and no session cookies.
pSearch = PyGle()

A point that is always necessary is to define a query. The query are that words you would like to search at Google. So once you have PyGle object initialized, you can define (or redefine) the query with:

pSearch.withQuery("your_query")

As mentioned at README.md, all methods start with with, and when constructing and defining options, you can recursively define everything at the same time. For example, you can do:

# Directly at constructor
def demostrateFunctionA(history_enabled: bool = False, cookies_enabled: bool = False):
    pSearch = PyGle(enable_history=history_enabled, use_session_cookies=cookies_enabled).withQuery("query").withTextInTitle("text_in_title").withContainingTwoTerms("term 1", "term 2")
    # Do stuff

# Once PyGle object is initialized
def demostrateFunctionB(history_enabled: bool = False, cookies_enabled: bool = False):
    pSearch = PyGle(enable_history=history_enabled, use_session_cookies=cookies_enabled)
    pSearch.withQuery("query").withTextInTitle("text_in_title").withContainingTwoTerms("term 1", "term 2")
    # Do stuff

Finally, each time we want to perform a search, we must call the method doSearch(), which will identify which Google page are we using (Google Images, Google News, Google Search, etc.) and get the results automatically.

This method also includes an option for torifying the search. As we can read at this section of README.md, torifying will allow us to navigate through a proxy so no one can see what we have searched.

For doing and finishing with the search, we have to do:

search_results = pSearch.doSearch()

# Using Tor
search_results = pSearch.doSearch(torify=True)

For handling results, we must know that the object that doSearch() returns is a Future, which means that will have the search results when available.

For obtaining the object containing the results, we do:

results = search_results.result()

# We can define a timeout so if some time passed, the search job is cancelled
results = search_results.result(timeout=10)

The obtained object is a List which contains a dictionary.

So the length of the list are the number of results, in which the latest position are statistics.


At the following pages you will see how is data structured inside the list, so you can navigate through it

<<< Previous page - Next page >>>