Web Scraping Course for SEO – Class 6

Find all internal links

From the previous step we will look for all the internal links of the website, going through each page that we find and looking for links within.

# -*- coding: iso-8859-15 -*-

# Proyecto Web Scraping para SEO con Python
# Clase 6: Buscar todos los enlaces internos# webscrap6
import urllib
import re

url_sitio = “https://evginformatica.blogspot.com/”
url_inicio=url_sitio
lista_enlaces = []


def buscar_enlaces(posicion,busqueda):
   
    url = lista_enlaces[posicion]
   
    htmluni = urllib.urlopen(url).read()
    html=urllib.unquote(htmluni).decode(“utf-8”)

    enlaces = re.findall(busqueda, html)
    for enlace in enlaces:
   
        #quitar href=’ y quitar comilla final
        enlace2 = enlace[6:-1]
   
        #enlaces excluidos
        #/search
        # o que contienen #
        # solo paginas .html
        anadir=True
        if enlace2.find(“.html”)<0:
            anadir=False
       
        if enlace2.find(“/search”)>0:
            anadir=False
       
        if enlace2.find(“#”)>0:
            anadir=False
       
        if enlace2 in lista_enlaces:
            anadir=False
           
        if anadir:
            lista_enlaces.append(enlace2)
   

if __name__ == “__main__”:  
    print “— Enlaces internos —“
   
    lista_enlaces.append(url_inicio)
    posicion = 0
    while True:
        #print “— “+lista_enlaces[posicion]
       
        busqueda = “href='”+url_sitio+“.+?'”
        buscar_enlaces(posicion,busqueda)
        busqueda = ‘href=”‘+url_sitio+‘.+?”‘
        buscar_enlaces(posicion,busqueda)
       
           
        posicion = posicion+1
        if posicion >= len(lista_enlaces):
            break
           
    i=1
    for lenlace in lista_enlaces:
        print str(i)+” “+lenlace       
        i=i+1
    

 

. The links will be searched on each page that we find.

. two searches are made because some links are enclosed in normal quotes and others in single quotes.

. the function def search_link (position, search) is defined:

to search for links on a page that is already added to the list

. the line

if __name__ == “__main__”:  

 

It will indicate the entry point of the program

 

2 Replies to “Web Scraping Course for SEO – Class 6”

  1. I am really impressed with your writing abilities and also with the structure for your weblog.

    Is that this a paid subject or did you modify it your self?
    Either way stay up the nice high quality writing, it’s rare to look a nice blog
    like this one today..

    1. Thanks for your comment.
      All the content of my blog I have written myself.

      I like to write about topics of my interest such as technology.

      I hope you can continue enjoying my blog.

Leave a Reply

Your email address will not be published. Required fields are marked *