Web Scraping Course for SEO – Class 5

Create list of valid links

 

Of the previously obtained links we are going to discard some that are not web pages or are used for searches.

The obtained links will put them in a list

 

# -*- coding: iso-8859-15 -*-

# Proyecto Web Scraping para SEO con Python
# Clase 5: Crear lista de enlaces válidos# webscrap5

import urllib
import re

url = “https://evginformatica.blogspot.com/”

htmluni = urllib.urlopen(url).read()
html=urllib.unquote(htmluni).decode(“utf-8”)busqueda = “href='”+url+“.+?'” 

lista_enlaces = []

enlaces = re.findall(busqueda, html)
for enlace in enlaces:
 
    #quitar href=’ y quitar comilla final
    enlace2 = enlace[6:-1]
   
    #enlaces excluidos
    #/search
    # o que contienen #
    # solo paginas .html
    if enlace2.find(“.html”)>0:
        if enlace2.find(“/search”)<0:
            if enlace2.find(“#”)<0:
                lista_enlaces.append(enlace2)
   

for lenlace in lista_enlaces:
    print lenlace

 

In this way we already have internal links to web pages of our site.

 

pej

https://evginformatica.blogspot.com/p/blog.html

https://evginformatica.blogspot.com/p/cursos-gratis-programacion.html

https://evginformatica.blogspot.com/p/cursoweb.html

https://evginformatica.blogspot.com/p/curso-online-java.html

https://evginformatica.blogspot.com/p/curso-c-avanzado.html

https://evginformatica.blogspot.com/p/curso-programando-con-python.html

https://evginformatica.blogspot.com/p/cursos-online-disponibles-introduccion.html

https://evginformatica.blogspot.com/2018/09/curso-programacion-orientada-objetos_21.html

https://evginformatica.blogspot.com/2018/09/curso-programacion-orientada-objetos_18.html

https://evginformatica.blogspot.com/2018/09/curso-programacion-orientada-objetos.html

https://evginformatica.blogspot.com/2018/10/curso-programando-con-python-clase-5.html

https://evginformatica.blogspot.com/2018/10/curso-programando-con-python-clase-4.html

 

Leave a Reply

Your email address will not be published. Required fields are marked *