Web Scraping Course for SEO with Python – Class 4

Decode links and see accents

In this step we are going to replace the special characters decoding the links

 

# -*- coding: iso-8859-15 -*-

# Proyecto Web Scraping para SEO con Python
# Clase 4: Decodificar enlaces y ver acentos
# webscrap4

import urllib
import re

url = “https://evginformatica.blogspot.com/”

htmluni = urllib.urlopen(url).read()
html=urllib.unquote(htmluni).decode(“utf-8”)

busqueda = “href='”+url+“.+?'”

enlaces = re.findall(busqueda, html)
for enlace in enlaces:
   print enlace

the line

html = urllib.unquote (htmluni).decode (“utf-8”)

unquote is to replace the special codes (eg% 20 for space) and decode (“utf-8”) is to put the characters in Spanish (eg accents)

With these changes we will see the links well

e.g.

href=’https://evginformatica.blogspot.com/search/label/Curso Online’

href=’https://evginformatica.blogspot.com/search/label/Lenguajes de programación’

 

Leave a Reply

Your email address will not be published. Required fields are marked *