Get internal links from a web page
Once we have read the content of the website, we will extract the internal links.
We will search for html tags of the form <a href = ’…..’ that go to our website
# -*- coding: iso-8859-15 -*-
# Proyecto Web Scraping para SEO con Python
# Clase 3: Obtener enlaces internos de una página web
# webscrap3
import urllib
import re
url = “https://evginformatica.blogspot.com/”
html = urllib.urlopen(url).read()
busqueda = “href='”+url+“.+?'”
enlaces = re.findall(busqueda, html)
for enlace in enlaces
print enlace
To achieve this, we will use the regular expression, where we instruct you to search the text for lines that begin with the address of our website, followed by 1 or more characters that are not quotation marks and that end with html.
The findall function will search for all search patterns in the html text.
with this we obtain on the screen all the internal links used on the website
e.g.
‘https://evginformatica.blogspot.com/p/blog.html’
‘https://evginformatica.blogspot.com/p/cursos-gratis-programacion.html’
…
In some cases, links may appear with% 20 symbols that are special characters such as spaces or accents.