Web Scraping Course for SEO with Python – Class 3

Get internal links from a web page

Once we have read the content of the website, we will extract the internal links.

We will search for html tags of the form <a href = ’…..’ that go to our website

 

# -*- coding: iso-8859-15 -*-
# Proyecto Web Scraping para SEO con Python
# Clase 3: Obtener enlaces internos de una página web
# webscrap3

import urllib
import re

url = “https://evginformatica.blogspot.com/”

html = urllib.urlopen(url).read()
busqueda = “href='”+url+“.+?'”

enlaces = re.findall(busqueda, html)
for enlace in enlaces
   print enlace

To achieve this, we will use the regular expression, where we instruct you to search the text for lines that begin with the address of our website, followed by 1 or more characters that are not quotation marks and that end with html.

 

The findall function will search for all search patterns in the html text.

with this we obtain on the screen all the internal links used on the website

e.g.

‘https://evginformatica.blogspot.com/p/blog.html’

‘https://evginformatica.blogspot.com/p/cursos-gratis-programacion.html’

 

In some cases, links may appear with% 20 symbols that are special characters such as spaces or accents.

Leave a Reply

Your email address will not be published. Required fields are marked *