The World’s Largest Online Community for Developers
I'm trying to scarpe data from a website that has a tag as
<a href="https: evisa.mfa.am "> for example, look at this website
Is there any way in BeautifulSoup to extract data from non html tags?
here's a snippet of the whole html page from the above link
<br/>2. Airlines must provide advance passenger information of scheduled arrival of nationals of Antigua and Barbuda and resident diplomats. <br/><br/><b>ARGENTINA</b> - published 02.04.2020 <br/>Passengers are not allowed to enter Argentina until 12 April 2020.<br/><br/><b>ARMENIA</b> - published 22.03.2020 <br/>1. Nationals of China (People's Rep.) with a normal passport are no longer visa exempt. <br/>2. Nationals of Iran can no longer obtain a visa on arrival. They must obtain a visa or an e-visa prior to their arrival in Armenia. The e-visa can be obtained at <a href="https://evisa.mfa.am/">https://evisa.mfa.am/</a> <br/>3. Passengers who have been in Austria, Belgium, China (People's Rep.), Denmark, France, Germany, Iran, Italy, Japan, Korea (Rep.), Netherlands, Norway, Spain, Sweden, Switzerland or United Kingdom in the past 14 days are not allowed to enter Armenia.<br/>- This does not apply to nationals or residents of Armenia.<br/>- This does not apply to spouses or children of nationals of Armenia.<br/>- This does not apply to employees of foreign diplomatic missions and consular institutions.<br/>- This does not apply to representations of official international missions or organizations.<br/>4. Nationals of Armenia who have been in Austria, Belgium, China (People's Rep.), Denmark, France, Germany, Iran, Italy, Japan, Korea (Rep.), Netherlands, Norway, Spain, Sweden, Switzerland or United Kingdom in the past 14 days must undergo 14-days of quarantine or self-isolation regime.
AMP chars, you can have a look here to understand what it is.
html.parser. Just use a real
parser such as
from bs4 import BeautifulSoup import requests r = requests.get( "https://www.iatatravelcentre.com/international-travel-document-news/1580226297.htm") soup = BeautifulSoup(r.content, 'html5lib') print(soup.prettify())
If you parse the webpage using
requests remove the part of the tag that is wrong, you can pass that to BeautifulSoup.
In the following I'm replacing
because it is just a HTML representation of a space.
import requests url = 'https://www.iatatravelcentre.com/international-travel-document-news/1580226297.htm' response = requests.get(url) content = response.text.replace(' ',' ') from bs4 import BeautifulSoup soup = BeautifulSoup(content, 'html.parser')
now you can use BeautifulSoup as you use too.
You have to analyse your html code before post questions .
Now Try to get your URL
from bs4 import BeautifulSoup with open("test.html","r") as f: page = f.read() soup = BeautifulSoup(page, 'html.parser') url = soup.findAll("a href=\"https:") print(url)