You can use the pandas read_html() function to read HTML tables into a pandas DataFrame.
This function uses the following basic syntax:
df = pd.read_html('https://en.wikipedia.org/wiki/National_Basketball_Association')
The following example shows how to use this function to read in a table of NBA team names from this Wikipedia page.
Example: Read HTML Table with Pandas
Before using the read_html() function, you’ll likely have to install lxml:
pip install lxml
Note: If you’re using a Jupyter notebook, you need to restart the kernel after performing this installation.
Next, we can use the read_html() function to read every HTML table on this Wikipedia page:
import pandas as pd import numpy as np import matplotlib.pyplot as plt from unicodedata import normalize #read all HTML tables from specific URL tabs = pd.read_html('https://en.wikipedia.org/wiki/National_Basketball_Association') #display total number of tables read len(tabs) 44
We can see that a total of 44 HTML tables were found on this page.
I know that the table I’m interested in has the word “Division” in it, so I can use the match argument to only retrieve HTML tables that contain this word:
#read HTML tables from specific URL with the word "Division" in them
tabs = pd.read_html('https://en.wikipedia.org/wiki/National_Basketball_Association',
match='Division')
#display total number of tables read
len(tabs)
1
I can then list the names of the columns of the table:
#define table
df = tabs[0]
#list all column names of table
list(df)
[('Division', 'Eastern Conference'),
('Team', 'Eastern Conference'),
('Location', 'Eastern Conference'),
('Arena', 'Eastern Conference'),
('Capacity', 'Eastern Conference'),
('Coordinates', 'Eastern Conference'),
('Founded', 'Eastern Conference'),
('Joined', 'Eastern Conference'),
('Unnamed: 8_level_0', 'Eastern Conference')]
I’m only interested in the first two columns, so I can filter the DataFrame to only contain these columns:
#filter DataFrame to only contain first two columns
df_final = df.iloc[:, 0:2]
#rename columns
df_final.columns = ['Division', 'Team']
#view first few rows of final DataFrame
print(df_final.head())
Division Team
0 Atlantic Boston Celtics
1 Atlantic Brooklyn Nets
2 Atlantic New York Knicks
3 Atlantic Philadelphia 76ers
4 Atlantic Toronto Raptors
The final table contains only the ‘Division’ and ‘Team’ columns.
Additional Resources
The following tutorials explain how to read other types of files in pandas:
How to Read a Text File with Pandas
How to Read Excel Files with Pandas
How to Read CSV Files with Pandas