Scraping MBA programmes information from MBAStudies.com

Francisco Riaño

Scraping MBA programmes information from MBAStudies.com

Francisco Riano
January 11, 2023
No Comments

Scraping MBA programmes information from MBAStudies.com

GitHub repository: https://github.com/FranciscoRiano/web_scrapping_python_project

Motivation

The importance of higher education cannot be denied as it plays a role in various areas of human life from personal development to business growth and socio-economic advancement. Higher education is an instrument for economic progress (Kromydas, 2017). It creates a path to financial security, economic mobility, personal growth, professional development, and leadership opportunities (Teague, 2015). Moreover, according to UNESCO, higher education institutions play a crucial role in developing innovative solutions to global and local problems. Therefore, it is essential to have access to transparent information about available study programmes in order to be able to make an informed decision.

The dataset was created with the aim of comparing different MBA programmes offered within the Group of Seven (G7) countries: Canada, France, Germany, Italy, Japan, the United Kingdom and the United States. Nowadays, the number of academic offers is considerably increasing; therefore, it is important to better organize all the information based on important features such as tuition fees, length of study, language, and location, among others.

While there is a lot of information available on the internet regarding MBA programmes, with the currently available tools, it is quite difficult to compare several programmes or get access to general statistics on the academic offerings presented per country. For example, MBAStudies allows a comparison of only two programmes at once. This option is expanded to four programmes when the account is created on the website. Therefore, our main purpose is to offer an alternative solution based on web scraping techniques, to allow for more effective comparison of MBA programmes. We considered it to be important to develop this project in order to enable students to benchmark their individual programmes of interest and discover new options.

Composition

Each instance in the dataset represents one MBA programme from the https://www.mbastudies.com/MBA website in the selected countries (G7). The data is scraped per country. For that reason, it is important to take into account the fact that programmes offered in more than one country can be duplicates in the dataset. While all programmes are in the same category (Master of Business Administration), each programme has its own characteristics that differentiate it from other programmes. These characteristics include locations (in which the programme is offered), earliest start date, languages, pace (part-time/full-time), duration of the programme, application deadline, type of study (e.g. online, on-campus), and the number of tuition fees.

There are 1266 MBA programmes in the G7 countries (as of 26 March 2022). This number can change in the future as the programmes are either added or deleted. The number of programmes was extracted from each country page of MBA programmes.

Collection process

The collected data is reported by higher education institutions (subjects). In this specific case, the subjects (e.g. University representatives) are those in charge of submitting and keeping updated the exhibited information on www.mbastudies.com. Keystone Education Group, which is the company behind the website, offers a friendly interface for all of those who want to promote academic programmes, as well the company is responsible for guaranteeing adequate filtering and organization of the deployed data on its web portal. The data scraped from the website is directly exhibited on each programmes’ page and publicly available to every user.

The mechanism used to extract data was web scraping techniques carried out through the Python programming language. The main libraries used were BeautifulSoup, Pandas and Selenium. BeautifulSoup was the most appropriate tool because it allows us to understand and interpret the html structure of the website and therefore retrieve the data more easily. Selenium was useful in our project because some content of the website is available after clicking the button to expand the text. At the end of the project it was aimed to store the information in JSON dictionaries as well making it available in csv format.

Preprocessing, cleaning, labelling

As mentioned before, some of the programme links were not provided in the programme listing. Therefore, when scraping these links, we have obtained some empty values for the missing links. These empty values were removed from the list of all links so that the scraper can work correctly. Besides that, all data was collected, without removing any instances.

The software Python in Jupyter Notebook (https://jupyter.org/) was used for writing and processing the code in Python (https://www.python.org/). Moreover, BeautifulSoup (https://pypi.org/project/beautifulsoup4/) and Selenium (https://pypi.org/project/selenium/) were used to retrieve information from the website. The statistical analysis was conducted in RStudio (https://www.rstudio.com/).

Uses

One obstacle of the MasterStudies website is that only two programmes’ information can be compared simultaneously. However, the dataset we scraped can be compared with different programmes as much as possible. Student candidates who want to apply to a university can easily compare the programmes they are interested in, as well as each programmes’ rankings, tuition fees, location, language, admission requirements, availability of scholarships, etc. It also provides an overview of programmes for educational institutions. By this way, they can find the information they need very efficiently and precisely.

Authors: Paulina Ambroziak, Yuetong Bi, Francisco Riaño

Francisco Riaño