Analysis of a dataset of flights, using the SparkSQL framework and extra web scraping techniques
The goal of this demo is to massively use the SparkSQL framework functionalities to perform a dataset analysis. In this example, the dataset represents the home US flights for the year 1994.
In the final section of the project, in order to understand if flight delays were related to wheater conditions, web scraping techniques have been applied. In this way, we have enough data to perform the analysis.
You can find the Python Notebook exported as an HTML file, that is more portable in terms of readability.
As said before, we used SparkSQL to query the database and analyse its content.
We have decided to use some of the most famous available Python libraries:
- Pandas and Numpy: data processing and analysis
- Matplotlib and Seaborn: data visualization
- UrlLib and BS4: web scraping for the weather conditions
We want to thank you our teacher, Michiardi Pietro, who has realized the baseline for the notebook and has guided us during its realization, teaching us all the techniques presented here.
ANGIUS Marco and AVALLE Giorgio - Ⓒ2017