Skip to content

sandraelekes/pands-project-2020

Repository files navigation

Programming and scripting project 2020

This repository is used for the final project given during the Programming and scripting module on Higher Diploma in Data Analytics course from GMIT. Topic of the project is research and investigation of IFisher's ris dataset.

Detailed project description can be found on GitHub from the lecturer Ian McLoughlin.

Table of contents

Iris dataset

Iris dataset history

Iris flower data, also known as Fisher's Iris dataset was introduced by British biologist and statistitian Sir Ronald Aylmer Fisher. In 1936, Sir Fisher published a report titled “The Use of Multiple Measurements in Taxonomic Problems” in the journal Annals of Eugenics. Sir Fisher didn’t collect these data himself. Credits for the data source go to Dr. Edgar Anderson, who collected the majority of the data at the Gaspé Peninsula.
In this article, Fisher developed and evaluated a linear function to differentiate Iris species based on the morphology of their flowers. It was the first time that the sepal and petal measures of the three Iris species as mentioned above appeared publicly. [01]

Iris flower difference in species is pictured below. [02]

Iris flower species

Iris dataset file

This Iris dataset contains a set of 150 records which represent three iris species (Iris setosa, Iris versicolor and Iris virginica) with 50 samples each.

The columns that represent records mentioned above are :

  • Id
  • SepalLengthCm
  • SepalWidthCm
  • PetalLengthCm
  • PetalWidthCm
  • Species

Iris dataset [03] used in this analysis can be found among files in this repository as Iris_dataset.csv.

Dataset code and analysis

In this section is explanation of the code for the imported libraries, dataset import and summary. Code used for plotting is explained in Plots.

Imported libraries and modules

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    import sys

NumPy is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
Shorter definition is thah NumPy is the fundamental package for scientific computing in Python. [04]\

pandas is a Python package for data science; it offers data structures for data manipulation and analysis. [05]
In this project pandas is used for creating a summary of the dataset from a .csv file.\

Matplotlib is a comprehensive visualisation library in Python, built on NumPy arrays, for creating static, animated and interactive 2D plots or arrays. [06] [07]
matplotlib.pyplot is a state-based interface to matplotlib. It provides a MATLAB-like way of plotting. pyplot is mainly intended for interactive plots and simple cases of programmatic plot generation. [08]

Seaborn is a library for making statistical graphics in Python. It is built on top of matplotlib and closely integrated with pandas data structures. [09]
Working with DataFrames is a bit easier with the Seaborn because the plotting functions operate on DataFrames and arrays that contain a whole dataset. [10]
Elite data science has interesting tutorial on seaborn presented on a famous Pokemon cartoon based dataset.

sys module represents system-specific parameters and functions and provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter. [11]

Interesting tutorials for working with these libraries can be found on Worthy mentions.

Libraries cheat sheets

List of usefull cheat sheets for libraries used in this project:

Dataset import

    ifds = pd.read_csv("Iris_dataset.csv", index_col = "Id")

This line of code is used for reading the .csv file into DataFrame and storing it as a variable ifds (iris flower dataset) for further analysis and manipulation.
Since pandas is using zero-based integer indices in the DataFrame, index_col = "Id" was used to make the Id column an index column while reading the file. That means that the index column will not be taken into consideration while analysing the data. [12]

Dataset summary

Part of the code for summary:

    def summary_to_file():
        sys.stdout = open ("analysis_summary.txt","w")
        ...
        print(ifds)
        ...
        print (ifds.describe())
        ...
        print (ifds.info())
        ...
        print (ifds["Species"].value_counts())
        ...
        print (((ifds["Species"].value_counts(normalize=True))*100))
        sys.stdout.close()

Dataset summary is not shown while starting the program, but rather stored in analysis_summary.txt.

Function summary_to_file() is created for making the summary and writing it into the file at the same time.

Writing outputs of the summary into a file is achieved with use of sys module and it's attribute stdout. stdout (standard output stream) is simply a default place to send a program’s text output. [13] [14]
Initial idea was to create a function with outputs of summary and write that output into a .txt file. After a long research and "trial and error technique" it seemed to complicated to code and this approach is chosen over writing in file with the help of .write(), because code is simpler and any print operation will write it's output to a .txt file, where .write() function only takes string value as an input(). [15] [16] [17]

ifds is giving the overview of the whole dataset loaded from the Iris_dataset.csv file.

Summary of the values - describe()

ifds.describe() gives the summary of the numeric values in the given dataset. It shows the count of variables in the dataset which can point out to any possible missing values. It calculates the mean, standard deviation, minimum and maximum value, and also 1st, 2nd and 3rd percentile of the columns with numeric value. [18]

Output

       SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
count     150.000000    150.000000     150.000000    150.000000
mean        5.843333      3.054000       3.758667      1.198667
std         0.828066      0.433594       1.764420      0.763161
min         4.300000      2.000000       1.000000      0.100000
25%         5.100000      2.800000       1.600000      0.300000
50%         5.800000      3.000000       4.350000      1.300000
75%         6.400000      3.300000       5.100000      1.800000
max         7.900000      4.400000       6.900000      2.500000

Samples of each type - info()

ifds. info() prints information about given dataset including the index data type and column data types, non-null values and memory usage. [18]

Output

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 1 to 150
Data columns (total 5 columns):
SepalLengthCm    150 non-null float64
SepalWidthCm     150 non-null float64
PetlLengthCm     150 non-null float64
PetalWidthCm     150 non-null float64
Species          150 non-null object
dtypes: float64(4), object(1)
memory usage: 7.0+ KB

Number of occurances of each of the species

Method value_counts() is used to count the number of desired columns. In this case, the column of interest is column Species. [19]
With defining the parameter normalise to True (it is False by default), these values can be presented in percentile (or relative frequencies) as well. [20]

Output

Iris-virginica     50
Iris-versicolor    50
Iris-setosa        50
Name: Species, dtype: int64

Or, viewed in percentile:

Iris-setosa        33.333333
Iris-versicolor    33.333333
Iris-virginica     33.333333
Name: Species, dtype: float64

Plots

Histograms

Sepal length Sepal width

Petal length Petal width

Histogram code

Histograms are coded with the help of functions. There are 4 functions representing each histogram: Sepal Length, Sepal Width, Petal Length and Petal Width. All of those functions are grouped in a function called histograms().

Example of part of the code:

    iris_s = ifds[ifds.Species == "Iris-setosa"]
    iris_vers = ifds[ifds.Species == "Iris-versicolor"]
    iris_virg = ifds[ifds.Species == "Iris-virginica"]

    def petal_length_hist():
        plt.figure(figsize = (9,9))
        sns.distplot(iris_s["PetalLengthCm"],  kde = False, label = "Iris setosa", color = "deeppink")
        sns.distplot(iris_vers["PetalLengthCm"],  kde = False, label = "Iris versicolor", color = "mediumorchid")
        sns.distplot(iris_virg["PetalLengthCm"],  kde = False, label = "Iris virginica", color = "navy")
        plt.title("Petal length in cm", size = 20)
        plt.xlabel("")
        plt.ylabel("Frequency", size = 16)
        plt.legend()
        plt.savefig("Petal-lenght.png")
        plt.show()

Variables iris_s, iris_vers and iris_virg are used for subsetting original dataframes for Iris setosa, Iris versicolor and Iris virginica, respectively. They are set outside of the functions for multiple use.[21]

Lot of parameters in codes are added for aesthetic purposes only. Example od that is adding size to title and labels text. [22] figsize is defined as 9 by 9 inches so on the saved picture the legend wouldn't be positioned over the histogram. Important to notice - figure size must be defined before start of plotting. [23]

distplot() is a function used to flexibly plot a univariate distribution of observations. [24]
Parameter kde (kernel density estimate) is set to False as it was unnecessary in this case.
Parameter color was set for a better distinction between species of flowers and nicer picture. [25]\

Scatterplots

Sepal length and Sepal width comparison Petal length and Petal width comparison

From the Sepal length and Sepal width comparison picture it is visible that it is easier to distinguish Iris setosa than Iris versicolor and Iris virginica. Iris setosa has wider and shorter sepals, while the other species are not easy to differentiate based on this data.

From the Petal length and Petal width comparison picture the difference bewtween the three speices is much more noticable. Iris setosa is very distinct and has the smallest and narrowest petals of the three. Iris virginica has the biggest petals.

Scatterplot code

Scatterplots are coded as two different functions: Sepal width and length comparison and Petal width and length comparison. Both those functions are united uder a function scatterplots().

Scatterplot code exmple:

    def sepal_length_width_scat():
        plt.figure(figsize = (9,9))
        sns.scatterplot(x = "SepalLengthCm", y = "SepalWidthCm", data = ifds, marker = "o", hue = "Species", 
        palette = ["deeppink","mediumorchid","navy"], edgecolor = "dimgrey")
        plt.title("Sepal length and Sepal width comparison", size = 20)
        plt.xlabel("Sepal length", size = 16)
        plt.ylabel("Sepal widthth", size = 16)
        plt.legend()
        plt.savefig("Sepal-length-width.png")
        plt.show()

sns.scatterplot() depicts the joint distribution of two variables using a cloud of points, where each point represents an observation in the dataset. Viewr can then determine if there is any meaningful relationships between the presented data. [26] [27]
Data that are used and compared this are columns "SepalLengthCm" and "SepalWidthCm" and they are grouped by "Species". [28]

Like in histograms, lots of parameters for scatterplots are added for aesthetic purposes.
Palette of colors used is the same as for the histograms. Circle style marker with an edgecolor is chosen for neater look. [29] [30]

Pairplot

Pairplot gives the better comparison and observation of the data and provides enough informations to draw conclusions.

Iris dataset pairplot

Pairplot code

    def pairplot():
        sns.pairplot(ifds, hue = "Species", diag_kind = "hist", palette = ["deeppink","mediumorchid","navy"])
        plt.savefig("Iris-dataset-pairplot.png")
        plt.show()

Pairplot is used for plotting pairwise relationships in datasets. The default diagonal plot is KDE, but in this case it is changed to histogram with the parameter diag_kind. Color palette remained the same. [31]

Because there is 4 different variables (SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm) 4x4 plot is created.

Conclusion

Even though it has the widest sepals of all three species, Iris setosa is the smallest flower.
If compared by sepal width and length, Iris versicolor and Iris virginica would not be distinguished easy.
But observing the petal length and width, and petal and sepal ratios the difference is noticed, with Iris virginica being the biggest of the flowers.

Technologies used

  • Visual Studio Code - version 1.44.2
  • cmder - version 1.3.14.982
  • python - version 3.7.4.final.0
  • Anaconda3 - 2019.10
  • Notepad++ - version 7.8.5
  • Mozzila Firefox 75.0 (64-bit)

References

[01] Towards data science. The Iris dataset - A little bit of history and biology
[02] The Good Python. Iris dataset.
[03] Kaggle. UCI Machine learning. Iris dataset download.
[04] Numpy.org. What is Numpy?
[05] Datacamp. Pandas tutorial
[06] Geeksforgeeks. Introduction mathplotlib
[07] Matplotlib.org
[08] Matplotlib.org. Matplotlib.pyplot
[09] Seaborn. Introduction.
[10] Datacamp. Seaborn Python tutorial
[11] Python.org. Sys
[12] Real python. Python csv.
[13] Lutz, M. (2009)."Learning Python", pg. 303
[14] StackOverflow.Sys.stdout
[15] Real Python. Read Write files Python
[16] Geeksforgeeks. Reading and writing text files
[17] StackOverflow. Python writing function output to a file.
[18] Towards Data Science. Getting started to data analysis with Python pandas
[19] Medium. Exploratory data analysis.
[20] Towards Data Science. Getting more value from the pandas value counts.
[21] Cmdline tips. How to make histogram in python with pandas and seaborn.
[22] StackOverflow. Text size of x and y axis and the title on matplotlib.
[23] StackOverflow. Change size of figures drawn with matplotlib.
[24] Seaborn.pydata. Seaborn.distplot
[25] Python graph gallery. Select color with matplotlib
[26] Seaborn. Seaborn scatterplot.
[27] Seaborn. Relational tutorial.
[28] Honing Data Science
[29] Matplotlib. Markers
[30] StackOverflow. Matplotlib border around Scatterplot points.
[31] Kite. Seaborn pairplot.

Worthy mentions

This is the list of sources that have not been used in analysis or summary of the Iris dataset but rather for better understanding of requirements for the project, researching how to edit the readme file and also interesting sources worth of reading.

GitHub editing

Dataset analysis approach by others

About

Git for the Programming and scripting project 2020

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages