Python Data Analysis Project — Palmer Penguins Dataset
The Palmer penguins dataset is a collection of data about penguins in the Palmer Archipelago, Antarctica. It is a common dataset that is easy to understand and work with. It contains variables such as island name (Dream, Torgersen, or Biscoe), species name (Adelie, Chinstrap, or Gentoo), bill length (mm), bill depth (mm), flipper length (mm), body mass (g), and sex. The dataset is available on Kaggle and can be used for data analysis and machine learning projects.
In this project, I look at the relationship between the sizes of the birds and features such as location, sex, flipper length etc. I also look at the clutch completion of each bird.
NB: A clutch of eggs is a group of eggs produced by birds, amphibians, or reptiles, often at a single time, particularly those laid in a nest. In birds, destruction of a clutch by predators (or removal by humans) results in double-clutching. The term “clutch” is also used to refer to living eggs that have not hatched yet.
I cleaned and analysed the data using python.
Importing Libraries and Studying Dataset
I started by importing the following necessary python libraries into Jupyter notebook;
Pandas: for storing, manipulating and working with structured data.
Numpy: for working with arrays and matrices, as well as carrying out mathematical operations on objects.
Matplotlib.pyplot: for creating plots
Seaborn: matplotlib sub-library for creating attractive data visualisation
Next I imported the csv file using pd.read_csv(“ filepath.csv”).
Next, I studied the dataset to get an overview before proceeding with the data.
The code data.head(10) allowed me to view the first 10 rows of the dataset. This is important to get an idea of the dataset one is working with.
Similarly data.tail(10) allowed me to view the last ten rows of the dataset
Next data.describe() describes the dataset, showing me the first and last five rows, including the headings of the columns.
The code data.sample(10) brings out a sample of ten random rows.
The code data.info provides information on the dataset, such as names of each column, data types of each column, number of non-null rows under each column etc.
The code data.nunique gives us the number of unique values in each column. This could allow us know the number of variables that exist under each column e.g number of types (species) of penguins in the dataset.
Data Cleaning
After getting a summary of the data, I proceeded to clean the data.
Using data.drop() I deleted all columns that were not of use in the analysis of our data.
The code data.isnull() shows us rows that contain null values. Rows which return the boolean True all have null values.
The code data.isnull().any() gives us a list of the columns in the dataset and tells us the columns that contain null values.
The code data.isnull().sum() gives us a list of the columns in the dataset and tells us the number of null values in each column.
After discovering the rows and columns with null values, we decide whether to delete them or replace them. If deleting rows with null values will not significantly affect the analysis, then they can be dropped. For this project however, I decided to replace the missing values.
We can replace missing values in an array (column) using the mean, median or mode of the values present in that array.
I started with the “culmen length” column. I found two rows with missing values. One value for the Gentoo penguin and another for the Adelie penguin. Next I found the mean values for the culmen lengths of the three penguin species. Also I got the mean for culmen depth, flipper length and body mass.
NB: Culmen is a term used in ornithology to refer to the upper ridge of a bird’s beak. It is also known as the “upper mandible”.
Haven calculated the desired mean values, I proceeded to fill in the missing values. I also filled in the missing values in the sex column by comparing the average sizes of male and female penguin species and inserting the sex accordingly.
And thus I concluded the cleaning of the data haven removed unwanted columns and filled in missing values. Next I proceeded to the analysis and visualisation of the data.