The Complete Guide to Moving from Excel to Python, Part I

The Complete Guide to Moving from Excel to Python, Part I

Anyone who works with data (or wants to work with data) has probably used Excel. Many people exclusively use Excel for years and years because it is quite a powerful tool.

So, why use Python? The Python environment offers benefits that far exceed Excel and make the switch worth it. However, there aren’t many step by step guides out there so we created one.

In this article, we’ll use a popular Python package - Pandas - to replace many Excel functions you already know and love.


The Dataset

In this article, we’re going to use a .csv file. These are rather common, and you can save many Excel files with the .csv file extension. Python also works well with .csv files.

The dataset we will use in this article contains student grades in different subjects, and is taken from Kaggle. We’ll be able to easily switch from Excel spreadsheets to Python dataframes.

Setup

First things first, this article assumes you have Python installed on your computer, and have an IDE to use. If not, this guide will get you to that point. There are lots of IDE options, but VSCode is a good starter option. If you have questions about other IDEs, leave them in the comments.

To start with this guide, let’s install the packages we will need, Pandas and NumPy. You can do this by typing the following in your terminal window.

pip install pandas
pip install numpy

Next, you will need to create a new script (.py file) or notebook (.ipynb file) to work in. Now we’re ready to import our packages and look at the data. To do this, we first need to import the packages we want to use. We can do this easily using the following:

import pandas as pd
import numpy as np


You might notice that we’re using some aliases here. This is purely out of convenience - you will get tired of typing “pandas” over and over again, so these two letter abbreviations save you a bit of time.

Nest to read in the dataset and view the first few rows, we’ll use pandas read_csv function like this:

student_performance_df = pd.read_csv('StudentsPerformance.csv') 
student_performance_df.head()

While we could just print the dataset, above we’re assigning the data frame to a name - student_performance_df. This is so that we can reference it in future operations.

The .head() function in Pandas is used to return the first few rows of a DataFrame. By default, it returns the first five rows of the DataFrame, but you can specify the number of rows you want to return by passing a number as an argument to the .head() function, like .head(10).

The .head() function is often used to quickly inspect the data in a DataFrame and make sure it is in the correct format. It can be useful for quickly checking the structure of the data, the data types of the columns, and the values in the first few rows of the DataFrame.

Now, if you run the file, the output should look similar to an Excel spreadsheet. However, in a pandas data frame the data will be easier to work with.

The most common way to run a Python file is to use the command line.

To do this, open a terminal or command prompt, navigate to the directory where your Python file is located, and then type the following command, replacing filename.py with the name of your Python file:

python filename.py

In addition to these methods, there are also a number of other ways to run a Python file, such as using an Integrated Development Environment (IDE), running it as a script in a web framework, or using the exec function in the Python interpreter. The exact method you use will depend on your specific needs and the environment in which you are working.

Now we’ll dive into some common functions you probably already know in Excel, and how to do them in Python.

Column Wise Descriptive Statistics

These are functions just about everyone has used before (e.g. Sum, Average, Max, Min, Count). Luckily, they’re easy to replicate using Python. Let’s take a look.

The quickest way to explore quantitative data in a pandas data frame is to use the .describe() function, like this:

student_performance_df.describe()

This will return a breakdown of all numeric columns in your dataset. The results should look like this:

With this we get the count, mean, standard deviation, max, min, and quartile values of all numerical columns. The .describe() method in Pandas is used to generate summary statistics for numerical columns in a DataFrame. It provides a quick and easy way to calculate common statistics such as the mean, median, mode, standard deviation, minimum and maximum value, and the number of unique values in a column.

The .describe() method is often used to quickly get a sense of the distribution of values in a numerical column, or to compare the summary statistics for different columns. It can be particularly useful when working with large datasets, as it allows you to quickly and easily summarize a large amount of data.

If we want to look at these stats for a specific column, we can select a specific column or subset of columns with square brackets [ ] and then use the method we need (.sum().mean()). For example, we can calculate the average reading scores like this:

student_performance_df['reading score'].mean()

Most descriptive statistics can be computed like this (e.g. .max(), .min(), etc).

Row Wise Descriptive Statistics

Ok, so let’s say we want to calculate the average of the 3 grades (math, reading, and writing) for each student. Those values are all held in different columns, so we can’t use quite the same operation as before. However, there’s an easy addition you can make to allow this:

student_performance_df['mean_score'] = student_performance_df.mean(axis=1)


Here, we’re creating a new column (called mean_score), where we store the average score across all columns for each student. The axis=1 argument tells Python that we want to compute the average for a given row across all columns.

In the case where you want the average (or sum, or some other operation) between only a subset of columns, just remember the square brackets. For a single column you use one set of brackets:

student_performance_df[‘reading score’]

and for multiple columns, you use two sets of brackets:

student_performance_df[[‘reading score’, ‘math score’]]

Next, suppose we want to count the number of rows in a column, we can use the .value_counts() method like this:

student_performance_df[‘reading score’].value_counts()

The .value_counts() method in Pandas is used to calculate the frequency of each unique value in a column. It returns a Series object containing the unique values as the index and the number of times each value appears as the values.

The .value_counts() method is often used to quickly get a sense of the distribution of values in a column, or to identify which values appear most frequently. It can be particularly useful when working with large datasets, as it allows you to quickly and easily summarize a large amount of data.

If Statements

If you are a regular Excel user, you’ve almost certainly encountered IF statements in Excel.

If statements in Pandas work in the same way as they do in other programming languages or environments. They are used to check whether a certain condition is met, and if it is, they execute a block of code.

In Pandas, if statements are often used to filter a DataFrame based on certain conditions. For example, you might use an if statement to only select rows from a DataFrame that meet a certain criteria, or to only include certain columns in a calculation.

To use an if statement in Pandas, you first specify the condition you want to check, and then provide the code that should be executed if the condition is met.

To use an if statement in Pandas to create a new flag column, you can use the .apply() method on a DataFrame. The .apply() method allows you to apply a custom function to each row or column of a DataFrame.

For example,,you could use the following code to create a pass column indicating whether a given student has passed an exam.

def passing_grade(score, passing_threshold = 64): 
    if score > passing_threshold: 
        return 1 
    else: 
        return 0
student_performance_df['pass'] = student_performance_df[mean_score].apply(passing_grade, axis=1)

In this example, the passing_grade() function is defined to take a row of data from the DataFrame as input, and it returns a 1 if the value in the mean_score column is greater than 65, or a 0 if it is not. This function is then applied to each row of the DataFrame using the .apply() method, with the axis parameter set to 1 to indicate that the function should be applied to each row.

The result of this code is a new column called pass that contains a 1 for each row where the value in the mean_score column is greater than 64, and a 0 for all other rows. This new column can be used as a flag to identify rows in the DataFrame that meet a certain criteria.

Using if statements with the .apply() method in this way is a powerful way to create new columns in a DataFrame based on custom conditions and logic. It can be useful for many different types of data analysis tasks.

Conclusion

This article covered the basics of Pandas that will allow you to replace many Excel functions you already know and love. In our next article we will cover Pandas equivalent to VLOOKUP and more complex functions.

Deep Learning 101: Transformer Activation Functions Explainer - Sigmoid, ReLU, GELU, Swish

Deep Learning 101: Transformer Activation Functions Explainer - Sigmoid, ReLU, GELU, Swish

Comments and docstrings and type hints, oh my!

Comments and docstrings and type hints, oh my!