Data Analysis Note - Pandas Part 3 - Groupby

This post covers useful notes for using pandas (the famous package for Python for data analysis). This post are more advanced topics such as group dataframe, group data with custom python functions and join dataframe etc.

1. Groupby dataframe :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
#This is a best way to find if this column has any null values or not:
#It is good practice to check if a column has null before groupby:
df.name.isnull().any()

movies_gb_rating = movies.groupby('rating')

movies_gb_rating.ngroups # return the number of groups totally

movies_gb_rating.groups
# return a python dictionary with key and value (key is the group name and value is a list of contained rows' indexes)

>> results ->
'G': [107, 933, 1126, 1557, 2911, 3059, ], 'NC-17': [5264, 6338, 7263], 'NR': [5971, 5987, 6015, 6054, 6087, 6093, 6095, 6179, 6190, 6207, 6208, 6226], 'PG': [6, 41, 42, 43, 45, 94, 115, 127, 137, 155, 156, 167, 168, 231, 300, 325, 326, 330, 344, 345, 347, 393, ...], 'PG-13': [0, 9, 27, 28, 29, 38, 44, 88, 112, 117, 128, 129, 134, 138, 142, , ...],


movies_gb_rating.first() # get the first result from all groups into a dataframe:

# get all results (a dataframe) that belongs to this specific group;
movies_gb_rating.get_group('G') # this is similar to filter the result by rating == 'G'

# iterate through all groups and print the group name and results
for name, group in movies_gb_rating:
print(name)
print('--------')
print(group)


grouped_df = df.groupby('student').grade.mean()
grouped_df = df.groupby('student', as_index=False).grade.mean() # as_index=False means -- Do not set the 'student' column as index;

#column1 is the column that we want to group by ('student' in our example)
#column2 is the column that we want to perform a measurement on (grade in our example)
#measurement is the measurement function we want to apply (mean in our example)
df.groupby('column1').column2.measurement()
pricey_shoes = orders.groupby("shoe_type").price.max() # type is <class 'pandas.core.series.Series'>

# using reset_index() function to reset the index and convert to a DataFrame (books dataframe),
# withou reset_index() it will be a series
df = books.groupby('Genre')['Price'].max().reset_index()
df >>
Genre Price
0 Fiction 82
1 Non Fiction 105

# rename the column 'Price' to "Max_price"
output.rename(columns = {"Price": "Max_price"}, inplace = True)

df >>
Genre Max_price
0 Fiction 82
1 Non Fiction 105


# Following command groups the books by year, calculates the number of authors and the max and average prices for each year, renames the columns, and sorts the results. The output is a sorted DataFrame where each row corresponds to a unique year and the columns are the calculated values.
@sort_values()
books.fillna(0).groupby('Year').agg({'Author':'count', 'Price':['max', 'mean']})\
.rename(columns={"count":'num_of_authers', "sum":'max_price',"mean":'avg_price'})\
.sort_values(by=[('Author', 'num_of_authers'), ('Price', 'avg_price')], ascending=[False, True])
>> result ->
Author Price
num_of_authers max_price avg_price
Year
2019 50 27 10.08
2015 50 46 10.42
2018 50 46 10.52

# for instance, in 2019 there are 50 authers who publsih books recorded in this dataset and the max price (higest price of book sold in that year is) $27 and average price of that year is $10.08

2. Group data practical usage: agg function

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Groupby certain conditions ('Continent' in this case) to get a groupby generater obj, then using key ('Continent name') to access each item:
grouped_by_genre = books.groupby('Genre')
for key, group in grouped_by_genre:
print(key, type(group))

# The above code get the following result:
Fiction <class 'pandas.core.frame.DataFrame'>
Non Fiction <class 'pandas.core.frame.DataFrame'>

# to print out each columns for the group object:
for key, group in grouped_by_genre:
print(key, group.columns)

# The above code get the following result:
Fiction Index(['Name', 'Author', 'User Rating', 'Reviews', 'Price', 'Year', 'Genre'], dtype='object')
Non Fiction Index(['Name', 'Author', 'User Rating', 'Reviews', 'Price', 'Year', 'Genre'], dtype='object')


@agg() # with customized function:
def price_range(s): #-> get the price range
return s.max() - s.min()

grouped_by_genre['Price'].agg(['min', 'max', price_range])


def count_nulls(s): #-> get the null value per group
return s.size - s.count()

grouped_by_genre['director'].agg(count_nulls)

3. Group data advanced usage: apply function

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
@apply()
# using groupby to combine with apply() function.

# find 100 percentile of the book price in different 'Genre' groups (aka highest price):
expensive_books=books.groupby("Genre").Price.apply(lambda x: np.percentile(x,100)).reset_index()
expensive_books
>>
Genre Price
0 Fiction 82.0
1 Non Fiction 105.0

# find 25 percentile of the book price in different 'Genre' groups (aka cheap books price):
cheap_books=books.groupby("Genre").Price.apply(lambda x: np.percentile(x,25)).reset_index()
cheap_books
>>
Genre Price
0 Fiction 6.0
1 Non Fiction 8.0

# find 50 percentile of the book price in different 'Genre' groups (aka median books price):
median_books=books.groupby("Genre").Price.apply(lambda x: np.percentile(x,50)).reset_index()
median_books
>>
Genre Price
0 Fiction 9.0
1 Non Fiction 12.0
# above is the same to "books.groupby('Genre')['Price'].median()"

# calculate the proportion of book names in each genre that contain the word ‘love’.
books.groupby('Genre').Name.apply(lambda s: s.str.lower().str.contains('love') .sum()/s.count())
>>
Genre
Fiction 0.008333
Non Fiction 0.067742
Name: Name, dtype: float64
# books.groupby('Genre'): This groups the books DataFrame by the ‘Genre’ column. Each group corresponds to a unique genre.
#.Name.apply(lambda s: s.str.lower().str.contains('love').sum()/s.count()): This applies a function to the ‘Name’ column of each group. The function does the following:
#s.str.lower(): Converts all the book names in the group to lowercase.
#.str.contains('love'): Checks if the word ‘love’ is in each book name. This returns a Series of True/False values.
#.sum(): Counts the number of True values, i.e., the number of book names that contain ‘love’.
#/s.count(): Divides the count by the total number of books in the group.


books.groupby('Genre').apply(lambda df: (df['Reviews'] / df['User Rating']).mean(), include_groups=False).sort_values()
# books.groupby('Genre'): This groups the books DataFrame by the ‘Genre’ column. Each group corresponds to a unique Genre.
# .apply(lambda df: (df['Reviews'] / df['User Rating']), include_groups=False): This applies a function to each group. The function does the following:
# (df['Reviews'] / df['User Rating']): For each group (i.e., for each year), it divides the ‘Reviews’ column by the ‘User Rating’ column. This operation is performed element-wise, get a result of how many reivews can generate 1 User Rating.
# include_groups=False: This is an option for the apply() function. By setting it to False, the ‘Genre’ column (the grouping column) is excluded from the operation.
# .sort_values(): This sorts the resulting Series in ascending order.


#This will add a new column ‘Max_Price_By_Genre’ to the books DataFrame. The transform() function
# applies a function to each group of values. In this case, it’s applying the ‘max’ function, which returns the maximum value for that genre, and then assign it back to the corresponding row.

books['Max_Price_By_Genre'] = books.groupby('Genre')['Price'].transform('max')
>>result ->
Name Author User Rating Reviews Price Year Genre Max_Price_By_Genre
0 bookname1 JJ Smith 4.7 17350 8 2016 Non Fiction 105
1 bookname2 Stephen 4.6 2052 22 2011 Fiction 82
2 bookname3 Jordan B 4.7 18979 15 2018 Non Fiction 105
3 bookname4 George O 4.7 21424 6 2017 Fiction 82


@filter() # this filter is working with groupby():
#groupby 'year', then using groupby result to check mean price to filter entre dataframe, note it returned df is the filtered df based on original df, not the groupby one:
def filter_price(df):
return df['Price'].mean() > 15

books.groupby('Year').filter(filter_price)


3. Reduce data size by ‘category’ datatype:
NOTE: Only suitable to columns that have limited types of string value, for instance for a type column in movies dataframe; or for gender types of student dataframe and the likes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Using 'category' can massively reduce the size of a dataframe :
'''if certain column's type is object and that dataframe has big size, then following way can convert certain column from 'object' to 'category' datatype, in essence, it will convert original string into int (which will mapping to certain dictionary of types), thus categorize the values'''

# internal logic is the column now stores values of integer instead string, and each int value mapping to a corresponding type of original string in a lookup dictionary. For instance following method will convert the 'type' column of a 'movies' dataframe into a category.

# This line will convert the column 'type' to datatype of 'category'
movies['type'] = movies['type'].astype('category')

# inspect the code value stored in this column,
movies['type'].cat.codes.head() # will be different from df['type'];

# Or you can convert when reading the df:
dtypes = {"type":"category"}
movies = pd.read_csv(r".\database\movies.csv", dtype=dtypes)



# Sort value with Category type with custom order (Customized order):
# Convert to ordered categorical type with custom ordering: ['North America' < 'Asia' < 'Europe' < 'Oceania' < 'South America' < 'Africa']
cat_dtype = pd.api.types.CategoricalDtype(categories=["North America", "Asia", "Europe", "Oceania", "South America", "Africa"], ordered=True)
df["continent"] = df.continent.astype(cat_dtype) # Categories (6, object): ['North America' < 'Asia' < 'Europe' < 'Oceania' < 'South America' < 'Africa']
# then you can use the above to sort value;
df_big.continent.sort_values()
# And you can use the above to filter values:
df_big.loc[df_big.continent > "Oceania"]