2023-08-13

Data Analysis Note - Pandas Part 3 - Groupby

This post covers useful notes for using pandas (the famous package for Python for data analysis). This post are more advanced topics such as group dataframe, group data with custom python functions and join dataframe etc.

1. Groupby dataframe :

#This is a best way to find if this column has any null values or not:
#It is good practice to check if a column has null before groupby:
df.name.isnull().any()

movies_gb_rating = movies.groupby('rating')

movies_gb_rating.ngroups # return the number of groups totally

movies_gb_rating.groups  
# return a python dictionary with key and value (key is the group name and value is a list of contained rows' indexes)

>> results ->
'G': [107, 933, 1126, 1557, 2911, 3059, ], 'NC-17': [5264, 6338, 7263], 'NR': [5971, 5987, 6015, 6054, 6087, 6093, 6095, 6179, 6190, 6207, 6208, 6226], 'PG': [6, 41, 42, 43, 45, 94, 115, 127, 137, 155, 156, 167, 168, 231, 300, 325, 326, 330, 344, 345, 347, 393, ...], 'PG-13': [0, 9, 27, 28, 29, 38, 44, 88, 112, 117, 128, 129, 134, 138, 142, , ...], 


movies_gb_rating.first()   # get the first result from all groups into a dataframe:

# get all results (a dataframe) that belongs to this specific group;
movies_gb_rating.get_group('G')     # this is similar to filter the result by rating == 'G'

# iterate through all groups and print the group name and results 
for name, group in movies_gb_rating:
    print(name)
    print('--------')
    print(group)


grouped_df = df.groupby('student').grade.mean()
grouped_df = df.groupby('student', as_index=False).grade.mean() # as_index=False means -- Do not set the 'student' column as index;

#column1 is the column that we want to group by ('student' in our example)
#column2 is the column that we want to perform a measurement on (grade in our example)
#measurement is the measurement function we want to apply (mean in our example)
df.groupby('column1').column2.measurement()
pricey_shoes = orders.groupby("shoe_type").price.max()  # type is <class 'pandas.core.series.Series'>

# using reset_index() function to reset the index and convert to a DataFrame (books dataframe),
# withou reset_index() it will be a series
df = books.groupby('Genre')['Price'].max().reset_index()
df >> 
       Genre	    Price
    0	Fiction	    82
    1	Non Fiction	105

# rename the column 'Price' to "Max_price"
output.rename(columns = {"Price": "Max_price"},  inplace = True) 

df >> 
     	    Genre	    Max_price
        0	Fiction	    82
        1	Non Fiction	105


# Following command groups the books by year, calculates the number of authors and the max and average prices for each year, renames the columns, and sorts the results. The output is a sorted DataFrame where each row corresponds to a unique year and the columns are the calculated values.
@sort_values()
 books.fillna(0).groupby('Year').agg({'Author':'count', 'Price':['max', 'mean']})\
            .rename(columns={"count":'num_of_authers', "sum":'max_price',"mean":'avg_price'})\
           .sort_values(by=[('Author', 'num_of_authers'), ('Price', 'avg_price')], ascending=[False, True])
>> result -> 
            Author	            Price
         num_of_authers  max_price	avg_price
    Year			
    2019	50	            27	        10.08
    2015	50	            46	        10.42
    2018	50	            46	        10.52
    
# for instance, in 2019 there are 50 authers who publsih books recorded in this dataset and the max price (higest price of book sold in that year is) $27 and average price of that year is $10.08

2. Group data practical usage: agg function

# Groupby certain conditions ('Continent' in this case) to get a groupby generater obj, then using key ('Continent name') to access each item:
grouped_by_genre = books.groupby('Genre')
for key, group in grouped_by_genre:
    print(key, type(group))

# The above code get the following result:	
        Fiction <class 'pandas.core.frame.DataFrame'>
        Non Fiction <class 'pandas.core.frame.DataFrame'>

# to print out each columns for the group object:
for key, group in grouped_by_genre:
    print(key, group.columns)

# The above code get the following result:	
Fiction Index(['Name', 'Author', 'User Rating', 'Reviews', 'Price', 'Year', 'Genre'], dtype='object')
Non Fiction Index(['Name', 'Author', 'User Rating', 'Reviews', 'Price', 'Year', 'Genre'], dtype='object')


@agg()  # with customized function:  
def price_range(s):   #-> get the price range
    return s.max() - s.min()

grouped_by_genre['Price'].agg(['min', 'max', price_range])


def count_nulls(s):   #-> get the null value per group
    return s.size - s.count()

grouped_by_genre['director'].agg(count_nulls)

3. Group data advanced usage: apply function

@apply()
# using groupby to combine with apply() function. 

# find 100 percentile of the book price in different 'Genre' groups (aka highest price):
expensive_books=books.groupby("Genre").Price.apply(lambda x: np.percentile(x,100)).reset_index()
expensive_books
>>   
    	    Genre	     Price
        0	Fiction	     82.0
        1	Non Fiction	 105.0

# find 25 percentile of the book price in different 'Genre' groups (aka cheap books price):
cheap_books=books.groupby("Genre").Price.apply(lambda x: np.percentile(x,25)).reset_index()
cheap_books
>> 
    	    Genre	     Price
        0	Fiction	     6.0
        1	Non Fiction	 8.0

# find 50 percentile of the book price in different 'Genre' groups (aka median books price):
median_books=books.groupby("Genre").Price.apply(lambda x: np.percentile(x,50)).reset_index()
median_books
>> 
    	    Genre	     Price
        0	Fiction	     9.0
        1	Non Fiction	 12.0
# above is the same to "books.groupby('Genre')['Price'].median()"

# calculate the proportion of book names in each genre that contain the word ‘love’.
books.groupby('Genre').Name.apply(lambda s: s.str.lower().str.contains('love') .sum()/s.count())
>> 
        Genre           
        Fiction        0.008333
        Non Fiction    0.067742
        Name: Name, dtype: float64
# books.groupby('Genre'): This groups the books DataFrame by the ‘Genre’ column. Each group corresponds to a unique genre.
#.Name.apply(lambda s: s.str.lower().str.contains('love').sum()/s.count()): This applies a function to the ‘Name’ column of each group. The function does the following:
    #s.str.lower(): Converts all the book names in the group to lowercase.
    #.str.contains('love'): Checks if the word ‘love’ is in each book name. This returns a Series of True/False values.
    #.sum(): Counts the number of True values, i.e., the number of book names that contain ‘love’.
    #/s.count(): Divides the count by the total number of books in the group.


books.groupby('Genre').apply(lambda df: (df['Reviews'] / df['User Rating']).mean(), include_groups=False).sort_values()
# books.groupby('Genre'): This groups the books DataFrame by the ‘Genre’ column. Each group corresponds to a unique Genre.
# .apply(lambda df: (df['Reviews'] / df['User Rating']), include_groups=False): This applies a function to each group. The function does the following:
        # (df['Reviews'] / df['User Rating']): For each group (i.e., for each year), it divides the ‘Reviews’ column by the ‘User Rating’ column. This operation is performed element-wise, get a result of how many reivews can generate 1 User Rating.
        # include_groups=False: This is an option for the apply() function. By setting it to False, the ‘Genre’ column (the grouping column) is excluded from the operation.
        # .sort_values(): This sorts the resulting Series in ascending order.


#This will add a new column ‘Max_Price_By_Genre’ to the books DataFrame. The transform() function
# applies a function to each group of values. In this case, it’s applying the ‘max’ function, which returns the maximum value for that genre, and then assign it back to the corresponding row.

books['Max_Price_By_Genre'] = books.groupby('Genre')['Price'].transform('max')
>>result ->
    	Name	Author	  User Rating	Reviews	Price	Year	Genre	     Max_Price_By_Genre
0	bookname1	JJ Smith	4.7     	17350	8	    2016	Non Fiction	   105
1	bookname2	Stephen 	4.6	        2052	22	    2011	Fiction	       82
2	bookname3	Jordan B    4.7	        18979	15	    2018	Non Fiction	   105
3	bookname4	George O	4.7	        21424	6	    2017	Fiction	       82


@filter()  # this filter is working with groupby(): 
#groupby 'year', then using groupby result to check mean price to filter entre dataframe, note it returned df is the filtered df based on original df, not the groupby one:
def filter_price(df):
    return df['Price'].mean() > 15

books.groupby('Year').filter(filter_price)

3. Reduce data size by ‘category’ datatype:
NOTE: Only suitable to columns that have limited types of string value, for instance for a type column in movies dataframe; or for gender types of student dataframe and the likes.

# Using 'category' can massively reduce the size of a dataframe :
'''if certain column's type is object and that dataframe has big size, then following way can convert certain column from 'object' to 'category' datatype, in essence, it will convert original string into int (which will mapping to certain dictionary of types), thus categorize the values'''

# internal logic is the column now stores values of integer instead string, and each int value mapping to a corresponding type of original string in a lookup dictionary. For instance following method will convert the 'type' column of a 'movies' dataframe into a category. 

# This line will convert the column 'type' to datatype of 'category' 
movies['type'] = movies['type'].astype('category')

# inspect the code value stored in this column, 
movies['type'].cat.codes.head()   # will be different from df['type'];

# Or you can convert when reading the df:
dtypes = {"type":"category"}
movies = pd.read_csv(r".\database\movies.csv", dtype=dtypes)



# Sort value with Category type with custom order (Customized order):
# Convert to ordered categorical type with custom ordering: ['North America' < 'Asia' < 'Europe' < 'Oceania' < 'South America' < 'Africa']
cat_dtype = pd.api.types.CategoricalDtype(categories=["North America", "Asia", "Europe", "Oceania", "South America", "Africa"], ordered=True)
df["continent"] = df.continent.astype(cat_dtype) # Categories (6, object): ['North America' < 'Asia' < 'Europe' < 'Oceania' < 'South America' < 'Africa']
# then you can use the above to sort value;
df_big.continent.sort_values()
# And you can use the above to filter values:
df_big.loc[df_big.continent > "Oceania"]