2023-06-03

Data Analysis Note - Pandas Part 1

This post covers useful notes for using pandas (the famous package for Python for data analysis). I’ve listed quite lots of useful notes for future reference.

The most commonly used methods I've encountered

1. To change the settings for displaying the dataframe linke to offical doc - pandas.get_option:

# Display the current setting for this param ("display.max_rows"):
pd.get_option("display.max_rows") 

# Disply the full rows for whole table (cautious would crash your program if large dataset):
pd.set_option('display.max_rows', None)

# Other usful settings:
pd.reset_option("display.max_colwidth", 1000) # to change the settings of the column width;
display.precision  # to change the diaply on how many decimal place of float figure;
pd.set_option("display.float_format", "{:,}".format) # To format the float number as 1,000.00
# This can also make sure not show the scientific notation. 

pd.set_option('display.float_format', "{:,.2f}".format)   # format like: 53,844,501.67
pd.set_option('display.float_format', "${:,.2f}".format)  #  	$53,844,501.67
pd.set_option('display.float_format', "{:.0f}".format)    # get rid of .0

# To Reset to defaul option settings:
pd.reset_option("display.max_rows")  # rest value for this param,
pd.reset_option("all")  # reset all options

2. Read various files with more options:

# Change the column name, dtypes when reading the file:
df = pd.read_csv(r"name.csv", 
                 names = new_colnames, header=0, dtype = {"specific_column":float}, na_values = "N.A.")  #set null value to "N.A."
# names must be used with header=0 to change the column name and skip the old header row.

# # Get database / table from website:
url = 'https://....'
df = pd.read_csv(url, sep=';')
pd.read_table(url, sep=';')   # if not a csv file;

# Read file that is not comma-separated and adding or change the column names:
col_labels=['id','title','year','imdb_rating','imdb_di','genres']
df = pd.read_csv(r'.\movies.tsv', sep='\t', names=col_labels)

# 2 ways to parse time when read file:
# Method 1 to parse time: (using python own time module)
import time
d_parser = lambda x: time.strftime("%Y-%m-%d %H" ,time.strptime(x, "%Y-%m-%d %I-%p"))
df = pd.read_csv(r"D:\table.csv", parse_dates=["Date"], date_parser = d_parser)

# Method 2 to parse time: (using pandas datetime)
import datetime
d_parser = lambda x: datetime.strptime(x, "%Y-%m-%d %I-%p")
df = pd.read_csv(r"D:\table.csv", parse_dates=["Date"], date_parser = d_parser)

3. Check database basic information:

>> df.head() # display the first 5 rows of the table
>> df.info()                      # includes an estimated memory usage
>> df.info(memory_usage="deep")   # includes exact memory usage
>> df.memory_usage(deep=True)     # Only show exact memory_usage per column as bytes
>> df.dtypes                      # shows all data types for all columns;

# give the length of the dataframe:
>> len(df)      # return one number;
>> df.count()   # return the length for every column;

>> df.columns           # display the column names of the table
>> df.columns.unique()  # display the distinct values for a column
# display the distinct values and counts for a column exclude the NaN, Identifying Duplicate Column Names:
>> df.columns.value_counts() 
>> df.columns.value_counts(dropna=False) # display the distinct values and counts for a column including the NaN values
>> df.columns.value_counts(normalize = True)  # display the distinct values for a column as frequence / percentage.

# quickly give you summarized data info:
>> df.describe() # display the summary statistics of the table
>> df.describe(include=np.number)  # only display the summary statistics of the numeric columns
>> df.describe(exclude = ["object"])  # only display the summary statistics of columns whose datatypes are not string aka 'object' here. 

# only display the summary statistics of columns whose datatypes are string. The value returned will be different from numeric columns:
>> df.describe(include = ["object"])   

# quickly select columns of data by types:
>> df.select_dtypes(include = ["object"])  # select columns by datatypes as string. 
>> df.select_dtypes(include = np.number)  # select columns by datatypes as numeric.

4. Sorting and filtering from Dataframe :

# To selelct two columns from df dataframe and return another dataframe "df_extracted"
df_extracted = df[['clinic_north', 'clinic_south']]

# best way to select multi rows (Best Practice)
df.loc[3:6]   # row 6 is inclusive;
df.loc[:, "Date": "Volume"]  #.loc to select all data for cloumns from "Date" to "Volume" 

df.loc[::-1] # reverse row order;
df.loc[:, ::-1]  # reverse column order;
df_big.loc[52, "pop_2020"]    #.loc to select index 52 and column "pop_2020"

# best way to select multi rows by index
df.iloc[3:6]   # row 6 is not inclusive;
df.iloc[2:5, 3:6] # using index for row and columns to get like row 2 to 4 and columns 3 to 5


df.loc[df.Hobbyist == "No"].index  # using loc and conditon to get the rows indexs:

state["area"]   # return a pandas Series (one column);
state["area"]["Texas"]   # return the specific figure: 695662

# sort the entire dataframe based on the order of ['column_1','column_2'] and extract the 10th largest rows;
df.nlargest(10,['column_1','column_2'])

# sort the df by column "Price" in descending order:
books.sort_values(by="Price", ascending=False)

# sort the df by Author's last name, here passes a lambda function so it will split the cell value (string) of 'Author', then sorting the df by last element of the string list;
books.sort_values(by="Author", key=lambda s: s.str.split(" ").str[-1])

@isin()
year_to_filter = [2018, 2022]  
books.loc[books["Year"].isin(year_to_filter)]    # get all books published in year 2018 and 2022

# combine conditions to filter:
year_to_filter = [2018, 2019]
# get all books published in year 2018 and 2022 and only show ones are 'Fiction'
books.loc[books["Year"].isin(year_to_filter) & (books["Genre"] == "Fiction")]

5. Common method to clean the data :
**NOTE: missing data in pandas is represented by a NaN, where NaN is a numerical value, i.e float type. However None is a Python NoneType, so NaN will not be equivalent to None.

The best way in Pandas to test a column is NaN is to use: isnull() or notnull()**


@is_na()          # checking na values
>> movies.isna()     # Return a DataFrame of the same shape as movies where each element is a boolean value. If the corresponding value in movies is missing (NaN), the value in movies_isna will be True, otherwise it will be False.

>> movies.isna().any()  # return list of columns and shows if each columns has missing values (na) 
>> result ->
        show_id         False
        type            False
        title           False
        director         True
        cast             True
        country          True
        date_added       True
        release_year    False
        rating           True
        duration         True
        listed_in       False
        description     False
        dtype: bool

>> movies.isna().any().any() # Return one boolean value result checking if whole df has missing value 'na'.

# using .loc and .isna to find the column whose values are missing and shows as a new dataframe to user.
>> movies.loc[movies['duration'].isna()]

>> fillna_lookup = {'director':'missing', 'country':'other', 'duration':0}
>> movies.fillna(fillna_lookup)   # fill those columns ('director','country','duration') with the corresponding values in the lookup table.



# practical use:  fill the missing values with the mean of the column, or with some other aggregate value:
bill_df = bill_df.fillna(value={"bill":bill_df.bill.mean(), "num_guests":bill_df.num_guests.mean()})

@fill_na()
>> movies.fillna(0, inplace=True) # This will replace all missing values with 0 in the movies DataFrame itself. inplace means replace original dataframe.


@drop_na()         # default to drop rows for 'na', using axis=1 to swith to column operation
>> movies.dropna()    # remove all rows in the movies DataFrame where at least one element is missing. After this operation, movies will only contain rows with no missing values.

>> movies.dropna(how='all')   # only drop rows whose values are all 'na'

>> movies.dropna(subset=['column_1'])   # only drop rows whose value in 'column_1' are 'na'

@duplicated()
>> df.duplicated()  #To check for duplicates, will return a Series telling us which rows are duplicate rows.
>> .duplicated().any()   # get one answer to check it has duplicate or not.

>> df.columns.duplicated().sum() # to sum how many duplicates, can also use subset=['item'].

@drop_duplicates()
# To drop duplicate: to remove all rows that are duplicates of another row.
>> .drop_duplicates()
>> df.drop_duplicates(inplace=True)  # make change in existing df.

@Converting_datatype
# to convert strings containing numerical values to integers or floats, if there are data unable to parse to desired types then errors raise, you should use following 'pd.to_numeric()'
>> us_census = us_census.astype({"Income":'float', "MalePop":'float', "FemalePop":'float'})

# if data have string or value cannot be converted to numeric value in normal way:
# using errors="coerce" if force those invalid value becoming NaN and force to convert.
>> new_stds.score = pd.to_numeric(new_stds.score)
>> df.Value = pd.to_numeric(df.Value, errors="coerce")

# astype() also convert the series to certain type of data and it can be combined with boolean results.
>> df2.country_region.str.contains("ch").astype(int)

# OR using apply to convert the whole df or part of the df:
>> df.loc[:, ["fert_rate", "med_age"]] = df.loc[:, ["fert_rate", "med_age"]].apply(pd.to_numeric, errors="coerce")

@to_datetime
pd.to_datetime(df.visit_time, format="%Y-%m-%d %H:%M:%S")  # To convert the column to datetime by passing format (dtype: datetime64[ns])


# e.g. this is to parse all value in 'date_added' column under movies df via formate of 'September 25, 2021'
pd.to_datetime(movies['date_added'].str.strip(), format="%B %d, %Y")  # note str.strip() to make sure strip off all spaces before or end of the value prior to parsing.

# 'coerce' force to parse column to date time type if invalid parsing will return 'NaT'
pd.to_datetime(movies['date_added'], errors='coerce')