2023-07-16

Data Analysis Note - Pandas Part 2 - Series

When working with Pandas’ specific column, you would have a new datatype called Series. A Series is a one-dimensional array of data. It can hold data of any type: string, integer, float, dictionaries, lists, booleans, and more.

The most commonly used methods for working with Series

1. Basic info checking:


# only display the summary statistics of this selected column:
>> df.column_name.describe()

# Return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
>> df.column_name.value_counts() 

# sort the entire dataframe based on the order of ['column_1','column_2'] and return a Series (with 2 indexes) containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
df[['column_1','column_2']].value_counts()
# for instance, get the number of books sold in certain years based on different genre:
books[['Genre','Year']].value_counts()
>> result ->
>>  Genre        Year
    Non Fiction  2015    33
                 2016    31
                 2019    30
                 2010    30
                 2018    29
    Fiction      2014    29
    Non Fiction  2012    29
                 2011    29
                 2017    26
                 2013    26
                 2009    26
    Fiction      2017    24
                 2013    24
                 2009    24
                 2018    21
    Non Fiction  2014    21
    Fiction      2012    21
                 2011    21
                 2010    20
                 2019    20
                 2016    19
                 2015    17
    Name: count, dtype: int64

# will return an array of unique value (default to include 'NaN' value)
>> df.column_name.unique()

# will return a number showing how many unique values are there, (default to NOT include 'NaN' value)
>> df.column_name.nunique()
# including 'NaN' value
>> df.column_name.nunique(dropna=False)

# return the nth of largest items;
df.column_name.nlargest(5)

# return the nth of smallest items;
df.column_name.nsmallest(5)

2. Filter / select from a series:


# filter/select rows with where, Series.where（） 
Series.where(cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False)[source]

# example that extract title column from dataframe, then filter which one contains string 'you' in the title.
title = df['title']
title.where(lambda x: x.str.contains('you'))


#while dataframe also has .where() method:
df.where(cond, other=nan, inplace=False, axis=None, level=None, errors='raise', try_cast=False)
df.where(df<90, "A+")   # find where df < 90，replace with "A+", default replace with 'NaN'

3. Converting a Series data obj to DataFrame:

1 2	#Convert Series to DataFrame. Series.to_frame(name=None)