2024-01-28

Miniconda and Conda Virtual Environment

This post covers miniconda, including its installation and managing conda virtual environments. After using the Anaconda suite for quite some time, I found it cumbersome, with many unused tools, a large number of default packages, and a substantial volume size. Consequently, I decided to uninstall Anaconda and switch to the more lightweight miniconda distribution.

For miniconda installation link: miniconda.

Creating and Managing Virtual Environment

1. Open Anaconda Terminal
2. Create new virtual environment and install specific python verison
into this virtual environment

3. Typing “conda env list” to check the virtual environments just created

2023-10-15

Data Analysis Note - Pandas Part 4 - Join Tables

The most commonly used methods for working with multiple tables

1. Concating series and dataframes:


@concat

# concat both columns into 1 column (1 series)
pd.concat([column1, column2])   

# concat the column by column (series) to form a dataframe and using same index to match each row by row.
pd.concat([column1, column2], axis=1)     # axis=1 default is 0;

pd.concat([df,df2], ignore_index=True) 

# If two dfs have different columns, then using join='inner' would only join their mutual columns;
pd.concat([df1, df2], sort=True, ignore_index=True,  join='inner',)   # sort=True is to order the column names

# Keys and then original indexes of the df would togeter create a multi-index dataframe;
pd.concat([df1, df2], sort=True, join='inner', keys=['df1', 'df2']) # Using keys to give a first level of index

2. Merging dataframes / tables on left, right, inner, outer joining methods:

2023-08-13

Data Analysis Note - Pandas Part 3 - Groupby

This post covers useful notes for using pandas (the famous package for Python for data analysis). This post are more advanced topics such as group dataframe, group data with custom python functions and join dataframe etc.

1. Groupby dataframe :

#This is a best way to find if this column has any null values or not:
#It is good practice to check if a column has null before groupby:
df.name.isnull().any()

movies_gb_rating = movies.groupby('rating')

movies_gb_rating.ngroups # return the number of groups totally

movies_gb_rating.groups  
# return a python dictionary with key and value (key is the group name and value is a list of contained rows' indexes)

>> results ->
'G': [107, 933, 1126, 1557, 2911, 3059, ], 'NC-17': [5264, 6338, 7263], 'NR': [5971, 5987, 6015, 6054, 6087, 6093, 6095, 6179, 6190, 6207, 6208, 6226], 'PG': [6, 41, 42, 43, 45, 94, 115, 127, 137, 155, 156, 167, 168, 231, 300, 325, 326, 330, 344, 345, 347, 393, ...], 'PG-13': [0, 9, 27, 28, 29, 38, 44, 88, 112, 117, 128, 129, 134, 138, 142, , ...], 


movies_gb_rating.first()   # get the first result from all groups into a dataframe:

# get all results (a dataframe) that belongs to this specific group;
movies_gb_rating.get_group('G')     # this is similar to filter the result by rating == 'G'

# iterate through all groups and print the group name and results 
for name, group in movies_gb_rating:
    print(name)
    print('--------')
    print(group)


grouped_df = df.groupby('student').grade.mean()
grouped_df = df.groupby('student', as_index=False).grade.mean() # as_index=False means -- Do not set the 'student' column as index;

#column1 is the column that we want to group by ('student' in our example)
#column2 is the column that we want to perform a measurement on (grade in our example)
#measurement is the measurement function we want to apply (mean in our example)
df.groupby('column1').column2.measurement()
pricey_shoes = orders.groupby("shoe_type").price.max()  # type is <class 'pandas.core.series.Series'>

# using reset_index() function to reset the index and convert to a DataFrame (books dataframe),
# withou reset_index() it will be a series
df = books.groupby('Genre')['Price'].max().reset_index()
df >> 
       Genre	    Price
    0	Fiction	    82
    1	Non Fiction	105

# rename the column 'Price' to "Max_price"
output.rename(columns = {"Price": "Max_price"},  inplace = True) 

df >> 
     	    Genre	    Max_price
        0	Fiction	    82
        1	Non Fiction	105


# Following command groups the books by year, calculates the number of authors and the max and average prices for each year, renames the columns, and sorts the results. The output is a sorted DataFrame where each row corresponds to a unique year and the columns are the calculated values.
@sort_values()
 books.fillna(0).groupby('Year').agg({'Author':'count', 'Price':['max', 'mean']})\
            .rename(columns={"count":'num_of_authers', "sum":'max_price',"mean":'avg_price'})\
           .sort_values(by=[('Author', 'num_of_authers'), ('Price', 'avg_price')], ascending=[False, True])
>> result -> 
            Author	            Price
         num_of_authers  max_price	avg_price
    Year			
    2019	50	            27	        10.08
    2015	50	            46	        10.42
    2018	50	            46	        10.52
    
# for instance, in 2019 there are 50 authers who publsih books recorded in this dataset and the max price (higest price of book sold in that year is) $27 and average price of that year is $10.08

2. Group data practical usage: agg function

2023-07-29

Matplotlib Notes

This post covers useful notes for using matplotlib (ploting charts for data analysis). I’ve listed quite lots of useful notes for future reference.

The most commonly used methods I've encountered

1. Presettings

# import the package:
import matplotlib.pyplot as plt

# check avaiable predefined styles:
>> plt.style.available

>> ['Solarize_Light2',
    '_classic_test_patch',
    '_mpl-gallery',
    '_mpl-gallery-nogrid',
    'bmh',
    'classic',
    'dark_background',
    'fast',
    'fivethirtyeight',
    'ggplot',
    'grayscale',
    'seaborn-v0_8',
    ...
    'seaborn-v0_8-white',
    'seaborn-v0_8-whitegrid',
    'tableau-colorblind10']

# to use the style:
>> plt.style.use("seaborn-notebook")


#NOTE: if want to save file must have plt.show() after plt.savefig() methods;

#We can use the command plt.savefig() to save out to many different file formats, such as png, svg, or pdf. After plotting, we can call plt.savefig('name_of_graph.png'):
plt.savefig('image.png')   # is high quality but larger file;
plt.savefig('image.png', dpi = 300)  # dpi is high quality resolation image; and lager
plt.savefig('subfolder/filename.png')  
plt.savefig('filename.jpg', quality = 50)  # jpg is small size image but lower quality; quality is only for 'jpg' or 'jpeg';

2. More examples for plotting charts

2023-07-16

Data Analysis Note - Pandas Part 2 - Series

When working with Pandas’ specific column, you would have a new datatype called Series. A Series is a one-dimensional array of data. It can hold data of any type: string, integer, float, dictionaries, lists, booleans, and more.

The most commonly used methods for working with Series

1. Basic info checking:


# only display the summary statistics of this selected column:
>> df.column_name.describe()

# Return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
>> df.column_name.value_counts() 

# sort the entire dataframe based on the order of ['column_1','column_2'] and return a Series (with 2 indexes) containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
df[['column_1','column_2']].value_counts()
# for instance, get the number of books sold in certain years based on different genre:
books[['Genre','Year']].value_counts()
>> result ->
>>  Genre        Year
    Non Fiction  2015    33
                 2016    31
                 2019    30
                 2010    30
                 2018    29
    Fiction      2014    29
    Non Fiction  2012    29
                 2011    29
                 2017    26
                 2013    26
                 2009    26
    Fiction      2017    24
                 2013    24
                 2009    24
                 2018    21
    Non Fiction  2014    21
    Fiction      2012    21
                 2011    21
                 2010    20
                 2019    20
                 2016    19
                 2015    17
    Name: count, dtype: int64

# will return an array of unique value (default to include 'NaN' value)
>> df.column_name.unique()

# will return a number showing how many unique values are there, (default to NOT include 'NaN' value)
>> df.column_name.nunique()
# including 'NaN' value
>> df.column_name.nunique(dropna=False)

# return the nth of largest items;
df.column_name.nlargest(5)

# return the nth of smallest items;
df.column_name.nsmallest(5)

2. Filter / select from a series:

2023-06-03

Data Analysis Note - Pandas Part 1

This post covers useful notes for using pandas (the famous package for Python for data analysis). I’ve listed quite lots of useful notes for future reference.

The most commonly used methods I've encountered

1. To change the settings for displaying the dataframe linke to offical doc - pandas.get_option:

# Display the current setting for this param ("display.max_rows"):
pd.get_option("display.max_rows") 

# Disply the full rows for whole table (cautious would crash your program if large dataset):
pd.set_option('display.max_rows', None)

# Other usful settings:
pd.reset_option("display.max_colwidth", 1000) # to change the settings of the column width;
display.precision  # to change the diaply on how many decimal place of float figure;
pd.set_option("display.float_format", "{:,}".format) # To format the float number as 1,000.00
# This can also make sure not show the scientific notation. 

pd.set_option('display.float_format', "{:,.2f}".format)   # format like: 53,844,501.67
pd.set_option('display.float_format', "${:,.2f}".format)  #  	$53,844,501.67
pd.set_option('display.float_format', "{:.0f}".format)    # get rid of .0

# To Reset to defaul option settings:
pd.reset_option("display.max_rows")  # rest value for this param,
pd.reset_option("all")  # reset all options

2. Read various files with more options:

# Change the column name, dtypes when reading the file:
df = pd.read_csv(r"name.csv", 
                 names = new_colnames, header=0, dtype = {"specific_column":float}, na_values = "N.A.")  #set null value to "N.A."
# names must be used with header=0 to change the column name and skip the old header row.

# # Get database / table from website:
url = 'https://....'
df = pd.read_csv(url, sep=';')
pd.read_table(url, sep=';')   # if not a csv file;

# Read file that is not comma-separated and adding or change the column names:
col_labels=['id','title','year','imdb_rating','imdb_di','genres']
df = pd.read_csv(r'.\movies.tsv', sep='\t', names=col_labels)

# 2 ways to parse time when read file:
# Method 1 to parse time: (using python own time module)
import time
d_parser = lambda x: time.strftime("%Y-%m-%d %H" ,time.strptime(x, "%Y-%m-%d %I-%p"))
df = pd.read_csv(r"D:\table.csv", parse_dates=["Date"], date_parser = d_parser)

# Method 2 to parse time: (using pandas datetime)
import datetime
d_parser = lambda x: datetime.strptime(x, "%Y-%m-%d %I-%p")
df = pd.read_csv(r"D:\table.csv", parse_dates=["Date"], date_parser = d_parser)

3. Check database basic information:

2023-01-15

Python Virtual Environment

This post details the creation of Python virtual environment with the use of ‘venv’, the python build-in package, of installing specific Python version and necessary packages.

Link to venv.

Creating and Managing Virtual Environment for Python project

1. Open Terminal and create directory

1
2
3


D:\Pyproject>mkdir testing_env_project

2. cd to the directory and create virtual environment with specific Python version (here for instance using python 3.8 version)

1
2
3

D:\Pyproject>cd testing_env_project

D:\Pyproject\testing_env_project>py -3.8 -m venv my_env_name

Then you will see a new folder ‘my_env_name’ created under ‘testing_env_project’.

# ls to check all the folders and files created under this.
D:\Pyproject\testing_env_project> ls */*

my_env_name/pyvenv.cfg

my_env_name/Include:

my_env_name/Lib:
site-packages/

my_env_name/Scripts:
activate      Activate.ps1    easy_install.exe*      pip.exe*     pip3.exe*    pythonw.exe*
activate.bat  deactivate.bat  easy_install-3.8.exe*  pip3.8.exe*  python.exe*

3. activate the virtual environment you’ve just created by following command

1
2
3

D:\Pyproject\testing_env_project> my_env_name\Scripts\activate

(my_env_name) D:\Pyproject\testing_env_project>

4. Then using folloing command to install specific packages you want in this virtual environment

1
2
3

(my_env_name) D:\Pyproject\testing_env_project> pip install package_name
# example to install python packages numpy and pandas
(my_env_name) D:\Pyproject\testing_env_project> pip install numpy pandas

5. Open the project via VS Code Editor using following command:

1	(my_env_name) D:\Pyproject\testing_env_project> code .

6. List and save all packages installed in this virtual environment for this project to a txt file so others can replicate the same project if sharing

1	(my_env_name) D:\Pyproject\testing_env_project> pip freeze > requirements.txt

Finally, you will be able to work in this project via VS Code Editor.