Miniconda and Conda Virtual Environment

This post covers miniconda, including its installation and managing conda virtual environments. After using the Anaconda suite for quite some time, I found it cumbersome, with many unused tools, a large number of default packages, and a substantial volume size. Consequently, I decided to uninstall Anaconda and switch to the more lightweight miniconda distribution.

For miniconda installation link: miniconda.

Creating and Managing Virtual Environment

1. Open Anaconda Terminal
2. Create new virtual environment and install specific python verison
into this virtual environment

3. Typing “conda env list” to check the virtual environments just created

Read More

Data Analysis Note - Pandas Part 4 - Join Tables

The most commonly used methods for working with multiple tables

1. Concating series and dataframes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

@concat

# concat both columns into 1 column (1 series)
pd.concat([column1, column2])

# concat the column by column (series) to form a dataframe and using same index to match each row by row.
pd.concat([column1, column2], axis=1) # axis=1 default is 0;

pd.concat([df,df2], ignore_index=True)

# If two dfs have different columns, then using join='inner' would only join their mutual columns;
pd.concat([df1, df2], sort=True, ignore_index=True, join='inner',) # sort=True is to order the column names

# Keys and then original indexes of the df would togeter create a multi-index dataframe;
pd.concat([df1, df2], sort=True, join='inner', keys=['df1', 'df2']) # Using keys to give a first level of index


2. Merging dataframes / tables on left, right, inner, outer joining methods:

Read More

Data Analysis Note - Pandas Part 3 - Groupby

This post covers useful notes for using pandas (the famous package for Python for data analysis). This post are more advanced topics such as group dataframe, group data with custom python functions and join dataframe etc.

1. Groupby dataframe :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
#This is a best way to find if this column has any null values or not:
#It is good practice to check if a column has null before groupby:
df.name.isnull().any()

movies_gb_rating = movies.groupby('rating')

movies_gb_rating.ngroups # return the number of groups totally

movies_gb_rating.groups
# return a python dictionary with key and value (key is the group name and value is a list of contained rows' indexes)

>> results ->
'G': [107, 933, 1126, 1557, 2911, 3059, ], 'NC-17': [5264, 6338, 7263], 'NR': [5971, 5987, 6015, 6054, 6087, 6093, 6095, 6179, 6190, 6207, 6208, 6226], 'PG': [6, 41, 42, 43, 45, 94, 115, 127, 137, 155, 156, 167, 168, 231, 300, 325, 326, 330, 344, 345, 347, 393, ...], 'PG-13': [0, 9, 27, 28, 29, 38, 44, 88, 112, 117, 128, 129, 134, 138, 142, , ...],


movies_gb_rating.first() # get the first result from all groups into a dataframe:

# get all results (a dataframe) that belongs to this specific group;
movies_gb_rating.get_group('G') # this is similar to filter the result by rating == 'G'

# iterate through all groups and print the group name and results
for name, group in movies_gb_rating:
print(name)
print('--------')
print(group)


grouped_df = df.groupby('student').grade.mean()
grouped_df = df.groupby('student', as_index=False).grade.mean() # as_index=False means -- Do not set the 'student' column as index;

#column1 is the column that we want to group by ('student' in our example)
#column2 is the column that we want to perform a measurement on (grade in our example)
#measurement is the measurement function we want to apply (mean in our example)
df.groupby('column1').column2.measurement()
pricey_shoes = orders.groupby("shoe_type").price.max() # type is <class 'pandas.core.series.Series'>

# using reset_index() function to reset the index and convert to a DataFrame (books dataframe),
# withou reset_index() it will be a series
df = books.groupby('Genre')['Price'].max().reset_index()
df >>
Genre Price
0 Fiction 82
1 Non Fiction 105

# rename the column 'Price' to "Max_price"
output.rename(columns = {"Price": "Max_price"}, inplace = True)

df >>
Genre Max_price
0 Fiction 82
1 Non Fiction 105


# Following command groups the books by year, calculates the number of authors and the max and average prices for each year, renames the columns, and sorts the results. The output is a sorted DataFrame where each row corresponds to a unique year and the columns are the calculated values.
@sort_values()
books.fillna(0).groupby('Year').agg({'Author':'count', 'Price':['max', 'mean']})\
.rename(columns={"count":'num_of_authers', "sum":'max_price',"mean":'avg_price'})\
.sort_values(by=[('Author', 'num_of_authers'), ('Price', 'avg_price')], ascending=[False, True])
>> result ->
Author Price
num_of_authers max_price avg_price
Year
2019 50 27 10.08
2015 50 46 10.42
2018 50 46 10.52

# for instance, in 2019 there are 50 authers who publsih books recorded in this dataset and the max price (higest price of book sold in that year is) $27 and average price of that year is $10.08

2. Group data practical usage: agg function

Read More

Matplotlib Notes

This post covers useful notes for using matplotlib (ploting charts for data analysis). I’ve listed quite lots of useful notes for future reference.

The most commonly used methods I've encountered

1. Presettings

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# import the package:
import matplotlib.pyplot as plt

# check avaiable predefined styles:
>> plt.style.available

>> ['Solarize_Light2',
'_classic_test_patch',
'_mpl-gallery',
'_mpl-gallery-nogrid',
'bmh',
'classic',
'dark_background',
'fast',
'fivethirtyeight',
'ggplot',
'grayscale',
'seaborn-v0_8',
...
'seaborn-v0_8-white',
'seaborn-v0_8-whitegrid',
'tableau-colorblind10']

# to use the style:
>> plt.style.use("seaborn-notebook")


#NOTE: if want to save file must have plt.show() after plt.savefig() methods;

#We can use the command plt.savefig() to save out to many different file formats, such as png, svg, or pdf. After plotting, we can call plt.savefig('name_of_graph.png'):
plt.savefig('image.png') # is high quality but larger file;
plt.savefig('image.png', dpi = 300) # dpi is high quality resolation image; and lager
plt.savefig('subfolder/filename.png')
plt.savefig('filename.jpg', quality = 50) # jpg is small size image but lower quality; quality is only for 'jpg' or 'jpeg';

2. More examples for plotting charts

Read More

Data Analysis Note - Pandas Part 2 - Series

When working with Pandas’ specific column, you would have a new datatype called Series. A Series is a one-dimensional array of data. It can hold data of any type: string, integer, float, dictionaries, lists, booleans, and more.

The most commonly used methods for working with Series

1. Basic info checking:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

# only display the summary statistics of this selected column:
>> df.column_name.describe()

# Return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
>> df.column_name.value_counts()

# sort the entire dataframe based on the order of ['column_1','column_2'] and return a Series (with 2 indexes) containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
df[['column_1','column_2']].value_counts()
# for instance, get the number of books sold in certain years based on different genre:
books[['Genre','Year']].value_counts()
>> result ->
>> Genre Year
Non Fiction 2015 33
2016 31
2019 30
2010 30
2018 29
Fiction 2014 29
Non Fiction 2012 29
2011 29
2017 26
2013 26
2009 26
Fiction 2017 24
2013 24
2009 24
2018 21
Non Fiction 2014 21
Fiction 2012 21
2011 21
2010 20
2019 20
2016 19
2015 17
Name: count, dtype: int64

# will return an array of unique value (default to include 'NaN' value)
>> df.column_name.unique()

# will return a number showing how many unique values are there, (default to NOT include 'NaN' value)
>> df.column_name.nunique()
# including 'NaN' value
>> df.column_name.nunique(dropna=False)

# return the nth of largest items;
df.column_name.nlargest(5)

# return the nth of smallest items;
df.column_name.nsmallest(5)

2. Filter / select from a series:

Read More

Data Analysis Note - Pandas Part 1

This post covers useful notes for using pandas (the famous package for Python for data analysis). I’ve listed quite lots of useful notes for future reference.

The most commonly used methods I've encountered

1. To change the settings for displaying the dataframe linke to offical doc - pandas.get_option:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Display the current setting for this param ("display.max_rows"):
pd.get_option("display.max_rows")

# Disply the full rows for whole table (cautious would crash your program if large dataset):
pd.set_option('display.max_rows', None)

# Other usful settings:
pd.reset_option("display.max_colwidth", 1000) # to change the settings of the column width;
display.precision # to change the diaply on how many decimal place of float figure;
pd.set_option("display.float_format", "{:,}".format) # To format the float number as 1,000.00
# This can also make sure not show the scientific notation.

pd.set_option('display.float_format', "{:,.2f}".format) # format like: 53,844,501.67
pd.set_option('display.float_format', "${:,.2f}".format) # $53,844,501.67
pd.set_option('display.float_format', "{:.0f}".format) # get rid of .0

# To Reset to defaul option settings:
pd.reset_option("display.max_rows") # rest value for this param,
pd.reset_option("all") # reset all options

2. Read various files with more options:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Change the column name, dtypes when reading the file:
df = pd.read_csv(r"name.csv",
names = new_colnames, header=0, dtype = {"specific_column":float}, na_values = "N.A.") #set null value to "N.A."
# names must be used with header=0 to change the column name and skip the old header row.

# # Get database / table from website:
url = 'https://....'
df = pd.read_csv(url, sep=';')
pd.read_table(url, sep=';') # if not a csv file;

# Read file that is not comma-separated and adding or change the column names:
col_labels=['id','title','year','imdb_rating','imdb_di','genres']
df = pd.read_csv(r'.\movies.tsv', sep='\t', names=col_labels)

# 2 ways to parse time when read file:
# Method 1 to parse time: (using python own time module)
import time
d_parser = lambda x: time.strftime("%Y-%m-%d %H" ,time.strptime(x, "%Y-%m-%d %I-%p"))
df = pd.read_csv(r"D:\table.csv", parse_dates=["Date"], date_parser = d_parser)

# Method 2 to parse time: (using pandas datetime)
import datetime
d_parser = lambda x: datetime.strptime(x, "%Y-%m-%d %I-%p")
df = pd.read_csv(r"D:\table.csv", parse_dates=["Date"], date_parser = d_parser)


3. Check database basic information:

Read More

Python Virtual Environment

This post details the creation of Python virtual environment with the use of ‘venv’, the python build-in package, of installing specific Python version and necessary packages.

Link to venv.

Creating and Managing Virtual Environment for Python project

1. Open Terminal and create directory

1
2
3

D:\Pyproject>mkdir testing_env_project

2. cd to the directory and create virtual environment with specific Python version (here for instance using python 3.8 version)

1
2
3
D:\Pyproject>cd testing_env_project

D:\Pyproject\testing_env_project>py -3.8 -m venv my_env_name

Then you will see a new folder ‘my_env_name’ created under ‘testing_env_project’.

1
2
3
4
5
6
7
8
9
10
11
12
13
# ls to check all the folders and files created under this.
D:\Pyproject\testing_env_project> ls */*

my_env_name/pyvenv.cfg

my_env_name/Include:

my_env_name/Lib:
site-packages/

my_env_name/Scripts:
activate Activate.ps1 easy_install.exe* pip.exe* pip3.exe* pythonw.exe*
activate.bat deactivate.bat easy_install-3.8.exe* pip3.8.exe* python.exe*

3. activate the virtual environment you’ve just created by following command

1
2
3
D:\Pyproject\testing_env_project> my_env_name\Scripts\activate

(my_env_name) D:\Pyproject\testing_env_project>

4. Then using folloing command to install specific packages you want in this virtual environment

1
2
3
(my_env_name) D:\Pyproject\testing_env_project> pip install package_name
# example to install python packages numpy and pandas
(my_env_name) D:\Pyproject\testing_env_project> pip install numpy pandas

5. Open the project via VS Code Editor using following command:

1
(my_env_name) D:\Pyproject\testing_env_project> code .

6. List and save all packages installed in this virtual environment for this project to a txt file so others can replicate the same project if sharing

1
(my_env_name) D:\Pyproject\testing_env_project> pip freeze > requirements.txt

Finally, you will be able to work in this project via VS Code Editor.