2023-10-15

Data Analysis Note - Pandas Part 4 - Join Tables

The most commonly used methods for working with multiple tables

1. Concating series and dataframes:


@concat

# concat both columns into 1 column (1 series)
pd.concat([column1, column2])   

# concat the column by column (series) to form a dataframe and using same index to match each row by row.
pd.concat([column1, column2], axis=1)     # axis=1 default is 0;

pd.concat([df,df2], ignore_index=True) 

# If two dfs have different columns, then using join='inner' would only join their mutual columns;
pd.concat([df1, df2], sort=True, ignore_index=True,  join='inner',)   # sort=True is to order the column names

# Keys and then original indexes of the df would togeter create a multi-index dataframe;
pd.concat([df1, df2], sort=True, join='inner', keys=['df1', 'df2']) # Using keys to give a first level of index

2. Merging dataframes / tables on left, right, inner, outer joining methods:


# Outer join:
store_a_b_outer = pd.merge(store_a,store_b, how="outer") # default merger method is 'inner'
# Left Join
store_a_b_left = pd.merge(store_a, store_b, how="left")
# Right Join
store_b_a_left = pd.merge(store_a, store_b, how="right")

# To join two table with same multi-index:
new_df = pd.merge(df1, df2, left_index=True, right_index=True)

# Join based on multiple columns:
merged_df = pd.merge(country_df, economies_df, left_on = ['code', 'year'], right_on = ['code', 'year'])

# chain merge
all_data = sales.merge(targets).merge(men_women)

# Inner join:
# when two differnt table both have "id" column with differt table means:
# First way to do:
pd.merge(orders, products.rename(columns={"id":"product_id"}))
# Second way to do:
pd.merge(orders, products, left_on="product_id", right_on = "id", suffixes=["_order", "_product"])


# Cross join
# Cross join: needs to create same column to each df. then join on the same new column, key column in this case:
cities['key'] = 1
languages['key'] = 1
city_lang = pd.merge(cities, languages, on='key', how='outer')

# Intersect between two df: using isin():
cities.loc[cities.name.isin(countries.country_name), 'name']