Fundamentals 20 min read

Advanced Pandas Tutorial: Data Access, Indexing, Merging, Grouping, and Time Handling

This article extends a basic pandas tutorial by covering advanced data access methods, index types, concatenation, merging, joining, grouping, aggregation, time series handling, and basic plotting, providing clear code examples and explanations for each feature.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Advanced Pandas Tutorial: Data Access, Indexing, Merging, Grouping, and Time Handling

The article continues from a previous introductory pandas tutorial and assumes the reader has completed the basics. It provides a GitHub repository (pandas_tutorial) for test data and source code.

Data Access – Two fundamental ways to access pandas data are demonstrated using the # select_data.py import pandas as pd import numpy as np series1 = pd.Series([1, 2, 3, 4, 5, 6, 7], index=["C", "D", "E", "F", "G", "A", "B"]) print("series1['E'] = {} \n".format(series1['E'])) print("series1.E = {} \n".format(series1.E)) which outputs the same value via bracket and dot notation. Notes explain that dot notation works only when the index label is a valid Python identifier.

loc and iloc – Access by label and by integer position is shown: # select_data.py import pandas as pd import numpy as np df1 = pd.DataFrame({"note": ["C", "D", "E", "F", "G", "A", "B"], "weekday": ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]}, index=['1','2','3','4','5','6','7']) print("df1.loc['2']:\n{}\n".format(df1.loc['2'])) print("series1.loc['E':'A']=\n{}\n".format(series1.loc['E':'A'])) print("df1.iloc[2:4]=\n{}\n".format(df1.iloc[2:4])) The output demonstrates selecting a single row, a label slice, and an integer‑position slice.

at and iat – For scalar value access: # select_data.py print("series1.at['E']={}\n".format(series1.at['E'])) print("df1.iloc[4,1]={}\n".format(df1.iloc[4,1])) Outputs the scalar values 3 and "Fri" respectively.

Index Objects – Creating an Index with a name: # index.py index = pd.Index(["C","D","E","F","G","A","B"], name='note') Set operations are illustrated: # index.py a = pd.Index([1,2,3,4,5]) b = pd.Index([3,4,5,6,7]) print("a|b = {}\n".format(a|b)) print("a&b = {}\n".format(a&b)) print("a.difference(b) = {}\n".format(a.difference(b))) Resulting unions, intersections, and differences are shown.

MultiIndex – A hierarchical index is built: # multiindex.py import pandas as pd import numpy as np multiIndex = pd.MultiIndex.from_arrays([ ["Geagle","Geagle","Geagle","Geagle","Epple","Epple","Epple","Epple","Macrosoft","Macrosoft","Macrosoft","Macrosoft"], ["S1","S2","S3","S4","S1","S2","S3","S4","S1","S2","S3","S4"]], names=("Company","Turnover")) df = pd.DataFrame(data=np.random.randint(0,1000,36).reshape(-1,12), index=[2016,2017,2018], columns=multiIndex) print("df = \n{}\n".format(df)) The example shows how to create a DataFrame with a MultiIndex and retrieve specific quarters for each company.

Data Concatenation – Using pd.concat and DataFrame.append to stack DataFrames vertically and horizontally: # concat_append.py import pandas as pd import numpy as np df1 = pd.DataFrame({"Note": ["C","D"], "Weekday": ["Mon","Tue"]}, index=[1,2]) df2 = pd.DataFrame({"Note": ["E","F"], "Weekday": ["Wed","Thu"]}, index=[3,4]) df3 = pd.DataFrame({"Note": ["G","A","B"], "Weekday": ["Fri","Sat","Sun"]}, index=[5,6,7]) df_concat = pd.concat([df1, df2, df3], keys=['df1','df2','df3']) print("df_concat=\n{}\n".format(df_concat)) df_concat_column = pd.concat([df1, df2, df3], axis=1) print("df_concat_column=\n{}\n".format(df_concat_column)) df_append = df1.append([df2, df3]) print("df_append=\n{}\n".format(df_append)) The outputs illustrate vertical concatenation (with a MultiIndex) and horizontal concatenation.

Merge and Join – Demonstrating SQL‑like merges: # merge_join.py import pandas as pd import numpy as np df1 = pd.DataFrame({"key": ["K1","K2","K3","K4"], "A": ["A1","A2","A3","A8"], "B": ["B1","B2","B3","B8"]}) df2 = pd.DataFrame({"key": ["K3","K4","K5","K6"], "A": ["A3","A4","A5","A6"], "B": ["B3","B4","B5","B6"]}) merge_df = pd.merge(df1, df2) merge_inner = pd.merge(df1, df2, how='inner', on=['key']) merge_left = pd.merge(df1, df2, how='left') merge_left_on_key = pd.merge(df1, df2, how='left', on=['key']) merge_right_on_key = pd.merge(df1, df2, how='right', on=['key']) merge_outer = pd.merge(df1, df2, how='outer', on=['key']) print("merge_df=\n{}\n".format(merge_df)) print("merge_inner=\n{}\n".format(merge_inner)) print("merge_left=\n{}\n".format(merge_left)) print("merge_left_on_key=\n{}\n".format(merge_left_on_key)) print("merge_right_on_key=\n{}\n".format(merge_right_on_key)) print("merge_outer=\n{}\n".format(merge_outer)) The results show inner, left, right, and outer joins. A further example uses DataFrame.join with suffixes: # merge_join.py df3 = pd.DataFrame({"key": ["K1","K2","K3","K4"], "A": ["A1","A2","A3","A8"], "B": ["B1","B2","B3","B8"]}, index=[0,1,2,3]) df4 = pd.DataFrame({"key": ["K3","K4","K5","K6"], "C": ["A3","A4","A5","A6"], "D": ["B3","B4","B5","B6"]}, index=[1,2,3,4]) join_df = df3.join(df4, lsuffix='_self', rsuffix='_other') join_left = df3.join(df4, how='left', lsuffix='_self', rsuffix='_other') print("join_df=\n{}\n".format(join_df)) print("join_left=\n{}\n".format(join_left))

Grouping and Aggregation – Using groupby , agg , and apply : # groupby.py import pandas as pd import numpy as np df = pd.DataFrame({ 'Name': ['A','A','A','B','B','B','C','C','C'], 'Data': np.random.randint(0,100,9)}) print('df=\n{}\n'.format(df)) grouped = df.groupby('Name') for name, group in grouped: print('Name: {}\nGroup:\n{}\n'.format(name, group)) print('Sum:') print(grouped.sum()) print('Agg Sum:') print(grouped.agg(['sum'])) print('Agg Map:') print(grouped.agg([('Total','sum'),('Min','min')])) def sort(df): return df.sort_values(by='Data', ascending=False) print('Sort Group: \n{}\n'.format(grouped.apply(sort))) The script shows grouping by a column, computing sums, custom aggregations, and applying a sorting function to each group.

Time Series – Working with timestamps, periods, and intervals using Python's datetime module and pandas' date_range: # time.py import datetime as dt import numpy as np import pandas as pd now = dt.datetime.now() print('Now is {}'.format(now)) yesterday = now - dt.timedelta(1) print('Yesterday is {}'.format(yesterday.strftime('%Y-%m-%d'))) this_year = pd.date_range(dt.datetime(2018,1,1), dt.datetime(2018,12,31), freq='5D') print('Selected days in 2018: \n{}\n'.format(this_year)) df = pd.DataFrame(np.random.randint(0,100,this_year.size), index=this_year) print('Jan: \n{}\n'.format(df['2018-01'])) The example prints the current date, generates a 5‑day frequency date range for 2018, creates a DataFrame indexed by that range, and selects all rows for January.

Plotting – A brief illustration of using matplotlib to plot a histogram of a CSV file: # plot.py import matplotlib.pyplot as plt import pandas as pd data = pd.read_csv('data/housing.csv') data.hist(bins=50, figsize=(15,12)) plt.show() The code reads a housing dataset and displays a histogram.

The article concludes by noting that while it covers many advanced pandas topics, there is still much more to explore, and readers are encouraged to consult the official documentation for deeper learning.

Indexingpandasdata-manipulationmergingGroupingtime-series
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.