Using Python (pandas) to Perform Common Excel Data Processing Tasks
This article demonstrates how to replace typical Excel operations such as VLOOKUP, pivot tables, duplicate removal, missing‑value handling, multi‑condition filtering, fuzzy matching, column splitting, outlier replacement, grouping and labeling with concise Python pandas code to streamline data analysis workflows.
After learning Python, the author uses it to handle Excel data, aiming to reinforce Python skills and improve data‑processing capability.
1. VLOOKUP Equivalent
Split the original table into two DataFrames and merge them to retrieve profit for each order using df_c = pd.merge(df1, df2, on="订单明细号", how="left") .
2. Pivot Table
Calculate total and average profit per salesperson in each region with pd.pivot_table(sale, index="地区名称", columns="业务员名称", values="利润", aggfunc=[np.sum, np.mean]) .
3. Compare Two Columns
Create a modified order‑detail column and find differences using result = sale.loc[sale["订单明细号"].isin(sale["订单明细号2"]) == False] .
4. Remove Duplicates
Drop duplicate salesperson codes: sale.drop_duplicates("业务员编码", inplace=True) .
5. Missing‑Value Handling
Identify columns with missing values via sale.info() , then fill or drop them: sale["客户名称"] = sale["客户名称"].fillna(0) and sale.dropna(subset=["客户编码"]) .
6. Multi‑Condition Filtering
Retrieve orders where Beijing salesperson "张爱" sold items over 6000: sale.loc[(sale["地区名称"]=="北京") & (sale["业务员名称"]=="张爱") & (sale["订单金额"]>5000)] .
7. Fuzzy Filtering
Select rows with product names containing "三星" or "索尼": sale.loc[sale["存货名称"].str.contains("三星|索尼")] .
8. Group Summary
Summarize total profit per salesperson in Beijing: sale.groupby(["地区名称", "业务员名称"])["利润"].sum() .
9. Conditional Statistics
For products with "三星" in the name and tax > 1000, compute count, sum and average profit using sale.loc[sale["存货名称"].str.contains("三星") & (sale["税费"]>=1000)][["订单明细号", "利润"]].describe() .
10. Trim Spaces
Remove surrounding spaces from product names: sale["存货名称"] = sale["存货名称"].map(lambda s: s.strip()) .
11. Split Columns
Separate date and time from the "单据日期" column: sale = pd.merge(sale, pd.DataFrame(sale["单据日期"].str.split(" ", expand=True)), how="inner", left_index=True, right_index=True) .
12. Outlier Replacement
Replace negative tax values (considered outliers) with zero: sale["订单金额"] = sale["订单金额"].replace(min(sale["订单金额"]), 0) .
13. Profit‑Based Grouping
Classify regions into "较差", "中等", "较好", "非常好" based on total profit quartiles using pd.cut (bins: [-10,7091,10952,17656,37556]).
14. Business‑Logic Labeling
Label products with profit margin >30% as "优质商品" and <5% as "一般商品": sale.loc[(sale["利润"]/sale["订单金额"])>0.3, "label"] = "优质商品" and sale.loc[(sale["利润"]/sale["订单金额"])<0.05, "label"] = "一般商品" .
In conclusion, both Excel and Python are valuable tools; the choice depends on the specific task, and a good data analyst should use whichever tool solves the problem efficiently.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.