Python Pandas Refresher
When you do not use pandas often, searching for documentation is a time consuming step.
Core concepts
- DataFrame (usually instantiated as
df) is a table with row and column names/indices- row index default to
0:n, may be specified otherwise.
- row index default to
- avoid chained(indexing then) assignment!
e.g.
df["col2"][df["col1"] > 4] = 0will fail!
Import export
- import from
- files
- table, csv:
pd.read_csv… - json:
pd.read_json- take note of the orientations: ‘split’, ‘records’, ‘index’, ‘columns’, ‘values’, see doc
- table, csv:
- memory:
pd.DataFrame.from_dictorfrom_records
- files
- export
df.toand check auto-complete
Using
- quick view:
df.head(n=5)df.columnsfor the column namesdf.indexfor row indicesdf.infodf.describefor statistic (mean, std, etc)
- access:
df.col_labelordf['col_label']to get one columnlocfor label,ilocfor integer index (like usual matrix)!- ignore
.at, it is for single value accessdf[["col1", "col2"]] # select multiple columns df["col1"], df.col1 # select the specific columns df.loc[["row1", "row2"], ["col1", "col2"]] # select specific rows and columns df.loc["row4":"row7", "col3":"col5"] # select row and column ranges df.iloc[[1, 2, 5, 8], 5:8] # select by indices01il
- others
df.sort_indexordf.sort_values
Plotting
matplotlibmatlab like, not the easiest, low levelseabornwrapper ofmatplotlib, still requiresmatplotlibfor tweaks e.g. axis etcggplot: from R, quite some learning curvebokehmcm nice and clean, interactive, but must be on notebook (more or less)pygalsvg, no panda integration, interactive-ish graphplotly: interactive, high lvl