Python Pandas Refresher
When you do not use pandas
often, searching for documentation is a time consuming step.
Core concepts
- DataFrame (usually instantiated as
df
) is a table with row and column names/indices- row index default to
0:n
, may be specified otherwise.
- row index default to
- avoid chained(indexing then) assignment!
e.g.
df["col2"][df["col1"] > 4] = 0
will fail!
Import export
- import from
- files
- table, csv:
pd.read_csv
… - json:
pd.read_json
- take note of the orientations: ‘split’, ‘records’, ‘index’, ‘columns’, ‘values’, see doc
- table, csv:
- memory:
pd.DataFrame.from_dict
orfrom_records
- files
- export
df.to
and check auto-complete
Using
- quick view:
df.head(n=5)
df.columns
for the column namesdf.index
for row indicesdf.info
df.describe
for statistic (mean, std, etc)
- access:
df.col_label
ordf['col_label']
to get one columnloc
for label,iloc
for integer index (like usual matrix)!- ignore
.at
, it is for single value accessdf[["col1", "col2"]] # select multiple columns df["col1"], df.col1 # select the specific columns df.loc[["row1", "row2"], ["col1", "col2"]] # select specific rows and columns df.loc["row4":"row7", "col3":"col5"] # select row and column ranges df.iloc[[1, 2, 5, 8], 5:8] # select by indices01il
- others
df.sort_index
ordf.sort_values
Plotting
matplotlib
matlab like, not the easiest, low levelseaborn
wrapper ofmatplotlib
, still requiresmatplotlib
for tweaks e.g. axis etcggplot
: from R, quite some learning curvebokeh
mcm nice and clean, interactive, but must be on notebook (more or less)pygal
svg, no panda integration, interactive-ish graphplotly
: interactive, high lvl