16  Pandas

Pandas is an open-source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It’s built on top of NumPy, another library offering support for multi-dimensional arrays, and integrates well with other libraries in the Python Data Science stack like Matplotlib for plotting, SciPy for scientific computing, and scikit-learn for machine learning.

16.1 Core Features

  • Data Structures: Pandas introduces two primary data structures: DataFrame and Series. A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Series, on the other hand, is a one-dimensional array with axis labels.

  • Handling of Data: Pandas excels in the handling of missing data, data alignment, and merging, reshaping, selecting, as well as data slicing and indexing.

  • File Import Export: It provides extensive capabilities to read and write data with a wide variety of formats, including CSV, Excel, SQL databases, JSON, HTML, and more.

  • Time Series: Pandas offers comprehensive support for working with time series data, including date range generation, frequency conversion, moving window statistics, and more.

16.2 Installation

Pandas can be installed using pip, a package installer for Python:

pip install pandas

16.2.1 Basic Usage

Importing Pandas is typically done using the pd alias:

import pandas as pd

Creating Data Structures

  • Series:

    s = pd.Series([1, 3, 5, np.nan, 6, 8])
  • DataFrame:

    dates = pd.date_range('20230101', periods=6)
    df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

Viewing Data

  • View the top and bottom rows of the frame:

    df.head()
    df.tail(3)
  • Display the index, columns, and the underlying NumPy data:

    df.index
    df.columns
    df.to_numpy()

Data Selection

  • Selecting a single column, which yields a Series:

    df['A']
  • Selecting via [], which slices the rows:

    df[0:3]
  • Selection by label:

    df.loc[dates[0]]
  • Selection by position:

    df.iloc[3]

Missing Data

Pandas primarily uses np.nan to represent missing data. It is by default not included in computations.

  • To drop any rows that contain missing data:

    df.dropna(how='any')
  • Filling missing data:

    df.fillna(value=5)

Operations

  • Stats:

    df.mean()
  • Applying functions to the data:

    df.apply(np.cumsum)

Grouping

  • Group by operations:

    df.groupby('A').sum()

Merging

  • Concatenating pandas objects together:

    pd.concat([df1, df2])
  • SQL style merges:

    pd.merge(left, right, on='key')

File I/O

  • Reading and writing to CSV:

    pd.read_csv('filename.csv')
    df.to_csv('my_dataframe.csv')
  • Reading and writing to Excel:

    pd.read_excel('filename.xlsx', sheet_name='Sheet1')
    df.to_excel('my_dataframe.xlsx', sheet_name='Sheet1')

Pandas is a foundational tool for data analysis in Python, offering comprehensive functions and methods to perform efficient data manipulation and analysis. Its robust features for handling complex data operations make it an indispensable tool for data scientists and analysts working in Python.

To load csv data in python:

pd.read_csv("file path or file name")`
Code
import pandas as pd
# load data
data1 = pd.read_csv("customers-100.csv")