Monday, 24 February 2025

Load data from a dictionary and perform data checking in Python

Python 

Below is a detailed example of how to load sample data from a dictionary in Python and perform data checking. This example includes checking for missing values, data types, and specific conditions.

Example: Loading Sample Data from a Dictionary and Performing Data Checking

import pandas as pd

# Sample data stored in a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [24, 27, 22, 32, None],  # None represents a missing value
    'Gender': ['F', 'M', 'M', 'M', 'F'],
    'Salary': [50000, 54000, None, 62000, 58000],  # None represents a missing value
    'Department': ['HR', 'Finance', 'IT', 'IT', 'HR']
}

# Load the dictionary into a pandas DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
print("Original DataFrame:")
print(df)

# Data Checking

# 1. Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())

# 2. Check data types
print("\nData Types:")
print(df.dtypes)

# 3. Check for specific conditions
# Example: Check if all ages are above 20
print("\nCheck if all ages are above 20:")
print(df['Age'].dropna().gt(20).all())

# Example: Check if all salaries are within a reasonable range (e.g., 30000 to 100000)
print("\nCheck if all salaries are within the range 30000 to 100000:")
print(df['Salary'].dropna().between(30000, 100000).all())

# 4. Handle missing values (optional)
# Fill missing ages with the mean age
mean_age = df['Age'].mean()
df['Age'].fillna(mean_age, inplace=True)

# Fill missing salaries with the median salary
median_salary = df['Salary'].median()
df['Salary'].fillna(median_salary, inplace=True)

# Display the DataFrame after handling missing values
print("\nDataFrame after handling missing values:")
print(df)

# 5. Verify that missing values have been handled
print("\nMissing Values after handling:")
print(df.isnull().sum())

Explanation:

1. Loading Data from a Dictionary:

  • We start by creating a dictionary data that contains sample data with keys as column names and values as lists of data.
  • We then load this dictionary into a pandas DataFrame using pd.DataFrame(data).

2. Data Checking:

  • Missing Values: We use df.isnull().sum() to count the number of missing values in each column.
  • Data Types: We use df.dtypes to check the data types of each column.
  • Specific Conditions: We check if all ages are above 20 using df['Age'].dropna().gt(20).all() and if all salaries are within a reasonable range using df['Salary'].dropna().between(30000, 100000).all().

3. Handling Missing Values:

  • We fill missing ages with the mean age using df['Age'].fillna(mean_age, inplace=True).
  • We fill missing salaries with the median salary using df['Salary'].fillna(median_salary, inplace=True).

4. Verification:

  • After handling missing values, we verify that there are no more missing values using df.isnull().sum().

Output:

Original DataFrame:
      Name   Age Gender   Salary Department
0    Alice  24.0      F  50000.0         HR
1      Bob  27.0      M  54000.0    Finance
2  Charlie  22.0      M      NaN         IT
3    David  32.0      M  62000.0         IT
4      Eva   NaN      F  58000.0         HR

Missing Values:
Name          0
Age           1
Gender        0
Salary        1
Department    0
dtype: int64

Data Types:
Name          object
Age          float64
Gender        object
Salary       float64
Department    object
dtype: object

Check if all ages are above 20:
True

Check if all salaries are within the range 30000 to 100000:
True

DataFrame after handling missing values:
      Name   Age Gender   Salary Department
0    Alice  24.0      F  50000.0         HR
1      Bob  27.0      M  54000.0    Finance
2  Charlie  22.0      M  54000.0         IT
3    David  32.0      M  62000.0         IT
4      Eva  26.2      F  58000.0         HR

Missing Values after handling:
Name          0
Age           0
Gender        0
Salary        0
Department    0
dtype: int64

This example demonstrates how to load data from a dictionary, perform basic data checking, handle missing values, and verify the results.

If you want to get the row numbers where missing values (NaN) occur in a Pandas DataFrame, you can do the following:

import pandas as pd
import numpy as np

# Sample data stored in a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [24, 27, np.nan, 32, None],  # np.nan and None represent missing values
    'Gender': ['F', 'M', 'M', 'M', 'F'],
    'Salary': [50000, 54000, None, 62000, 58000],  # None represents a missing value
    'Department': ['HR', 'Finance', 'IT', 'IT', 'HR']
}

# Load the dictionary into a pandas DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
print("Original DataFrame:")
print(df)

# Get row numbers where missing values occur
missing_value_rows = df[df.isnull().any(axis=1)].index.tolist()

# Print the row numbers with missing values
print("\nRow numbers with missing values:")
print(missing_value_rows)

Explanation:

1. Sample Data:

  • We create a dictionary data with some missing values represented by np.nan and None.
  • The dictionary is loaded into a pandas DataFrame.

2. Finding Missing Values:

  • df.isnull() returns a DataFrame of the same shape as df, with True where there are missing values and False otherwise.
  • df.isnull().any(axis=1) checks if any value in a row is True (i.e., if there is at least one missing value in the row).
  • df[df.isnull().any(axis=1)] filters the DataFrame to include only rows with missing values.
  • .index.tolist() extracts the row numbers (indices) of these rows and converts them to a list.

3. Output:

  • The row numbers with missing values are printed.

Output:

Original DataFrame:
      Name   Age Gender   Salary Department
0    Alice  24.0      F  50000.0         HR
1      Bob  27.0      M  54000.0    Finance
2  Charlie   NaN      M      NaN         IT
3    David  32.0      M  62000.0         IT
4      Eva   NaN      F  58000.0         HR

Row numbers with missing values:
[2, 4]

Explanation of Output:

  • Row 2 has missing values in the Age and Salary columns.
  • Row 4 has a missing value in the Age column.

This example demonstrates how to identify and retrieve the row numbers where missing values occur in a DataFrame.



Search