Data Exploration
This lesson explains how to explore and inspect data in a Pandas DataFrame. It presents methods to obtain structural information, descriptive statistics, and techniques to select and filter data.
Basic Inspection Methods
The head(), tail(), and sample() methods allow viewing subsets of data from a DataFrame.
import pandas as pd
df = pd.DataFrame({
'name': ['Anna', 'Bernat', 'Carla', 'David', 'Eva'],
'age': [25, 30, 22, 35, 28],
'city': ['Barcelona', 'Girona', 'Lleida', 'Tarragona', 'Barcelona']
})
print(df.head(3)) # Shows the first rows (default 5)
print(df.tail(2)) # Shows the last rows
print(df.sample(2)) # Shows random rowsThe head() method returns the first n rows, tail() the last ones, and sample() a random sample. They are useful for a quick inspection of the content.
The info() and describe() methods provide summaries of the DataFrame.
df.info() # Information about the DataFrame structure
print(df.describe()) # Descriptive statistics of numeric columnsThe info() method shows the number of rows, columns, data types, and memory usage. The describe() method calculates statistics such as mean, standard deviation, minimum, and maximum for numeric columns.
Accessing Structural Information
The dimensions and shape of a DataFrame can be queried with the attributes shape, size, ndim, and columns.
print(df.shape) # Dimensions (rows, columns)
print(df.size) # Total number of elements
print(df.ndim) # Number of dimensions
print(df.columns) # Column names
print(df.index) # Row indexThe shape attribute returns a tuple with the number of rows and columns. The size attribute indicates the total number of elements, and ndim is always 2 for DataFrames.
The data types of each column can be queried with dtypes.
print(df.dtypes) # Data types of each column
print(df['age'].dtype) # Data type of a specific columnThe most common types are int64, float64, object (for text), and bool.
To obtain unique values and counts, the methods unique(), nunique(), and value_counts() are used.
print(df['city'].unique()) # Unique values of a column
print(df['city'].nunique()) # Number of unique values
print(df['city'].value_counts()) # Count of each unique valueThe unique() method returns an array with values without repetition, nunique() the number of different values, and value_counts() counts the occurrences of each value.
Quick Descriptive Statistics
Pandas offers methods to calculate basic statistics directly.
print(df['age'].mean()) # Mean of ages
print(df['age'].median()) # Median
print(df['age'].std()) # Standard deviation
print(df['age'].min()) # Minimum
print(df['age'].max()) # Maximum
print(df['age'].sum()) # Total sumThese methods are applied on numeric columns and return a scalar value with the statistical calculation result.
Data Selection
The [] operator allows selecting columns or rows from a DataFrame.
df = pd.DataFrame({
'name': ['Anna', 'Bernat', 'Carla'],
'age': [25, 30, 22],
'city': ['Barcelona', 'Girona', 'Lleida']
})
print(df['name']) # Select a column (returns a Series)
print(df[['name', 'age']]) # Select multiple columns
print(df[0:2]) # Select rows by position using slicingWith a single label inside [], a column is selected. With a list of labels, multiple columns are selected. With slicing, rows are selected by position.
Selection with loc
The loc indexer selects data using row and column labels.
print(df.loc[0]) # Select a row by label
print(df.loc[0:1, 'name':'age']) # Select specific rows and columns
print(df.loc[[0, 2]]) # Select rows with a list of indexes
print(df.loc[:, 'city']) # Select all rows of a columnThe loc operator uses [rows, columns]. Ranges with : include the end value, unlike usual Python slicing.
Selection with iloc
Similarly, the iloc indexer selects data using integer positions.
print(df.iloc[0]) # Select a row by position
print(df.iloc[0:2, 0:2]) # Select rows and columns by position
print(df.iloc[[0, 2], [0, 2]]) # Select rows and columns with lists
print(df.iloc[:, 1]) # Select all rows of a columnRanges do not include the end value!
Filtering with Boolean Conditions
Boolean conditions allow filtering rows that meet certain criteria.
# Filter rows where age is greater than 24
print(df[df['age'] > 24])
# Multiple conditions with & (and) and | (or)
print(df[(df['age'] > 24) & (df['city'] == 'Barcelona')])
# Membership condition
print(df[df['city'].isin(['Barcelona', 'Girona'])])
# Negation with ~
print(df[~(df['age'] > 24)])Conditions return a boolean Series used to filter the DataFrame. Parentheses are required with logical operators &, |, and ~.
Query and Advanced Selections
The query() method allows filtering with more readable expressions.
# Filtering with query
print(df.query('age > 24'))
# Multiple conditions
print(df.query('age > 24 and city == "Barcelona"'))
# Using external variables with @
threshold = 24
print(df.query('age > @threshold'))The query() method accepts expressions as text and is especially useful for complex conditions.
Selecting Multiple Columns and Rows
Pandas offers various ways to select subsets of data.
# Select columns with select_dtypes
print(df.select_dtypes(include=['int64']))
# Filter columns by name with filter
print(df.filter(items=['name', 'age']))
# Filter columns by pattern
df_extended = df.copy()
df_extended['age_years'] = df_extended['age']
print(df_extended.filter(like='age'))The select_dtypes() method filters columns by data type, filter() by exact names or patterns.
