You are currently viewing Adult UCI Dataset Analysis with Python

Adult UCI Dataset Analysis with Python

The adult UCI dataset is one of the popular datasets for practice. It is a Supervised binary classification problem.

AIM

To predict whether a person makes over 50k a year

Links to download the dataset:

Details of the Dataset

The dataset contains a mix of categorical and numeric type data.

Categorical Attributes

  • workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
    • Individual work category
  • education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
    • Individual’s highest education degree
  • marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
    • Individual marital status
  • occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspect, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
    • Individual’s occupation
  • relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
    • Individual’s relation in a family
  • race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
    • Race of Individual
  • sex: Female, Male.
  • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinidad Tobago, Peru, Hong, Holland-Netherlands.
    • Individual’s native country

Continuous Attributes

  • age: continuous.
    • Age of an individual
  • fnlwgt: final weight, continuous.
  • The weights on the CPS files are controlled to independent estimates of the civilian noninstitutional population of the US. These are prepared monthly for us by Population Division here at the Census Bureau.
  • capital-gain: continuous.
  • capital-loss: continuous.
  • hours-per-week: continuous.
    • Individual’s working hour per week

Let’s get started with the loading the libraries and the dataset.

If you have noticed the data do not contains any null values but it has question marks as highlighted above. So we have options either we can drop those values or can replace them with mode values. This we shall deal later ( just keep this in mind ).

Exploring the Data

The data contains 48,842 rows and 15 columns (including the target/ output column (income) ). The data type is a mix of categorical and numeric data. We notice that there are no null values.

Checking the unique value for each column.

Remember the “?”

The columns/ features which have the ” ?” sign are : Workclass, Occupation and Native Country. Let’s explore a more in depth. I am using Value_Counts function to find out the count of values for every feature.

Utility of Value_Count function:

  • I can easily estimate the highest and the lowest values,
  • It is easy to figure out the ‘mode’ in case of categorical data.
  • I can find any corrupt or noisy data, in this case ‘?’
  • Helps to easily find the values needed for replacement.
  • It is easy operation to perform
  • Proves handy when it comes to plotting the data.

Just, for practice you can perform Value_count operation for all the columns.

Time for Solution

Let us deal with the ‘?’ now. We shall replace it with the ‘MODE’.

The ‘?’ values been replaced with MODE( most frequently occuring value) as highlighted in red.

Let us do some Feature Engineering to make more sense from the data, as this data has a lot of information we will reduce it to make it more manageable.

Dealing with Columns

Have a look at the values and the columns, then we can reduce the number of values in that column.

Now, let us come to a more interesting part : DATA VISUALIZATION

WE shall look into the

  • Histogram to study the shape of the numeric data
  • BoxPlot to have an idea of outliers
  • Correlation plot to study the correlation among the numeric variables
  • Plotly Pie charts
  • Plotly Scatter Charts
  • Countplot for the income variable

Key Findings

  • The minimum age is 17 and the maximum is 90 years, most of the working age group lies between 20-40
  • The minimum hours-per-week is 1 and maximum is 90, with most of the count lying between 30-40
  • outliers observed in almost all the numeric features, these are the extreme values that are present in the data.
  • Not very strong correlation observed among variables

Training the MODEL and Making Predictions

  • Create X and y object to store the independent variable (X) and dependent variable(y).
  • Perform Standard Scaling to scale the data
  • Label Encoding is performed to convert the categorical data into numeric format
  • Label Encoder makes the data suitable for machine
  • Perform fit and Transform
  • Split the dataset into train and test split

Here, I have used Logistic Regression, Decision Tree and Random Forest Classifier.

Key Findings

  • Random Forest Classifier is giving the best accuracy on test data: 85%
  • Logistic Regression Classifier accuracy is: 84%
  • Decision Tree Classifier accuracy is: 81%

Tried to explain Confusion Matrix along with mentioning the formula and how it can be calculated for both classes.

Further Scope:

Apply Boosting Algorithms, can go parameter tuning to improve the performance of the test results.

Leave a Reply