Next I noticed that the BODY TYPE column had incomplete data. I cleaned the data by dropping the rows with incomplete data.

Next I noticed that the IS NEW column has complete data, however the OWNER COUNT column which should be related to the IS NEW column had some missing values. I cleaned the data by comparing the IS NEW column and the OWNER COUNT columns. Comparison 1: The car is used but the owner count is not listed. In this case I dropped the rows because we did not want any incomplete information in the data. Comparision 2: The car is new but the owner count is not na. In this case I dropped the rows because we did not any misleading information in the data. Comparison 3: The car is new but the owner count is na. In this case I want to change the value from na to 0.0.

Next I noticed the missing values in the MILEAGE column, however this column should also relate to the IS NEW column. We cleaned the data by comparing the IS NEW column and the MILEAGE columns. For my purposes new cars will have 0.0 mileage. A new car with less than 200 miles will have its mileage changed to 0. Anything above a 200 will be droppped because the data is incorrect. Comparison 1: The car is used but the mileage data is missing. I dropped the rows becuase we do not want incomplete data. Comparision 2: The car is new but the mileage data is greater than 200. I droppped the rows because the data is incorrect. Comparision 3: The car is new but the mileage data is less than 200. I changed the values to 0.0 because a new car in our data will have 0.0 mileage. Comparison 4: The car is new but the mileage data is na. I changed the values from na to 0.0.

Next I noticed the missing values in the HAS ACCIDENTS column, however this column should also relate to the IS NEW column. I cleaned the data by comparing the IS NEW column and the HAS ACCIDENTS columns. Comparison 1: The car is used but the HAS ACCIDENTS data is missing. We dropped the rows becuase I did not want incomplete data. Comparision 3: The car is new but the HAS ACCIDENTS data is True. I droppped the rows because the data is incorrect. Comparison 4: The car is new but the HAS ACCIDENTS data is na. I changed the values from na to False.

Next I noticed the missing in values in MAXIMUM SEATING, ENGINE TYPE,HORSEPOWER, TRANSMISSION, FUEL ECONOMY, and SELLER RATING. All of these columns are independant of other columns. I wanted the most complete data so any rows with incomplete data I dropped.

Lastly I found that the LISTING COLOR column while not having any missing values did have the string 'UNKNOWN' for some rows. I want the most complete data we can get so we dropped the rows were the color was listed as unknown.

Next I decided to turn the seats column into a flat integers and drop any nan values for better use by algorithms.

This makes sense because fuel economy and horsepower rely on engine size which makes the defining columns not independant so they essentially give similar information.

Clustering Vs Classification: Predicting the body type from City Fuel Economy, Highway Fuel Economy, Engine Type, Maximum seating, and Horsepower

A brief overview of the results

Davies-Bouldin Index is interpreted as lower scores meaning better clustering, so we achieved a relatively poor score. Our Silhouette Coefficient is not horrible because the range is [-1, 1], so our score indicates that samples were assigned to the right cluster Our adjusted rand score outright failed, most likely due to the size of the data set. Our Fowlkes Mallows score is again pretty poor because better clustering results in a higher score.

I observed that these scores where pretty bad, so I decided to see the prediction results of a classification method. In my case I chose to use a Decision Tree Classification algorithm.

Results of classification

All of the tests indicate better results when closer to one, so our results all indicate that the classification algorithm performed well.

Clustering Vs Classification: Predicting the Price Range of a Used Car by Make, Model, Year, Mileage, Fuel Economy (CJN)

For the clustering and classification predictions I decided to focus on only used cars under 50,000 dollars. I binned the price column into 5 ranges to come up with the original labels. To bin the price ranges I used the pandas function cut which cuts the data into bins based on 10000 dollar ranges. I then labeled the bins from 1 to 5 1 being the cheapest range 5 being the most expensive range. Next I label encoded both the make and model name columns. I chose k-means and Agglomerative for clustering. I chose K-neighbors and Decision Tree for classification.

K-Means

The Davies-Bouldin score is poor the optimal score is 0 with lower values indicating better clustering.

Decision Tree

K-Neighbors

All Cars: Visualizing price by Make, Mileage, and Color

The scatter plot and bar plot shows the presence of outliers and a skew in price which needs to be addresses for the other graphs to be meaningful. Also, we can see how the majority of accident cars are cheaper and spread across all mileage

Problem!

As you can see there are some heavy outliers that would make it difficult to plot anything with price. So, I am going to remove them before plotting anything else using percentiles because anything dealing with means would be influenced by the outliers.

As we can see, some colors like pink and purple have a similar quartile price points, but not many exist on the higher end of price. This makes sense because they are less common car colors while the more common colors have a wider distribution.

Much Better!

Pie Chart of top 25 Makes

Violin plot of Price and Mileage vs Accidents

This violinplot further solidifies the notion that a cars price noticably drops when it incurs an accident

As expected the non accident section contains a lot of cars on the low end because they are new and couldn't have been in an accident.

Top 500 cars by Make and Accident Status

Since a car with more miles has more opporutunity to be in an accident, we chose the top mileage of the top makes to see if there was a relation between make and accident occurance. Surprisingly, there doesn't seem to be any coorelation between make and accident status

Count of transmission type and year

Interestingly, we would assume that manual cars are more liekly to decrease in count but they are actually increasing very gradually up until 2021.

Dealing with skewed Data

The price and mileage categories are heavily skewed to the right, so, by square rooting the categories we can get a better sense of how mileage and price are reliant on each other.

So, using this graph we can show how price significantly decreases as mileage increases. But, at a certain point there is a cooresponding priced vehicle that has 0 miles on it.

Pair Plot to examine confounding variables when interpreting price

This plot is useful because we can understand how accidents play into the capabilities of a car. For example, we can see how fuel economy is worse with mileage (aka lifespan) when a car has had an accident.

All Cars: Scatterplot Comparing Engine Type and Make Name with the Hue representing Transmission Types (CJN)

This plot is helpful in identifying maker trends regarding engine types and transmission types. Such as: Chevy creates the most different engine types; High end companies have less diversity in both engine type and transmission type; The most popular transmission type is A; The least popular is Dual Clutch; The most popular engine is I4.

Used Cars under 50,000 Dollars: Pair Plot Comparing Year, Mileage with the Hue representing different Price Ranges (CJN)

Two interesting trends can be found by looking at these plots both of which make sense when you think about them. One the cheaper cars have higher mileages as seen in the lower right plot. Two the older cars are cheaper as seen in the upper left hand plot.