Schlimmer Jeffrey. Schlimmer ' ' a. This data set consists of three types of entities: a the specification of an auto in terms of various characteristics, b its assigned insurance risk rating, c its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates.

Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky or lessthis symbol is adjusted by moving it up or down the scale. Actuarians call this process "symboling". The third factor is the relative average loss payment per insured vehicle year.

Note: Several of the attributes in the database could be used as a "class" attribute. Attribute: Attribute Range 1. Kibler, D. Instance-based prediction of real-valued attributes. Computational Intelligence, Vol 5, Geraldine E. Rosario and Elke A.

Rundensteiner and David C. Brown and Matthew O. PR Yongge Wang. Please refer to the Machine Learning Repository's citation policy. Center for Machine Learning and Intelligent Systems.Abstract : Derived from simple hierarchical decision model, this database may be useful for testing constructive induction and structure discovery methods.

Creator: Marko Bohanec Donors: 1. Marko Bohanec marko. Blaz Zupan blaz. Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX, M. Bohanec, V.

Rajkovic: Expert system for decision making. Sistemica 1 1pp. The model evaluates cars according to the following concept structure: CAR car acceptability. PRICE overall price. TECH technical characteristics. Every concept is in the original model related to its lower level descendants by a set of examples for these examples sets see [Web Link]. The Car Evaluation Database contains examples with the structural information removed, i. Because of known underlying concept structure, this database may be particularly useful for testing constructive induction and structure discovery methods.

Class Values: unacc, acc, good, vgood Attributes: buying: vhigh, high, med, low. Bohanec and V. Rajkovic: Knowledge acquisition and explanation for multi-attribute decision making. Zupan, M. Bohanec, I. Bratko, J. Demsar: Machine learning by function decomposition. Qingping Tao Ph. Daniel J. Lizotte and Omid Madani and Russell Greiner.

Budgeted Learning of Naive-Bayes Classifiers. Jianbin Tan and David L. Australian Conference on Artificial Intelligence. Journal of Machine Learning Research, 3. Nikunj C. Oza and Stuart J. Experimental comparisons of online and batch versions of bagging and boosting.

Impact of learning set quality and size on decision tree performances. Signal, 1. Iztok Savnik and Peter A. Discovery of multivalued dependencies from relations. Data Anal, 4.Within this dataset, we will learn how the mileage of a car plays into the final price of a used car with data analysis.

Since we will be using the used cars dataset, you will need to download this dataset. The str command displays the internal structure of an R object.

This function is an alternative to summary. When using the str function, only one line for each basic structure will be displayed. The summary function is a basic function that issued to produce the result summary of various model functions. In addition, you can print only one column of the used cars dataset. For example, lets complete a summary of only the year of the used cars.

The range function returns a vector containing the maximum and minimum of all the given arguments. In addition, you can use the diff function on the range function to return suitably lagged and iterated differences. The quantile function produces sample quantiles corresponding to the given probabilities.

The smallest observation corresponds to a probability of 0 and the largest to a probability of 1. The probs parameter using methods to handle ties among values and data sets with no middle values.

The boxplot is for common visualization of the five-number summary.

10 Open Datasets for Linear Regression

In addition, the boxplot produces box-and-whisker plot s of the given grouped values. Which you will see below, the median is the dark line in the plot. In addition, you can add extra parameters such as main and ylab to add a title to the figure and label the y-axis vertical axis. Histograms are another way to graphically depict the spread of a numeric variable. Similar to a boxplot in a way that it divides the variables values into a predefined. Also, the number of portions called bins that act as containers for values.

The table function uses the cross-classifying factors to build a contingency table of the counts at each combination of factor levels.All datasets below are provided in the form of csv files. If you are using D3 or Altair for your project, there are builtin functions to load these files into your project. If you are using Processing, these classes will help load csv files into memory: download tableDemos.

The zip file contains the Table class all files named Table. Since some of the datasets include country data, we also provide you with a file countries. For an example on how to use this file to draw a map, download mapDemo. For this you might have to use the DateFormat class or regular expressions. These are simple multidimensional datasets that are for the most part classic infovis datasets. If you use one of these data sets, you will need to focus your effort on creating good, interactive representations that are well-suited to your analytic tasks.

Includes mostly free-form text with some structured data including id, title, when created, published, updated, deleted, author type, postal code, and text contents. Causes of death in France from Other data on European countries can be downloaded from the Eurostat Website :. This dataset consists of three files: sleep periods, feeding periods, and diaper changes of a baby in its first 2. The above table is quite small and only provides the average rating for the question How happy would you say you are these days?

Rating 1 low to 10 high by country and by sex. On its own, this dataset it probably insufficient for this class project. You are encouraged to download and visualize answers to other questions as well.

Other data per country per year can be downloaded from gapmindersuch as electricity generation per person, alcohol consumption, air traffic accidents, and more classical measures such as GDP. HIV prevalence per country per year, with uncertainty bounds.

Speed dating data with over 8, observations of matches and non-matches, with answers to survey questions about how people rate themselves and how they rate others on several dimensions. An aggregated dataset computed from the World Values Survey that measures cultural proximity of countries across two dimensions, and for different time periods.

A collection of over 20, dream reports with dates. You may also choose your own dataset. In order to do so, you must first get your dataset approved by the instructor. Please acknowledge these authors when reusing content from this page, and the source data authors for external links. Download csv file.

Source Website. Once you are satisfied with the table, click on the disk icon on the top then select the xls format. Explanation of columns.Every data scientist will likely have to perform linear regression tasks and predictive modeling processes at some point in their studies or career.

For those of you looking to learn more about the topic or complete some sample assignments, this article will introduce 10 open datasets for linear regression. Additionally, some of the datasets on this list include regression tasks for you to complete with the data. This dataset includes data taken from cancer. Along with the dataset, the author includes a full walkthrough on how they sourced and prepared the data, their exploratory analysis, model selection, diagnostics, and interpretation.

From the Behavioral Risk Factor Surveillance System at the CDC, this dataset includes information about physical activity, weight, and average adult diet. Built for multiple linear regression and multivariate analysis, the Fish Market Dataset contains information about common fish species in market sales. The dataset includes the fish species, weight, length, height, and width. The data contains medical information and costs billed by health insurance companies. It contains rows of data and the following columns: age, gender, BMI, children, smoker, region, insurance charges.

Created as a resource for technical analysis, this dataset contains historical data from the New York stock market. The dataset comes in four CSV files: prices, prices-split-adjusted, securities, and fundamentals. Using this data, you can experiment with predictive modeling, rolling linear regression, and more.

Used Cars Dataset Analysis with R

The dataset contains data from cancer. It is in CSV format and includes the following information about cancer in the US: death rates, reported cases, US county name, income per county, population, demographics, and more.

This real estate dataset was built for regression analysis, linear regression, multiple regression, and prediction models. It includes the date of purchase, house age, location, distance to nearest MRT station, and house price of unit area. The dataset includes info about the chemical properties of different types of wine and how they relate to overall quality.

A useful dataset for price prediction, this vehicle dataset includes information about cars and motorcycles listed on CarDekho. The data is in a CSV file which includes the following columns: model, year, selling price, showroom price, kilometers driven, fuel type, seller type, transmission, and number of previous owners.

This dataset contains information compiled by the World Health Organization and the United Nations to track factors that affect life expectancy.

The data contains rows and 22 columns. Using the datasets above, you should be able to practice various predictive modeling and linear regression tasks. Lucas is a seasoned writer, with a specialization in pop culture and tech. He spends most of his free time coaching high-school basketball, watching Netflix, and working on the next great American novel. Sign up to our newsletter for fresh developments from the world of training data. Lionbridge brings you interviews with industry experts, dataset collections and more.

A Automobiles Statistics. This dataset contains figures related to auto production, sales and usage. Automotive Registrations by Country in Europe. Source: European Automobile Manufacturers' Association. Automotive Registrations by Manufacturer in Europe.

Car Production Statistics. Passenger cars are motor vehicles with at least four wheels, used for the transport of passengers, and comprising no more than eight seats in addition to the driver's seat. Commercial vehicles include light commercial vehicles, heavy trucks, coaches and buses. Car Sales by Country. Source: China Association of Automobile Manufacturers. Car Sales Projections for Saudi Arabia.

Global Electric Vehicles Outlook, This is not an official indicator of the IEA. Data of various road transport parameters have been provided here i. The dataset only include locally produced models, and exclude imported cars. Data for brands that have no local production in a Joint Venture with a local manufacturer and only import their vehicles are not available, and neither are data for the imported models from brands that do produce some of their vehicles locally.

New cars registration, France. New passenger car registrations by fuel type detailed. Country: Greece Includes both new vehicles and used vehicles from abroad. Country: Hungary Includes both new vehicles and used vehicles from abroad.Eli Tilevich, Dr.

Clifford A. Shaffer, Dr. Dennis Kafura. Records of AIDS related statistics from several countries. Information about flight delays in major aiports since Information about over billionaires from around the world. This library holds data about Broadway shows, such as tickets sold.

The Business Dynamics Statistics BDS includes measures of establishment openings and closings, firm startups, job creation and destruction by firm size, age, and industrial sector, and several other statistics on business dynamics for the US. Cumulative cancer deaths for the period are reported for each U. This is a dataset about cars and how much fuel they use.

Records and computed statistics about the top books on Project Gutenberg. This dataset provides data on the number and valuation of new housing units authorized by building permits. Estimates of the total dollar value of construction work done in the U.

Demographic information for counties in the United States. This dataset is about substance abuse cigarettes, marijuana, cocaine, alcohol among different age groups and states. Records from different earthquake occurences across the world. A breakdown of how each county voted in the Presidential primaries.

This data set describes over U. Reported data for includes electrical generation, distribution, revenues, and customers. This data set contains data from through United States Government reports on consumption, production, import, and export of various fuel sources.

The Annual Survey of State Government Finances provides a comprehensive summary of the annual survey findings for state governments, as well as data for individual states. Statistics for various food items View. Data about counties ability to access supermarkets, supercenters, grocery stores, or other sources of healthy and affordable food.

