Top 25 Data Science Interview Questions
A list of frequently asked Data Science Interview Questions and Answers are given below.
1) What do you understand by the term Data Science?
- Data science is a multidisciplinary field that combines statistics, data analysis, machine learning, Mathematics, computer science, and related methods, to understand the data and to solve complex problems.
- Data Science is a deep study of the massive amount of data, and finding useful information from raw, structured, and unstructured data.
- Data science is similar to data mining or big data techniques, which deals with a huge amount of data and extract insights from data.
- It uses various tools, powerful programming, scientific methods, and algorithms to solve the data-related problems.
2) What are the differences between Data Science, Machine Learning, and Artificial intelligence?
Data science, Machine learning, and Artificial Intelligence are the three related and most confusing concepts of computer science. Below diagram is showing the relation between AI, ML, and Data Science.
Following are some main points to differentiate between these three terms:
Data Science | Artificial Intelligence | Machine Learning |
---|---|---|
Data science is a multidisciplinary field that is used for deep study of data and finding useful insights from it. | Artificial Intelligence is a branch of computer science that build intelligent machines which can mimic the human brain. | Machine learning is a branch of computer science which enables machines to learn from the data automatically. |
Data Science is not exactly a subset of artificial intelligence and machine learning, but it uses ML algorithms for data analysis and future prediction. | Artificial Intelligence is a wide field which ranges from natural language processing to deep learning. | Machine learning is a subset of Artificial Intelligence and a part of data science. |
The goal of Data science is to find hidden patterns from the raw data. | The goal of artificial intelligence is to make intelligent machines. | The goal of machine learning is to allow a machine to learn from data automatically. |
Data science finds meaningful insights from data to solve complex problems. | Artificial intelligence creates intelligent machines to solve complex problems. | Machine learning uses data and train models to solve some specific problems. |
3) Discuss Linear Regression?
- Linear Regression is one of the popular machine learning algorithms based on supervised learning, which is used for understanding the relationship between input and output numerical variables.
- It applies regression analysis, a predictive modeling technique that finds a relationship between the dependent and independent variables.
- It shows the linear relationship between independent and dependent variables, hence it is called a linear regression algorithm.
- Linear Regression is used for prediction of continuous numerical variables such as sales/day, temperature, etc.
- It can be divided into two categories:
- Simple Linear Regression
- Multiple Linear Regression
If we talk about simple linear regression algorithm, then it shows a linear relationship between the variables, which can be understood using the below equation, and graph plot.
4) Differentiate between Supervised and Unsupervised Learning?
Supervised and Unsupervised learning are types of Machine learning.
Supervised Learning:
Supervised learning is based on the supervision concept. In supervised learning, we train our machine learning model using sample data, and on the basis of that training data, the model predicts the output.
Unsupervised learning:
Unsupervised learning does not have any supervision concept. Hence, in unsupervised learning machine learns without any supervision. In unsupervised learning, we provide data which is not labeled, classified, or categorized.
Below are some main differences between supervised and unsupervised learning:
Sr. No. | Supervised Learning | Unsupervised learning |
---|---|---|
1. | In supervised learning, the machine learns in supervision using training data. | In unsupervised learning, the machine learns without any supervision. |
2. | Supervised learning uses labeled data to train the model. | Unsupervised learning uses unlabeled data to train the model. |
3. | It uses known input data with the corresponding output. | It uses unknown data without any corresponding output. |
4. | It can be grouped into Classification and Regression algorithms. | It can be grouped into Clustering and Association algorithms. |
5. | It has more complex computation than Unsupervised learning. | It has less complex computation than supervised learning. |
6. | It provides more accurate and reliable output. | It provides less reliable and less accurate output. |
7. | It can also use Off-line data analysis. | It uses real-time data analysis. |
5) What do you understand by bias, variance trade-off?
When we work with a supervised machine learning algorithm, the model learns from the training data. The model always tries to best estimate the mapping function between the output variable(Y) and the input variable(X). The estimation for target function may generate the prediction error, which can be divided mainly into Bias error, and Variance error. These errors can be explained as:
- Bias Error: Bias is a prediction error which is introduced in the model due to oversimplifying the machine learning algorithms. It is the difference of predicted output and actual output. There are two types of bias:
- High Bias: If the suggested predicted values are much different from actual value, then it is called as high bias. Due to high bias, an algorithm may miss the relevant relationships between the input features and target output, which is called underfitting.
- Low Bias: If the suggested predicted values are less different from actual value, then it is called as low bias.
- Variance Error: If the machine learning model performs well with training dataset, but does not perform well with test dataset, then variance occurs. It can also be defined as an error caused by the model’s sensitivity to small fluctuation in training dataset. The high variance would cause Overfitting in machine learning model, which means an algorithm introduce noise along with the underlying pattern in data to the model.
Bias Variance tradeoff:
In the machine learning model, we always try to have low bias and low variance, and
- If we try to increase the bias, the variance decreases
- If we try to increase the variance, the bias decreases.
Hence, trying to get an optimal bias and variance is called bias-variance trade-off. We can define it using the Bull eye diagram given below. There are four cases of bias and variances:
- If there is low bias and low variance, the predicted output is mostly close to the desired output.
- If there is low bias and high variance, the model is not consistent.
- If there is high variance and low bias, the model is consistent but predicted results are far away from the actual output.
- If there is high bias and high variance, then the model is inconsistent, and also predictions are much different with actual value. It is the worst case of bias and variance.
6) Define Naive Bayes?
Naive Bayes is a popular classification algorithm used for predictive modeling. It is a supervised machine learning algorithm which is based on Bayes theorem.
It is easy to build a model using Naive Bayes algorithm when working with a large dataset. It is comprised of two words, Naive and Bayes, where Naive means features are unrelated to each other.
In simple words, we can say that “Naive Bayes classifier assumes that the features present in a class are statistically independent to the other features.”
7) What is the SVM algorithm?
SVM stands for Support Vector Machine. It is a supervised machine learning algorithm which is used for classification and regression analysis.
It works with labeled data as it is a part of supervised learning. The goal of support vector machine algorithm is to construct a hyperplane in an N-dimensional space. The hyperplane is a dividing line which distinct the objects of two different classes, it is also known as a decision boundary.
If there are only two distinct classes, then it is called as Binary SVM classifier. A schematic example of binary SVM classifier is given below.
The data point of a class which is nearest to the other class is called a support vector.
There are two types of SVM classifier:
- Linear SVM classifier: A classifier by which we can separate the set of objects into their respective group by drawing a single line, i.e., hyperplane, called as linear SVM classifier.
- Non-Linear SVM classifier: Non-linear SVM classifier applies on those objects which cannot be classified into two groups by a single line.
On the basis of error function, we can divide a SVM model into four categories:
- Classification SVM Type1
- Classification SVM Type2
- Regression SVM Type1
- Regression SVM Type1
8) What do you understand by Normal distribution?
- If the given data is distributed around a central value in the bell-shaped curve without any left or right bias, then it is called Normal distribution. It is also called a Bell Curve because it looks like a bell?shaped curve.
- The normal distribution has a mean value, half of the data lies to the left of the curve, and half of the data lies right of the curve.
- In probability theory, the normal distribution is also called a Gaussian distribution, which is used for the probability distribution.
- It is a probability distribution function used to see the distribution of data over the given range.
- Normal distribution has two important parameters: mean(µ) and standard deviation(σ).
9) Explain Reinforcement learning.
- Reinforcement learning is a type of machine learning where an agent interacts with the environment and learns by his actions and outcomes. On each good action, he gets a positive reward, and for each bad action, he gets a negative reward. Consider the below image:
- The goal of an agent in reinforcement learning is to maximize positive rewards.
- In reinforcement learning, algorithms are not explicitly programmed for tasks but learns with experiences without any human intervention.
- The reinforcement learning algorithms is different from supervised learning algorithms as there is no any training dataset is provided to the algorithm. Hence the algorithm automatically learns from experiences.
10) What do you mean by p-value?
- The p-value is the probability value which is used to determine the statistical significance in a hypothesis test.
- Hypothesis tests are used to check the validity of the null hypothesis (claim).
- P-values can be calculated using p-value tables or statistical software.
- The p-values lies between 0 and 1. It can have mainly two cases:
- (p-value<0.05): A small p-value indicates strong evidence against the null hypothesis, so we can reject the null hypothesis.
- (p-value>0.05): A large p-value indicates weak evidence against the null hypothesis, so we consider the null hypothesis as true.
11) Differentiate between Regression and Classification algorithms?
Classification and Regression both are the supervised learning algorithms in machine learning, and uses the same concept of training datasets for making predictions. The main difference between both the algorithms is that the output variable in regression algorithms is Numerical or continuous, whereas in Classification algorithm output variables are Categorical or discrete.
Regression Algorithm: A regression algorithm is about mapping the input variable x to some real numbers such as percentage, age, etc. Or we can say regression algorithms are used if the required output is continuous. Linear regression is a famous example of the regression algorithm.
Regression Algorithms are used in weather forecasting, population growth prediction, market forecasting, etc.
Classification Algorithm: A classification algorithm is about mapping the input variable x with a discrete number of labels such as true or false, yes or no, male-female, etc. Or we can say Classification algorithm is used if the required output is a discrete label. Logistic regression and decision trees are popular examples of a classification algorithm. The classification algorithm is used for image classification, spam detection, identity fraud detection, etc.
12) Which is the best suitable language among Python and R for text analytics?
Both R and Python are the suitable language for text analytics, but the preferred language is Python, because:
- Python has Pandas library, by which we can easily use data structure and data analysis tools.
- Python performs fast execution for all types of text analytics.
13) What do you understand by L1 and L2 regularization methods?
Regularization is a technique to reduce the complexity of the model. It helps to solve the over-fitting problem in a model when we have a large number of features in a dataset. Regularization controls the model complexity by adding a penalty term to the objective function.
There are two main regularization methods:
L1 Regularization:
- L1 regularization method is also known as Lasso Regularization. L1 regularization adds a penalty term to the error function, where penalty term is the sum of the absolute values of weights.
- It performs feature selection by providing 0 weight to unimportant features and non-zero weight to important features.
- It is given below:
- Here is the sum of the squared difference between the actual value and the predicted value.
- is regularization term, and λ is penalty parameter which determines how much to penalize the weights.
L2 Regularization:
- L2 regularization method is also known as Ridge Regularization. L2 regularization does the same as L1 regularization except that penalty term in L2 regularization is the sum of the squared values of weights.
- It performs well if all the input features affect the output and all weights are of approximately equal size.
- It is given as:
- Here, is the sum of the squared difference between actual value and predicted value.
- is the regularization term, and λ is the penalty parameter which determines how much to penalize the weights.
14) What is the 80/20 rule? Explain its importance in model validation?
In machine learning, we usually split the dataset into two parts:
- Training set: Part of the dataset used to train the model.
- Test set: Part of the dataset used to test the performance of the model.
The best ratio to split the dataset is 80-20%, to create the validation set for machine learning model. Here, 80% is assigned for the training dataset, and 20% is for the test dataset. This ratio maybe 90-20%, 70-30%, 60-40%, but these ratios would not be preferable.
Importance of 80/20 rule in model validation:
The process of evaluating a trained model on the test dataset is called as model validation in machine learning. In model validation, the ratio of splitting dataset is important to avoid Overfitting problem. The best preferable ration is 80-20%, which is also known as 80/20 rule, but it also depends upon the amount of data in a dataset.
15) What do you understand by confusion matrix?
- Confusion matrix is a unique concept of the statistical classification problem.
- Confusion matrix is a type of table which is used for describing or measuring the performance of Binary classification model in machine learning.
- The confusion matrix is itself easy to understand, but the terminologies used in the matrix can be confusing. It is also known as Error matrix.
- It is used in statistics, data mining, machine learning, and different Artificial Intelligence applications.
- It is a table with two dimensions, “actual and predicted” and identical set of classes in both dimensions of the table.
- The confusion matrix has four following cases:
- True Positive(TP): The predictions is positive and its actually true.
- False Positive(FP): The prediction is positive but its actually false.
- True Negative(TN): The prediction is negative but its actually true.
- False Negative(FN): The prediction is negative and its false.
The classification accuracy can be obtained by the below formula:
16) What is the ROC curve?
ROC curve stands for Receiver Operating Characteristics curve, which graphically represents the performance of a binary classifier model at all classification threshold. The curve is a plot of true positive rate (TPR) against false positive rate (FPR) for different threshold points.
17) Explain the Decision Tree algorithm, and how is it different from the random forest algorithm?
- Decision tree algorithm belongs to supervised learning which solves both classifications and Regression problems in machine learning.
- Decision tree solves problems using a tree-type structure which has leaves, decision nodes, and links between nodes. Each node represents an attribute or feature, each branch of the tree represent the decision, and each leaf represents the outcomes.
- Decision tree algorithm often mimic human thinking hence, it can be easily understood as compared to other classifications algorithm.
Difference between Decision Tree and Random Forest algorithm:
Decision Tree Algorithm | Random Forest Algorithm |
---|---|
Decision tree algorithm is a tree-like structure to solve classification and regression problems. | Random forest algorithm is a combination of various decision trees which gives the final output based on the average of each tree output. |
Decision tree may have a chance of Overfitting problem. | Random Forest reduces the chance of Overfitting problem by averaging out several trees predictions. |
Simpler to understand as it is based on human thinking. | This algorithm is comparatively complex. |
It gives less accurate result as compared to the random forest algorithm. | It gives a more accurate result. |
18) Explain the term “Data warehouse”.
The data warehouse is a system which is used for analysis and reporting of data collected from operational systems and different data sources. Data warehouse plays an important role in Business Intelligence.
In a data warehouse, data is extracted from various sources, transformed (cleaned and integrated) according to decision support system needs, and stored into a data warehouse.
The data present in the data warehouse after analysis does not change, and it is directly used by end-users or for data visualization.
Advantages of Data Warehouse:
- Data Warehouse makes data more readable, hence, strategic questions can be easily answered using various graphs, trends, plots, etc.
- Data warehouse makes data analysis and operation faster and more accurate.
19) What do you understand by clustering?
Clustering is a way of dividing the data points into a number of groups such that data points within a group are more similar to each other than data points of other groups. These groups are called clusters, and hence, the similarities within the clusters is high, and similarities between the clusters is less.
The clustering techniques are used in various fields such as machine learning, data mining, image analysis, pattern recognition, etc.
Clustering is a type of supervised learning problems in machine learning. It can be divided into two types:
- Hard Clustering
- Soft Clustering
20) How to determine the number of clusters in k-means clustering algorithm?
In k-means clustering algorithm, the number of clusters depends on the value of k.
21) Differentiate between K-means clustering and hierarchical clustering?
The K-means clustering and Hierarchical Clustering both are the machine learning algorithms. Below are some main differences between both the clustering:
K-means clustering | Hierarchal Clustering |
---|---|
K-means clustering is a simple clustering algorithm in which objects are divided into clusters. | Hierarchal clustering shows the hierarchal or parent-child relationship between the clusters. |
In k-means clustering, we need prior knowledge of k to define the number of clusters which sometimes may be difficult. | In hierarchal clustering, we don’t need prior knowledge of the number of clusters, and we can choose as per our requirement. |
K-means clustering can handle big data better than hierarchal clustering. | Hierarchal clustering cannot handle big data in a better way. |
Time complexity of K-means is O(n) (Linear). | Time complexity of hierarchal clustering is O(n2)(Clustering). |
22) What do you understand by Ensemble Learning?
In machine learning, Ensemble learning is a process of combining several diverse base models in order to produce one better predictive model. By combining all the predictions, ensemble learning improves the stability of the model.
The concept of ensemble learning is that various weak learners come together to make a strong learner. Ensemble methods help in reducing the variance, and bias error which causes a difference in actual value and predicted value. Ensemble learning can also be used for selecting optimal features, data fusion, error correction, incremental learning, etc.
Below are the two popular ensemble learning techniques:
- Bagging:
Bootstrap Aggregation is called Bagging, which is a powerful method of ensemble. Bagging is an application of Bootstrap technique to create a high-variance machine learning algorithm, such as decision trees. It takes the various sampled datasets from the original datasets and trains each dataset to increase the model variance. The bagging concept can be easily understood by the below diagram:
- Boosting:
Boosting is sequential ensemble method of machine learning. It helps to exploit the dependencies between the models, and mainly reduces the bias and variance in machine learning algorithms. It is an iterative technique that adjust the weight of the instances in a dataset based on the previous classification. If the instance was classified incorrectly, then it increases the weight of that instance. In short, it converts the weak learner to strong learners. Sometimes boosting shows better accuracy than bagging, but it may also introduce the over-fitting in the training data. The common type of boosting is Adaboost.
23) Explain Box Cox transformation?
A Box-Cox transformation is a statistical technique to transform the non-normal dependent variable into a normal shape. We usually need normally distributed data to use in various statistical analysis tools such as control charts, Cp/Cpk analysis, and analysis of variance. If the data is not normally distributed, we need to determine the cause for non-normality and need to take the required actions to make the data normal. So for making data normal and transforming non-normal dependent variable into a normal shape, box cox transformation technique is used.
24) What is the aim of A/B testing?
A/B testing is a way of comparing two versions of a webpage to determine which webpage version is performing better than other. It is a statistical hypothesis testing which determines any changes to a webpage in order to increase the outcome of strategy.
25) How is Data Science different from Data Analytics?
When we deal with data science, there are various other terms also which can be used as data science. Data Analytics is one of those terms. The data science and data analytics both deal with the data, but the difference is how they deal with it. So to clear the confusion between data science and data analytics, there are some differences given:
Data Science:
Data Science is a broad term which deals with structured, unstructured, and raw data. It includes everything related to data such as data analysis, data preparation, data cleansing, etc.
Data science is not focused on answering particular queries. Instead, it focuses on exploring a massive amount of data, sometimes in an unstructured way.
Data Analytics:
Data analytics is a process of analysis of raw data to draw conclusions and meaningful insights from the data. To draw insights from data, data analytics involves the application of algorithms and mechanical process.
Data analytics basically focus on inference which is a process of deriving conclusions from the observations.
Data Analytics mainly focuses on answering particular queries and also perform better when it is focused.