Ace the Data Science Interview

Nishant Tyagi
3 min readMay 12, 2021

Data Science is a big ocean where different streams of ideas merge together. As an ocean the subject is also very deep where each concept can be dig in deeper and there is always lot to explore.

Here is a humble attempt to put in multiple concepts which can come handy before any Data Science Interview-

Knowledge about Statistics and Probability

Knowledge of python and data types, class , comprehension , scraping

Knowledge of Visualization tool like Power BI, Tableau

Knowledge of Data Science libraries like nltk, keras, tensor-flow

Clear your concepts on Traditional machine learning, text processing, reinforcement learning and deep learning.

Precision vs Recall vs Accuracy

Accuracy is all about identifying the correct labels- positive or negative

=TP+TN/TP+FP+FN+TN

It can give good estimate of model performance but there are few limitations-

  • For imbalanced dataset like telecom churn prediction, credit card fraud detection with 99% data being of one label accuracy will be 99% even if all points are identified incorrectly!

Precision is about identifying correct positive labels- tp/tp+fp

Recall(Sensitivity/TPR) is about identifying correct negative labels- tp/tp+fn

There is usually trade off between precision and recall and we can choose which to improve based on our requirements.

Say for pregnancy prediction we dont want false positives so we prefer precision. For cancer prediction we dont want false negatives so we prefer recall.

TPR vs FPR and ROC(Receiver Operating Characteristics) curve

0.5 AUC (Area under Curve) score will be for a random classifier while higher score for better classifiers.

ANN vs CNN vs RNN

ANN refers to artificial neural network in deep learning. Neural networks are used to capture the complex features of real-world items be it image,audio or video. Feature Engineering refers to feature extraction+feature selection to improve the performance if our deep learning model.

Deep learning typically has 3 levels/layers- The input layer, hidden layer(processing the input) and output layer.

In ANN the inputs are processed only in forward direction. It can work with tabular, image and text data. Note that the processed input is treated with activation function to introduce non-linearity in the network. Otherwise the network will only learn linear functions. While dealing with an image of size 224*224 the trainable parameters at first layer itself would be (for 4 neurons) equal 602112 (input is 3)-See image for reference-

Source- https://www.analyticsvidhya.com/blog/2020/02/cnn-vs-rnn-vs-mlp-analyzing-3-types-of-neural-networks-in-deep-learning/

This is huge so computational cost becomes a challenge. To overcome this challenge we use RNN wherein we add a looping constraint to the ANN network. Since RNNs share the parameters across different time steps, it results is a fewer parameters to train.

https://www.researchgate.net/figure/The-standard-RNN-and-unfolded-RNN_fig1_318332317

One common problem with very large RNNs is the vanishing or exploding gradient. Since it uses backpropagation to update the weights, the gradient vanishes or explodes as it propagates backward which leads to vanishing/exploding gradient issue. To work on image data CNN are most suitable plus they also work well on sequential data. CNN capture the spatial features(arrangement of features and relationship between them) of an image. CNN also follow the concept of feature sharing ie a 2*2 feature map is prepared by slinding the same 3*3 filter across different parts of the image.

https://www.analyticsvidhya.com/blog/2020/02/cnn-vs-rnn-vs-mlp-analyzing-3-types-of-neural-networks-in-deep-learning

--

--