Balancing Up with SQL and Database Management

I had understood very early on while learning the basics of data science that the three pillars of a sturdy analytics structure are statistics, a programming language, and database management. So, after covering the first two in my previous posts, it’s natural that I move to database foundations.

During Fall 2018, I started learning the basics of databases in Dr. James Scott’s class. The man is a gifted speaker and entertainer. His class was full of marvelous impressions, anecdotes from his variety of experiences, and exciting PowerPoint presentations. It was here that I understood the concept of data modeling with topics like primary and foreign keys, Entity Relationship Diagrams (ERD) , schemas and sub-schemas, weak and strong relationships, and Normalization . However, the most important part of this class was that it got me started in one THE MOST IN-DEMAND TOOL asked for in every job role I desire – SQL!

Photo by Tobias Fischer on Unsplash

As my friend Ankita loves saying – SELECT is written in our star(*)s. It was a delight to work on class assignments that tested our knowledge of dependencies, NULL values, SQL functions, relational operators, joins, sub-queries, and views. We also got into the basics of transaction management using SQL. And since we had worked extensively with Relational Databases for most part of the class, Dr. Scott spent the last leg of our semester teaching us the basics of NoSQL and MongoDB. It formed a great runway for my future big data endeavors.

My SQL and database learning during this semester culminated with a project where I got my hands dirty with some data munging, database modeling and even regression using SQL and R. Just cleaning this data before we can perform any kind of retrieval was a task in itself. Thanks to this class, I find myself proficient in creating ERDs, working with various SQL joins and clauses to retrieve simple as well as aggregated data from complex data sets.

ALSO SEE Saying “Hello, old friend” to Statistics and Analytics
Diving Deep into Business Analytics with R Programming

This is the third post of my #10DaysToGraduate series where I share 10 key lessons from my Master’s degree in the form of a countdown to May 8, my graduation date.

Diving Deep into Business Analytics with R Programming

When a class is named after your graduation major, and one of the most popular disciplines in the present world, you know it’s going to be pivotal in your learning path. BA with R proved to be just that. The brilliant Dr. Sourav Chatterjee made it clear right at the beginning that R programming is going to be used just as a tool (which it is) to understand and master the nuances of business analytics. Having said that, his course material left no stone unturned in taking us through all aspects of R programming needed for data science.

I had worked a bit with Java and PHP, but this was my first experience with the R programming language. I started with an introductory course on Datacamp to quickly learn the very basics of R like vectors, matrices and data frames. Then, in class, Dr. Chatterjee proved to be a dedicated and patient professor as he started with basic manipulations and sample generation in R and then quickly moving to the foundations of data analytics. We got familiar with libraries like tidyverse, forecast, gplots and toyed with data visualization using ggplot on some interesting data sets. We created several plots, graphs, charts, and heatmaps, before scaling up to larger data sets.

This was followed by some of the most important things a business analyst/data scientist learns in his career. So far, everything looked pretty straight forward to me but now was the time to push boundaries and actually dive deep into analytics. I was introduced to dimension reduction, correlation matrix and the all-important analytics task of principal component analysis (PCA). I learnt how to evaluate performance of models, create lift and decile charts, and classification with the help of a confusion matrix – all with just a few lines of code. As Dr. Chatterjee explained time and again, it was never about the code. It was about knowing when and how to use it and what to do with the result.

Dr. Sourav Chatterjee’s BA with R class

We then followed the natural analytics progression with linear and multiple regression where I learned about partitioning of data and generating predictions. This was followed by a thorough understanding of the KNN model and how and when to run it. By now, I was beginning to get a hand of problem statements and the approach to take to solve them, thanks to class assignments on real-world scenarios like employee performance and spam detection. Through the examples done in class, it was easy to grasp the concepts of R-squared value, p-value and the roles they play in model evaluation. It was in this class that I understood logistic regression, discriminant analysis, association rules for the first time and I have been working on them ever since, in every data science course or project that I have taken up.

All of this knowledge and Dr. Chatterjee’s guidelines were put to use in the final project where I worked with a group led by the talented Abhishek Pandey on London cabs data. After rigorous work on large data sets downloaded/extracted from various sources, we trained a model to predict arrival times for cabs by comparing RMSE across random forests, logistic regression, and SVMs. It was a great way to put into practice everything we had learned over four months.

And with that, I had laid a robust foundation in data analytics, and was ready to build it further in the time to come. By January 2019, I was confident to dive into analytics projects and work on complex data sets to generate prediction models using the tools taught by Dr. Saurav Chatterjee.

ALSO SEE Saying “Hello, old friend” to Statistics and Analytics

This is the second post of my #10DaysToGraduate series where I share 10 key lessons from my Master’s degree in the form of a countdown to May 8, my graduation date.