SOP for Thesis-based Masters in Computer Science with Interest in ML/NLP

nickil21 1 / -

Nov 17, 2019 #1

My SOP is pretty verbose, but it fits in the 2 pages. I would be really thankful if you could give any feedback, suggestion so as to make it strong enough. I really appreciate your help and time for reading my SOP. Have a nice day!

STATEMENT OF PURPOSE

I first heard about the word "machine learning" when I searched Quora in November 2014, after stumbling on a post that addressed the Netflix competition to boost it's recommendation engine by 10 percent more accurately. Needless to say, I was hooked, because it included two of my favorite activities - learning and competing. On exploring further, I came across Kaggle in which there was an active competition - "Avazu Click-Through Rate Prediction". The size of the dataset was larger than 10 million rows and I wasn't able to run any algorithm on my 8GB RAM machine to get decent results. Upon investigating, I came across a technique used by Google for their online CTR prediction system that used Per-Coordinate FTRL-Proximal with L1 and L2 Regularization for Logistic Regression through hash-trick and one-hot encoding [1]. The memory usage was reduced drastically to a couple of MBs and I was able to successfully optimize the execution speed. I knew I had to learn the foundations to understand the prerequisites for this subject deep enough though.

As a beginner in the field, I focused my attention on finishing the book "Pattern Recognition and Machine Learning" by CM Bishop. The first step was to ensure that all the basic background material is covered, which included - Calculus, Probability theory/Statistics, Linear Algebra. During my 11/12th grade, I was exposed to a lot of concepts from probability and calculus. Thus calculus (multivariate), and probability were covered. For Linear Algebra, beginning at the end of January 2015, I finished Prof. Gilbert Strang's video lectures on MIT's OpenCourseWare along with solving its assignments. I felt that was enough for anything in the book on Machine Learning that was about to come. It took 2.5 months to complete this book, but once it was finished, I was very keen to read any other books relating in general to mathematical optimization and artificial intelligence.

My first foray into research was right after graduating in the year 2015 with a cgpa of 7.75 when I did a research internship for 6 months at MSRIT, Bangalore where I had worked on an NLP project - "Sentiment Analysis in the healthcare field". This was my first experience dealing with textual data. To get familiar with the problem statement, I had read close to 20 papers on literature and implemented traditional algorithms including SVM, Naive Bayes, BayesNet, and Maximum Entropy. After trying different approaches, an N-gram proximity-based algorithm gave the highest F-score on the holdout set. In my spare time, I also took part in competitive coding sites to hone my Python and algorithmic skills. I managed to secure a global rank under 1500 among 0.3 million users on Sphere Online Judge (SPOJ).

To get a functional knowledge of how to derive insights from data, I started answering data science specific questions on StackOverflow. I wrote 425 answers on topics ranging from Python, NLP, Pandas, Numpy, Scikit-Learn, Machine Learning, Matplotlib, and so on, and was one of India's top writers for these tags in early 2017. I have accumulated over 20000 reputation points and my posts have helped 2 million people worldwide.

Another encounter into independent research was during my time at XXX(Sept 2016 - Feb 2018), where the founder being a PhD himself from KTH Stockholm saw research potential in me and gave me a task where little to no resources were available that was open-sourced. It was to recommend and rank similar images based on a select segment from a queried image. This led me to read SOTA papers on object detection, image segmentation, visit Kaggle forums, blogs, etc and come up with a research plan. Ultimately, the final algorithm that was a slight variant of selective search [2] performed so well that it was able to correctly identify patterns in minuscule segments and recommend matches with great precision.

While working in XXX(May 2018 - current), I built NLP capabilities from conversation standpoint spanning English as well as select vernacular languages - entity extraction and linking, parse classification modules, machine translation for low resource indic languages, FAQ bot, code-mixing identification, unsupervised 1synonym detection etc for a task-oriented dialogue system. Overall, the customer satisfaction score was increased by 30% and the number of transacting customers by 40%(YoY).

During September 2018, I participated in a research-based competition hosted by Drivendata featuring data from Schneider Electric. More than 1200 data scientists from all over the world competed to build the most reliable forecasts given only a few days of historical data for each building. I finished in the top 1% (11/1200). My algorithm brought down the median absolute percent error across daily and weekly forecasts to 7% - less than half that of the LSTM benchmark. The energy consumption forecasts were also used by facility managers, utility companies and building commissioning projects to implement energy-saving policies and optimize the operations of chillers, boilers, and energy storage systems.

Later, the same year, I also took part in an NLP research-based competition hosted by Kaggle - "Quora Insincere Questions Classification" where there was a time constraint of 2 hours GPU run-time to train models and generate predictions. I finished in the top 3%(93/4037) and won a silver medal. One of the highlights of my work was the use of bucketing to make a mini-batch from instances that have similar lengths to alleviate the cost of padding. This made the training speed more than 3x faster and thus I could run more epochs for each split of 5-fold. Another improvement was through learning a distribution of embeddings for a word instead of learning fixed vectors. [3]

Towards the end of 2018, I was selected to represent XXX at the NITI Aayog Indic NLP Workshop. I consider myself extremely fortunate and humbled to have gotten the opportunity to interact and brainstorm with some of the Best Minds of the industry and academia on challenges in understanding Indic Languages. This session re-emphasized on the power of ML and AI and how shortly it is going to create a huge impact - touching Millions of lives.

XXX is my top choice school because it's a large school with world-class faculty in a variety of fields, especially AI. I had a chance to interact with Prof. XXX and I believe that our research interests are closely aligned. He comes with rich work experience in the field of NLP, ML, with a focus on structured prediction. His work on Probabilistic Grammars, Document Summarization, Parsing algorithm approaches has intrigued me. Working under his esteemed guidance would help me expand my horizon and gain a deeper understanding of this field of study. Upon completion of the thesis-based masters preferably with a financial aid/assistantship, I plan to pursue a PhD in a related field(ML/NLP) and work on challenging problems with a research focus.

Maria - / 1096

Nov 19, 2019 #2

@nickil21
Hi there. Thanks for coming to the forum! I hope my feedback helps you.

While I appreciate the storytelling at the very introduction, I find that it still lacks that straightforward approach that will tell the readers almost instantaneously what they should anticipate from the rest of the text. Generally speaking, inserting a thesis statement on the very beginning that will briefly glance on the reason why this writing is relevant would help guide readers into reading the entirety of the text.

Furthermore, when you're trying to relay information in relation to your personal experiences in the field, it would be much better to prioritize which pieces of data are actually relevant (and which can be tossed out). Remember that having a more specific approach to writing will benefit you because it will save the readers from having to repeatedly read information that won't even be of contribution to the overall assessment in the end. In line with this, I heavily suggest rereading and revising the second to the fifth paragraph.

Try to also categorize and substantiate with a more organized approach. For instance, the latter parts of the text could have obviously been dealt with in a more structured pattern that follows a time sequence to avoid cluttering the text in its entirety.