Preparing for the Transition to Data Science

Insight
Insight
Published in
7 min readMar 14, 2016

--

Roni Kobrosly is a Program Director for the New York Insight Data Science Program. Before this he was a Fellow in the 2015 Summer Insight session, where he developed an app to help social scientists write in the style of the top journal articles in their field. You can learn more about this app and his other data analysis projects here.

Kevin Mercurio is a Data Scientist and Lead Mentor for Insight Data Labs. After a PhD spent analyzing data and studying the Higgs Boson at the Large Hadron Collider, Kevin joined the Insight Data Science team to provide unique educational experiences and help PhDs and postdocs make transitions to careers in data science.

The data science industry has grown quite a bit since our original Preparing for Insight post. For starters, the U.S. now has a Chief Data Scientist at The White House, our very own former Insight advisor, DJ Patil.

We know from firsthand experience that the transition out of academia can seem daunting. As part of the Insight team, we attend career panels at universities across the country to speak with scientists who are considering career transitions. One of the most common questions we get asked is: “What skills and tools should I be learning?”

Since 2012 Insight has helped 400+ of the brightest PhDs and postdocs transition into top industry positions and we’ve learned a lot about how to make the transition efficient. Below you will find suggestions and resources that will help you with your own transition.

Skills to Cultivate Before You Transition to Data Science

By definition all scientists are data scientists. In my opinion, they are half hacker, half analyst, they use data to build products and find insights. It’s Columbus meets Columbo — starry eyed explorers and skeptical detectives.

MONICA ROGATI
FORMER VP DATA SCIENCE AT JAWBONE
INSIGHT ADVISOR AND FREQUENT VISITOR TO INSIGHT IN SILICON VALLEY
VIA FORBES

Programming

There are many languages for conducting data science work: Python, R, MATLAB, Stata, SAS, and so on. However, we’ve found the the general trend in data science is towards Python.

Python is a general purpose programming language that has a growing number of modules for data analysis, including SciPy, Numpy, Pandas, StatsModels, and Scikit-learn, as well as many visualization tools like seaborn, matplotlib, and ggplot.

Action Items:

Databases

As mentioned in our previous post, academic researchers typically store their data in flat files, like a csv. However, data scientists in industry typically need to handle much larger datasets, often with complex linkages to other datasets. You will need to become more familiar with relational database systems like MySQL and PostgreSQL (or Postgres). These systems allow you to query data quickly by linking together many datasets within a larger database.

MySQL is still one of the most commonly used databases in industry, but increasingly we’ve found that PostgreSQL is the database of choice. Postgres has some richer features and many say that it’s better at joining/merging datasets than MySQL. Thankfully, MySQL and Postgres use highly similar syntax for querying, so learning how to use one database system will give you a foundation to work with both.

Action Items:

  • Work through the “basic” and “intermediate” lessons at Mode Analytics’ SQL School. Insight Fellows love this site because it includes a nice platform for testing your queries on real databases in your browser!
  • Practice connecting Python to MySQL or Postgres. Fellows from our previous sessions told us this piece was particularly helpful for building projects.
  • If you have more time, check out this tutorial on SQLAlchemy, an increasingly popular Python module that allows you to access and modify SQL databases in a more “pythonic” way.

Data Analysis in Python

Once you gain some experience with Python and organize your databases, you will be ready to dig into some analysis. As we mentioned above, there are a number of Python modules for analyzing data (SciPy, Numpy, etc.). In addition to these modules, we also recommend using Wes McKinney’s pandas package to analyze your data.

Action Items:

  • A great way to learn pandas is through exercises, like these ones from Wes McKinney.
  • If you have trouble understanding what is going on, the pandas “Cookbook”is a valuable resource.
  • Review Big Data University’s Jupyter notebook on an exploratory data analysis.

Machine Learning

Expertise with machine learning methods is an integral part of data science. Gaining familiarity with the landscape of available methods will help you figure out what direction to go in when trying to solve a real-world problem.

Action Items:

Computer Science

At many companies data science job applicants are always asked to code on whiteboards as part of the interview process. At other places this happens rarely, if ever. But being prepared for these types of interviews shouldn’t be your only goal. Data scientists often work closely with engineering teams and being able to understand your colleagues and teammates is crucial for doing great work. So it’s important to be familiar with CS fundamentals like algorithms and data structures.

Action Items:

Putting the tools to work

The job of the data scientist is to ask the right questions.

HILARY MASON
FOUNDER AND CEO OF FAST FORWARD LABS
INSIGHT ADVISOR AND FREQUENT SPEAKER AT INSIGHT NYC
VIA FORBES

All of the resources above are simply tools that will help you solve real problems. But your goal as a data scientist should be to get better at finding the right tool, the right model, and perhaps most importantly, asking the right questions of your data. The best way to develop these abilities is to practice: put your tools to work and answer questions with real data.

Action Items:

  • Determine a question you may be interested in answering by using data. What are you passionate about? What would make your life more efficient? What would improve the city you live in?
  • Search for datasets related to the question. There are many repositories ofinteresting datasets online and many social media and news outlets haveAPIs with great documentation.
  • Start exploring the data and applying your tools. When you get stuck, ask for help! The Data Science community is often warm, responsive, and eager to help. Share your code online in github, post on forums, and find friends to work with!

Preparing for Insight’s Learning Environment

Insight is not a traditional school with lectures and coursework. We build real data-related projects. We’ve created a unique social learning environment that is extremely collaborative and much different from an academic research setting. In the tech industry, one often has to continually reevaluate priorities and work in an agile manner rather than letting projects go on for long timescales with less flexibility. At Insight, we help you get ready for this new, flexible workstyle.

Action Items:

  • Check out Sasha Laundy’s talk on how to effectively give and ask for technical feedback: For those coming from a competitive academic environment, there can be a temptation to struggle alone with problems. To thrive in a fast-paced tech environment, communication and collaboration are key!
  • Try Pair-programming. Pair-programming is the practice of two-people working together on a coding problem. They work together with one computer: one person writing out code (the driver) and the other (the navigator) reviews the lines and concepts as they are typed in. Pair-programming is a great way to create simpler, well-documented code and to find and resolve bugs. It’s also a great way to learn new tricks and learn each other’s strengths. Try it out!
  • Attend data science meetups, join an organization like PyLadies, or find a “hackathon” near you. From time to time, Insight will host a hackathon event, like this one!

Parting Advice

Data science is an exciting, burgeoning field, and employers are hungry for your analytic skills! One final piece of advice we suggest is immersing yourself in the data science world by keeping abreast of the latest news in the field. Here are some great data science news sources that we stay on top of:

  • Hacker News: Hacker News is a social news website focusing on computer science, data science, and entrepreneurship. It is run by Y Combinator, a well-known startup incubator. Don’t be thrown off by the name! The original usage of the term “hacker” has nothing to do with cyber criminals, but rather someone who comes up with clever solutions to problems through their programming skills.
  • Data Science Weekly: The latest data science news and helpful resources (books, meetups, and datasets).
  • Insight Blog: We’ve got a number of amazing posts. Program Director Emily Thompson’s Academia to Industry: Data Science Myths and Truths is particularly relevant to those transitioning into industry. Also, check out our Data Engineering Blog, which includes an awesome post about the The New Data Engineering Ecosystem: Trends and Rising Stars with a beautiful interactive map of Data Engineering tools.

Interested in transitioning to career in data science?
Find out more about the
Insight Data Science Fellows Program, apply today, or sign up for program updates.

--

--