DS Roadmap

Full-Stack Data Science Roadmap 2023

Hey there! So you're looking to become a full-stack data scientist? Well, I can definitely say that it's a challenging but achievable goal. You start by mastering the fundamentals - linear algebra, calculus, statistics, basic Python, SQL, and data structures - these are the building blocks that every data scientist should know. From there, you can choose to specialize in data engineering, data analytics, or machine learning.

a roadmap with all skills needed to become an expert full-stack data scientists

Samuel Westby

February 20th, 2023

The truth is, knowing all parts of the stack (data engineering, data analytics, and machine learning) is extremely valuable, but it's also very difficult because there's so much to learn! But don't worry, by doing things, getting stuck, and learning, you'll be on your way to becoming a full-stack data scientist in no time!

Here's how to use this guide. First, master the fundamentals. Second, choose your adventure: data engineering, data analysis, or machine learning. Third, learn how to share your models using deployment strategies and web development. In each section I provide sub-sections and many resources for learning each subsection.

As a data scientist, it's easy to get lost in the technical aspects of the job, but it's important to remember that communication skills are just as critical to success. Even if the roadmap doesn't specifically mention it, being able to clearly communicate your findings and insights to a non-technical audience is a vital part of the role. It's all about making sure the work you've done gets put into action, and that requires being able to explain it in a way that everyone can understand. So don't forget about communication, even if it's not explicitly listed on the roadmap! It'll pay off in the long run. As you work through the topics below, try explaining them to your friends in a way they'll understand - the bird's eye view. A master can explain their topic to someone at any level.

Probability and Statistics

2. Basic Python

The Basics

Interactive Computing

3. Data structures

4. Basic SQL and database management

5. Version control (Git)

6. Algorithms

Data Engineering

1. Data storage systems

Databases

Data warehousing

2. Data processing frameworks

Pandas

Other

Data Analytics

1. Exploratory data analysis

2. Data visualization

3. Hypothesis testing

Machine Learning

1. Machine learning algorithms

3. Deployment strategies

Web Development

1. HTML, CSS, and JavaScript

2. REST APIs

3. Frameworks

Real-World Projects

1. Non-profits.

2. Open-source projects.

3. Hackathons and competitions.

Fundamentals

The fundamentals are crucial in becoming a full-stack data scientist because they provide the foundation for everything else that you'll need to learn. Basic math and programming concepts are essential for understanding complex data structures and algorithms, as well as building and managing databases. Exploratory data analysis, data visualization, and hypothesis testing give you the skills to understand and interpret data, and the ability to make informed decisions based on that data. Mastering the concepts below will give you a solid foundation to become a data scientist.

Math

Math is important for data science because it provides the foundation for many machine learning algorithms and data analysis techniques. It is used for tasks such as optimizing models, solving systems of equations, and understanding the relationships between variables in large datasets.

Linear Algebra

Data scientists use linear algebra to manipulate data at a large and general scale. Many algorithms leverage linear algebra for processing acceleration.

Resources (pick one):

Calculus

Data scientists use calculus for many techniques. A quite popular technique is gradient descent for model optimization. You do not need to learn the depths of calculus proofs. Just have an understanding of differentiation and integration.

Resources (pick one):

Calculus for Machine Learning YouTube Playlist (up to video 22)
A Mathematics Course for Political and Social Research (free pdf, good high-level introduction)

Probability and Statistics

Probability and statistics allow data scientists to be certain about conclusions they draw from the data. They provide tools to make claims about truth, correlation, change, and causality.

Resources (pick one):

Basic Python

Basic Python is important for data science because it is a widely used and versatile programming language that is well-suited for data analysis and machine learning tasks, with a large community and many powerful libraries and tools such as NumPy, Pandas, and scikit-learn. Data scientists use Python more than any other language. Visualization is also an important tool. Start with matplotlib.

The Basics

Resources (pick one):

Intro to Python for Computer Science and Data Science (free pdf)
Course 1, 2, and 3 of this Coursera specialization
Course 1, 2 and 3 of Python for Everybody Coursera specialization

Interactive Computing

Resources (pick one):

IPython Cookbook (Chapters 1-3)
Corey Schafer's Jupyter Notebook tutorial (30 minutes)

Data structures

Data structures are important for data science because they provide a way to organize and efficiently manipulate large amounts of data, allowing data scientists to perform complex data analysis tasks and make informed decisions.

Resources (pick one):

Course 2 of Python for Everybody Coursera specialization
Chapter 4 of the pythonds book

Basic SQL and database management

Basic SQL and database management is important for data science because it enables data scientists to efficiently store, manipulate, and retrieve large amounts of data, which is crucial for data analysis and machine learning tasks.

Resources (pick one):

Course 5 of this Coursera specialization
Course 4 of Python for Everybody Coursera specialization

Version control (Git)

Version control is important for data science because it allows data scientists to track changes to their code, collaborate with others, and revert to previous versions if necessary, ensuring the integrity and reproducibility of their work.

Resources (pick one):

Git in 1 hour YouTube video
ohmygit.org (open-source game that teaches git)

Algorithms

Algorithms are important for data science because they provide the basis for machine learning models and other data analysis techniques, enabling data scientists to make predictions, classify data, and identify patterns in complex data sets.

Resources (pick one):

Do lots of problems on LeetCode or HackerRank
Awesome Algorithms (GitHub repo with TONS of resources - the cheat sheet section has great resources like this)
Stanford Algorithms Course

Data Engineering

Data Engineering provides the infrastructure for collecting, storing, and processing large amounts of data. Data engineers design and implement data pipelines, workflow management systems, and data storage systems that ensure data are easily accessible, scalable, and secure. Without the contributions of data engineers, data scientists would be unable to work with the large and complex data sets

Data storage systems

Data storage systems provide the means for storing large amounts of structured and unstructured data and support the data processing needs of data scientists and other stakeholders. They come in many different forms.

Databases

Resources (pick one):

Carnegie Mellon University databases course

Data warehousing

Resources (pick one):

Data processing frameworks (Pandas, PySpark, Dask)

Data processing frameworks, such as Pandas, Apache Spark, and Dask, provide tools for efficiently manipulating, transforming, and aggregating large datasets. This helps data scientists to work with complex data.

Pandas

Resources (pick one):

Other

Resources (pick one):

Learning Spark: Lightning Fast Data Analytics (free pdf)
End-to-end PySpark project Udemy course (paid course)
Dask introduction

Data Analytics

Data analytics refers to the process of examining, cleaning, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.

Exploratory data analysis

Exploratory Data Analysis (EDA) is an approach to analyzing and understanding data by summarizing its main characteristics and identifying patterns, anomalies, and relationships between variables.

Resources (pick one):

Data visualization

Data scientists use visualization as a tool to effectively communicate insights and findings from complex data to stakeholders, as well as to explore and understand the relationships and patterns in the data.

Resources (pick one):

Chapter 6 of the IPython Cookbook
Corey Schafer's matplotlib YouTube playlist
Visual Analysis and Design by Tamara Munzner (visualization theory, very good)

Hypothesis testing

Data scientists use hypothesis testing to make conclusions about an entire population based on a sample of data.

Resources (pick one):

Machine Learning

Machine learning is a tool to learn from data and make predictions or decisions without being explicitly programmed. It involves training algorithms on large datasets to identify patterns and relationships in the data. Data scientists use these patterns and predictions to generate insights and anticipate the future.

Machine learning algorithms

Supervised learning is a type of machine learning where the algorithm is trained on labeled data, meaning that the desired outcome or label is provided for each example in the training data. Unsupervised learning, on the other hand, involves training the algorithm on unlabeled data, where the desired outcome or label is not provided, and the algorithm must discover patterns and relationships on its own.

Resources (pick one):

Hands-On Machine Learning (textbook with foundational ML methods like regression and PCA)
Chapter 8 of the IPython Cookbook
Tech with Tim YouTube playlist

Neural networks

Neural networks are a type of machine learning model that are loosely inspired by the structure and function of the human brain, and consist of interconnected nodes or "neurons" that process and transmit information. This is the foundation for deep learning and state-of-the-art ML.

Resources (pick one):

Andrew Ng's Deep Learning Coursera Specialization (just Course 1)
Siraj's Deep Learning Syllabus (just Week 1)

Deep learning

Deep learning is neural networks on steroids.This allows the algorithms to learn and make increasingly complex representations of the data.

Resources (pick one):

Andrew Ng's Deep Learning Coursera Specialization (Courses 2-5)
Siraj's Deep Learning Syllabus (just Week 2-6)

Deployment

Deployment in data science refers to the process of making a machine learning model or data-driven solution available for use by others, often by integrating it into existing systems and processes.

Docker

Docker is a platform for building, shipping, and running applications in self-contained containers. For example, you can “dockerize” your program with a unique Python environment so anyone can run your application even if they don't have the same environment or operating system.

Resources (pick one):

https://www.docker.com/101-tutorial/
Very good YouTube tutorial (2 hours)

Kubernetes

Kubernetes is an open-source platform for automating the deployment, scaling, and management of containerized applications, allowing them to run and be maintained in a stable and efficient manner in production environments.

Resources (pick one):

Deployment strategies

How do you distribute your work?

Resources:

Hands-on tutorial from Ahmed Besbes
MLFlow Tutorials (very good)
Read about blue-green deployment, rolling deployment, and canary deployment.

Web Development

Data scientists often use web development skills to create dashboards, reports, and interactive visualizations to communicate their findings and insights to stakeholders. They also use web development to build and deploy machine learning models as web services, enabling others to access and use their models through a web interface. This is a deep rabbit hole, so start with the basics. It fits well under the Deployment section, but is so large I gave it its own section.

HTML, CSS, and JavaScript

Data scientists use HTML, CSS, and JavaScript to create interactive dashboards, reports, and visualizations to present and communicate their findings and insights to stakeholders.

Resources:

Read freeCodeCamp's blog post
HTML, CSS, and JavaScript video explanation (30 minutes)
Beginner friendly website build YouTube tutorial (90 minutes)

REST APIs

Data scientists use REST APIs to retrieve and manipulate data from remote servers and databases

Resources:

REST APIs in 11 minutes
APIs for beginners YouTube tutorial (2 hours)

MERN, MEAN or flask/django + postgress + ReactJS

Data scientists should learn a web development stack because it allows them to build and deploy interactive, web-based applications and dashboards to communicate their findings and insights effectively and efficiently to a broad range of stakeholders.

Resources (find what works for you):

Real-World Projects

Working on real-world projects is important in learning data science because it provides hands-on experience with solving real-world problems. It also helps build a portfolio of work that demonstrates one's abilities and skills to potential employers, which is especially important for those looking to enter the field.

Work on real-world projects that incorporate all the skills you've learned so far.

By reaching out to a non-profit organization and offering your skills as a data scientist, you have the opportunity to make a positive impact while also gaining valuable real-world experience. Many non-profits have plenty of data, but may not have the resources or expertise to extract meaningful insights from it. By volunteering your time and expertise, you can work with members of the organization to analyze the data and generate insights that can inform their decision-making and help further their mission. This is a great way to build your portfolio and develop your skills while making a difference in your community. Be careful, do not oversell yourself and become a burden on the organizations you're helping. It's important to be useful.

Contribute to open-source projects.

Contributing to open-source projects is a valuable way for data scientists to develop their skills, expand their network, and give back to the community by sharing their knowledge and expertise.

Participate in data science hackathons and competitions.

Participating in data science hackathons and competitions provides an opportunity for data scientists to challenge themselves, gain new experiences, and build their portfolio by working on real-world problems and presenting their solutions to a wider audience.

Resources:

Coursera projects
Find a dataset you like on Kaggle, GitHub, or your city and uncover the stories in the data.
Data Analysis for Hospitals (paid, but there's a free trial)
DS Projects GitHub Repo
Tech with Tim Machine Learning Projects for Beginners
12 Data Science Projects

Conclusion

As a learner, I firmly believe in the power of doing things and getting stuck. That's how you learn best. When you actively apply what you know, it sticks with you so much better. For example, I initially learned Java, but it wasn't until I started working on a visualization project that I realized how much easier it was to do in Python. So I switched gears and dove into Python, and now I can confidently say that I have a good grasp of it. The hands-on approach to learning has always worked well for me, and I highly recommend it to others too. Best of luck on your data science journey!

Full-Stack Data Science Roadmap 2023

Fundamentals

Math

Linear Algebra

Calculus

Probability and Statistics

Basic Python

The Basics

Interactive Computing

Data structures

Basic SQL and database management

Version control (Git)

Algorithms

Data Engineering

Data storage systems

Databases

Data warehousing

Data processing frameworks (Pandas, PySpark, Dask)

Pandas

Other

Data Analytics

Exploratory data analysis

Data visualization

Hypothesis testing

Machine Learning

Machine learning algorithms

Neural networks

Deep learning

Deployment

Docker

Kubernetes

Deployment strategies

Web Development

HTML, CSS, and JavaScript

REST APIs

MERN, MEAN or flask/django + postgress + ReactJS

Real-World Projects

Work on real-world projects that incorporate all the skills you've learned so far.

Contribute to open-source projects.

Participate in data science hackathons and competitions.

Conclusion