Why the Pareto Principle Should Guide your Data Science Career
There is no denying that data science is one of the most exciting fields today, constantly calling for professionals skilled and experienced in one of its many dimensions. The conventional data scientist typically comes from computer science, engineering, mathematics, or physics, all of which have a strong grounding in mathematical principles. It is, however, slowly becoming less uncommon to see new entrants to data science coming from genomics, linguistics, and other hitherto unconventional backgrounds.
Typically, the first thing that someone looking to begin a data science career asks is:
What knowledge and skills are needed?
The most commonly discussed skills needed in the context of the average data science project include the following:
- Big Data platforms (Spark or Hadoop)
- Cloud computing
- Data visualization
- Data Wrangling and preprocessing
- Deep Learning
- Mathematics (Linear Algebra and Calculus)
- Programming (Python, R, Julia, Scala, etc)
- SQL
- Statistics
- Supervised Learning
- Unsupervised Learning
- Communication skills
It appears logical, then, for an aspiring data scientist to surmise, on perusing this list, that (s)he is far from mastering most of these skills. This could mean numerous hours spent before attaining such mastery and becoming an 🔗experienced professional in a data science career.
There is, however, a major positive side too. The previously discussed list of skills includes some intrinsic at a basic level to any data science project, and it is these basic skills that could be the base from which to start a career in data science.
The issue is one of prioritization. A great guiding principle, not originally conceived for this purpose yet very much applicable, is the Pareto Principle, described by Vilfredo Pareto, a 19th-century Italian economist, engineer, philosopher, political scientist, and sociologist. Essentially, this principle suggests that roughly 80% of the effects of many events come from just 20% of the causes.
When applied in the context of data science, it fits perfectly as, in this field, a minimum of three-quarters of time spent on a project goes towards collecting, wrangling, and preprocessing data i.e. organizing the dataset itself takes a lot of time. Rare is the data scientist who gets to work on well-cataloged data, thereby spending a bigger proportion of time on data visualization and modeling.
Compare two work scenarios:
- Typical: SQL is used to combine data in tables from different databases, or APIs fetch data from distinct sources, after which Pandas or R help to organize and explore data. Missing values and outliers are factored into generating a dataset that is subjected to data visualization and training models.
- Reality: The process takes much longer than expected, due to iterative repetition of data collection and wrangling before preparing the final dataset.
How can the Pareto Principle be applied to the journey of learning in data science?
A great way to start is to focus on mastering data extraction, data wrangling, and programming. Competence in Pandas, Python, and SQL should mean the ability to discharge a minimum of 80% of the daily responsibilities in a 🔗data science project. Python, in particular, is important as it is the most popular language used in a data science career due to a relatively less steep learning curve and thus a great first choice for a new programmer.
What would be the next step?
As the candidate becomes stronger in the collection and preprocessing, (s)he can work on model selection and visualization with much more confidence. Once the person starts work on defining hypotheses, (s)he can research methods and algorithms for machine learning (ML).
A similar line of thought is echoed by a hierarchy of needs in data science, according to which, the following is the order of needs: collection à moving/storage à exploration and transformation à aggregation and labeling à learning and optimization. This clearly implies that in the absence of skills in data processing and wrangling, it could become difficult to work on a data science project.
A great way to pick up the essential skills is to opt for one of the 🔗best data science certifications. A certification is an attestation of competence with the most current skills and technologies in the field of data science and shows the desire of the candidate to grow the work role and/or move up in the organization.
Given the long career path to becoming an experienced data scientist, the path of optimized learning is the best bet to land good job opportunities fast!