In order for a company to get serious about working with data, a variety of roles can be brought in to collect, organize, and analyze the data. Considering the large scope of this process, taking the data from raw events all the way through actionable insights, there can be some considerable confusion in what responsibilities each data role has. Organizations that don’t have any experience with data may end up creating listings where they attach responsibilities to a job title that are counter to general expectations. Job hunters can also fall into the trap from the other direction, seeking opportunities that they are not suited for or which will not match with their interests.
This article describes general guidelines for differentiating between three major data roles that organizations hire for their data teams: data engineers, data analysts, and data scientists. While these aren’t the only “data” roles out there, these three are the most prominent, and the focus of most questions about distinguishing roles in data.
Traditionally, data engineers have been the number one hire for a company to start seriously working with data. It doesn’t really make sense to think about how to perform data analysis until you actually have data to analyze. A data engineer is responsible for figuring out how to gather data, organize it, and maintain it, so they are a vital role to have on a data team.
Data engineers frequently have to contend with messy or incomplete data and make decisions on how that data will be processed and maintained. The engineer needs to know how data applications are structured, test data pipelines, and monitor how data is being used. Done well, the data engineer’s work makes sure that data users are able to access what they need, and that their queries’ outputs are generated in a timely fashion. While a data engineer is unlikely to be performing analyses themselves, other data roles are dependent on the data engineer’s work in order to extract useful information from the data.
In a smaller organization, however, having a data engineer might not be as high priority. Over the past few years, data collection products like Stitch and Blendo have been developed to help manage the extraction of data from common data sources in processes also known as ETL and ELT. Cloud-based data storage solutions like Amazon Redshift and Google BigQuery can flexibly manage the manipulation of large amounts of data. It is not inconceivable that a data analyst or scientist can address an organization’s data needs without a data engineer by setting up and leveraging these tools.
That said, a data engineer is still an important role to have on a data team. Data pipeline applications and solutions act as a relief on the most common or tedious essential tasks that a data engineer might have needed to work on before. This can free up their time to look at more intricate tasks that better leverage their expertise. A modern data engineer will have less to do with the creation of data pipelines, and more to do with the maintenance and optimization of those pipelines along with the creation of custom transformations or data gathering routines that ETL and ELT tools cannot handle. This blog post from Fishtown Analytics lays out a more detailed breakdown of tasks a modern data engineer should perform, along with suggestions on when data engineers should be brought in to a team.
As a final note, the role of data engineer has a fair overlap with that of a data architect, often the fourth “data” role added to the three focused on in this article. A data architect shares a lot of the same knowledge as the data engineer in knowing how data can be extracted from data sources, how data should be transformed into useful forms, and how cleaned data can be stored. However, one general distinction that is made between the two roles is that a data architect has responsibility for planning the architecture or framework in which the data will be processed and stored. The architect dictates the ways in which data should be collected, stored, and made available to users at a high level, while anticipating and adapting to the changing needs of those users. The data engineer, then, will be responsible for implementation and maintenance of the data pipeline following the architect’s plan. The duties of the data architect may sometimes fall to a senior data engineer, or be a step in a data engineer’s career path.
One way that you can think about the distinction in data roles is whether they act before the data is collected or after the data is collected. Data engineers and data architects are responsible for operations before the data is collected, while data analysts and data scientists are responsible for operations after the data is collected. Just as there can be some confusion between the roles of data engineer and data architect, there also exists confusion between the roles of data analyst and data scientist.
For a company that is just starting out, the most likely case is that they will want a data analyst. (Discussion and contrast to the data scientist will follow in the next section.) While there is a broad range of responsibilities that a data analyst might have depending on the company, a good rule of thumb is to think of data analysts as explorers. A top data analyst will have the curiosity and skills to investigate the data from multiple angles, performing cleaning and transformation operations to look for trends in the data. They may find new paths for the company to explore, possibly identifying areas where more data could be collected for deeper analysis.
Data analysts are well-served not just by the ability to mine through data, but also be able to report their findings to others. An analyst should be able to create visualizations or use tools to create dashboards that convey to others what they have found. Visualizations and dashboards should not only be for members of a data team to understand the data, they’re also for demonstrating findings to others outside of the team. A good data analyst or data scientist should know how to polish their exploratory visualization work into explanatory visualizations that effectively communicate findings.
One data role that data analysts may cross over into is that of a business analyst. When a data analyst performs their explorations and creates their reports, they may not necessarily be required to interpret their findings in terms of company actions. On the other hand, a business analyst will be primarily focused on their use of data to answer business questions and suggest future actions to take. In a way, a business analyst might be considered as a data analyst that acts in a specialized domain. Although the data analyst might make use of domain knowledge and business ideas to guide their exploration, they will be more concerned that trends and patterns in the data are identified than collaborating with others to enact strategies based off those findings.
One rule of thumb that is often put forth considers data analysts and data scientists to be in the same general domain – gathering insights from data – but that the data analyst is basically a junior role to the data scientist. This isn’t exactly wrong, but there’s definitely more nuance to the two roles than just that.
One distinction promoted by Google’s Cassie Korzykov is that data analysts’ work is all about stating what the data tells you, and about reporting facts rather than uncertainties. Coming up with a conclusion should not be the job of an analyst’s report – that’s the job of statistics and a data scientist. Data analysts tend to develop performance metrics, report what is there, and convey those observations to others, while data scientists are geared towards making sure those observations actually carry statistical significance.
A data scientist should be able to sift through data in the same way as an analyst, but also be able to apply statistical techniques in order to differentiate between signal and noise. Lead data scientists especially need the ability to make decisions about which observations from a data analyst are worth following up on. They should understand what questions are worth investigating and how to answer those questions with further data gathering and running experiments. Understanding of how to create balanced experiment designs and anticipating common design issues is an important skill for moving beyond correlational observations to understanding of causal effects. Because of the data scientist’s specialization into making deeper dives, it makes more sense to start with hiring data analysts to explore the data that already exists, then bring in a data scientist later to create focus around the most promising points.
Another factor that differentiates data scientists from analysts is in their ability to apply machine learning to data. Machine learning can be used in combination with other statistical techniques to move beyond descriptive analytics and into the realm of predictive analytics, making predictions about future events or outcomes. However, machine learning often requires a lot of data in order to make useful predictions. One of the main skills of a data scientist applying machine learning is to anticipate which algorithms have the highest chance of being useful in each project they tackle.
Other Data Analysis Roles
For a small company, a data scientist might need to be an expert in all of the aspects of the role outlined above. But not all data scientists need to be responsible for all of these points; as a company gets larger, there may be distinctions made between senior and junior data scientist roles at a company. In addition, explicitly specialist roles of statistician and machine learning engineer can also stand as part of a data team alongside a data scientist.
In this article, you have learned about three major roles that can be present on a data team: the data engineer, data analyst, and data scientist. A data engineer handles implementation of infrastructure to gather data, transform them into useful forms, and make sure they are available to other data roles to perform analyses. Data analysts handle descriptive analytics, exploring the data, looking for patterns and trends, and reporting connections that have been observed. Data scientists build up from there into the realm of predictions, using statistics and machine learning to distinguish signals from noise and run experiments to untangle correlation from causation. Other related data roles like data architect or statistician can also take on specialized responsibilities.
Typically, the first dedicated data team hire will be an engineer or analyst. Which is more important depends on the experience and skills of those heading up the hiring process. The data scientist and other specialist roles can come after the data movement has gotten some momentum. Through all of this, it is also good to make sure that someone on the team can act as a leader and create direction for the data team in relation to the company’s goals.
As a final reminder, it is worth stressing the variation that can be found in job listings for data roles. Even if different job listings share the same job title, they may have different requirements, depending on the company size and their maturity around working with data. In addition, there can be some blurring of lines in responsibilities based on how they’ve defined their data role structure. The guidelines above will not be comprehensive, and some variation should be expected. When hiring for a data role, it is important for a company to look at candidates’ specific skills in order to know what they’re capable of. Considering the rise in opportunities for people to learn new things and to share what they’ve done, it is important to look not only at academic and work experience, but also examine any portfolio of projects to help make an assessment.