Two types of data scheduling platform systems and their implementation methods and processes

Two types of data scheduling platform systems and their implementation methods and processes

What is a scheduling system

The scheduling system, to be more precise, the job scheduling system (Job Scheduler) or workflow scheduling system (workflow Scheduler) is an indispensable and important part of any big data development platform that has a slightly larger scale and is not simply fun.

In addition to Crontab, Quartz and other single-machine timing schedulers/libraries. There are also many open source distributed job scheduling systems, such as oozie, azkaban, chronos, zeus, etc., in addition, there are Ali s TBSchedule, SchedulerX , Tencent s Lhotse, and our company s TASKCTL that has been honed for ten years.

Two types of operating systems

The current scheduling systems on the market can be divided into two types according to their functionality. Timing type job scheduling systems & DAG workflow type job scheduling systems. There are usually very big differences in the architecture and function implementation of these two types of systems. Let s share this with everyone. The difference between the two operating systems;

Timing operating system

The direction of the timing system is focused on a large number of concurrent task fragmentation execution scenarios;

In actual application scenarios, the business logic that usually needs to be executed regularly for maintenance work is relatively discrete and disorderly, and there are only certain simple associations.

E.g:

  • Need to periodically clean up the disk space of a batch of machines in batches,

  • Need to generate a batch of commodity lists regularly,

  • Need to index a batch of data in batches regularly,

  • Need to send push notifications to a batch of users regularly and so on.

The core objectives are basically two points:

1. Job fragmentation logic support: split a large task into multiple small tasks and assign them to different servers for execution. The difficulty lies in ensuring that it is not leaking, not heavy, and ensuring load balance. When the node crashes, the task will be automatically migrated. Wait

2. High-availability and precise timing triggering: Because the timeliness and accuracy of actual business processes are often involved in normal times, it is usually necessary to ensure the strong real-time and reliability of task triggering

Therefore, "load balancing, elastic expansion", "state synchronization" and "failover" are usually the key features considered in the architecture design of this type of scheduling system.

DAG Workflow Job Scheduling System

It is mainly positioned in the correct processing of the scheduling dependencies of ordered jobs. The logic of sharding execution is usually not the granularity of the system's attention. If some jobs really focus on the sharding logic, they are usually handed over to the back-end cluster (such as MR tasks with their own sharding). Chip capabilities) or specific types of task execution back-end to achieve.

The DAG workflow scheduling system usually serves a large number of jobs, and the flow between jobs depends on more complex scenarios;

For example: the offline data warehouse report processing business of the big data development platform, from data collection, cleaning, to the summary calculation of various levels of reports, to the final data export to an external business system, a complete business process may involve hundreds of Thousands of cross-dependent and related jobs.

Therefore, the focus of the DAG workflow scheduling system usually includes:

  • Sufficient and flexible dependency trigger mechanism (such as: time trigger task, dependency trigger task, mixed trigger task)

  • Management and synchronization of operation plan, change and execution flow

  • Priority management of tasks, business isolation, authority management, etc.

  • Processing of various special processes (such as: suspending tasks, refreshing historical data, manually marking failures/successes, coordination of temporary tasks and periodic tasks, etc.)

  • Complete monitoring and alarm notification mechanism

Summary: The positioning goals of these two types of systems are not absolutely conflicting, and judging from the development of the current timing scheduling system, it is also necessary to deal with some strong dependencies between complex jobs, such as " micro-batch (small number of DAG batches). The concept of job processing) was put forward. It s just that it s very difficult from an implementation point of view to satisfactorily support these two types of requirements at the same time. Because of the difference in focus, there will be some trade-offs in some aspects in terms of architecture. At present, both types of systems are There is no perfect balance between the two.

Why do you need a scheduling system

We all know that the calculation, analysis and processing of big data are generally composed of multiple task units (Hive, Sparksql, Spark, Shell, etc.), and each task unit completes specific data processing logic.

There is often a strong dependency between multiple task units. Only when the upstream task is executed and succeeded can the downstream task be executed. For example, after the upstream task gets the result A, the downstream task needs to combine the result A to produce the result B. Therefore, the start of the downstream task must be started after the upstream task successfully runs and gets the result.

In order to ensure the accuracy of data processing results, these tasks must be executed in an orderly and efficient manner in accordance with the upstream and downstream dependencies. A more basic processing method is to estimate the processing time of each task, calculate the start and end time of the execution of each task according to the sequence, and keep the entire system running stably by running tasks regularly.

A complete data analysis task is executed at least once. In the low-frequency data processing process where the amount of data is small and the dependency is relatively simple, this scheduling method can fully meet the needs.

However, in enterprise-level scenarios, more tasks need to be executed every day. If there are a large number of tasks, it will take a lot of time to calculate the time for the task to start. In addition, if the execution time of the upstream task exceeds the original estimated time or the operation is abnormal The problem is that the above-mentioned processing methods will not be able to deal with at all, and will also cause repeated loss of human and material resources. Therefore, for the enterprise data development process, a complete and efficient workflow scheduling system will play a vital role.

Write at the end

TASKCTL is currently the only scheduling product that proposes the complete concept of "disordered timing and ordered DAG job flow". Processing timings may be "micro-batch" control, it is possible to process the job stream DAG "timer" control.

E.g:

  • In big data distributed (sharded) computing, real-time ETL run batch processing of data,

  • In the ETL job running batch, a certain job or a branch is cycled and processed in a time window

As the demand for big data applications continues to expand, the complexity and real-time requirements of data processing are getting higher and higher. TASKCTL, as a professional dispatching product independently researched and developed in China, has made arrangements in advance for enterprises to enter the era of big data 2.0.

To understand the product features, please refer to:

Write at the end 

The pandemic swept the world in 2020, and it has had a serious impact on the entire market economy, causing many small and medium-sized enterprises to have their business chains blocked, large enterprises funding constraints, and the shift system resulting in a substantial increase in the workload of company operation and maintenance personnel; Tasker Information Technology Corporation The leader researched and decided to fulfill social responsibilities and actively return to the society to help enterprises affected during this epidemic reduce operation and maintenance capital expenditures, improve work efficiency, and ensure background data security. ETL batch job scheduling tools that can be applied to work demand scenarios, During this epidemic, you will be allowed to use TASKCTL with a total value of about 100,000 for free. 


Contact us if you have any questions

Consultation hotline: 028-68731039 (working day 9:30-18:00)

Online consultation: WeChat "Kitleer" (24 hours) Community: 75273038 (QQ)

Email cooperation: service@taskctl.com