Task – TPL dataflow vs normal semaphore
I need to make an extensible process The process mainly has I / O operations and some secondary CPU operations (mainly deserializing strings) The process queries the list of URLs in the database, then obtains the data from these URLs, de serializes the downloaded data to the object, and then saves some data to CRM dynamic and another database After that, I need to update the first database dealing with web addresses Part of the requirement is to make parallelism configurable
Initially, I wanted to implement it through a series of tasks, waiting and using semaphore to limit parallelism - very simple Then I read some posts and answers from @ Stephen cleary and suggested using TPL dataflow. I think it may be a good candidate However, I want to make sure that I "complicate" the code by using dataflow, which is worth it I also have a suggestion that using foreachasync extension method is also very simple, but I'm not sure whether it will cause memory overhead due to the way of partition collection
Is TPL dataflow suitable for this situation? How is it better than the semaphore or foreachasync method – what benefits would I get if I implemented it on every other option (semaphore / foreachasync) through TPL dataflow?
Solution
This is almost I / O. unless these strings are large, deserialization will not be worth parallelizing The kind of CPU work you are doing will be lost in the noise
Therefore, you need to focus on concurrency and asynchrony
>As you can see, semaphoreslim is the standard mode. > TPL dataflow can also perform concurrency (asynchronous and parallel forms)
Foreachasync can take many forms; Please note that in the @ L_ 404_ 1 @, there are five different implementations of this method, and each implementation is effective "[t] there are many different semantics for iteration, each of which leads to different design choices and implementations." For your purposes (not wanting CPU parallelization), you should not consider using task Run or partition In the asynchronous concurrent world, any foreachasync implementation is just a syntax sugar, which hides the semantics of its implementation, which is why I tend to avoid it
This allows you to use semaphoreslim with actionblock I generally suggest that people start with semaphoreslim. If their requirements become more complex (considering that they will benefit from the data flow pipeline), they can consider turning to TPL dataflow
For example, "part of the requirement is to make parallelism configurable."
You can start by allowing a certain degree of concurrency – where the restricted thing is a single overall operation (getting data from the URL, deserializing the downloaded data into objects, persisting to CRM dynamics and another database, and updating the first database) This is where semaphoreslim will be the perfect solution
However, you may decide to use multiple knobs: for example, the concurrency of how many URLs you downloaded, the individual concurrency of persistence, and the individual concurrency of updating the original database Then you also need to limit the "queue" between these points: there are only so many deserialized objects in memory, etc. - to ensure that a fast URL with a slow database does not cause your application to use too much problem memory If these are useful semantics, you have begun to solve the problem from the perspective of data flow, which is that you can better use libraries such as TPL dataflow