Go – concurrent file system scanning

I want to get the file information of the files in the directory (file name and size, in bytes) But there are many subdirectories (~ 1000) and files (~ 40000)

In fact, my solution is to use filepath Walk() to get the file information of each file But it's a long time

func visit(path string,f os.FileInfo,err error) error {
    if f.Mode().IsRegular() {
        fmt.Printf("Visited: %s File name: %s Size: %d bytes\n",path,f.Name(),f.Size())
    }
    return nil
}
func main() {
    flag.Parse()
    root := "C:/Users/HERNOUX-06523/go/src/boilerpipe" //flag.Arg(0)
    filepath.Walk(root,visit)
}

Can I use filepath Walk() for parallel / concurrent processing?

Solution

You can perform concurrent processing by modifying the visit () function. Instead of entering subfolders, you can start a new goroutine for each subfolder

To do this, if the entry is a directory, a special filepath is returned from the visit () function Skipdir error Don't forget to check whether the path in visit () is a subfolder that goroutines should handle, because it is also passed to visit (). Without this check, you will start goroutines for the initial folder endlessly

In addition, you need some kind of "counter" to indicate how many goroutines are still working in the background because you can use sync WaitGroup.

This is a simple implementation:

var wg sync.WaitGroup

func walkDir(dir string) {
    defer wg.Done()

    visit := func(path string,err error) error {
        if f.IsDir() && path != dir {
            wg.Add(1)
            go walkDir(path)
            return filepath.SkipDir
        }
        if f.Mode().IsRegular() {
            fmt.Printf("Visited: %s File name: %s Size: %d bytes\n",f.Size())
        }
        return nil
    }

    filepath.Walk(dir,visit)
}

func main() {
    flag.Parse()
    root := "folder/to/walk" //flag.Arg(0)

    wg.Add(1)
    walkDir(root)
    wg.Wait()
}

Some notes:

Depending on the "distribution" of files in subfolders, this may not make full use of your CPU / storage, just as, for example, 99% of all files are in one subfolder, goroutine will still spend most of its time

Also note that FMT The printf () call is serialized, so it also slows down the process I assume this is just an example. In fact, you will do some processing / statistics in memory Don't forget to also protect concurrent access to variables accessed from the visit () function

Don't worry about a large number of subfolders This is normal. The go runtime can even handle hundreds of thousands of goroutines

Also note that the performance bottleneck is likely to be your storage / hard disk speed, so you may not get the required performance After a certain point (your hard disk limit), you will not be able to improve performance

Starting a new goroutine for each subfolder at the same time may not be the best choice. It may be to limit the number of goroutines in the folder to obtain better performance To do this, check and use the work pool:

Is this an idiomatic worker thread pool in Go?

The content of this article comes from the network collection of netizens. It is used as a learning reference. The copyright belongs to the original author.
THE END
分享
二维码
< <上一篇
下一篇>>