Go – concurrent file system scanning
I want to get the file information of the files in the directory (file name and size, in bytes) But there are many subdirectories (~ 1000) and files (~ 40000)
In fact, my solution is to use filepath Walk() to get the file information of each file But it's a long time
func visit(path string,f os.FileInfo,err error) error { if f.Mode().IsRegular() { fmt.Printf("Visited: %s File name: %s Size: %d bytes\n",path,f.Name(),f.Size()) } return nil } func main() { flag.Parse() root := "C:/Users/HERNOUX-06523/go/src/boilerpipe" //flag.Arg(0) filepath.Walk(root,visit) }
Can I use filepath Walk() for parallel / concurrent processing?
Solution
You can perform concurrent processing by modifying the visit () function. Instead of entering subfolders, you can start a new goroutine for each subfolder
To do this, if the entry is a directory, a special filepath is returned from the visit () function Skipdir error Don't forget to check whether the path in visit () is a subfolder that goroutines should handle, because it is also passed to visit (). Without this check, you will start goroutines for the initial folder endlessly
In addition, you need some kind of "counter" to indicate how many goroutines are still working in the background because you can use sync WaitGroup.
This is a simple implementation:
var wg sync.WaitGroup func walkDir(dir string) { defer wg.Done() visit := func(path string,err error) error { if f.IsDir() && path != dir { wg.Add(1) go walkDir(path) return filepath.SkipDir } if f.Mode().IsRegular() { fmt.Printf("Visited: %s File name: %s Size: %d bytes\n",f.Size()) } return nil } filepath.Walk(dir,visit) } func main() { flag.Parse() root := "folder/to/walk" //flag.Arg(0) wg.Add(1) walkDir(root) wg.Wait() }
Some notes:
Depending on the "distribution" of files in subfolders, this may not make full use of your CPU / storage, just as, for example, 99% of all files are in one subfolder, goroutine will still spend most of its time
Also note that FMT The printf () call is serialized, so it also slows down the process I assume this is just an example. In fact, you will do some processing / statistics in memory Don't forget to also protect concurrent access to variables accessed from the visit () function
Don't worry about a large number of subfolders This is normal. The go runtime can even handle hundreds of thousands of goroutines
Also note that the performance bottleneck is likely to be your storage / hard disk speed, so you may not get the required performance After a certain point (your hard disk limit), you will not be able to improve performance
Starting a new goroutine for each subfolder at the same time may not be the best choice. It may be to limit the number of goroutines in the folder to obtain better performance To do this, check and use the work pool:
Is this an idiomatic worker thread pool in Go?