The naming in this is weird becasue i get moderated every time i mention the T word, replace the b in borrents with a t
Last night I was very intoxicated and rewriting parts of my scraper for GetStrike as i noticed it was bottle necking.
So GetStrike has millions of borrents indexed to disk for users to download; so this means while scraping sometimes I need to check if a file exist or not. Each folder contains about 2.8 million files on average, I should segment them more later.
The structure of the folders is pretty straight forward.
So we parse each folder to check if the file exist, doing anything like *.* would be slow to begin with,
Here was the original causing the bottle neck
Code:
for (int i= 0; i < groupLength; ++i) {
if (!File.Exist(dest + i)) {
Console.WriteLine("borrent didn't exist");
Downloadborrent(borrent_link, hash);
break;
} else {
break;
}
}
Too slow in in our case, average response was about 450ms
Using Exists() to check for file or directory name in use is subject to race conditions. Not a lot of people realize this. After the scraper Exists() check has finished, another instance of the scraper could have created the borrent with that hash before the original instance code reaches the point where you download a borrent, for example. I found it faster to simply to open the borrent, specifying the FileShare parameter.
Code:
using (var stream = File.Open(borrentPath, FileMode.CreateNew, FileAccess.Write, FileShare.None))
{
// Write borrent
}
So handling the IOException on failure may result in code that is less prone to race conditions,:
- If another instance has already created the borrent, will cause an IOException to be thrown
- If the original instance open and create succeeds, because of , no other instance can access the file until it closes it.
It seems it is not possible to check whether a file is currently in use, and not throw an exception, without PInvoke:
Code:
class Win32
{
const uint FILE_READ_DATA = 0x0001;
const uint FILE_SHARE_NONE = 0x00000000;
const uint FILE_ATTRIBUTE_NORMAL = 0x00000080;
const uint OPEN_EXISTING = 3;
const int INVALID_HANDLE_VALUE = -1;
[DllImport("kernel32.dll", SetLastError=true)]
internal static extern IntPtr CreateFile(string lpFileName,
uint dwDesiredAccess,
uint dwShareMode,
IntPtr lpSecurityAttributes,
uint dwCreationDisposition,
uint dwFlagsAndAttributes,
IntPtr hTemplateFile);
[DllImport("kernel32.dll")]
internal static extern bool CloseHandle(IntPtr hObject);
}
Then actually checking if its in use
Code:
bool IsFileInUse(string fileName)
{
IntPtr hFile = Win32.CreateFile(fileName, Win32.FILE_READ_DATA, 0, IntPtr.Zero, Win32.OPEN_EXISTING, Win32.FILE_ATTRIBUTE_NORMAL, IntPtr.Zero);
if (hFile.ToInt32() == Win32.INVALID_HANDLE_VALUE)
return true;
Win32.CloseHandle(hFile);
return false;
}
This faster method is also prone to race conditions, unless you return the file handle from it, and pass that to the relevant constructor.
Total check time for 17 million files was 32ms and some change.
Funny enough, if I don’t check anything at all and just download, the duplicate borrents and overwrite, its faster than checking if it exist. But I nonetheless the FileShare method works best for my needs.
tl;dr Exist(); sucks