Introduction
LinkedIn isn't just any social site. It is used by many professionals for building networks around a vocational space with potential collaborations or business partners in sight. Nor did the scientific community fully shy away from it either, as is generally the case with social networking or Web 2.0 platforms.
Since the data breach of LinkedIn three days ago users have been keen to SHA1-hash their password and look it up in the released hash-file which is around 300MB.
This premise motivated me to do enough to perform a few benchmarks of the most commonly used methods for hash lookup with the intend of providing a REST service for users to look up their hash by providing only a subset of their hash key such as the character position and characters of their key (e.g. //Service/10/8d852/30/8c1cd ) rather then the entire key.
Tools used are MySQL with an underlying InnoDB-database with the hash-field set as INDEX along with a running integer Id-field, grep on Linux Ubuntu 10x, grep on cygwin (Windows 7), grep on andLinux (Windows 7), Window's findstr and TotalCommander's 8.x general purpose file-Viewer Lister (Windows 7). Memory consumption is mentioned where it is significant.
Benchmarks
Three hashes were picked out, at a file position of 33%, 66% and 99%;
Total Commander's Lister seems to performs a simple linear string-search, comparing character by character rather than using search parser optimization's like for instance suffix trees. This would explain its incredibly slow speed. Memory consumption remains stagnant at <1MB with the Lister-viewer running as a thread inside the TOTALCMD.EXE Process.
TotalCommander's Lister Viewer:
GUI Search (CTRL+F and pressing ENTER)
21sec, 41sec, 60sec
findstr:
Findstr on windows is a poor man's version of grep with Regular expression support, but not the entire PREG regime. The /b -flag matches the pattern if at the beginning of a line, thus speeding up the search.
Memory consumption reaches 250MB.
andLinux Grep:
Linux Grep:
see above.
Results:
Note: The hash passwords are intentionally not shown in full.
Conclusion:
MySQL performance is disappointing (using the syntax LIKE "c7268....%", can be improved through table and INDEX optimization. Grep performs considerably well considering its full PREG compatible engine. findstr performs very well, with a less sophisticated regular expression set. Both perform in a line mode per default without further flag-parameters rather than multiline mode as is the case for TotalCommanders Lister application.
Lastly, one of few secure ways to check if your hash is in the hash-file-database is by creating a new Browser Profile (i.e. devoid of Browser Extensions), open a blank page (e.g. CTRL+T), invoke the web-developer JavaScript console (e.g. CTRL+J or CTRL+I) and copy and paste the following code, then press ENTER:
Rely on JavaScript Hashing Tools running inside of Blogs isn't recommended either, as most users won't be able to have full transparency of the provided code, nor can the blog-author exclude Cross-site scripting attacks. What speaks against a secure environment in the case of blogs is that multiple scripts, hosted on several locations are generally incorporated, in addition to user's being able to post limited content in the form of discussions and blogs being generally delivered over non-secure HTTP.
PS: You can report your own benchmarks below.
LinkedIn isn't just any social site. It is used by many professionals for building networks around a vocational space with potential collaborations or business partners in sight. Nor did the scientific community fully shy away from it either, as is generally the case with social networking or Web 2.0 platforms.
Since the data breach of LinkedIn three days ago users have been keen to SHA1-hash their password and look it up in the released hash-file which is around 300MB.
This premise motivated me to do enough to perform a few benchmarks of the most commonly used methods for hash lookup with the intend of providing a REST service for users to look up their hash by providing only a subset of their hash key such as the character position and characters of their key (e.g. //Service/10/8d852/30/8c1cd ) rather then the entire key.
Tools used are MySQL with an underlying InnoDB-database with the hash-field set as INDEX along with a running integer Id-field, grep on Linux Ubuntu 10x, grep on cygwin (Windows 7), grep on andLinux (Windows 7), Window's findstr and TotalCommander's 8.x general purpose file-Viewer Lister (Windows 7). Memory consumption is mentioned where it is significant.
Benchmarks
Three hashes were picked out, at a file position of 33%, 66% and 99%;
Total Commander's Lister seems to performs a simple linear string-search, comparing character by character rather than using search parser optimization's like for instance suffix trees. This would explain its incredibly slow speed. Memory consumption remains stagnant at <1MB with the Lister-viewer running as a thread inside the TOTALCMD.EXE Process.
TotalCommander's Lister Viewer:
GUI Search (CTRL+F and pressing ENTER)
21sec, 41sec, 60sec
findstr:
Findstr on windows is a poor man's version of grep with Regular expression support, but not the entire PREG regime. The /b -flag matches the pattern if at the beginning of a line, thus speeding up the search.
h:\Databases>findstr /b "c7268a410a9e3b8068...546f6e020a6e" linkedin_merge.combo_not
c7268a410a9e3b8068...546f6e020a6e
.6s
h:\Databases>findstr /b "00000a0dab6d941319...c89818aba59" linkedin_merge.combo_not
00000a0dab6d941319...c89818aba59
.4s
h:\Databases>findstr /b "000009a460f37...4ad88c1cd4d3b7" linkedin_merge.combo_not
000009a460f37...4ad88c1cd4d3b7
.3s
Memory consumption reaches 250MB.
Cygwin Grep:
c:\cygwin\bin>grep.exe "^ c7268a410a9e3b8068...546f6e020a6e " H:\Databases\linkedin_merge.combo_not
2s
c:\cygwin\bin>grep.exe "^ c7268a410a9e3b8068...546f6e020a6e " H:\Databases\linkedin_merge.combo_not
2s
c:\cygwin\bin>grep.exe "^ c7268a410a9e3b8068...546f6e020a6e " H:\Databases\linkedin_merge.combo_not
3.5s
andLinux Grep:
ubuntu@andLinux:~$ "^...." grep databases/linkedin_merge.combo_not
etc..
Linux Grep:
see above.
Results:
function googleChartDrawVisualization() { // Create and populate the data table. var data = google.visualization.arrayToDataTable([ ['Entry Position', 'TotalCmd Lister', 'findstr', 'cygwin grep', 'ubuntu grep', 'MySQL'], ['33%', 21, .6, 2, 1.5, .6], ['66%', 41, .4, 2, 1.5, .6], ['99%', 61, .3, 3.5, 2, .6], ]); // Create and draw the visualization. new google.visualization.BarChart(document.getElementById('visualization')). draw(data, {title:"Hash Entry Lookup in 300MB LinkedIn Hash-File", width:600, height:400, vAxis: {title: "Hash Position in File [%]"}, hAxis: {title: "Time [s]"}} ); }; setTimeout(googleChartDrawVisualization,5000);
Note: The hash passwords are intentionally not shown in full.
Conclusion:
MySQL performance is disappointing (using the syntax LIKE "c7268....%", can be improved through table and INDEX optimization. Grep performs considerably well considering its full PREG compatible engine. findstr performs very well, with a less sophisticated regular expression set. Both perform in a line mode per default without further flag-parameters rather than multiline mode as is the case for TotalCommanders Lister application.
Lastly, one of few secure ways to check if your hash is in the hash-file-database is by creating a new Browser Profile (i.e. devoid of Browser Extensions), open a blank page (e.g. CTRL+T), invoke the web-developer JavaScript console (e.g. CTRL+J or CTRL+I) and copy and paste the following code, then press ENTER:
Finally, run SHA1("yourpassword"); and hit ENTER. Download the complete hash-file from the web and search for your hash-string provided by the SHA1 function. By no means should you trust companies that sell password-file and provide a web-interface for LinkedIn hash lookup! That is at the very least negligence on your part. User-Browsers are in most cases uniquely identifiable, and web-advertising companies can track you across sites and some do sell such information.Rely on JavaScript Hashing Tools running inside of Blogs isn't recommended either, as most users won't be able to have full transparency of the provided code, nor can the blog-author exclude Cross-site scripting attacks. What speaks against a secure environment in the case of blogs is that multiple scripts, hosted on several locations are generally incorporated, in addition to user's being able to post limited content in the form of discussions and blogs being generally delivered over non-secure HTTP.
PS: You can report your own benchmarks below.