searching large files

searching large files

This page is to help with searching unreasonably large files. An example of a large file would be a database dump to plain text, such as when a Postgres server is dumped with pg_dumpall (let's say it was 5 Gig).

32-bit applications will have a size limit for what they can handle when opening
most text editors will choke on large sets of data, especially when considering that db dumps often have binary data in one line of the file (no line breaks, and each byte sequence is escaped)

breaking the file up to manageable parts

(see advanced technique below to quickly examine portions of one large file)

To break a large file into smaller files that can be handled, use the split program, available in a unix shell. A recommended size to work with is 500 Meg. Remember that if you specify a byte size for each file instead of a number of lines, you will run the risk of splitting two files at a point where your search term would be split. (it's a low risk, but a risk nonetheless)

split -b 500m <largefile> [filename prefix]

This will create smaller files like this: xaa, xab, xac … You can also specify another prefix for split to use if desired.

searching the manageable parts

Now, to search through the large files for something specific, use grep. If you have a chance of a search result being on a line with binary data, you might want to suppress normal output and only list the file the result is found in, or use the -o flag to show only the portion matching your expression.

This command tells grep to search for a regular expression, show the line number and byte offset for the matching results, and only search files beginning with 'x'.

grep -n -b -E 'posix regex pattern' x*

This command will only show the portion of the line matching the expression (so binary data won't cause an output problem), and the byte offset in the file. grep will stop after 25 matches.

grep -b -o -m 25 -E 'posix regex pattern' x*

examining the matching files

Once you find a set of files with results, you could try to open them one at a time in a sophisticated text editor, such as vim. However, a better approach would be to open the file in a hex editor. The reason for this is that a hex editor will open the file as a sequence of bytes, whereas a text editor would try to find line breaks, create syntax formatting, etc…

advanced technique

You can keep the large file as is, and quickly examine portions of it. To do this, you will need to use the unix utility dd. First use grep as shown above on the large file, but make sure to use the flag to output the byte offset. Then, you can examine the area in the file with the command below. Lets say grep found a match at offset 155450, and we want to examine 1000 bytes before and after this match:

dd if=largefile bs=1 skip=154450 count=2000 | less

or, if we have binary data...

dd if=largefile bs=1 skip=154450 count=2000 | xxd | less

This tells dd to use largefile as an inputfile (if), use a block size of 1 byte (bs=1), skip the input content to the byte offset of 154450, and only output 2000 bytes forward. The result is piped to less for viewing.

Hex Editors

XVI32 (Windows)
HexEdit (Mac)