Tags: | Posted by Admin on 8/31/2008 8:58 AM | Comments (0)

Some time back I interviewed with Google, and a number of other well known software development companies. I've written up a number of questions very similar to the ones I was asked, changing them enough so that I'm not breaking my NDA.

Q: Why are manhole covers round?

Q: A man pushed his car to a hotel and lost his fortune. What happened?

Q: Explain the significance of "dead beef".

Q: Write a C program which measures the the speed of a context switch on a UNIX/Linux system.

Q: Given a function which produces a random integer in the range 1 to 5, write a function which produces a random integer in the range 1 to 7.

Q: Describe the algorithm for a depth-first graph traversal.

Q: Design a class library for writing card games.

Q: You need to check that your friend, Bob, has your correct phone number, but you cannot ask him directly. You must write a the question on a card which and give it to Eve who will take the card to Bob and return the answer to you. What must you write on the card, besides the question, to ensure Bob can encode the message so that Eve cannot read your phone number?

Q: How are cookies passed in the HTTP protocol?

Q: What is the size of the C structure below on a 32-bit system? On a 64-bit?

struct foo {
char a;
char* b;

Q: Design the SQL database tables for a car rental database.

Q: Write a regular expression which matches a email address.

Q: Write a function f(a, b) which takes two character string arguments and returns a string containing only the characters found in both strings in the order of a. Write a version which is order N-squared and one which is order N.

Q: You are given a the source to a application which is crashing when run. After running it 10 times in a debugger, you find it never crashes in the same place. The application is single threaded, and uses only the C standard library. What programming errors could be causing this crash? How would you test each one?

Q: Explain how congestion control works in the TCP protocol.

Here's one that someone sent me in email. I've outlined a possible solution, but I haven't thought through the problem very well. Here's the question:

Design and describe a system/application that will most efficiently produce a report of the top 1 million Google search requests. You are given:

  1. You are given 12 servers to work with. They are all dual-processor machines with 4Gb of RAM, 4x400GB hard drives and networked together.(Basically, nothing more than high-end PC's)
  2. The log data has already been cleaned for you. It consists of 100 Billion log lines, broken down into 12 320 GB files of 40-byte search terms per line.
  3. You can use only custom written applications or available free open-source software.

There are approx. 8 billion search terms per log file (320 GIG / 40 Bytes). Each computer can return it's top 1 million search terms to a central server at a cost of 48MB each because 40 Bytes * 1 million = 40MB, plus a 8 byte unsigned integer occurrence count for each search term. The combined top 1 million results from each log take only 520 MB, so there is plenty of space on any single computer to merge the 12 million entries together and come up with top 1 million. So, the only missing part at this point is how to efficiently count the search term occurrences on each computer, and return the top 1 million (this same algorithm can be used for merging the 12 top 1 million, too). A hash table may be a bad idea. With this many search terms and only 4GIG of memory, hashing the search terms would cause a lot of collisions. You want to use a tree, not a hash. Order ln(n) should work well for the search term lookups (if it's good enough for the STL maps, it's good enough for me, right??!!). Now, we know that we need to produce the top 1 million results, but there are potentially 8 Billion unique search terms per log, and a minimum of 1 unique search term per log (what are the chances you get 8 billion searches for 'porn' or 'sex' on a single log? Not much, but it is a formal possibility. At the very least, such a case would make a great unit test for your program), either way, you could blow a 32 bit unsigned int trying to count them... So, we do have to use unsigned longs. That increases the size of the search term count to a long-int, 8 bytes instead of 4. With 4 gigs of memory, how many log entries can be counted at a time? 1 unique term will take 40 bytes, + 8 bytes for the count, two pointers for the left/right nodes of the binary tree (+ 8 bytes or + 16 bytes on a 64 bit system). Let's only assume you can use 3.5GIG of the 4GIG of memory, so that's 3.5/64 = 0.0546 GIG, or about 54 million. So, here's the stratagy:

1) go through the log, and count the occurances of search terms found using a binary tree for search term lookups, and a long-int for the counter. Do this unil you have 54 million unique terms in your tree. Write a marker into the log file for each search term used so that further passes know it has been counted.

2) transmit the tree over the network to the next computer and count only the search term occurances which are already in the tree, marking any counted in the logs as such; then transmit the tree to the next and repeat this process until the tree has passed through each computer and ends up at the computer WHERE THE FIRST UNQIUE SEARCH TERM WAS ADDED TO THE TREE(more on this in the next step). Heap-sort the tree by search term occurance number (why not, it's already in a binary tree -- that's great for heap-sort!). Truncate the tree to 1 million search terms, and write it to disk.

3) Every time a tree of unique search terms has "filled up", that is, the tree contains 54 million unique search terms, a new unique-search term tree must be built and sent to each computer so that the terms can be counted. A tree might begin near the end of a log on one computer, and be sent to the next computer with less than the 54 million unique search terms. When this is the case, the tree will continue to collect unique search terms from the log on the next computer.

There are two processess occuring in this algorithm which travel in rings around the computer network until they reach the computer at which they began. The first process travels from computer to computer generating these unqiue search term trees (call this process #1). Once a tree fills up with unique terms, it breaks off from this process and is sent around the ring to count the search term occurances for terms which it already contains (call these processes Tn, there can be many). Once process #1 reaches the computer where it started, it quits. Once all the Tn processes have traveled the ring, they quit also after writing their top 1 million search results to disk in sorted order. The search terms in these files are all unique and accurately counted. Just start popping search terms off the top of these logs by the highest occurance count until you have 1 million search terms. That should be it.

Add comment

  Country flag
  • Comment
  • Preview