Untitled Page

Q: In number 4, you say not to store the word from the data file (the story) in an array. Could you elaborate?

A: Yes. You may store a single word from the data file in a char array that you will use for comparison purposes. What you want to avoid is storing *all* the words from the story; only store one at a time, and write over it each time you read a new word.

When you are lookiing for bi-grams you may store two words.

Q: May I use the gets() command? It seems easier to use than fgets().

A: No. The gets() command is a major source of program vulnerabilities because it makes it easy to write past the end of the array you are using. Do not use it. Use fgets() instead. See

http://www.rsasecurity.com/rsalabs/node.asp?id=2011

for more information on buffer overflow security problems.

Q: I have a question about the final part of the lab--the bigram frequency. What should we do if the word is not part of the vocab array? Add it? Skip it? Print an error message?

A: Skip it. Don't count a bi-gram that starts or ends with a word we don't know.

Q: When I read in a word that is followed by punctuation, the punctuation ends up in my data. How do I fix this?

A: You can modify one of the functions you wrote in lab to walk through the word, replacing teh first instance of a non-letter char with a null terminating character. Or you can use some of the C char and string functions listed in Appendix B of H&K. Warning: there are many many C string and char functions, and your TAs and I will not be familiar with all of them. Make sure you read carefully and understand what they do, then TEST to be sure you are correct about their behavior.

Q: My email was eaten by rabid wolves. Could you please post the information on how to print the bi-gram frequencies?

For each row in the bigram array (each row represents the start word
for a bigram) print four things:

the word that the row represents;
the sum of all bigram counts in that row;
the (first) max column word;
the count of the (first) max column word.

So if you printed the output for the bigram array in the project
handout, it would look like this:

row sum column frequency

the 2 quick 1
quick 1 brown 1
brown 1 fox 1
fox 1 jumped 1
jumped 1 the 1
spam 0 no_word 0

Note that this means finding the max and sum for each row. When you
find the max, only the first word that has the max count is to be
printed, so in the top row "quick" gets printed because it comes
before "spam" in the row, even though they are both frequency 1.