Trouble reading CSV files with QBASIC/FreeBasic

dbish · Post by **dbish** » Tue Apr 12, 2005 8:02 pm

I am in the middle of a large project involving simple manipulation of a large number of files (related to stock market ticker symbols). I first read the raw CSV files and extract summary information and which I write to new CSV files. I have a massive amount of data to read in and process. Order of magnitude: 2,000,000 files / 15 GB of data in the form of small CSV files. The first set of QBASIC programs read the data and created summary CSV files. Much of the code is not shown but I used the following lines to write the summary files. I believe that I have set up correct DIM statements for all variables and am pretty good about using data type operators (%,$,# etc) on all my variables in the code:

OPEN FILENAMESAVE$ FOR APPEND AS #2
WRITE #2, "DATA", "INDEX", "CS", "MKTCAP", "MKTCAPMA", "CUMCALLOI", "CUMPUTOI", "PutCall", "PC SD OI", "PC MA OI"
WRITE #2, "DATE", "DATE", "PRICE", "999", "10", "CUMCALL", "CUMPUT", "1", "2", "10"
WRITE #2, ""

Above code writes a 3 line header to a new file. Code below adds a line for summary data from each file for a given stock ticker:

WRITE #2, DATADATE$, INDEXDATE$, CSPRICE$, MKTCAP$, "1", FINALCALLOI$, FINALPUTOI$, FINALPUTCALLOI$, "2", "3"

If I open the summary file in Windows Notepad I see the following:
"DATA","INDEX","CS","MKTCAP","MKTCAPMA","CUMCALLOI","CUMPUTOI","PutCall OI","PC SD OI","PC MA OI"
"DATE","DATE","PRICE","0.086778","10","CUMCALLOI","CUMPUTOI","1","2","10"
""
"2/5/2002 4:00","20040205","6.23","0.540626941655159","1","3122.825043682009","377.5000035464764","0.12088414","2","3"
"2/6/2002 4:00","20040206","6.1","0.5293457917242049","1","2792.900016859174","398.7999981790781","0.14279065","2","3"
etc.
etc.
etc.
etc.

My problem occurs in the next phase of the overall project - taking these summary files and doing something with them. I am trying to read the files (using a completely different QBASIC program) with the following code snippet:
FOR f = 1 TO TOTALFILECOUNT%
OPEN FILENAMELIST$(f) FOR INPUT AS #1
FOR i = 1 TO 800
LINE INPUT #1, WorkingLineArray$(i)
If i<4 Then Goto SKIPEND
CHAR1$ = LEFT$(WorkingLineArray$(i),1)
IF CHAR1$ = "" THEN EXIT FOR
SKIPEND:
NEXT
CLOSE #1
FINALROWNUMBER% = i-1

OPEN FILENAMELIST$(f) FOR INPUT AS #1
FOR i = 1 TO FINALROWNUMBER%
FOR j = 1 TO 10
INPUT #1, WorkingArray$(i, j)
NEXT
NEXT
CLOSE #1

FOR i = 1 TO FINALROWNUMBER%
FOR j = 1 TO 10
PRINT "i ";i;" j ";j;" cell= ";WorkingArray$(i, j)
sleep 250
NEXT
NEXT

The first loop determines the last row of FILENAMELIST(f) - for some reason DO WHILE NOT EOF(1) does not work - probably for same reason that I am doing this posting. The second loop reads data into WorkingArray$, and third displays data as a debugging test. What I found is that my program reads the next field every two (i,j) reads. By this I mean that in reading my CSV dump above I would get the following results in my print for debugging loop:
1 j value
-- -- ------
1 1 DATA
1 2
1 3 INDEX
1 4
1 5 CS
1 6
1 7 MKTCAP
1 8
1 9 MKTCAPMA

However, when I edited the CSV input file in Windows Notepad and took out all the quotes (") the file ran like it should. I tried to change my data generation process to eliminate the quotes from the program that generated the Data CSV file but QBASIC balked about "expected expression" - it mistook my unquoted strings as variables without definition.

My data files are so large and numerous that I have written the original code in QBASIC and then compiled and ran in FreeBasic to get around the data array size limitations. These results are all from the FreeBasic compiled runs. It will take me several hours to create special data and recode the program to test whether QBASIC can handle it correctly (and thus determine if the problem is FreeBasic or my coding). I am hoping that one of the forum members has enough experience with both QBASIC and FreeBasic to give me some help.

I could manually fix the quotes in my summary data files (3500 total) but as new files are created on an ongoing basis this is not a workable solution. I cannot figure out why it skips every other field with quotes and works with no quotes. For what it is worth, the LINE INPUT loop works fine even with the data with quotes.

Any ideas??

Thanks

dbish

MystikShadows · Post by **MystikShadows** » Tue Apr 12, 2005 9:08 pm

When you open a sequencial file, it's not necessarily CSV it's a text file. this means that there's not predetermined formatting per se on a line or in the whole file. I don't like to use Write, I prefer to use PRINT #2, DATA because Write for some reason always gives some kind of complication that I shouldn't need when dealing with text files

. My suggestion is to do the following changes:

Therefore the best thing you can do is.

1. Write your data one line at a time like so:

Code: Select all


OPEN FILENAMESAVE$ FOR APPEND AS #2

' If you absolutely want the commas present in the file:
PRINT #2, "DATA,INDEX,CS,MKTCAP,MKTCAPMA,CUMCALLOI,CUMPUTOI,PutCall,PC SD OI,PC MA OI"
PRINT #2, "DATE,DATE,PRICE,999,10,CUMCALL,CUMPUT,1,2,10
PRINT #2, "" 

' If you just want the header names
PRINT #2, "DATA INDEX CS MKTCAP MKTCAPMA CUMCALLOI CUMPUTOI PutCall PC SD OI PC MA OI"
PRINT #2, "DATE DATE PRICE 999 10 CUMCALL CUMPUT 1 2 10
PRINT #2, "" 

' IF in your official code the values are variables then you'll
' have to concatenate them into a string and write that string 
' with the print statement. Note this line should be on one line.
PRINT #2, DATADATE$ + "," + INDEXDATE$ + "," + CSPRICE$+ "," + MKTCAP$+ "," + "1" + "," + FINALCALLOI$+ "," + FINALPUTOI$ + "," + FINALPUTCALLOI$+ "," + "2"+ "," + "3"

I'm not sure you need to include quotes in what you're writing or reading at all. I see that each record in the summary line you write has 10 "fields" in them. If you open the data file you should have something like the following

Code: Select all

DATA,INDEX,CS,MKTCAP,MKTCAPMA,CUMCALLOI,CUMPUTOI,PutCall,OI,PC SD OI,PC MA OI
DATE,DATE,PRICE,0.086778, 10,CUMCALLOI,CUMPUTOI,1,2,10

2/5/2002 4:00,20040205,6.23,0.540626941655159,1,3122.825043682009,377.5000035464764, 0.12088414,2,3
2/6/2002 4:00,20040206,6.1,0.5293457917242049,1,2792.900016859174,398.7999981790781, 0.14279065,2,3
etc.
etc.
etc.
etc.

Next when you read that file, you'll just need to look for the commas to split your field values. To do so, perhaps something like the folllowing would do the trick.

Code: Select all


DIM WorkString AS STRING 
DIM TempString AS STRING 
DIM Offset     AS INTEGER
DIM Position   AS INTEGER
DIM ColumnNo   AS INTEGER

' This assumes each of your files has that same 3 line header.
FOR f = 1 TO TOTALFILECOUNT%
    OPEN FILENAMELIST$(f) FOR INPUT AS #1
    
    ' This skips the header so we can eliminate the SkipEnd Label
    LINE INPUT #1, WorkingLineArray$(1)
    LINE INPUT #1, WorkingLineArray$(2)
    LINE INPUT #1, WorkingLineArray$(3)
    ' And we loop to get the remainder of the records
    DO WHILE NOT EOF(1)
        LINE INPUT #1, WorkingLineArray$(i)
        IF RTRIM$(WorkingLineArray$(i)) = "" THEN EXIT DO
    LOOP
    CLOSE #1
    FINALROWNUMBER% = i-1

    OPEN FILENAMELIST$(f) FOR INPUT AS #1
    FOR i = 1 TO FINALROWNUMBER%
           LINE INPUT #1, WorkString
           Offset = 0
           Position = INSTR$(1, WorkString, ",")
           ColumnNo = 1
           DO While Position <> 0
              TempString = MID$(WorkString, Offset, Position - Offset - 1)
              WorkingArray$(i, ColumnNo)  = TempString
              Offset = Position + 1
              ColumnNo = ColumnNo + 1
              Position = INSTR$(Offset, WorkString, ",")
              If Position = 0 THEN
                 ' The following line should be on one line.
                 WorkArray$(i, ColumnNo) = RIGHT$(WorkString, LEN(WorkString) - OffSet) 
              END IF
           LOOP
    NEXT
    CLOSE #1

    FOR i = 1 TO FINALROWNUMBER%
        FOR j = 1 TO 10
            PRINT "i ";i;" j ";j;" cell= ";WorkingArray$(i, j)
            sleep 250
        NEXT
    NEXT 
NEXT

If your data is written properly and has 10 fields on each line like illustrated here, after your data should be well organized and should work fine.

If you have more questions, feel free to let me know.

Hope this helps.

dbish · Post by **dbish** » Wed Apr 13, 2005 12:40 am

Thanks for the quick reply. I will need some time to get through all your suggestions and try it out. Thanks - I'll be back!

MystikShadows · Post by **MystikShadows** » Wed Apr 13, 2005 8:31 am

You're welcome...the one good thing about my suggestion is that if you add a field in the future, my loop will auto adjust itself the way its' written. provided you changed the dim statement to the right number of columns

.

I'll be waiting to hear from ya

dbish · Post by **dbish** » Wed Apr 13, 2005 7:40 pm

I have made it through the first part with the replacement of WRITE with PRINT. MUCH MUCH better. I recreated the data files and they worked perfectly. Now, I need to move onto the reading using your second set of Routines. Thanks so much for the code you sent. You reply was posted within a short time and I had spent many multiples of that time trying manual debug action without success. Sure helps to go to the knowledgeable one!

By the way, I thought the throughput of my (OUR!) routines was phenomenal. I had 1,752,386 small CSV data files in 3581 directories comprising 14.75 GB. I estimated that there is about 80,000,000 lines of data (DIM WOW as string (80,000,000,000, 19)). The routines chunked through all the data and generated 3581 summary files of about 60k each in size that I can use in Excel or other QBASIC routines. The time to process all this data (Windows XP, 2.4 GB CPU)? 3 hours, 45 minutes!

Thanks again - I'll now start working on the second piece of this puzzle.

MystikShadows · Post by **MystikShadows** » Wed Apr 13, 2005 7:46 pm

indeed, that's not bad at all considering the number of file access (read and write) required per file. Pretty impressive.

Once you get the 2nd part going, we'll see if there isn't a way to accelerate things even more....my firsts priority was to provide something that does the job, reasonably since it seemed to be a inportant project to you that needed to be solved ASAP

.

So we'll get it working they we'll play with it a bit (in a different directory of course

)

MystikShadows · Post by **MystikShadows** » Thu Apr 14, 2005 8:43 pm

I suspect once the data file creation is complete and flawless we'll be able to go back to the DO WHILE NOT EOF(1) and it will work perfectly then. This will save us a pass in the file (the one that gets the row count for the parsing loop). Thus probably accelerating the whole process by close to 33%. so the 3:45 Hours could then be brought down by about an hour faster

.

But that's just one trick I got up my sleeve. I got more

. But first we'll just make sure it does it's job....then we'll make it do it's job faster

.

Just let me know when you're ready.