File search in QB without crash

mikefromca · Post by **mikefromca** » Sat Jul 10, 2021 9:44 pm

Pardon me for my crazy code below, but one thing I have an issue with is file scanning.

I'm making an application strictly for MS-DOS with QB 4.x that is to scan the entire hard drive for a particular file. The problem is the program bombs with the index of the array index being too high or insufficient string space.

Anyways, for those who don't want to go through my code, I'll explain in pseudo code what happens:

Code: Select all

Assume our X filesystem has this directory tree:

In X:
Folder: QB
Folder: WIN
In X:\QB
Folder: ASIC
Folder: 45
In X:\WIN
Folder: SYS

And no folders in either ASIC, 45, or SYS.

Pseudo code: 

1. Build empty array of x bytes to hold all folders.
2. Set 2 variables: s = number of folders found, p = our progress in folders
2. Set 1st index (s=1) of that array to root drive. example: "X:" (# folders found=1)
3. Load 1st index (p=1) and get directory list (= DIR X: in DOS)
4. Array now contains:
Index 1 = X:
index 2 = X:\QB
Index 3 = X:\WIN
5. s = 3 (as there's 3 entries in the folder array). increment p and get next list (= DIR X:\QB)
6. Array now contains:
Index 1 = X:
index 2 = X:\QB
Index 3 = X:\WIN
index 4 = X:\QB\ASIC
index 5 = X:\QB\45
7. s = 5 (as there's 5 entries in the folder array). increment p and get next list (= DIR X:\WIN)
8. Array now contains:
Index 1 = X:
index 2 = X:\QB
Index 3 = X:\WIN
index 4 = X:\QB\ASIC
index 5 = X:\QB\45
index 6 = X:\WIN\SYS
9. s = 6 (as there's 6 entries in the folder array). increment p and get next list (= DIR X:\QB\ASIC)
....Now this time the list won't grow because the remaining folders don't have folders inside of them. p will keep incrementing until it reaches 6 them the program ends.

As you can see, in my code, if there are a ton of folders like many computers will have today, my program will ultimately crash due to insufficient string space.

And I feel my options will look bleak because my target is old computers without much memory or disk space.

I see my choices are as follows:

1. Require user to have several megs of XMS memory, or
2. Thrash the user's disk as directory entries are constantly written to or read from it, or
3. somehow wipe out sections of the folder array and somehow fill them with new directory entries.

Any suggestions?

And there's my crummy code below.

Code: Select all

folder$() variable represents all folders collected so far.

myfol$ = RTRIM$(myfol$)
x.func = 0: x.filespec = myfol$ + "\*.*" + CHR$(0): x.mask = &H1F: xx& = 0
DO
CALL runcode: obj$ = LEFT$(x.fname, x.flen)
IF x.fattr = 16 THEN
IF obj$ <> "." AND obj$ <> ".." THEN
folct& = folct& + 1: folders$(folct&) = myfol$ + "\" + obj$
END IF
ELSE
mat$ = LEFT$(x.fname, x.flen)
IF RTRIM$(UCASE$(mat$)) = RTRIM$(UCASE$(matf$)) THEN findinfol% = 1: EXIT DO
END IF
xx& = xx& + 1
LOOP UNTIL x.errc > 0 OR xx& > 32766

Erik · Post by **Erik** » Mon Aug 02, 2021 11:51 pm

What if you cached your directory "database" array to a file? That way you could write chunks of it to disk before it hits the max string space.

For example, say the full directory structure will take 2.5x the max string space, once you get close to maxing it out, flush that data to a file, clear the array and continue where you left off. Then continue this process until your directory scan is done. You could call the files something incremental like DIRTEMP.001, DIRTEMP.002, DIRTEMP.003 or something like that where it could easily be read back in using a loop.

General idea of how I could see this being used:

First run/files do not exist? Do directory scan and generate files.
Read through files to find location where searched term is found.
If found in temp cache files, attempt a "DIR" on the found file to make sure that it's really still there.
If found file has been verified to exist, return found location and exit.
If not found, do scan to generate files and re-run search.
If found on follow up scan with fresh cache, return found location and exit.
If not found on follow up scan with fresh cache, return "Not found" and exit.

When the program is ran and the temp files already exist, it will skip the initial directory scan and file generation and attempt to search through temp files first so not to trash the disk generating them each time it's ran. Only if the search term isn't found will it refresh the cached directory structure and try again.

I added the "DIR" check in the bullet points above for when when it's found in the first cache search for times that the program is re-ran but the cache data might have fallen out of sync. Example: the directory temp cache files exist, but the file you searched for that's listed in the cache has since been deleted. The search will return found when the file no longer exists. A quick check that it's really there before returning the results saves the application from returning any false positives. If it fails the validation check, the temp files are regenerated and the directory structure is rescanned and then searched through one last time with a fresh pull to see if it can be found.

It's no fool proof though. Maybe you could add another command line option to force cache invalidation and rescan of directories at start. Also save the temp file timestamp at the top of the first temp file. Then if it's greater than a certain age, assume the contents can't be trusted and do a scan at the start.

Just thinking out loud. Hopefully the idea helps.

MikeHawk · Post by **MikeHawk** » Wed Aug 18, 2021 9:28 pm

I wrote an answer, but I'm dumb and tired so I'm revising it.

Short answer to your problem: you don't have to search all folders and keep track of the whole directory tree starting from the root IF you parse your folders linearly (one scope at the time.) That means you only need to keep track of folders located inside parents. In your example you try to search all folders and subfolders of both "x:/qb/[directories]" and "x:/win/[directories]", but you don't need to because you won't be parsing both directories at the same time.

Keep track of the currently searched folder via an absolute path string (something that would look like X:/QB/PROJECTS/FINISHED/). This way, you know where you are and most importantly, you know the name of the folder you leave when going DOWN. For instance, if there's nothing of interest in FINISHED/, you just go down to X:/QB/PROJECTS/ and search for the next folder after FINISHED/ on your list and enter it. Here's what I'd do:

When you enter a new folder/begin your search (going UP:)
1. Find First with your file.
- If a file exists, add to result array.
2. Now, Find First for all objects (discard files, retain folders.)
- For as long as you have folders, add them to the folder list for this scope.
3. When no more folders are found
- You found no folder at all. You already searched for your file so go DOWN to parent.
- You have some folders, go UP via the first on your list (update absolute path)

When you leave a folder (going DOWN:)
1. Search for the folder you left in this scope's folder list.
- If there's another folder available, go UP via that folder.
- If there's no more folder available, go DOWN to parent.

Since you search for your file upon entering (going UP) a new folder, the only thing left to do is listing all directories contained within and visit each one of them. When you're going DOWN to the parent directory, you can trash the directory list: going DOWN means you parsed all directories and already looked for your file. This should keep the array size in check (each entry will be at most 12 bytes - I mean, it's DOS) and since there's only one list per scope, you don't have exponentially growing directory names.

Does it make sense?

EDIT: Alright, I gave it a fair shake and it seems to work under DOSBox although I haven't tested it extensively. There's a 20,000 bytes buffer for directories and an extra 384 bytes to keep track of parent directories composition. There's a debug system to show where the program is searching, and the buffer content (with colors so you can see how each scope is processed.)

Code: Select all

' The basics:
' 1. Only when going UP (or on initialization):
'    > Get all folders from current directory, seek file
' 2. Seek next folder we should search
'    > If there's a folder, enter it (going UP) and repeat all steps.
'    > If there's no folder left, go to parent (going DOWN), repeat all steps.
' You're done when you're no longer allowed to go DOWN.

'$INCLUDE: 'QB.BI'
DEFINT A-Z

CONST debugLevel = 1 ' set to 0 for silent, 1 for path, 2 for path & buffer

DECLARE SUB searchFile (where AS STRING, query AS STRING)
DECLARE SUB getObjects (where AS STRING, query AS STRING)
DECLARE SUB debugBuffer ()
DECLARE FUNCTION scopeDown% ()
DECLARE FUNCTION scopeUp$ ()
DECLARE FUNCTION scopeAppend% (folder AS STRING)
DECLARE FUNCTION INSTRREV% (source AS STRING, find AS STRING)

CONST moveROOT = 0
CONST moveDOWN = -1
CONST moveUP = 1

TYPE scopeInfo          ' OFS SZE   DESCRIPTION
  start AS INTEGER      ' 000 ..2   Starting offset in pathLst (starts at 1)
  count AS INTEGER      ' 002 ..2   Number of folders
  visit AS INTEGER      ' 004 ..2   Index of last visited folder (0 for none)
END TYPE                ' 6 BYTES - Scope descriptor

DIM SHARED pathLst AS STRING * 20000      ' Parent and current sub-folders
DIM SHARED pathOfs AS INTEGER             ' Writing offset for pathLst
DIM SHARED levlNfo(0 TO 63) AS scopeInfo  ' Root plus 63 levels deep
DIM SHARED levlNow AS INTEGER             ' How deep we are

' Search for all files named "duke3d.grp" starting from "c:/games/"
' Note: the program assumes more than one file can have this name; finding a
' file won't stop the program. There's not user escape implemented.
searchFile "c:/games/", "duke3d.grp"

''
'' Display folder name buffer, DEBUG PURPOSE ONLY
''
SUB debugBuffer
  DIM length AS INTEGER, offset AS INTEGER, scopeId AS INTEGER

  offset = 1
  DO
    IF (offset = levlNfo(scopeId).start) THEN
      COLOR 1 + (scopeId AND &HF)
      scopeId = scopeId + 1
    END IF
    length = ASC(MID$(pathLst, offset, 1))
    PRINT MID$(pathLst, offset + 1, length); " ";
    offset = offset + length + 1
  LOOP WHILE (offset < pathOfs)
  COLOR 8: PRINT pathOfs
END SUB

''
'' Get all objects in current directory
''
SUB getObjects (where AS STRING, query AS STRING)
  DIM DTA AS STRING * 44, MaskZ AS STRING
  DIM regs AS RegTypeX, objName AS STRING

  '' SETUP DTA SO WE DON'T DESTROY COMMAND$ ''
  regs.ax = &H1A00                  ' Set DTA function
  regs.dx = VARPTR(DTA)             ' DS:DX points to our DTA
  regs.ds = -1                      ' Use current value for DS
  CALL INTERRUPTX(&H21, regs, regs) ' Do the interrupt

  MaskZ = where + "*.*" + CHR$(0)   '  Mask (search all)
  regs.ax = &H4E00                  '  FindFirst
  regs.cx = 16                      '  Get all object types
  regs.dx = SADD(MaskZ)             '  DS:DX points to ASCIIZ file mask
  regs.ds = -1                      '  Use current DS

  '' PARSE ALL OBJECTS ''
  DO
    CALL INTERRUPTX(&H21, regs, regs)    ' Do the interrupt
    IF (regs.flags AND &H1) THEN EXIT DO ' No object left

    objName = UCASE$(MID$(DTA, 31, INSTR(31, DTA, CHR$(0)) - 31))
    ' Folder, append to scope
    IF (ASC(MID$(DTA, &H15 + 1, 1)) AND &H10) THEN
      IF ((objName <> ".") AND (objName <> "..")) THEN
        IF scopeAppend%(objName) THEN PRINT "Buffer overflow!": END
      END IF
    ' File, compare with query
    ELSE
      IF (objName = query) THEN
        PRINT "Found file in " + CHR$(34) + where + CHR$(34) + " there might be more! (PRESS ANY KEY)": SLEEP
      END IF
    END IF

    ' FindNext
    regs.ax = &H4F00
  LOOP
END SUB

''
'' Last occurence of a string
''
FUNCTION INSTRREV% (source AS STRING, find AS STRING)
  DIM ofs AS INTEGER
  DO
    INSTRREV% = ofs
    ofs = INSTR(ofs + 1, source, find)
  LOOP WHILE ofs
END FUNCTION

''
'' Append subfolder to scope. One byte is reserved to provide the length in
'' bytes of the folder name. This takes less memory than assuming that all
'' folders have 12-byte long names. This function returns -1 if the memory
'' buffer is saturated. Returns 0 if the sub-folder was successfully added.
''
FUNCTION scopeAppend% (folder AS STRING)
  DIM tmp AS STRING

  IF (levlNfo(levlNow).start = 0) THEN levlNfo(levlNow).start = pathOfs
  levlNfo(levlNow).count = levlNfo(levlNow).count + 1

  tmp = LTRIM$(RTRIM$(UCASE$(folder)))
  tmp = CHR$(LEN(tmp)) + tmp
  IF ((pathOfs + LEN(tmp)) >= LEN(pathLst)) THEN
    scopeAppend% = -1
  ELSE
    MID$(pathLst, pathOfs, LEN(tmp)) = tmp
    pathOfs = pathOfs + LEN(tmp)
    scopeAppend% = 0
  END IF
END FUNCTION

''
'' Free current scope and go back to parent. Freeing means that we no longer
'' need the sub-folder list for this scope so we can rewrite it. We also clear
'' the descriptor so it can be re-used. This function returns -1 if root as
'' been reached (there's no parent directory.)
''
FUNCTION scopeDown%
  IF levlNfo(levlNow).start THEN pathOfs = levlNfo(levlNow).start
  levlNfo(levlNow).start = 0
  levlNfo(levlNow).count = 0
  levlNfo(levlNow).visit = 0
  scopeDown% = (levlNow = 0)
END FUNCTION

''
'' Search for the next sub-folder in this scope we should be browsing. The
'' "folder" argument returns the name of the next folder to enter (variable is
'' never read, only written.) If there is no more folder to visit, "folder" is
'' empty (you have to scopeDown.)
''
FUNCTION scopeUp$
  DIM offset AS INTEGER, length AS INTEGER

  levlNfo(levlNow).visit = levlNfo(levlNow).visit + 1
  IF (levlNfo(levlNow).visit > levlNfo(levlNow).count) THEN
    scopeUp$ = "" ' nothing left to see here
  ELSE
    ' get next folder in line to browse
    offset = levlNfo(levlNow).start
    FOR i% = 1 TO levlNfo(levlNow).visit
      length = ASC(MID$(pathLst, offset, 1))
      offset = offset + length + 1
    NEXT i%
    scopeUp$ = MID$(pathLst, offset - length, length)
  END IF
END FUNCTION

''
'' Where is the base directory (where we start looking for the file,) it must
'' be an absolute path and be terminated with a forward slash ("/".) Query is
'' the file name we're looking for.
''
SUB searchFile (where AS STRING, query AS STRING)
  DIM pathNxt AS STRING, pathAll AS STRING
  DIM move AS INTEGER

  pathOfs = 1     ' Always set to 1 before starting
  move = moveROOT ' Pretend we're moving up (for folder list)
  levlNow = 0     ' We start at level 0

  pathAll = where
  query = LTRIM$(RTRIM$(UCASE$(query)))

  DO
    IF (move <> moveDOWN) THEN
      if (debugLevel = 1) then PRINT CHR$(34) + pathAll + CHR$(34)
      ' Append directories, search for file
      IF (move) THEN levlNow = levlNow + 1
      CALL getObjects(pathAll, query)
    END IF

    if (debugLevel = 2) then
      CLS : PRINT CHR$(34) + pathAll + CHR$(34); TAB(75); levlNow: debugBuffer
    end if

    ' Enter next sub-directory
    pathNxt = scopeUp$
    IF LEN(pathNxt) THEN
      pathAll = pathAll + pathNxt + "/"
      move = moveUP
    ELSE
      IF (scopeDown%) THEN EXIT DO
      levlNow = levlNow - 1
      pathAll = LEFT$(pathAll, INSTRREV%(LEFT$(pathAll, LEN(pathAll) - 1), "/"))
      move = moveDOWN
    END IF
  LOOP
END SUB

This can be optimized. For instance, searching for the next folder via an index is slower than using an offset since we're parsing the same strings over and over again. I also suspect searching for the file separately could be a tiny bit faster than using the same code for both folders and file match. Using fixed-length folder names will be a lot faster, but will require more memory...

EDIT AGAIN: I made the tiny optimization I mentioned earlier (using an offset rather than an index for string parsing) and it really speeds things up when directories have many sub folders (it's easy to add and really worth it!) I tried a compiled version of the program on a disk with 43,902 files and 2,699 folders, it didn't crash. At most, only 10% of the string buffer was used (2,139 bytes out of 20,000 allocated - so really, no need for XMS or virtual memory,) and the deepest the program had to go was 11 levels (it can go up to 63 - there's no check for that so beware.) It found "ultramid.ini" 10 times in 44 seconds (DOSBox at 3,000 cycles,) and took about 3 seconds with 100% cycles.

Attaching a file containing the dirty (yet somewhat usable) program. Source included, use at your own risks, etc.

RE-EDIT AGAIN: adding another version, it supports file masking and displays the result in a somewhat orderly fashion (including file size and time stamp.) It's usable. Tested under DOSBox only.

Pete's QBASIC Site

File search in QB without crash

File search in QB without crash

Re: File search in QB without crash

Re: File search in QB without crash