Fix #703: Recover from empty structure files in PDB_CACHE_DIR#774
Fix #703: Recover from empty structure files in PDB_CACHE_DIR#774sbliven merged 11 commits intobiojava:bugfixes-4.2from
Conversation
log slow downloads upon start
The main change hasn't been implemented, so we want tests to fail. However, the tests exposed some NPE and IO exceptions. These are now fixed, so the tests fail in the expected manner. - Use ATP ligand, which is not covered by the ReducedChemCompProvider - Use the ReducedChemCompProvider as a fallback consistently, preventing null chemComp - Defensive parsing in SimpleMMcifConsumer - Robust test, doesn't require internet!
Files less than 40 bytes are deleted to allow for gzip headers. This addresses biojava#703
This is necessary when changing the cache path.
| return chemComp; | ||
| // May be null if the file was corrupt. Fall back on ReducedChemCompProvider in that case | ||
| if(chemComp != null) { | ||
| return chemComp; |
There was a problem hiding this comment.
This is a small change in behavior. If we can't read the downloaded chemcomp file, we now return a stub chemcomp rather than NULL. Previously the stub was returned for corrupt GZIP files (which threw an error) but not for malformed cif contents (which just returned null).
| // probably a network error happened. Try to use the ReducedChemCOmpProvider | ||
| ReducedChemCompProvider reduced = new ReducedChemCompProvider(); | ||
| if( fallback == null) { | ||
| fallback = new ReducedChemCompProvider(); |
There was a problem hiding this comment.
creating a ReducedChemCompProvider is cheap (and idempotent), but let's cache it anyways.
| * | ||
| * @param dir directory to delete | ||
| */ | ||
| public static void deleteDirectory(Path dir) throws IOException { |
There was a problem hiding this comment.
Surprisingly this isn't an nio method and I couldn't find one in BioJava
| cache.setPath(tmpCache.toString()); | ||
| cache.setCachePath(tmpCache.toString()); | ||
| cache.setUseMmCif(true); | ||
| ChemCompGroupFactory.setChemCompProvider(new DownloadChemCompProvider(tmpCache.toString())); |
There was a problem hiding this comment.
It's annoying how much code is needed to change the cache dir.
|
Tests pass locally, but there's some danger of system-dependent bugs here since I'm hacking the cache path for the two new tests. BTW, this could be an example of how to use AtomCache in tests without incurring network access. |
|
This should be merged to master after merging |
josemduarte
left a comment
There was a problem hiding this comment.
Thanks @sbliven !
I've submitted a few comments and questions
|
|
||
| protected void downloadFileFromRemote(URL remoteURL, File localFile) throws IOException{ | ||
| // System.out.println("downloading " + remoteURL + " to: " + localFile); | ||
| LOGGER.info("Downloaded file {} to local file {}", remoteURL, localFile); |
There was a problem hiding this comment.
Shouldn't this be "Downloading"? It isn't downloaded at this point yet
| if( ! success) { | ||
| return null; | ||
| } | ||
| assert(!f.exists()); |
There was a problem hiding this comment.
I'd rather log a warning here. Assert is going to be ignored in normal situations
There was a problem hiding this comment.
That's intended. If the delete was successful then the file will not exist. This is just a sanity check during development to make sure delete() wasn't asynchronous or something.
| if( f.length() < MIN_PDB_FILE_SIZE ) { | ||
| boolean success = f.delete(); | ||
| if( ! success) { | ||
| return null; |
There was a problem hiding this comment.
Shouldn't this rather throw an exception (perhaps IOException)? I don't see how can this null be handled if file can't be deleted.
There was a problem hiding this comment.
switched to Files.delete which throws the exception
| } | ||
|
|
||
| /** | ||
| * Force the cache to be reset. |
There was a problem hiding this comment.
This clears the memory cache, right? Could you add that to javadoc, it's an important detail since there is both memory and file cache
There was a problem hiding this comment.
Only memory. Documented.
| // delete files that are too short to have contents | ||
| if( f.length() < LocalPDBDirectory.MIN_PDB_FILE_SIZE ) { | ||
| f.delete(); | ||
| return false; |
There was a problem hiding this comment.
If the file fails to be deleted, this won't work as expected. I think the fail to delete case should throw an IOException
There was a problem hiding this comment.
I don't check the status because the delete isn't really necessary here. If this method returns false then we re-download the file and move it on top of the old one. I just added the delete defensively.
| } | ||
|
|
||
| return reduced.getChemComp(recordName); | ||
| return fallback.getChemComp(recordName); |
There was a problem hiding this comment.
Is there a warning logged when the fallback is used? That'd be important, so that users are aware and can investigate if there's something wrong in their code or file system
| residueNrInt = Integer.parseInt(residueNumberS); | ||
| } else { | ||
| String label_seq_id = atom.getLabel_seq_id(); | ||
| residueNrInt = Integer.parseInt(label_seq_id); |
There was a problem hiding this comment.
label_seq_id is not the same as residue number. This can introduce many problems.
There was a problem hiding this comment.
We should discuss this further.
I added it because the atp.cif.gz file I added (from pymol) has label_seq_id but not auth_seq_id. It seems reasonable to me to fall back on the sequential numbering if the authors don't specify a custom numbering.
Note that auth_seq_id is optional but label_seq_id is required. From the spec it sounds to me like label_seq_id is a good fallback for ResidueNumber:
An alternative identifier for _atom_site.label_seq_id that
may be provided by an author in order to match the identification
used in the publication that describes the structure.
Ideally BioJava would use Group.getId() rather then getResidueNumber() everywhere. However, many things still use residue numbers, so setting a default seems prudent.
There was a problem hiding this comment.
There's another potential issue here, which is that the seq_id is stored as a long while ResidueNumber contains an Integer. I don't think there are any files with >2billion groups per entity, but if we hit it then the ResidueNumber would throw a NumberFormatException here. I think that's fine.
| line = buf.readLine(); | ||
| while( line != null && (line.isEmpty() || line.startsWith(COMMENT_CHAR))) { | ||
| line = buf.readLine(); | ||
| } |
There was a problem hiding this comment.
Nice to handle comments in first line!
Is an empty first line also valid CIF?
There was a problem hiding this comment.
I think blank lines are permitted anywhere in CIF, but I could be wrong. I try to write permissive parsers. It seems like it shouldn't break the spec so we might as well be robust to it either way.
| assertTrue(chem.getAtoms().size() > 0); | ||
| assertEquals("NON-POLYMER", chem.getType()); | ||
| } finally { | ||
| // FileDownloadUtils.deleteDirectory(tmpCache); |
There was a problem hiding this comment.
Why commented? Is cleaning up not needed here?
If not needed please remove the try/finally
There was a problem hiding this comment.
Fixed. Cleaning up /tmp is polite, but makes it harder to debug failing tests.
| assertTrue(chem.getAtoms().size() > 0); | ||
| assertEquals("NON-POLYMER", chem.getType()); | ||
| } finally { | ||
| // FileDownloadUtils.deleteDirectory(tmpCache); |
See comments on biojava#774
Use GlobalsHelper.pushState()/restoreState() before and after tests to ensure that state isn't carried between tests. This is applied to the AtomCacheTest to fix test regressions while simplifying the code.
Maven runs tests with a clean environment, so we can't restore PDB_DIR
|
@josemduarte please check again and see if I addressed all your suggestions. The |
|
Tests pass with an old cache, but 4LNC got updated Tuesday and now |
4LNC was updated to remove the X-Ray experimental method. This switches the test to 6F2Q, which uses both Neutron & Xray.
| String recordName = atom.getGroup_PDB(); | ||
| String residueNumberS = atom.getAuth_seq_id(); | ||
| Integer residueNrInt = Integer.parseInt(residueNumberS); | ||
| Integer residueNrInt; |
There was a problem hiding this comment.
Could you handle this in a separate pull request? as per discussion in #775
Parsing such a file again throws a NumberFormatException. Further work/discussion of this issue is on biojava#775, but it was blocking the merging of biojava#774.
|
OK, I removed the mmcif changes (#775) and merged the result. |
For a short period the ChemComp files didn't resolve, resulting in empty files getting cached by BioJava. The caching problem has been long fixed, but this PR should make it so BioJava can recover and if the user happened to use an old version of BioJava (e.g. older CE-Symm versions).
Fixes #703